
On 7/16/19 2:35 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:57 +0200, Michal Privoznik wrote:
There is this class of PCI devices that act like disks: NVMe. Therefore, they are both PCI devices and disks. While we already have <hostdev/> (and can assign a NVMe device to a domain successfully) we don't have disk representation. There are three problems with PCI assignment in case of a NVMe device:
1) domains with <hostdev/> can't be migrated
2) NVMe device is assigned whole, there's no way to assign only a namespace
3) Because hypervisors see <hostdev/> they don't put block layer on top of it - users don't get all the fancy features like snapshots
NVMe namespaces are way of splitting one continuous NVDIMM memory
s/NVDIMM memory/NVMe device/
into smaller ones, effectively creating smaller NVMe-s (which can then be partitioned, LVMed, etc.)
Because of all of this the following XML was chosen to model a NVMe device:
<disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> </disk>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/formatdomain.html.in | 45 +++++++++++++++++++++-- docs/schemas/domaincommon.rng | 32 ++++++++++++++++ tests/qemuxml2argvdata/disk-nvme.xml | 55 ++++++++++++++++++++++++++++ 3 files changed, 129 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index a7a6ec32a5..545578076d 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -2922,6 +2922,13 @@ </backingStore> <target dev='vdd' bus='virtio'/> </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'>
The 'type' filed may get confusing a bit as it is supposed to be stored in virStorageSource->nvme->type, while virStorageSource has it's own type.
Well, I'm trygin to mix two things here: <disk/> and <hostdev/> (well, PCI devices). There are other types of NVMe than PCIe attached ones. They are called NVMe-oF (NVMe over Fabrics), so I've figured that we want to specifically tell that NVMe we are dealing with here is the PCIe one. But I guess it can also be added later, if we ever decide to support NVMe-oF (disks without @type would have 'pci' autofilled in a post parse callback).
Also I'm pondering whether managed='yes' belongs as a top-level attribute under 'source' as it's possibly specific to the 'pci' setting of type.
Yeah, designing future proof XML is hard. I'm open to any suggestion.
+ <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vde' bus='virtio'/> + </disk> </devices> ...</pre>
[...]
@@ -3118,6 +3126,31 @@ <span class="since">Since 1.0.5</span> </p> </dd> + <dt><code>nvme</code></dt> + <dd> + To specify disk source for NVMe disk the <code>source</code> + element has the following attributes: + <dl> + <dt><code>type</code></dt> + <dd>The type of address specified in <code>address</code> + sub-element. Currently, only <code>pci</code> value is + accepted. + </dd> + + <dt><code>managed</code></dt> + <dd>This attribute instructs libvirt to detach NVMe + controller automatically on domain startup (<code>yes</code>) + or expect the controller to be detached by system + administrator (<code>no</code>). + </dd> + + <dt><code>namespace</code></dt> + <dd>The namespace ID which should be assigned to the domain. + According to NVMe standard, namespace numbers start from 1, + including. + </dd> + </dl> + </dd> </dl> With "file", "block", and "volume", one or more optional sub-elements <code>seclabel</code>, <a href="#seclabel">described @@ -3280,11 +3313,17 @@ initiator IQN needed to access the source via mandatory attribute <code>name</code>. </dd> + <dt><code>address</code></dt> + <dd>For disk of type <code>nvme</code> this element + specifies the PCI address of the host NVMe + controller. + <span class="since">Since 5.5.0</span> + </dd> </dl>
<p> - For a "file" or "volume" disk type which represents a cdrom or floppy - (the <code>device</code> attribute), it is possible to define + For a "file", "volume" or "nvme" disk type which represents a cdrom or + floppy (the <code>device</code> attribute), it is possible to define
You specifically forbid startup policy in the next commit, so what's the point of documenting it here?
Ah, good point. Back in the day I still wanted to make startupPolicy work, but then decided not to.
policy what to do with the disk if the source file is not accessible. (NB, <code>startupPolicy</code> is not valid for "volume" disk unless the specified storage volume is of "file" type). This is done by the diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 31db599ab9..f367e8f6fd 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng
[...]
@@ -1918,6 +1919,37 @@ </optional> </define>
+ <define name="diskSourceNvme"> + <attribute name="type"> + <value>nvme</value> + </attribute> + <optional> + <element name="source"> + <attribute name="type"> + <value>pci</value> + </attribute> + <attribute name="namespace"> + <ref name="uint32"/> + </attribute> + <optional> + <attribute name="managed"> + <ref name="virYesNo"/> + </attribute> + </optional> + <element name="address"> + <ref name="pciaddress"/> + </element> + <ref name="diskSourceCommon"/> + <optional> + <ref name="storageStartupPolicy"/> + </optional> + <optional> + <ref name="encryption"/> + </optional> + </element> + </optional> + </define> + <define name="diskTarget"> <data type="string"> <param name="pattern">(ioemu:)?(fd|hd|sd|vd|xvd|ubd)[a-zA-Z0-9_]+</param> diff --git a/tests/qemuxml2argvdata/disk-nvme.xml b/tests/qemuxml2argvdata/disk-nvme.xml new file mode 100644 index 0000000000..0b3dbad4eb --- /dev/null +++ b/tests/qemuxml2argvdata/disk-nvme.xml @@ -0,0 +1,55 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-i686</emulator> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vda' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='2'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
Make at heast one of them use qcow2 as format.
Okay.
+ </source> + <target dev='vdb' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='no' namespace='1'> + <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> + </source> + <target dev='vdc' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='no' namespace='2'> + <address domain='0x0001' bus='0x02' slot='0x00' function='0x0'/> + <encryption format='luks'> + <secret type='passphrase' uuid='0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f'/> + </encryption> + </source> + <target dev='vdd' bus='virtio'/> + </disk> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <controller type='scsi' index='0' model='virtio-scsi'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='none'/> + </devices> +</domain>
I'm also missing any form of documentation describing the caveats (e.g. users should not pass in a NVMe disk the host is using)
Since I'm using virHostdevManager to track used NVMe-s, any attempt to start a domain with a taken NVMe disk will fail. So users will see that immediately :-) But okay, I will enhance the docs. Michal