On 12/9/19 11:55 PM, Cole Robinson wrote:
On 12/2/19 9:26 AM, Michal Privoznik wrote:
> There is this class of PCI devices that act like disks: NVMe.
> Therefore, they are both PCI devices and disks. While we already
> have <hostdev/> (and can assign a NVMe device to a domain
> successfully) we don't have disk representation. There are three
> problems with PCI assignment in case of a NVMe device:
>
> 1) domains with <hostdev/> can't be migrated
>
> 2) NVMe device is assigned whole, there's no way to assign only a
> namespace
>
> 3) Because hypervisors see <hostdev/> they don't put block layer
> on top of it - users don't get all the fancy features like
> snapshots
>
> NVMe namespaces are way of splitting one continuous NVDIMM memory
> into smaller ones, effectively creating smaller NVMe-s (which can
> then be partitioned, LVMed, etc.)
>
> Because of all of this the following XML was chosen to model a
> NVMe device:
>
> <disk type='nvme' device='disk'>
> <driver name='qemu' type='raw'/>
> <source type='pci' managed='yes' namespace='1'>
> <address domain='0x0000' bus='0x01' slot='0x00'
function='0x0'/>
> </source>
> <target dev='vda' bus='virtio'/>
> </disk>
>
> Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
> ---
> docs/formatdomain.html.in | 57 +++++++++++++++++++++++--
> docs/schemas/domaincommon.rng | 32 ++++++++++++++
> tests/qemuxml2argvdata/disk-nvme.xml | 63 ++++++++++++++++++++++++++++
> 3 files changed, 149 insertions(+), 3 deletions(-)
> create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml
>
> diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in
> index 6df4a8b26e..fe871d933f 100644
> --- a/docs/formatdomain.html.in
> +++ b/docs/formatdomain.html.in
> @@ -2944,6 +2944,13 @@
> </backingStore>
> <target dev='vdd' bus='virtio'/>
> </disk>
> + <disk type='nvme' device='disk'>
> + <driver name='qemu' type='raw'/>
> + <source type='pci' managed='yes'
namespace='1'>
> + <address domain='0x0000' bus='0x01' slot='0x00'
function='0x0'/>
> + </source>
> + <target dev='vde' bus='virtio'/>
> + </disk>
> </devices>
> ...</pre>
>
> @@ -2957,7 +2964,8 @@
> Valid values are "file", "block",
> "dir" (<span class="since">since
0.7.5</span>),
> "network" (<span class="since">since
0.8.7</span>), or
> - "volume" (<span class="since">since
1.0.5</span>)
> + "volume" (<span class="since">since
1.0.5</span>), or
> + "nvme" (<span class="since">since
5.6.0</span>)
6.0.0 or whatever version this will land in
> and refer to the underlying source for the disk.
> <span class="since">Since 0.0.3</span>
> </dd>
> @@ -3140,6 +3148,43 @@
> <span class="since">Since 1.0.5</span>
> </p>
> </dd>
> + <dt><code>nvme</code></dt>
> + <dd>
> + To specify disk source for NVMe disk the
<code>source</code>
> + element has the following attributes:
> + <dl>
> + <dt><code>type</code></dt>
> + <dd>The type of address specified in
<code>address</code>
> + sub-element. Currently, only <code>pci</code> value is
> + accepted.
> + </dd>
> +
> + <dt><code>managed</code></dt>
> + <dd>This attribute instructs libvirt to detach NVMe
> + controller automatically on domain startup
(<code>yes</code>)
> + or expect the controller to be detached by system
> + administrator (<code>no</code>).
> + </dd>
> +
> + <dt><code>namespace</code></dt>
> + <dd>The namespace ID which should be assigned to the domain.
> + According to NVMe standard, namespace numbers start from 1,
> + including.
> + </dd>
> + </dl>
> +
> + The difference between <code><disk
type='nvme'></code>
> + and <code><hostdev/></code> is that the
latter is plain
> + host device assignment with all its limitations (e.g. no live
> + migration), while the former makes hypervisor to run the NVMe
> + disk through hypervisor's block layer thus enabling all
> + features provided by the layer (e.g. snapshots, domain
> + migration, etc.). Moreover, since the NVMe disk is unbinded
> + from its PCI driver, the host kernel storage stack is not
> + involved (compared to passing say
<code>/dev/nvme0n1</code> via
> + <code><disk type='block'></code> and
therefore lower
> + latencies can be achieved.
> + </dd>
> </dl>
> With "file", "block", and "volume", one or
more optional
> sub-elements <code>seclabel</code>, <a
href="#seclabel">described
> @@ -3302,11 +3347,17 @@
> initiator IQN needed to access the source via mandatory
> attribute <code>name</code>.
> </dd>
> + <dt><code>address</code></dt>
> + <dd>For disk of type <code>nvme</code> this element
> + specifies the PCI address of the host NVMe
> + controller.
> + <span class="since">Since 5.6.0</span>
Same
> + </dd>
> </dl>
>
> <p>
> - For a "file" or "volume" disk type which represents a
cdrom or floppy
> - (the <code>device</code> attribute), it is possible to define
> + For a "file" or "volume" disk type which represents a
cdrom or
> + floppy (the <code>device</code> attribute), it is possible to
define
Stray change?
Oh right. I've realigned this area when adding the address description.
But this change does not belong here.
Also, tn the test XML you need to "s/qemu-system-i686/qemu-system-i386/"
or you'll hit a weird error. And VIR_TEST_REGENERATE_OUTPUT is also
busted, see my patches elsewhere on this list.
Yeah, I've noticed Dan posted patches after these. I've fixed that
locally but never replied to this patch. Sorry.
Reviewed-by: Cole Robinson <crobinso(a)redhat.com>
Thanks,
Michal