Re: [libvirt] [PATCH v3 10/30] schemas: Introduce disk type NVMe

10 Dec 2019

      On 12/2/19 3:26 PM, Michal Privoznik wrote:
...
There is this class of PCI devices that act like disks: NVMe.
Therefore, they are both PCI devices and disks. While we already
have <hostdev/> (and can assign a NVMe device to a domain
successfully) we don't have disk representation. There are three
problems with PCI assignment in case of a NVMe device:
1) domains with <hostdev/> can't be migrated
2) NVMe device is assigned whole, there's no way to assign only a
    namespace
3) Because hypervisors see <hostdev/> they don't put block layer
    on top of it - users don't get all the fancy features like
    snapshots
NVMe namespaces are way of splitting one continuous NVDIMM memory
into smaller ones, effectively creating smaller NVMe-s (which can
then be partitioned, LVMed, etc.)
Because of all of this the following XML was chosen to model a
NVMe device:
<disk type='nvme' device='disk'>
     <driver name='qemu' type='raw'/>
     <source type='pci' managed='yes' namespace='1'>
       <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
     </source>
     <target dev='vda' bus='virtio'/>
   </disk>
Last week I've discussed this on IRC with Dan an Maxim (bot CC'ed) and 
there was a suggestion to accept /dev/nvmeXXX path instead of PCI 
address. The reasoning was that there is a tool that Maxim wrote (alas 
not merged into qemu/kvm yet) that acts like a standalone daemon which 
does VFIO magic and then serves qemus connecting to it (this allows a 
NVMe disk to be shared between multiple qemus which is now not allowed 
currently due to VFIO restriction). And if we accepted /dev/nvmeXXX here 
we could change the backend less invasively - we could either use qemu's 
-drive nvme://XXXX or the new tool.

On the other hand, /dev/nvmeXXX (even though it may be a bit more user 
friendly) wouldn't work if host kernel doesn't have NVMe driver or if 
the disk is already detached. PCI address as I have it here.

Note that sysfs offers translations both ways [PCI address, namespace] 
<-> /dev/nvmeXXX so that shouldn't be a limitation.

Thoughts?

Michal