On 12/2/19 3:26 PM, Michal Privoznik wrote:
There is this class of PCI devices that act like disks: NVMe.
Therefore, they are both PCI devices and disks. While we already
have <hostdev/> (and can assign a NVMe device to a domain
successfully) we don't have disk representation. There are three
problems with PCI assignment in case of a NVMe device:
1) domains with <hostdev/> can't be migrated
2) NVMe device is assigned whole, there's no way to assign only a
namespace
3) Because hypervisors see <hostdev/> they don't put block layer
on top of it - users don't get all the fancy features like
snapshots
NVMe namespaces are way of splitting one continuous NVDIMM memory
into smaller ones, effectively creating smaller NVMe-s (which can
then be partitioned, LVMed, etc.)
Because of all of this the following XML was chosen to model a
NVMe device:
<disk type='nvme' device='disk'>
<driver name='qemu' type='raw'/>
<source type='pci' managed='yes' namespace='1'>
<address domain='0x0000' bus='0x01' slot='0x00'
function='0x0'/>
</source>
<target dev='vda' bus='virtio'/>
</disk>
Last week I've discussed this on IRC with Dan an Maxim (bot CC'ed) and
there was a suggestion to accept /dev/nvmeXXX path instead of PCI
address. The reasoning was that there is a tool that Maxim wrote (alas
not merged into qemu/kvm yet) that acts like a standalone daemon which
does VFIO magic and then serves qemus connecting to it (this allows a
NVMe disk to be shared between multiple qemus which is now not allowed
currently due to VFIO restriction). And if we accepted /dev/nvmeXXX here
we could change the backend less invasively - we could either use qemu's
-drive nvme://XXXX or the new tool.
On the other hand, /dev/nvmeXXX (even though it may be a bit more user
friendly) wouldn't work if host kernel doesn't have NVMe driver or if
the disk is already detached. PCI address as I have it here.
Note that sysfs offers translations both ways [PCI address, namespace]
<-> /dev/nvmeXXX so that shouldn't be a limitation.
Thoughts?
Michal