[libvirt] [PATCH v1 00/31] Introduce NVMe support

These patches introduce a support for NVMe disks into libvirt. Note that even without them it is possible to use NVMe disks for your domains in two ways: 1) <hostdev/> - This is regular PCI assignment with all the drawbacks (no migration, no snapshots, ...) 2) <disk/> - Since NVMe disks are accessible via /dev/nvme* they can be assigned to domains. Problem is, because qemu is accessing /dev/nvme* the host kernel's storage stack is involved which adds significant latency [1]. Solution to this problem is to combine 1) and 2) together: - Bypass host kernel's storage stack by detaching the NVMe disk from the host (and attaching it to VFIO driver), and - Plug the NVMe disk into qemu's block layer so that all fancy features can be supported. On qemu command line this is done via: -drive file.driver=nvme,file.device=0000:01:00.0,file.namespace=1,format=raw,\ if=none,id=drive-virtio-disk0 \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,\ id=virtio-disk0,bootindex=1 \ You can find my patches also on my github [2]. 1: https://www.linux-kvm.org/images/4/4c/Userspace_NVMe_driver_in_QEMU_-_Fam_Zh... 2: https://github.com/zippy2/libvirt/commits/nvme Michal Prívozník (31): virHostdevPreparePCIDevices: Separate out function body virHostdevReAttachPCIDevices: Separate out function body virpcimock: Move actions checking one level up Revert "virpcitest: Test virPCIDeviceDetach failure" virpcimock: Create driver_override file in device dirs virPCIDeviceAddressEqual: Fix const correctness virPCIDeviceAddressAsString: Fix const correctness virpci: Introduce virPCIDeviceAddressCopy qemuDomainDeviceDefValidateDisk: Reorder some checks schemas: Introduce disk type NVMe conf: Format and parse NVMe type disk util: Introduce virNVMeDevice module virhostdev: Include virNVMeDevice module virhostdevtest: Don't proceed to test cases if init failed virhostdevtest: s/CHECK_LIST_COUNT/CHECK_PCI_LIST_COUNT/ virpcimock: Introduce NVMe driver and devices virhostdevtest: Test virNVMeDevice assignment qemu: prepare NVMe devices too qemu: Take NVMe disks into account when calculating memlock limit virstoragefile: Introduce virStorageSourceChainHasNVMe domain_conf: Introduce virDomainDefHasNVMeDisk qemu_domain: Separate VFIO code qemu_domain: Introduce NVMe path getting helpers qemu: Create NVMe disk in domain namespace qemu: Allow NVMe disk in CGroups security_selinux: Simplify virSecuritySELinuxSetImageLabelInternal virSecuritySELinuxRestoreImageLabelInt: Don't skip non-local storage qemu_capabilities: Introduce QEMU_CAPS_DRIVE_NVME qemu: Generate command line of NVMe disks qemu: Don't leak storage perms on failure in qemuDomainAttachDiskGeneric qemu_hotplug: Prepare NVMe disks on hotplug docs/formatdomain.html.in | 45 +- docs/schemas/domaincommon.rng | 32 ++ src/conf/domain_conf.c | 160 +++++++ src/conf/domain_conf.h | 6 + src/libvirt_private.syms | 26 ++ src/qemu/qemu_block.c | 24 + src/qemu/qemu_capabilities.c | 4 + src/qemu/qemu_capabilities.h | 3 + src/qemu/qemu_cgroup.c | 59 ++- src/qemu/qemu_command.c | 4 + src/qemu/qemu_domain.c | 115 ++++- src/qemu/qemu_domain.h | 6 + src/qemu/qemu_driver.c | 4 + src/qemu/qemu_hostdev.c | 49 ++- src/qemu/qemu_hostdev.h | 10 + src/qemu/qemu_hotplug.c | 76 +++- src/qemu/qemu_migration.c | 1 + src/qemu/qemu_process.c | 7 + src/security/security_dac.c | 38 ++ src/security/security_selinux.c | 95 ++-- src/util/Makefile.inc.am | 2 + src/util/virhostdev.c | 350 +++++++++++++-- src/util/virhostdev.h | 25 ++ src/util/virnvme.c | 412 ++++++++++++++++++ src/util/virnvme.h | 89 ++++ src/util/virpci.c | 12 +- src/util/virpci.h | 8 +- src/util/virstoragefile.c | 73 ++++ src/util/virstoragefile.h | 17 + src/xenconfig/xen_xl.c | 1 + .../caps_2.12.0.aarch64.xml | 1 + .../caps_2.12.0.ppc64.xml | 1 + .../caps_2.12.0.s390x.xml | 1 + .../caps_2.12.0.x86_64.xml | 1 + .../qemucapabilitiesdata/caps_3.0.0.ppc64.xml | 1 + .../caps_3.0.0.riscv32.xml | 1 + .../caps_3.0.0.riscv64.xml | 1 + .../qemucapabilitiesdata/caps_3.0.0.s390x.xml | 1 + .../caps_3.0.0.x86_64.xml | 1 + .../qemucapabilitiesdata/caps_3.1.0.ppc64.xml | 1 + .../caps_3.1.0.x86_64.xml | 1 + .../caps_4.0.0.aarch64.xml | 1 + .../qemucapabilitiesdata/caps_4.0.0.ppc64.xml | 1 + .../caps_4.0.0.riscv32.xml | 1 + .../caps_4.0.0.riscv64.xml | 1 + .../qemucapabilitiesdata/caps_4.0.0.s390x.xml | 1 + .../caps_4.0.0.x86_64.xml | 1 + .../caps_4.1.0.x86_64.xml | 1 + .../disk-nvme.x86_64-latest.args | 52 +++ tests/qemuxml2argvdata/disk-nvme.xml | 63 +++ tests/qemuxml2argvtest.c | 1 + tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + tests/virhostdevtest.c | 185 ++++++-- tests/virpcimock.c | 76 +++- tests/virpcitest.c | 32 -- tests/virpcitestdata/0000-01-00.0.config | Bin 0 -> 4096 bytes tests/virpcitestdata/0000-02-00.0.config | Bin 0 -> 4096 bytes 58 files changed, 1978 insertions(+), 204 deletions(-) create mode 100644 src/util/virnvme.c create mode 100644 src/util/virnvme.h create mode 100644 tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml create mode 100644 tests/virpcitestdata/0000-01-00.0.config create mode 100644 tests/virpcitestdata/0000-02-00.0.config -- 2.21.0

In near future we will have a list of PCI devices we want to detach (held in virPCIDeviceListPtr) but we don't have virDomainHostdevDefPtr. That's okay because virHostdevPreparePCIDevices() works with virPCIDeviceListPtr mostly anyway. And in very few places where it needs virDomainHostdevDefPtr are not interesting for our case. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virhostdev.c | 48 ++++++++++++++++++++++++++++--------------- 1 file changed, 32 insertions(+), 16 deletions(-) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index a3647a6cf4..88b0828675 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -613,27 +613,22 @@ virHostdevRestoreNetConfig(virDomainHostdevDefPtr hostdev, } } -int -virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, - const char *drv_name, - const char *dom_name, - const unsigned char *uuid, - virDomainHostdevDefPtr *hostdevs, - int nhostdevs, - unsigned int flags) + +static int +virHostdevPreparePCIDevicesImpl(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + const unsigned char *uuid, + virPCIDeviceListPtr pcidevs, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + unsigned int flags) { - VIR_AUTOUNREF(virPCIDeviceListPtr) pcidevs = NULL; int last_processed_hostdev_vf = -1; size_t i; int ret = -1; virPCIDeviceAddressPtr devAddr = NULL; - if (!nhostdevs) - return 0; - - if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) - return -1; - virObjectLock(mgr->activePCIHostdevs); virObjectLock(mgr->inactivePCIHostdevs); @@ -906,10 +901,31 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, cleanup: virObjectUnlock(mgr->activePCIHostdevs); virObjectUnlock(mgr->inactivePCIHostdevs); - return ret; } + +int +virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + const unsigned char *uuid, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + unsigned int flags) +{ + VIR_AUTOUNREF(virPCIDeviceListPtr) pcidevs = NULL; + + if (!nhostdevs) + return 0; + + if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) + return -1; + + return virHostdevPreparePCIDevicesImpl(mgr, drv_name, dom_name, uuid, + pcidevs, hostdevs, nhostdevs, flags); +} + /* * Pre-condition: inactivePCIHostdevs & activePCIHostdevs * are locked -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:48 +0200, Michal Privoznik wrote:
In near future we will have a list of PCI devices we want to detach (held in virPCIDeviceListPtr) but we don't have virDomainHostdevDefPtr. That's okay because virHostdevPreparePCIDevices() works with virPCIDeviceListPtr mostly anyway. And in very few places where it needs virDomainHostdevDefPtr are not interesting for our case.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virhostdev.c | 48 ++++++++++++++++++++++++++++--------------- 1 file changed, 32 insertions(+), 16 deletions(-)
diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index a3647a6cf4..88b0828675 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -613,27 +613,22 @@ virHostdevRestoreNetConfig(virDomainHostdevDefPtr hostdev, } }
-int -virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, - const char *drv_name, - const char *dom_name, - const unsigned char *uuid, - virDomainHostdevDefPtr *hostdevs, - int nhostdevs, - unsigned int flags) + +static int +virHostdevPreparePCIDevicesImpl(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + const unsigned char *uuid, + virPCIDeviceListPtr pcidevs, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + unsigned int flags) { - VIR_AUTOUNREF(virPCIDeviceListPtr) pcidevs = NULL; int last_processed_hostdev_vf = -1; size_t i; int ret = -1; virPCIDeviceAddressPtr devAddr = NULL;
- if (!nhostdevs) - return 0; - - if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) - return -1; - virObjectLock(mgr->activePCIHostdevs); virObjectLock(mgr->inactivePCIHostdevs);
@@ -906,10 +901,31 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, cleanup: virObjectUnlock(mgr->activePCIHostdevs); virObjectUnlock(mgr->inactivePCIHostdevs); - return ret;
Spurious whitespace change.
}
+ +int +virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + const unsigned char *uuid, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + unsigned int flags) +{ + VIR_AUTOUNREF(virPCIDeviceListPtr) pcidevs = NULL; + + if (!nhostdevs) + return 0; + + if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) + return -1; + + return virHostdevPreparePCIDevicesImpl(mgr, drv_name, dom_name, uuid, + pcidevs, hostdevs, nhostdevs, flags); +} +
Two empty lines please, similarly to what you are adding around the new funcs. ACK

In near future we will have a list of PCI devices we want to re-attach to the host (held in virPCIDeviceListPtr) but we don't have virDomainHostdevDefPtr. That's okay because virHostdevReAttachPCIDevices() works with virPCIDeviceListPtr mostly anyway. And in very few places where it needs virDomainHostdevDefPtr are not interesting for our case. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virhostdev.c | 58 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 21 deletions(-) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 88b0828675..07397b9682 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -953,30 +953,18 @@ virHostdevReattachPCIDevice(virHostdevManagerPtr mgr, } } -/* @oldStateDir: - * For upgrade purpose: see virHostdevRestoreNetConfig - */ -void -virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr, - const char *drv_name, - const char *dom_name, - virDomainHostdevDefPtr *hostdevs, - int nhostdevs, - const char *oldStateDir) + +static void +virHostdevReAttachPCIDevicesImpl(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + virPCIDeviceListPtr pcidevs, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + const char *oldStateDir) { - VIR_AUTOUNREF(virPCIDeviceListPtr) pcidevs = NULL; size_t i; - if (!nhostdevs) - return; - - if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) { - VIR_ERROR(_("Failed to allocate PCI device list: %s"), - virGetLastErrorMessage()); - virResetLastError(); - return; - } - virObjectLock(mgr->activePCIHostdevs); virObjectLock(mgr->inactivePCIHostdevs); @@ -1100,6 +1088,34 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr, virObjectUnref(pcidevs); } + +/* @oldStateDir: + * For upgrade purpose: see virHostdevRestoreNetConfig + */ +void +virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + const char *oldStateDir) +{ + VIR_AUTOUNREF(virPCIDeviceListPtr) pcidevs = NULL; + + if (!nhostdevs) + return; + + if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) { + VIR_ERROR(_("Failed to allocate PCI device list: %s"), + virGetLastErrorMessage()); + virResetLastError(); + return; + } + + virHostdevReAttachPCIDevicesImpl(mgr, drv_name, dom_name, pcidevs, + hostdevs, nhostdevs, oldStateDir); +} + int virHostdevUpdateActivePCIDevices(virHostdevManagerPtr mgr, virDomainHostdevDefPtr *hostdevs, -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:49 +0200, Michal Privoznik wrote:
In near future we will have a list of PCI devices we want to re-attach to the host (held in virPCIDeviceListPtr) but we don't have virDomainHostdevDefPtr. That's okay because virHostdevReAttachPCIDevices() works with virPCIDeviceListPtr mostly anyway. And in very few places where it needs virDomainHostdevDefPtr are not interesting for our case.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virhostdev.c | 58 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 21 deletions(-)
ACK

The pci_driver_bind() and pci_driver_unbind() functions are "internal implementation", meaning other parts of the code should be able to call them and get the job done. Checking for actions (PCI_ACTION_BIND and PCI_ACTION_UNBIND) should be done in handlers (pci_driver_handle_bind() and pci_driver_handle_unbind()). Surprisingly, the other two actions (PCI_ACTION_NEW_ID and PCI_ACTION_REMOVE_ID) are checked already at this level. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virpcimock.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/tests/virpcimock.c b/tests/virpcimock.c index beb5e1490d..6865f992dc 100644 --- a/tests/virpcimock.c +++ b/tests/virpcimock.c @@ -551,8 +551,8 @@ pci_driver_bind(struct pciDriver *driver, int ret = -1; char *devpath = NULL, *driverpath = NULL; - if (dev->driver || PCI_ACTION_BIND & driver->fail) { - /* Device already bound or failing driver requested */ + if (dev->driver) { + /* Device already bound */ errno = ENODEV; return ret; } @@ -598,8 +598,8 @@ pci_driver_unbind(struct pciDriver *driver, int ret = -1; char *devpath = NULL, *driverpath = NULL; - if (dev->driver != driver || PCI_ACTION_UNBIND & driver->fail) { - /* Device not bound to the @driver or failing driver used */ + if (dev->driver != driver) { + /* Device not bound to the @driver */ errno = ENODEV; return ret; } @@ -669,8 +669,8 @@ pci_driver_handle_bind(const char *path) struct pciDevice *dev = pci_device_find_by_content(path); struct pciDriver *driver = pci_driver_find_by_path(path); - if (!driver || !dev) { - /* This should never happen (TM) */ + if (!driver || !dev || PCI_ACTION_BIND & driver->fail) { + /* No driver, no device or failing driver requested */ errno = ENODEV; goto cleanup; } @@ -686,8 +686,8 @@ pci_driver_handle_unbind(const char *path) int ret = -1; struct pciDevice *dev = pci_device_find_by_content(path); - if (!dev || !dev->driver) { - /* This should never happen (TM) */ + if (!dev || !dev->driver || PCI_ACTION_UNBIND & dev->driver->fail) { + /* No device, device not binded or failing driver requested */ errno = ENODEV; goto cleanup; } -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:50 +0200, Michal Privoznik wrote:
The pci_driver_bind() and pci_driver_unbind() functions are "internal implementation", meaning other parts of the code should be able to call them and get the job done. Checking for actions (PCI_ACTION_BIND and PCI_ACTION_UNBIND) should be done in handlers (pci_driver_handle_bind() and pci_driver_handle_unbind()). Surprisingly, the other two actions (PCI_ACTION_NEW_ID and PCI_ACTION_REMOVE_ID) are checked already at this level.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virpcimock.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/tests/virpcimock.c b/tests/virpcimock.c index beb5e1490d..6865f992dc 100644 --- a/tests/virpcimock.c +++ b/tests/virpcimock.c @@ -551,8 +551,8 @@ pci_driver_bind(struct pciDriver *driver, int ret = -1; char *devpath = NULL, *driverpath = NULL;
- if (dev->driver || PCI_ACTION_BIND & driver->fail) { - /* Device already bound or failing driver requested */ + if (dev->driver) { + /* Device already bound */ errno = ENODEV; return ret; }
So this function ...
@@ -669,8 +669,8 @@ pci_driver_handle_bind(const char *path) struct pciDevice *dev = pci_device_find_by_content(path); struct pciDriver *driver = pci_driver_find_by_path(path);
- if (!driver || !dev) { - /* This should never happen (TM) */ + if (!driver || !dev || PCI_ACTION_BIND & driver->fail) { + /* No driver, no device or failing driver requested */ errno = ENODEV; goto cleanup; }
... is called here, which you fix, but also in pci_device_autobind and pci_driver_handle_new_id which are not fixed by this commit. I don't quite understand deeply what this is supposed to do, so I don't know what's supposed to happen in that case, but this seems suspicious to me. Please try explaining/justifying why the two other call paths are not changed. Also I did not bother checking the unbind code for the same problem.

On 7/16/19 2:03 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:50 +0200, Michal Privoznik wrote:
The pci_driver_bind() and pci_driver_unbind() functions are "internal implementation", meaning other parts of the code should be able to call them and get the job done. Checking for actions (PCI_ACTION_BIND and PCI_ACTION_UNBIND) should be done in handlers (pci_driver_handle_bind() and pci_driver_handle_unbind()). Surprisingly, the other two actions (PCI_ACTION_NEW_ID and PCI_ACTION_REMOVE_ID) are checked already at this level.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virpcimock.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/tests/virpcimock.c b/tests/virpcimock.c index beb5e1490d..6865f992dc 100644 --- a/tests/virpcimock.c +++ b/tests/virpcimock.c @@ -551,8 +551,8 @@ pci_driver_bind(struct pciDriver *driver, int ret = -1; char *devpath = NULL, *driverpath = NULL;
- if (dev->driver || PCI_ACTION_BIND & driver->fail) { - /* Device already bound or failing driver requested */ + if (dev->driver) { + /* Device already bound */ errno = ENODEV; return ret; }
So this function ...
@@ -669,8 +669,8 @@ pci_driver_handle_bind(const char *path) struct pciDevice *dev = pci_device_find_by_content(path); struct pciDriver *driver = pci_driver_find_by_path(path);
- if (!driver || !dev) { - /* This should never happen (TM) */ + if (!driver || !dev || PCI_ACTION_BIND & driver->fail) { + /* No driver, no device or failing driver requested */ errno = ENODEV; goto cleanup; }
... is called here, which you fix, but also in
pci_device_autobind and pci_driver_handle_new_id which are not fixed by this commit.
I don't quite understand deeply what this is supposed to do, so I don't know what's supposed to happen in that case, but this seems suspicious to me. Please try explaining/justifying why the two other call paths are not changed.
Also I did not bother checking the unbind code for the same problem.
This whole PCI_ACTION_BIND mess exists because with RHEL-7 kernel it's not possible to bind a PCI device directly to vfio-pci driver. I mean, with RHEL-7 kernel the following steps fail: 01:00.0 Non-Volatile memory controller: Device 1cc1:8201 (rev 03) Subsystem: Device 1cc1:8201 Kernel driver in use: nvme # echo "1cc1 8201" > /sys/bus/pci/drivers/vfio-pci/new_id # echo "0000:01:00.0" > /sys/bus/pci/devices/0000\:01\:00.0/driver/unbind # echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind 01:00.0 Non-Volatile memory controller: Device 1cc1:8201 (rev 03) Subsystem: Device 1cc1:8201 Kernel driver in use: vfio-pci But on anything else (e.g. my vanilla kernel) this is allowed. Anyway, this is irrelevant because even on RHEL-7 we are using 'driver_override' (which is way simpler to use and doesn't create a window where an unbinded PCI device can be claimed by a different PCI driver). How does this concern pcimock? Well, there are only so many "entry" points to the mock. These are functions which have "handle" in their name and are called from within pci_driver_handle_change(). Every other function is just a helper. For instance, pci_device_autobind() is called from pci_device_new_from_stub() which just creates a PCI device during the mock initialization. Or, it's called on a write to "drivers_probe" which again succeeds (on both RHEL-7 and vanilla kernels). And for pci_driver_handle_new_id it's the same story. If you want, I can add those checks there so that this patch looks complete. But honestly, it doesn't matter because this code path will not be used once 'driver_override' is implemented (patch 05/31). Michal

This reverts commit b70c093ffa00cd87c8d39d3652b798f033a81faf. In next commit the virpcimock is going to be extended and thus binding a PCI device to vfio-pci driver will finally succeed. Remove this test as it will no longer make sense. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virpcitest.c | 32 -------------------------------- 1 file changed, 32 deletions(-) diff --git a/tests/virpcitest.c b/tests/virpcitest.c index 961a7eff1a..9ecd1b7d27 100644 --- a/tests/virpcitest.c +++ b/tests/virpcitest.c @@ -256,36 +256,6 @@ testVirPCIDeviceDetachSingle(const void *opaque) return ret; } -static int -testVirPCIDeviceDetachFail(const void *opaque) -{ - const struct testPCIDevData *data = opaque; - int ret = -1; - virPCIDevicePtr dev; - - dev = virPCIDeviceNew(data->domain, data->bus, data->slot, data->function); - if (!dev) - goto cleanup; - - virPCIDeviceSetStubDriver(dev, VIR_PCI_STUB_DRIVER_VFIO); - - if (virPCIDeviceDetach(dev, NULL, NULL) < 0) { - if (virTestGetVerbose() || virTestGetDebug()) - virDispatchError(NULL); - virResetLastError(); - ret = 0; - } else { - virReportError(VIR_ERR_INTERNAL_ERROR, - "Attaching device %s to %s should have failed", - virPCIDeviceGetName(dev), - virPCIStubDriverTypeToString(VIR_PCI_STUB_DRIVER_VFIO)); - } - - cleanup: - virPCIDeviceFree(dev); - return ret; -} - static int testVirPCIDeviceReattachSingle(const void *opaque) { @@ -421,8 +391,6 @@ mymain(void) DO_TEST_PCI(testVirPCIDeviceIsAssignable, 5, 0x90, 1, 0); DO_TEST_PCI(testVirPCIDeviceIsAssignable, 1, 1, 0, 0); - DO_TEST_PCI(testVirPCIDeviceDetachFail, 0, 0x0a, 1, 0); - /* Reattach a device already bound to non-stub a driver */ DO_TEST_PCI_DRIVER(0, 0x0a, 1, 0, "i915"); DO_TEST_PCI(testVirPCIDeviceReattachSingle, 0, 0x0a, 1, 0); -- 2.21.0

Newer kernels (v3.16-rc1~29^2~6^4) have 'driver_override' file which simplifies way of binding a PCI device to desired driver. Libvirt has support for this for some time too (v2.3.0-rc1~236), but not our virpcimock. So far we did not care because our code is designed to deal with this situation. Except for one. hypothetical case: binding a device to the vfio-pci driver can be successful only via driver_override. Any attempt to bind a PCI device to vfio-pci driver using old method (new_id + unbind + bind) will fail because of b803b29c1a5. While on vanilla kernel I'm able to use the old method successfully, it's failing on RHEL kernels (not sure why). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virpcimock.c | 57 +++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 6 deletions(-) diff --git a/tests/virpcimock.c b/tests/virpcimock.c index 6865f992dc..18d06d11d4 100644 --- a/tests/virpcimock.c +++ b/tests/virpcimock.c @@ -87,6 +87,11 @@ char *fakesysfspcidir; * Probe for a driver that handles the specified device. * Data in format "DDDD:BB:DD.F" (Domain:Bus:Device.Function). * + * /sys/bus/pci/devices/<device>/driver_override + * Name of a driver that overrides preferred driver can be written + * here. The device will be attached to it on drivers_probe event. + * Writing an empty string (or "\n") clears the override. + * * As a little hack, we are not mocking write to these files, but close() * instead. The advantage is we don't need any self growing array to hold the * partial writes and construct them back. We can let all the writes finish, @@ -147,6 +152,7 @@ static struct pciDevice *pci_device_find_by_content(const char *path); static void pci_driver_new(const char *name, int fail, ...); static struct pciDriver *pci_driver_find_by_dev(struct pciDevice *dev); static struct pciDriver *pci_driver_find_by_path(const char *path); +static struct pciDriver *pci_driver_find_by_driver_override(struct pciDevice *dev); static int pci_driver_bind(struct pciDriver *driver, struct pciDevice *dev); static int pci_driver_unbind(struct pciDriver *driver, struct pciDevice *dev); static int pci_driver_handle_change(int fd, const char *path); @@ -202,7 +208,8 @@ make_symlink(const char *path, static int pci_read_file(const char *path, char *buf, - size_t buf_size) + size_t buf_size, + bool truncate) { int ret = -1; int fd = -1; @@ -224,7 +231,8 @@ pci_read_file(const char *path, goto cleanup; } - if (ftruncate(fd, 0) < 0) + if (truncate && + ftruncate(fd, 0) < 0) goto cleanup; ret = 0; @@ -398,6 +406,8 @@ pci_device_new_from_stub(const struct pciDevice *data) ABORT("@tmp overflow"); make_file(devpath, "class", tmp, -1); + make_file(devpath, "driver_override", NULL, -1); + if (snprintf(tmp, sizeof(tmp), "%s/../../../kernel/iommu_groups/%d", devpath, dev->iommuGroup) < 0) { @@ -441,7 +451,7 @@ pci_device_find_by_content(const char *path) { char tmp[32]; - if (pci_read_file(path, tmp, sizeof(tmp)) < 0) + if (pci_read_file(path, tmp, sizeof(tmp), true) < 0) return NULL; return pci_device_find_by_id(tmp); @@ -450,7 +460,10 @@ pci_device_find_by_content(const char *path) static int pci_device_autobind(struct pciDevice *dev) { - struct pciDriver *driver = pci_driver_find_by_dev(dev); + struct pciDriver *driver = pci_driver_find_by_driver_override(dev); + + if (!driver) + driver = pci_driver_find_by_dev(dev); if (!driver) { /* No driver found. Nothing to do */ @@ -544,6 +557,36 @@ pci_driver_find_by_path(const char *path) return NULL; } +static struct pciDriver * +pci_driver_find_by_driver_override(struct pciDevice *dev) +{ + struct pciDriver *ret = NULL; + char *path = NULL; + char tmp[32]; + size_t i; + + if (virAsprintfQuiet(&path, + SYSFS_PCI_PREFIX "devices/%s/driver_override", + dev->id) < 0) + return NULL; + + if (pci_read_file(path, tmp, sizeof(tmp), false) < 0) + goto cleanup; + + for (i = 0; i < nPCIDrivers; i++) { + struct pciDriver *driver = pciDrivers[i]; + + if (STREQ(tmp, driver->name)) { + ret = driver; + break; + } + } + + cleanup: + VIR_FREE(path); + return ret; +} + static int pci_driver_bind(struct pciDriver *driver, struct pciDevice *dev) @@ -657,6 +700,8 @@ pci_driver_handle_change(int fd ATTRIBUTE_UNUSED, const char *path) ret = pci_driver_handle_remove_id(path); else if (STREQ(file, "drivers_probe")) ret = pci_driver_handle_drivers_probe(path); + else if (STREQ(file, "driver_override")) + ret = 0; /* nada */ else ABORT("Not handled write to: %s", path); return ret; @@ -711,7 +756,7 @@ pci_driver_handle_new_id(const char *path) goto cleanup; } - if (pci_read_file(path, buf, sizeof(buf)) < 0) + if (pci_read_file(path, buf, sizeof(buf), true) < 0) goto cleanup; if (sscanf(buf, "%x %x", &vendor, &device) < 2) { @@ -766,7 +811,7 @@ pci_driver_handle_remove_id(const char *path) goto cleanup; } - if (pci_read_file(path, buf, sizeof(buf)) < 0) + if (pci_read_file(path, buf, sizeof(buf), true) < 0) goto cleanup; if (sscanf(buf, "%x %x", &vendor, &device) < 2) { -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:52 +0200, Michal Privoznik wrote:
Newer kernels (v3.16-rc1~29^2~6^4) have 'driver_override' file which simplifies way of binding a PCI device to desired driver. Libvirt has support for this for some time too (v2.3.0-rc1~236), but not our virpcimock. So far we did not care because our code is designed to deal with this situation. Except for one. hypothetical case: binding a device to the vfio-pci driver can be successful only via driver_override. Any attempt to bind a PCI device to vfio-pci driver using old method (new_id + unbind + bind) will fail because of b803b29c1a5. While on vanilla kernel
You've reverted the mentioned commit just before this patch. Also I did not understand what this is supposed to do from the commit message. Perhaps due to my limited understanding of the pci detachment code. So either somebody else reviews this or you'll need to reprhase the commit message.
I'm able to use the old method successfully, it's failing on RHEL kernels (not sure why).
This does not inspire confidence.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virpcimock.c | 57 +++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 6 deletions(-)
diff --git a/tests/virpcimock.c b/tests/virpcimock.c index 6865f992dc..18d06d11d4 100644 --- a/tests/virpcimock.c +++ b/tests/virpcimock.c @@ -87,6 +87,11 @@ char *fakesysfspcidir; * Probe for a driver that handles the specified device. * Data in format "DDDD:BB:DD.F" (Domain:Bus:Device.Function). * + * /sys/bus/pci/devices/<device>/driver_override + * Name of a driver that overrides preferred driver can be written + * here. The device will be attached to it on drivers_probe event. + * Writing an empty string (or "\n") clears the override. + * * As a little hack, we are not mocking write to these files, but close() * instead. The advantage is we don't need any self growing array to hold the * partial writes and construct them back. We can let all the writes finish, @@ -147,6 +152,7 @@ static struct pciDevice *pci_device_find_by_content(const char *path); static void pci_driver_new(const char *name, int fail, ...); static struct pciDriver *pci_driver_find_by_dev(struct pciDevice *dev); static struct pciDriver *pci_driver_find_by_path(const char *path); +static struct pciDriver *pci_driver_find_by_driver_override(struct pciDevice *dev); static int pci_driver_bind(struct pciDriver *driver, struct pciDevice *dev); static int pci_driver_unbind(struct pciDriver *driver, struct pciDevice *dev); static int pci_driver_handle_change(int fd, const char *path); @@ -202,7 +208,8 @@ make_symlink(const char *path, static int pci_read_file(const char *path, char *buf, - size_t buf_size) + size_t buf_size, + bool truncate) { int ret = -1; int fd = -1; @@ -224,7 +231,8 @@ pci_read_file(const char *path, goto cleanup; }
- if (ftruncate(fd, 0) < 0) + if (truncate && + ftruncate(fd, 0) < 0) goto cleanup;
ret = 0; @@ -398,6 +406,8 @@ pci_device_new_from_stub(const struct pciDevice *data) ABORT("@tmp overflow"); make_file(devpath, "class", tmp, -1);
+ make_file(devpath, "driver_override", NULL, -1); + if (snprintf(tmp, sizeof(tmp), "%s/../../../kernel/iommu_groups/%d", devpath, dev->iommuGroup) < 0) { @@ -441,7 +451,7 @@ pci_device_find_by_content(const char *path) { char tmp[32];
- if (pci_read_file(path, tmp, sizeof(tmp)) < 0) + if (pci_read_file(path, tmp, sizeof(tmp), true) < 0) return NULL;
return pci_device_find_by_id(tmp); @@ -450,7 +460,10 @@ pci_device_find_by_content(const char *path) static int pci_device_autobind(struct pciDevice *dev) { - struct pciDriver *driver = pci_driver_find_by_dev(dev); + struct pciDriver *driver = pci_driver_find_by_driver_override(dev); + + if (!driver) + driver = pci_driver_find_by_dev(dev);
if (!driver) { /* No driver found. Nothing to do */ @@ -544,6 +557,36 @@ pci_driver_find_by_path(const char *path) return NULL; }
+static struct pciDriver * +pci_driver_find_by_driver_override(struct pciDevice *dev) +{ + struct pciDriver *ret = NULL; + char *path = NULL; + char tmp[32]; + size_t i; + + if (virAsprintfQuiet(&path, + SYSFS_PCI_PREFIX "devices/%s/driver_override", + dev->id) < 0) + return NULL; + + if (pci_read_file(path, tmp, sizeof(tmp), false) < 0) + goto cleanup; + + for (i = 0; i < nPCIDrivers; i++) { + struct pciDriver *driver = pciDrivers[i]; + + if (STREQ(tmp, driver->name)) { + ret = driver; + break; + } + } + + cleanup: + VIR_FREE(path);
VIR_AUTOFREE should be available.
+ return ret; +} + static int pci_driver_bind(struct pciDriver *driver, struct pciDevice *dev) @@ -657,6 +700,8 @@ pci_driver_handle_change(int fd ATTRIBUTE_UNUSED, const char *path) ret = pci_driver_handle_remove_id(path); else if (STREQ(file, "drivers_probe")) ret = pci_driver_handle_drivers_probe(path); + else if (STREQ(file, "driver_override")) + ret = 0; /* nada */ else ABORT("Not handled write to: %s", path); return ret; @@ -711,7 +756,7 @@ pci_driver_handle_new_id(const char *path) goto cleanup; }
- if (pci_read_file(path, buf, sizeof(buf)) < 0) + if (pci_read_file(path, buf, sizeof(buf), true) < 0) goto cleanup;
if (sscanf(buf, "%x %x", &vendor, &device) < 2) { @@ -766,7 +811,7 @@ pci_driver_handle_remove_id(const char *path) goto cleanup; }
- if (pci_read_file(path, buf, sizeof(buf)) < 0) + if (pci_read_file(path, buf, sizeof(buf), true) < 0) goto cleanup;
if (sscanf(buf, "%x %x", &vendor, &device) < 2) { -- 2.21.0
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 7/16/19 2:13 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:52 +0200, Michal Privoznik wrote:
Newer kernels (v3.16-rc1~29^2~6^4) have 'driver_override' file which simplifies way of binding a PCI device to desired driver. Libvirt has support for this for some time too (v2.3.0-rc1~236), but not our virpcimock. So far we did not care because our code is designed to deal with this situation. Except for one. hypothetical case: binding a device to the vfio-pci driver can be successful only via driver_override. Any attempt to bind a PCI device to vfio-pci driver using old method (new_id + unbind + bind) will fail because of b803b29c1a5. While on vanilla kernel
You've reverted the mentioned commit just before this patch.
Yep, I needed to do that s
Also I did not understand what this is supposed to do from the commit message. Perhaps due to my limited understanding of the pci detachment code. So either somebody else reviews this or you'll need to reprhase the commit message.
I'm able to use the old method successfully, it's failing on RHEL kernels (not sure why).
This does not inspire confidence.
I've asked Alex Williamson about this and he confirmed that he can see this behaviour but nor he could understand why RHEL-7 kernel behaves that way.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virpcimock.c | 57 +++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 6 deletions(-)
diff --git a/tests/virpcimock.c b/tests/virpcimock.c index 6865f992dc..18d06d11d4 100644 --- a/tests/virpcimock.c +++ b/tests/virpcimock.c @@ -87,6 +87,11 @@ char *fakesysfspcidir; * Probe for a driver that handles the specified device. * Data in format "DDDD:BB:DD.F" (Domain:Bus:Device.Function). * + * /sys/bus/pci/devices/<device>/driver_override + * Name of a driver that overrides preferred driver can be written + * here. The device will be attached to it on drivers_probe event. + * Writing an empty string (or "\n") clears the override. + * * As a little hack, we are not mocking write to these files, but close() * instead. The advantage is we don't need any self growing array to hold the * partial writes and construct them back. We can let all the writes finish, @@ -147,6 +152,7 @@ static struct pciDevice *pci_device_find_by_content(const char *path); static void pci_driver_new(const char *name, int fail, ...); static struct pciDriver *pci_driver_find_by_dev(struct pciDevice *dev); static struct pciDriver *pci_driver_find_by_path(const char *path); +static struct pciDriver *pci_driver_find_by_driver_override(struct pciDevice *dev); static int pci_driver_bind(struct pciDriver *driver, struct pciDevice *dev); static int pci_driver_unbind(struct pciDriver *driver, struct pciDevice *dev); static int pci_driver_handle_change(int fd, const char *path); @@ -202,7 +208,8 @@ make_symlink(const char *path, static int pci_read_file(const char *path, char *buf, - size_t buf_size) + size_t buf_size, + bool truncate) { int ret = -1; int fd = -1; @@ -224,7 +231,8 @@ pci_read_file(const char *path, goto cleanup; }
- if (ftruncate(fd, 0) < 0) + if (truncate && + ftruncate(fd, 0) < 0) goto cleanup;
ret = 0; @@ -398,6 +406,8 @@ pci_device_new_from_stub(const struct pciDevice *data) ABORT("@tmp overflow"); make_file(devpath, "class", tmp, -1);
+ make_file(devpath, "driver_override", NULL, -1); + if (snprintf(tmp, sizeof(tmp), "%s/../../../kernel/iommu_groups/%d", devpath, dev->iommuGroup) < 0) { @@ -441,7 +451,7 @@ pci_device_find_by_content(const char *path) { char tmp[32];
- if (pci_read_file(path, tmp, sizeof(tmp)) < 0) + if (pci_read_file(path, tmp, sizeof(tmp), true) < 0) return NULL;
return pci_device_find_by_id(tmp); @@ -450,7 +460,10 @@ pci_device_find_by_content(const char *path) static int pci_device_autobind(struct pciDevice *dev) { - struct pciDriver *driver = pci_driver_find_by_dev(dev); + struct pciDriver *driver = pci_driver_find_by_driver_override(dev); + + if (!driver) + driver = pci_driver_find_by_dev(dev);
This could explain why the check from 03/31 can't be here. If it was, then nor 'driver_override' would be able to bind a PCI device to the vfio-pci driver (which obviously is not the case).
if (!driver) { /* No driver found. Nothing to do */ @@ -544,6 +557,36 @@ pci_driver_find_by_path(const char *path) return NULL; }
+static struct pciDriver * +pci_driver_find_by_driver_override(struct pciDevice *dev) +{ + struct pciDriver *ret = NULL; + char *path = NULL; + char tmp[32]; + size_t i; + + if (virAsprintfQuiet(&path, + SYSFS_PCI_PREFIX "devices/%s/driver_override", + dev->id) < 0) + return NULL; + + if (pci_read_file(path, tmp, sizeof(tmp), false) < 0) + goto cleanup; + + for (i = 0; i < nPCIDrivers; i++) { + struct pciDriver *driver = pciDrivers[i]; + + if (STREQ(tmp, driver->name)) { + ret = driver; + break; + } + } + + cleanup: + VIR_FREE(path);
VIR_AUTOFREE should be available.
Ah, good point. Consider fixed. Michal

This function does not change any of the passed addresses. It just reads them. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virpci.c | 4 ++-- src/util/virpci.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/util/virpci.c b/src/util/virpci.c index 75e8daadd5..5392d62406 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1718,8 +1718,8 @@ virPCIDeviceAddressIsEmpty(const virPCIDeviceAddress *addr) } bool -virPCIDeviceAddressEqual(virPCIDeviceAddress *addr1, - virPCIDeviceAddress *addr2) +virPCIDeviceAddressEqual(const virPCIDeviceAddress *addr1, + const virPCIDeviceAddress *addr2) { if (addr1->domain == addr2->domain && addr1->bus == addr2->bus && diff --git a/src/util/virpci.h b/src/util/virpci.h index 457be3c929..a940608701 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -231,8 +231,8 @@ bool virPCIDeviceAddressIsValid(virPCIDeviceAddressPtr addr, bool report); bool virPCIDeviceAddressIsEmpty(const virPCIDeviceAddress *addr); -bool virPCIDeviceAddressEqual(virPCIDeviceAddress *addr1, - virPCIDeviceAddress *addr2); +bool virPCIDeviceAddressEqual(const virPCIDeviceAddress *addr1, + const virPCIDeviceAddress *addr2); char *virPCIDeviceAddressAsString(virPCIDeviceAddressPtr addr) ATTRIBUTE_NONNULL(1); -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:53 +0200, Michal Privoznik wrote:
This function does not change any of the passed addresses. It just reads them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ---
ACK

This function does not change any of the passed addresses. It just reads them. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virpci.c | 2 +- src/util/virpci.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/util/virpci.c b/src/util/virpci.c index 5392d62406..59f478dd41 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1731,7 +1731,7 @@ virPCIDeviceAddressEqual(const virPCIDeviceAddress *addr1, } char * -virPCIDeviceAddressAsString(virPCIDeviceAddressPtr addr) +virPCIDeviceAddressAsString(const virPCIDeviceAddress *addr) { char *str; diff --git a/src/util/virpci.h b/src/util/virpci.h index a940608701..1efd8b77ed 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -234,7 +234,7 @@ bool virPCIDeviceAddressIsEmpty(const virPCIDeviceAddress *addr); bool virPCIDeviceAddressEqual(const virPCIDeviceAddress *addr1, const virPCIDeviceAddress *addr2); -char *virPCIDeviceAddressAsString(virPCIDeviceAddressPtr addr) +char *virPCIDeviceAddressAsString(const virPCIDeviceAddress *addr) ATTRIBUTE_NONNULL(1); int virPCIDeviceAddressParse(char *address, virPCIDeviceAddressPtr bdf); -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:54 +0200, Michal Privoznik wrote:
This function does not change any of the passed addresses. It just reads them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ---
ACK

This helper is cleaner than plain memcpy() because one doesn't have to look into virPCIDeviceAddress struct to see if it contains any strings / pointers. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virpci.c | 6 ++++++ src/util/virpci.h | 2 ++ 3 files changed, 9 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 02d5b7acce..6cef8d20fe 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2612,6 +2612,7 @@ virObjectUnref; # util/virpci.h virPCIDeviceAddressAsString; +virPCIDeviceAddressCopy; virPCIDeviceAddressEqual; virPCIDeviceAddressGetIOMMUGroupAddresses; virPCIDeviceAddressGetIOMMUGroupNum; diff --git a/src/util/virpci.c b/src/util/virpci.c index 59f478dd41..03ce651f40 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1730,6 +1730,12 @@ virPCIDeviceAddressEqual(const virPCIDeviceAddress *addr1, return false; } +void virPCIDeviceAddressCopy(virPCIDeviceAddressPtr dst, + const virPCIDeviceAddress *src) +{ + memcpy(dst, src, sizeof(*src)); +} + char * virPCIDeviceAddressAsString(const virPCIDeviceAddress *addr) { diff --git a/src/util/virpci.h b/src/util/virpci.h index 1efd8b77ed..72e90a1ef3 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -233,6 +233,8 @@ bool virPCIDeviceAddressIsEmpty(const virPCIDeviceAddress *addr); bool virPCIDeviceAddressEqual(const virPCIDeviceAddress *addr1, const virPCIDeviceAddress *addr2); +void virPCIDeviceAddressCopy(virPCIDeviceAddressPtr dst, + const virPCIDeviceAddress *src); char *virPCIDeviceAddressAsString(const virPCIDeviceAddress *addr) ATTRIBUTE_NONNULL(1); -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:55 +0200, Michal Privoznik wrote:
This helper is cleaner than plain memcpy() because one doesn't have to look into virPCIDeviceAddress struct to see if it contains any strings / pointers.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virpci.c | 6 ++++++ src/util/virpci.h | 2 ++ 3 files changed, 9 insertions(+)
[...]
diff --git a/src/util/virpci.c b/src/util/virpci.c index 59f478dd41..03ce651f40 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1730,6 +1730,12 @@ virPCIDeviceAddressEqual(const virPCIDeviceAddress *addr1, return false; }
Please add a comment stating that this is a deep copy and also note with the definitions of 'struct _virZPCIDeviceAddress' and 'struct _virPCIDeviceAddress' that there is a deep-copy function which needs to be fixed when adding new members.
+void virPCIDeviceAddressCopy(virPCIDeviceAddressPtr dst, + const virPCIDeviceAddress *src) +{ + memcpy(dst, src, sizeof(*src)); +} + char * virPCIDeviceAddressAsString(const virPCIDeviceAddress *addr) {
ACK with the above addressed.

I find this function more readable if checks for passed storage source are done first and backing chain is done last. Mixing them together does not hurt, but is less readable. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 0f1fda2384..f09abc8a73 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -5178,11 +5178,6 @@ qemuDomainDeviceDefValidateDisk(const virDomainDiskDef *disk, return -1; } - for (n = disk->src; virStorageSourceIsBacking(n); n = n->backingStore) { - if (qemuDomainValidateStorageSource(n, qemuCaps) < 0) - return -1; - } - if (disk->device == VIR_DOMAIN_DISK_DEVICE_CDROM && disk->bus == VIR_DOMAIN_DISK_BUS_VIRTIO) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, @@ -5191,6 +5186,11 @@ qemuDomainDeviceDefValidateDisk(const virDomainDiskDef *disk, return -1; } + for (n = disk->src; virStorageSourceIsBacking(n); n = n->backingStore) { + if (qemuDomainValidateStorageSource(n, qemuCaps) < 0) + return -1; + } + return 0; } -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:56 +0200, Michal Privoznik wrote:
I find this function more readable if checks for passed storage source are done first and backing chain is done last. Mixing them together does not hurt, but is less readable.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
ACK

There is this class of PCI devices that act like disks: NVMe. Therefore, they are both PCI devices and disks. While we already have <hostdev/> (and can assign a NVMe device to a domain successfully) we don't have disk representation. There are three problems with PCI assignment in case of a NVMe device: 1) domains with <hostdev/> can't be migrated 2) NVMe device is assigned whole, there's no way to assign only a namespace 3) Because hypervisors see <hostdev/> they don't put block layer on top of it - users don't get all the fancy features like snapshots NVMe namespaces are way of splitting one continuous NVDIMM memory into smaller ones, effectively creating smaller NVMe-s (which can then be partitioned, LVMed, etc.) Because of all of this the following XML was chosen to model a NVMe device: <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> </disk> Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/formatdomain.html.in | 45 +++++++++++++++++++++-- docs/schemas/domaincommon.rng | 32 ++++++++++++++++ tests/qemuxml2argvdata/disk-nvme.xml | 55 ++++++++++++++++++++++++++++ 3 files changed, 129 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index a7a6ec32a5..545578076d 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -2922,6 +2922,13 @@ </backingStore> <target dev='vdd' bus='virtio'/> </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vde' bus='virtio'/> + </disk> </devices> ...</pre> @@ -2935,7 +2942,8 @@ Valid values are "file", "block", "dir" (<span class="since">since 0.7.5</span>), "network" (<span class="since">since 0.8.7</span>), or - "volume" (<span class="since">since 1.0.5</span>) + "volume" (<span class="since">since 1.0.5</span>), or + "nvme" (<span class="since">since 5.5.0</span>) and refer to the underlying source for the disk. <span class="since">Since 0.0.3</span> </dd> @@ -3118,6 +3126,31 @@ <span class="since">Since 1.0.5</span> </p> </dd> + <dt><code>nvme</code></dt> + <dd> + To specify disk source for NVMe disk the <code>source</code> + element has the following attributes: + <dl> + <dt><code>type</code></dt> + <dd>The type of address specified in <code>address</code> + sub-element. Currently, only <code>pci</code> value is + accepted. + </dd> + + <dt><code>managed</code></dt> + <dd>This attribute instructs libvirt to detach NVMe + controller automatically on domain startup (<code>yes</code>) + or expect the controller to be detached by system + administrator (<code>no</code>). + </dd> + + <dt><code>namespace</code></dt> + <dd>The namespace ID which should be assigned to the domain. + According to NVMe standard, namespace numbers start from 1, + including. + </dd> + </dl> + </dd> </dl> With "file", "block", and "volume", one or more optional sub-elements <code>seclabel</code>, <a href="#seclabel">described @@ -3280,11 +3313,17 @@ initiator IQN needed to access the source via mandatory attribute <code>name</code>. </dd> + <dt><code>address</code></dt> + <dd>For disk of type <code>nvme</code> this element + specifies the PCI address of the host NVMe + controller. + <span class="since">Since 5.5.0</span> + </dd> </dl> <p> - For a "file" or "volume" disk type which represents a cdrom or floppy - (the <code>device</code> attribute), it is possible to define + For a "file", "volume" or "nvme" disk type which represents a cdrom or + floppy (the <code>device</code> attribute), it is possible to define policy what to do with the disk if the source file is not accessible. (NB, <code>startupPolicy</code> is not valid for "volume" disk unless the specified storage volume is of "file" type). This is done by the diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 31db599ab9..f367e8f6fd 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -1603,6 +1603,7 @@ <ref name="diskSourceDir"/> <ref name="diskSourceNetwork"/> <ref name="diskSourceVolume"/> + <ref name="diskSourceNvme"/> </choice> </define> @@ -1918,6 +1919,37 @@ </optional> </define> + <define name="diskSourceNvme"> + <attribute name="type"> + <value>nvme</value> + </attribute> + <optional> + <element name="source"> + <attribute name="type"> + <value>pci</value> + </attribute> + <attribute name="namespace"> + <ref name="uint32"/> + </attribute> + <optional> + <attribute name="managed"> + <ref name="virYesNo"/> + </attribute> + </optional> + <element name="address"> + <ref name="pciaddress"/> + </element> + <ref name="diskSourceCommon"/> + <optional> + <ref name="storageStartupPolicy"/> + </optional> + <optional> + <ref name="encryption"/> + </optional> + </element> + </optional> + </define> + <define name="diskTarget"> <data type="string"> <param name="pattern">(ioemu:)?(fd|hd|sd|vd|xvd|ubd)[a-zA-Z0-9_]+</param> diff --git a/tests/qemuxml2argvdata/disk-nvme.xml b/tests/qemuxml2argvdata/disk-nvme.xml new file mode 100644 index 0000000000..0b3dbad4eb --- /dev/null +++ b/tests/qemuxml2argvdata/disk-nvme.xml @@ -0,0 +1,55 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-i686</emulator> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vda' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='2'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vdb' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='no' namespace='1'> + <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> + </source> + <target dev='vdc' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='no' namespace='2'> + <address domain='0x0001' bus='0x02' slot='0x00' function='0x0'/> + <encryption format='luks'> + <secret type='passphrase' uuid='0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f'/> + </encryption> + </source> + <target dev='vdd' bus='virtio'/> + </disk> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <controller type='scsi' index='0' model='virtio-scsi'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='none'/> + </devices> +</domain> -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:57 +0200, Michal Privoznik wrote:
There is this class of PCI devices that act like disks: NVMe. Therefore, they are both PCI devices and disks. While we already have <hostdev/> (and can assign a NVMe device to a domain successfully) we don't have disk representation. There are three problems with PCI assignment in case of a NVMe device:
1) domains with <hostdev/> can't be migrated
2) NVMe device is assigned whole, there's no way to assign only a namespace
3) Because hypervisors see <hostdev/> they don't put block layer on top of it - users don't get all the fancy features like snapshots
NVMe namespaces are way of splitting one continuous NVDIMM memory
s/NVDIMM memory/NVMe device/
into smaller ones, effectively creating smaller NVMe-s (which can then be partitioned, LVMed, etc.)
Because of all of this the following XML was chosen to model a NVMe device:
<disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> </disk>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/formatdomain.html.in | 45 +++++++++++++++++++++-- docs/schemas/domaincommon.rng | 32 ++++++++++++++++ tests/qemuxml2argvdata/disk-nvme.xml | 55 ++++++++++++++++++++++++++++ 3 files changed, 129 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index a7a6ec32a5..545578076d 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -2922,6 +2922,13 @@ </backingStore> <target dev='vdd' bus='virtio'/> </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'>
The 'type' filed may get confusing a bit as it is supposed to be stored in virStorageSource->nvme->type, while virStorageSource has it's own type. Also I'm pondering whether managed='yes' belongs as a top-level attribute under 'source' as it's possibly specific to the 'pci' setting of type.
+ <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vde' bus='virtio'/> + </disk> </devices> ...</pre>
[...]
@@ -3118,6 +3126,31 @@ <span class="since">Since 1.0.5</span> </p> </dd> + <dt><code>nvme</code></dt> + <dd> + To specify disk source for NVMe disk the <code>source</code> + element has the following attributes: + <dl> + <dt><code>type</code></dt> + <dd>The type of address specified in <code>address</code> + sub-element. Currently, only <code>pci</code> value is + accepted. + </dd> + + <dt><code>managed</code></dt> + <dd>This attribute instructs libvirt to detach NVMe + controller automatically on domain startup (<code>yes</code>) + or expect the controller to be detached by system + administrator (<code>no</code>). + </dd> + + <dt><code>namespace</code></dt> + <dd>The namespace ID which should be assigned to the domain. + According to NVMe standard, namespace numbers start from 1, + including. + </dd> + </dl> + </dd> </dl> With "file", "block", and "volume", one or more optional sub-elements <code>seclabel</code>, <a href="#seclabel">described @@ -3280,11 +3313,17 @@ initiator IQN needed to access the source via mandatory attribute <code>name</code>. </dd> + <dt><code>address</code></dt> + <dd>For disk of type <code>nvme</code> this element + specifies the PCI address of the host NVMe + controller. + <span class="since">Since 5.5.0</span> + </dd> </dl>
<p> - For a "file" or "volume" disk type which represents a cdrom or floppy - (the <code>device</code> attribute), it is possible to define + For a "file", "volume" or "nvme" disk type which represents a cdrom or + floppy (the <code>device</code> attribute), it is possible to define
You specifically forbid startup policy in the next commit, so what's the point of documenting it here?
policy what to do with the disk if the source file is not accessible. (NB, <code>startupPolicy</code> is not valid for "volume" disk unless the specified storage volume is of "file" type). This is done by the diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 31db599ab9..f367e8f6fd 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng
[...]
@@ -1918,6 +1919,37 @@ </optional> </define>
+ <define name="diskSourceNvme"> + <attribute name="type"> + <value>nvme</value> + </attribute> + <optional> + <element name="source"> + <attribute name="type"> + <value>pci</value> + </attribute> + <attribute name="namespace"> + <ref name="uint32"/> + </attribute> + <optional> + <attribute name="managed"> + <ref name="virYesNo"/> + </attribute> + </optional> + <element name="address"> + <ref name="pciaddress"/> + </element> + <ref name="diskSourceCommon"/> + <optional> + <ref name="storageStartupPolicy"/> + </optional> + <optional> + <ref name="encryption"/> + </optional> + </element> + </optional> + </define> + <define name="diskTarget"> <data type="string"> <param name="pattern">(ioemu:)?(fd|hd|sd|vd|xvd|ubd)[a-zA-Z0-9_]+</param> diff --git a/tests/qemuxml2argvdata/disk-nvme.xml b/tests/qemuxml2argvdata/disk-nvme.xml new file mode 100644 index 0000000000..0b3dbad4eb --- /dev/null +++ b/tests/qemuxml2argvdata/disk-nvme.xml @@ -0,0 +1,55 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-i686</emulator> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vda' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='2'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
Make at heast one of them use qcow2 as format.
+ </source> + <target dev='vdb' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='no' namespace='1'> + <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> + </source> + <target dev='vdc' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='no' namespace='2'> + <address domain='0x0001' bus='0x02' slot='0x00' function='0x0'/> + <encryption format='luks'> + <secret type='passphrase' uuid='0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f'/> + </encryption> + </source> + <target dev='vdd' bus='virtio'/> + </disk> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <controller type='scsi' index='0' model='virtio-scsi'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='none'/> + </devices> +</domain>
I'm also missing any form of documentation describing the caveats (e.g. users should not pass in a NVMe disk the host is using) or any advantages/reasons for using this.

On 7/16/19 2:35 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:57 +0200, Michal Privoznik wrote:
There is this class of PCI devices that act like disks: NVMe. Therefore, they are both PCI devices and disks. While we already have <hostdev/> (and can assign a NVMe device to a domain successfully) we don't have disk representation. There are three problems with PCI assignment in case of a NVMe device:
1) domains with <hostdev/> can't be migrated
2) NVMe device is assigned whole, there's no way to assign only a namespace
3) Because hypervisors see <hostdev/> they don't put block layer on top of it - users don't get all the fancy features like snapshots
NVMe namespaces are way of splitting one continuous NVDIMM memory
s/NVDIMM memory/NVMe device/
into smaller ones, effectively creating smaller NVMe-s (which can then be partitioned, LVMed, etc.)
Because of all of this the following XML was chosen to model a NVMe device:
<disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> </disk>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/formatdomain.html.in | 45 +++++++++++++++++++++-- docs/schemas/domaincommon.rng | 32 ++++++++++++++++ tests/qemuxml2argvdata/disk-nvme.xml | 55 ++++++++++++++++++++++++++++ 3 files changed, 129 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index a7a6ec32a5..545578076d 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -2922,6 +2922,13 @@ </backingStore> <target dev='vdd' bus='virtio'/> </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'>
The 'type' filed may get confusing a bit as it is supposed to be stored in virStorageSource->nvme->type, while virStorageSource has it's own type.
Well, I'm trygin to mix two things here: <disk/> and <hostdev/> (well, PCI devices). There are other types of NVMe than PCIe attached ones. They are called NVMe-oF (NVMe over Fabrics), so I've figured that we want to specifically tell that NVMe we are dealing with here is the PCIe one. But I guess it can also be added later, if we ever decide to support NVMe-oF (disks without @type would have 'pci' autofilled in a post parse callback).
Also I'm pondering whether managed='yes' belongs as a top-level attribute under 'source' as it's possibly specific to the 'pci' setting of type.
Yeah, designing future proof XML is hard. I'm open to any suggestion.
+ <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vde' bus='virtio'/> + </disk> </devices> ...</pre>
[...]
@@ -3118,6 +3126,31 @@ <span class="since">Since 1.0.5</span> </p> </dd> + <dt><code>nvme</code></dt> + <dd> + To specify disk source for NVMe disk the <code>source</code> + element has the following attributes: + <dl> + <dt><code>type</code></dt> + <dd>The type of address specified in <code>address</code> + sub-element. Currently, only <code>pci</code> value is + accepted. + </dd> + + <dt><code>managed</code></dt> + <dd>This attribute instructs libvirt to detach NVMe + controller automatically on domain startup (<code>yes</code>) + or expect the controller to be detached by system + administrator (<code>no</code>). + </dd> + + <dt><code>namespace</code></dt> + <dd>The namespace ID which should be assigned to the domain. + According to NVMe standard, namespace numbers start from 1, + including. + </dd> + </dl> + </dd> </dl> With "file", "block", and "volume", one or more optional sub-elements <code>seclabel</code>, <a href="#seclabel">described @@ -3280,11 +3313,17 @@ initiator IQN needed to access the source via mandatory attribute <code>name</code>. </dd> + <dt><code>address</code></dt> + <dd>For disk of type <code>nvme</code> this element + specifies the PCI address of the host NVMe + controller. + <span class="since">Since 5.5.0</span> + </dd> </dl>
<p> - For a "file" or "volume" disk type which represents a cdrom or floppy - (the <code>device</code> attribute), it is possible to define + For a "file", "volume" or "nvme" disk type which represents a cdrom or + floppy (the <code>device</code> attribute), it is possible to define
You specifically forbid startup policy in the next commit, so what's the point of documenting it here?
Ah, good point. Back in the day I still wanted to make startupPolicy work, but then decided not to.
policy what to do with the disk if the source file is not accessible. (NB, <code>startupPolicy</code> is not valid for "volume" disk unless the specified storage volume is of "file" type). This is done by the diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 31db599ab9..f367e8f6fd 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng
[...]
@@ -1918,6 +1919,37 @@ </optional> </define>
+ <define name="diskSourceNvme"> + <attribute name="type"> + <value>nvme</value> + </attribute> + <optional> + <element name="source"> + <attribute name="type"> + <value>pci</value> + </attribute> + <attribute name="namespace"> + <ref name="uint32"/> + </attribute> + <optional> + <attribute name="managed"> + <ref name="virYesNo"/> + </attribute> + </optional> + <element name="address"> + <ref name="pciaddress"/> + </element> + <ref name="diskSourceCommon"/> + <optional> + <ref name="storageStartupPolicy"/> + </optional> + <optional> + <ref name="encryption"/> + </optional> + </element> + </optional> + </define> + <define name="diskTarget"> <data type="string"> <param name="pattern">(ioemu:)?(fd|hd|sd|vd|xvd|ubd)[a-zA-Z0-9_]+</param> diff --git a/tests/qemuxml2argvdata/disk-nvme.xml b/tests/qemuxml2argvdata/disk-nvme.xml new file mode 100644 index 0000000000..0b3dbad4eb --- /dev/null +++ b/tests/qemuxml2argvdata/disk-nvme.xml @@ -0,0 +1,55 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-i686</emulator> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vda' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='2'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
Make at heast one of them use qcow2 as format.
Okay.
+ </source> + <target dev='vdb' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='no' namespace='1'> + <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> + </source> + <target dev='vdc' bus='virtio'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='no' namespace='2'> + <address domain='0x0001' bus='0x02' slot='0x00' function='0x0'/> + <encryption format='luks'> + <secret type='passphrase' uuid='0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f'/> + </encryption> + </source> + <target dev='vdd' bus='virtio'/> + </disk> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <controller type='scsi' index='0' model='virtio-scsi'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='none'/> + </devices> +</domain>
I'm also missing any form of documentation describing the caveats (e.g. users should not pass in a NVMe disk the host is using)
Since I'm using virHostdevManager to track used NVMe-s, any attempt to start a domain with a taken NVMe disk will fail. So users will see that immediately :-) But okay, I will enhance the docs. Michal

To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 129 +++++++++++++++++++++++++ src/libvirt_private.syms | 1 + src/qemu/qemu_block.c | 1 + src/qemu/qemu_command.c | 1 + src/qemu/qemu_driver.c | 4 + src/qemu/qemu_migration.c | 1 + src/util/virstoragefile.c | 59 +++++++++++ src/util/virstoragefile.h | 15 +++ src/xenconfig/xen_xl.c | 1 + tests/qemuxml2argvdata/disk-nvme.xml | 12 ++- tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + 12 files changed, 224 insertions(+), 2 deletions(-) create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 3323c9a5b1..73f5e1fa0f 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5088,6 +5088,11 @@ virDomainDiskDefPostParse(virDomainDiskDefPtr disk, return -1; } + if (disk->src->type == VIR_STORAGE_TYPE_NVME) { + if (disk->src->nvme->managed == VIR_TRISTATE_BOOL_ABSENT) + disk->src->nvme->managed = VIR_TRISTATE_BOOL_YES; + } + if (disk->info.type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE && virDomainDiskDefAssignAddress(xmlopt, disk, def) < 0) { return -1; @@ -5938,6 +5943,38 @@ virDomainDiskDefValidate(const virDomainDiskDef *disk) return -1; } + if (disk->src->type == VIR_STORAGE_TYPE_NVME) { + /* These might not be valid for all hypervisors, but be + * strict now and possibly refine in the future. */ + if (disk->device != VIR_DOMAIN_DISK_DEVICE_DISK) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported disk type '%s' for NVMe disk"), + virDomainDiskDeviceTypeToString(disk->device)); + return -1; + } + + if (disk->bus != VIR_DOMAIN_DISK_BUS_VIRTIO) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported bus '%s' for NVMe disk"), + virDomainDiskBusTypeToString(disk->bus)); + return -1; + } + + if (disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_DEFAULT && + disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_MANDATORY) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported startup policy '%s' for NVMe disk"), + virDomainStartupPolicyTypeToString(disk->startupPolicy)); + return -1; + } + + if (disk->src->shared) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Unsupported <shareable/> for NVMe disk")); + return -1; + } + } + return 0; } @@ -9184,6 +9221,76 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node, } +static int +virDomainDiskSourceNVMeParse(xmlNodePtr node, + xmlXPathContextPtr ctxt, + virStorageSourcePtr src) +{ + VIR_AUTOPTR(virStorageSourceNVMeDef) nvme = NULL; + VIR_AUTOFREE(char *) type = NULL; + VIR_AUTOFREE(char *) namespace = NULL; + VIR_AUTOFREE(char *) managed = NULL; + xmlNodePtr address; + + if (VIR_ALLOC(nvme) < 0) + return -1; + + if (!(type = virXMLPropString(node, "type"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'type' attribute to disk source")); + return -1; + } + + if (STRNEQ(type, "pci")) { + virReportError(VIR_ERR_XML_ERROR, + _("unsupported source type '%s'"), + type); + return -1; + } + + if (!(namespace = virXMLPropString(node, "namespace"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'namespace' attribute to disk source")); + return -1; + } + + if (virStrToLong_ul(namespace, NULL, 10, &nvme->namespace) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed namespace '%s'"), + namespace); + return -1; + } + + /* NVMe namespaces start from 1 */ + if (nvme->namespace == 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe namespace can't be zero")); + return -1; + } + + if ((managed = virXMLPropString(node, "managed"))) { + if ((nvme->managed = virTristateBoolTypeFromString(managed)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed managed value '%s'"), + managed); + return -1; + } + } + + if (!(address = virXPathNode("./address", ctxt))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe disk source is missing address")); + return -1; + } + + if (virPCIDeviceAddressParseXML(address, &nvme->pciAddr) < 0) + return -1; + + VIR_STEAL_PTR(src->nvme, nvme); + return 0; +} + + static int virDomainDiskSourcePRParse(xmlNodePtr node, xmlXPathContextPtr ctxt, @@ -9284,6 +9391,10 @@ virDomainStorageSourceParse(xmlNodePtr node, if (virDomainDiskSourcePoolDefParse(node, &src->srcpool) < 0) return -1; break; + case VIR_STORAGE_TYPE_NVME: + if (virDomainDiskSourceNVMeParse(node, ctxt, src) < 0) + return -1; + break; case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -23922,6 +24033,19 @@ virDomainDiskSourceFormatNetwork(virBufferPtr attrBuf, } +static void +virDomainDiskSourceNVMeFormat(virBufferPtr attrBuf, + virBufferPtr childBuf, + const virStorageSourceNVMeDef *nvme) +{ + virBufferAddLit(attrBuf, " type='pci'"); + virBufferAsprintf(attrBuf, " managed='%s'", + virTristateBoolTypeToString(nvme->managed)); + virBufferAsprintf(attrBuf, " namespace='%ld'", nvme->namespace); + virPCIDeviceAddressFormat(childBuf, nvme->pciAddr, false); +} + + static int virDomainDiskSourceFormatPrivateData(virBufferPtr buf, virStorageSourcePtr src, @@ -24008,6 +24132,11 @@ virDomainDiskSourceFormat(virBufferPtr buf, break; + case VIR_STORAGE_TYPE_NVME: + if (src->nvme) + virDomainDiskSourceNVMeFormat(&attrBuf, &childBuf, src->nvme); + break; + case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 6cef8d20fe..350b638193 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2994,6 +2994,7 @@ virStorageSourceNetworkAssignDefaultPorts; virStorageSourceNew; virStorageSourceNewFromBacking; virStorageSourceNewFromBackingAbsolute; +virStorageSourceNVMeDefFree; virStorageSourceParseRBDColonString; virStorageSourcePoolDefFree; virStorageSourcePoolModeTypeFromString; diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c index 0a6522577d..5eeb3757f1 100644 --- a/src/qemu/qemu_block.c +++ b/src/qemu/qemu_block.c @@ -1050,6 +1050,7 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src, break; case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: return NULL; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 688dc324c6..927641cf46 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -1184,6 +1184,7 @@ qemuGetDriveSourceString(virStorageSourcePtr src, break; case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: break; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index a52b54b9d8..9bf12ea20d 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -14637,6 +14637,7 @@ qemuDomainSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdi case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -14653,6 +14654,7 @@ qemuDomainSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdi case VIR_STORAGE_TYPE_NETWORK: case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -14715,6 +14717,7 @@ qemuDomainSnapshotPrepareDiskExternalActive(virDomainSnapshotDiskDefPtr snapdisk case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -14837,6 +14840,7 @@ qemuDomainSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk, case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 2436f5051b..87adccab3d 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -227,6 +227,7 @@ qemuMigrationDstPrecreateDisk(virConnectPtr conn, case VIR_STORAGE_TYPE_BLOCK: case VIR_STORAGE_TYPE_DIR: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c index 269d0050fd..18aa33fe05 100644 --- a/src/util/virstoragefile.c +++ b/src/util/virstoragefile.c @@ -57,6 +57,7 @@ VIR_ENUM_IMPL(virStorage, "dir", "network", "volume", + "nvme", ); VIR_ENUM_IMPL(virStorageFileFormat, @@ -2114,6 +2115,48 @@ virStoragePRDefCopy(virStoragePRDefPtr src) } +static virStorageSourceNVMeDefPtr +virStorageSourceNVMeDefCopy(const virStorageSourceNVMeDef *src) +{ + VIR_AUTOPTR(virStorageSourceNVMeDef) ret = NULL; + + if (VIR_ALLOC(ret) < 0) + return NULL; + + *ret = *src; + VIR_RETURN_PTR(ret); +} + + +static bool +virStorageSourceNVMeDefIsEqual(const virStorageSourceNVMeDef *a, + const virStorageSourceNVMeDef *b) +{ + if (!a && !b) + return true; + + if (!a || !b) + return false; + + if (a->namespace != b->namespace || + a->managed != b->managed || + !virPCIDeviceAddressEqual(&a->pciAddr, &b->pciAddr)) + return false; + + return true; +} + + +void +virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def) +{ + if (!def) + return; + + VIR_FREE(def); +} + + virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model) @@ -2323,6 +2366,10 @@ virStorageSourceCopy(const virStorageSource *src, !(def->pr = virStoragePRDefCopy(src->pr))) return NULL; + if (src->nvme && + !(def->nvme = virStorageSourceNVMeDefCopy(src->nvme))) + return NULL; + if (virStorageSourceInitiatorCopy(&def->initiator, &src->initiator)) return NULL; @@ -2376,6 +2423,10 @@ virStorageSourceIsSameLocation(virStorageSourcePtr a, } } + if (a->type == VIR_STORAGE_TYPE_NVME && + !virStorageSourceNVMeDefIsEqual(a->nvme, b->nvme)) + return false; + return true; } @@ -2463,6 +2514,7 @@ virStorageSourceIsLocalStorage(const virStorageSource *src) case VIR_STORAGE_TYPE_NETWORK: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_LAST: case VIR_STORAGE_TYPE_NONE: return false; @@ -2493,6 +2545,10 @@ virStorageSourceIsEmpty(virStorageSourcePtr src) src->protocol == VIR_STORAGE_NET_PROTOCOL_NONE) return true; + if (src->type == VIR_STORAGE_TYPE_NVME && + !src->nvme) + return true; + return false; } @@ -2548,6 +2604,7 @@ virStorageSourceClear(virStorageSourcePtr def) VIR_FREE(def->compat); virStorageEncryptionFree(def->encryption); virStoragePRDefFree(def->pr); + virStorageSourceNVMeDefFree(def->nvme); virStorageSourceSeclabelsClear(def); virStoragePermsFree(def->perms); VIR_FREE(def->timestamps); @@ -3776,6 +3833,7 @@ virStorageSourceUpdatePhysicalSize(virStorageSourcePtr src, /* We shouldn't get VOLUME, but the switch requires all cases */ case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -4242,6 +4300,7 @@ virStorageSourceIsRelative(virStorageSourcePtr src) case VIR_STORAGE_TYPE_NETWORK: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: return false; diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h index 38ba901858..a1294ea608 100644 --- a/src/util/virstoragefile.h +++ b/src/util/virstoragefile.h @@ -31,6 +31,7 @@ #include "virsecret.h" #include "virautoclean.h" #include "virenum.h" +#include "virpci.h" /* Minimum header size required to probe all known formats with * virStorageFileProbeFormat, or obtain metadata from a known format. @@ -52,6 +53,7 @@ typedef enum { VIR_STORAGE_TYPE_DIR, VIR_STORAGE_TYPE_NETWORK, VIR_STORAGE_TYPE_VOLUME, + VIR_STORAGE_TYPE_NVME, VIR_STORAGE_TYPE_LAST } virStorageType; @@ -231,6 +233,14 @@ struct _virStorageSourceInitiatorDef { char *iqn; /* Initiator IQN */ }; +typedef struct _virStorageSourceNVMeDef virStorageSourceNVMeDef; +typedef virStorageSourceNVMeDef *virStorageSourceNVMeDefPtr; +struct _virStorageSourceNVMeDef { + unsigned long namespace; + int managed; /* enum virTristateBool */ + virPCIDeviceAddress pciAddr; +}; + typedef struct _virStorageDriverData virStorageDriverData; typedef virStorageDriverData *virStorageDriverDataPtr; @@ -262,6 +272,8 @@ struct _virStorageSource { bool encryptionInherited; virStoragePRDefPtr pr; + virStorageSourceNVMeDefPtr nvme; /* type == VIR_STORAGE_TYPE_NVME */ + virStorageSourceInitiatorDef initiator; virObjectPtr privateData; @@ -416,6 +428,9 @@ bool virStoragePRDefIsManaged(virStoragePRDefPtr prd); bool virStorageSourceChainHasManagedPR(virStorageSourcePtr src); +void virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def); +VIR_DEFINE_AUTOPTR_FUNC(virStorageSourceNVMeDef, virStorageSourceNVMeDefFree); + virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model); diff --git a/src/xenconfig/xen_xl.c b/src/xenconfig/xen_xl.c index ca094d30c2..fa15e4e2a5 100644 --- a/src/xenconfig/xen_xl.c +++ b/src/xenconfig/xen_xl.c @@ -1662,6 +1662,7 @@ xenFormatXLDiskSrc(virStorageSourcePtr src, char **srcstr) break; case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: break; diff --git a/tests/qemuxml2argvdata/disk-nvme.xml b/tests/qemuxml2argvdata/disk-nvme.xml index 0b3dbad4eb..fe956d5ab6 100644 --- a/tests/qemuxml2argvdata/disk-nvme.xml +++ b/tests/qemuxml2argvdata/disk-nvme.xml @@ -20,6 +20,7 @@ <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> @@ -27,6 +28,7 @@ <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vdb' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> @@ -34,6 +36,7 @@ <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </source> <target dev='vdc' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> @@ -44,10 +47,15 @@ </encryption> </source> <target dev='vdd' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk> - <controller type='usb' index='0'/> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> <controller type='pci' index='0' model='pci-root'/> - <controller type='scsi' index='0' model='virtio-scsi'/> + <controller type='scsi' index='0' model='virtio-scsi'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </controller> <input type='mouse' bus='ps2'/> <input type='keyboard' bus='ps2'/> <memballoon model='none'/> diff --git a/tests/qemuxml2xmloutdata/disk-nvme.xml b/tests/qemuxml2xmloutdata/disk-nvme.xml new file mode 120000 index 0000000000..ea9eb267ac --- /dev/null +++ b/tests/qemuxml2xmloutdata/disk-nvme.xml @@ -0,0 +1 @@ +../qemuxml2argvdata/disk-nvme.xml \ No newline at end of file diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 6d808e172f..c9f3a8dbfa 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -336,6 +336,7 @@ mymain(void) DO_TEST("disk-network-sheepdog", NONE); DO_TEST("disk-network-vxhs", NONE); DO_TEST("disk-network-tlsx509", NONE); + DO_TEST("disk-nvme", QEMU_CAPS_VIRTIO_SCSI); DO_TEST("disk-scsi", QEMU_CAPS_SCSI_LSI, QEMU_CAPS_SCSI_MEGASAS, QEMU_CAPS_SCSI_MPTSAS1068, QEMU_CAPS_SCSI_DISK_WWN); DO_TEST("disk-virtio-scsi-reservations", -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:58 +0200, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 129 +++++++++++++++++++++++++ src/libvirt_private.syms | 1 + src/qemu/qemu_block.c | 1 + src/qemu/qemu_command.c | 1 + src/qemu/qemu_driver.c | 4 + src/qemu/qemu_migration.c | 1 + src/util/virstoragefile.c | 59 +++++++++++ src/util/virstoragefile.h | 15 +++ src/xenconfig/xen_xl.c | 1 + tests/qemuxml2argvdata/disk-nvme.xml | 12 ++- tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + 12 files changed, 224 insertions(+), 2 deletions(-) create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 3323c9a5b1..73f5e1fa0f 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5088,6 +5088,11 @@ virDomainDiskDefPostParse(virDomainDiskDefPtr disk, return -1; }
+ if (disk->src->type == VIR_STORAGE_TYPE_NVME) { + if (disk->src->nvme->managed == VIR_TRISTATE_BOOL_ABSENT) + disk->src->nvme->managed = VIR_TRISTATE_BOOL_YES; + } + if (disk->info.type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE && virDomainDiskDefAssignAddress(xmlopt, disk, def) < 0) { return -1; @@ -5938,6 +5943,38 @@ virDomainDiskDefValidate(const virDomainDiskDef *disk) return -1; }
+ if (disk->src->type == VIR_STORAGE_TYPE_NVME) {
Note that this can potentially happen in the backing chain as well, so this should be checked throughout the whole backing chain. Also this seems all to belong to the qemu specific post parse callback.
+ /* These might not be valid for all hypervisors, but be + * strict now and possibly refine in the future. */ + if (disk->device != VIR_DOMAIN_DISK_DEVICE_DISK) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported disk type '%s' for NVMe disk"), + virDomainDiskDeviceTypeToString(disk->device)); + return -1; + } + + if (disk->bus != VIR_DOMAIN_DISK_BUS_VIRTIO) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported bus '%s' for NVMe disk"), + virDomainDiskBusTypeToString(disk->bus)); + return -1; + } + + if (disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_DEFAULT && + disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_MANDATORY) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported startup policy '%s' for NVMe disk"), + virDomainStartupPolicyTypeToString(disk->startupPolicy)); + return -1; + } + + if (disk->src->shared) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Unsupported <shareable/> for NVMe disk")); + return -1; + } + } +

On 7/11/19 6:05 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:58 +0200, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 129 +++++++++++++++++++++++++ src/libvirt_private.syms | 1 + src/qemu/qemu_block.c | 1 + src/qemu/qemu_command.c | 1 + src/qemu/qemu_driver.c | 4 + src/qemu/qemu_migration.c | 1 + src/util/virstoragefile.c | 59 +++++++++++ src/util/virstoragefile.h | 15 +++ src/xenconfig/xen_xl.c | 1 + tests/qemuxml2argvdata/disk-nvme.xml | 12 ++- tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + 12 files changed, 224 insertions(+), 2 deletions(-) create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 3323c9a5b1..73f5e1fa0f 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5088,6 +5088,11 @@ virDomainDiskDefPostParse(virDomainDiskDefPtr disk, return -1; }
+ if (disk->src->type == VIR_STORAGE_TYPE_NVME) { + if (disk->src->nvme->managed == VIR_TRISTATE_BOOL_ABSENT) + disk->src->nvme->managed = VIR_TRISTATE_BOOL_YES; + } + if (disk->info.type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE && virDomainDiskDefAssignAddress(xmlopt, disk, def) < 0) { return -1; @@ -5938,6 +5943,38 @@ virDomainDiskDefValidate(const virDomainDiskDef *disk) return -1; }
+ if (disk->src->type == VIR_STORAGE_TYPE_NVME) {
Note that this can potentially happen in the backing chain as well, so this should be checked throughout the whole backing chain.
Is that so? I mean, other checks done in this funtion check only 'top level' disk->src too.
Also this seems all to belong to the qemu specific post parse callback.
Possibly. But since other drivers would still use virNVMeDevice module I'm adding later in this series, and since the module is build on these assumptions I figured the best place to check for them is in driver agnostic callback. Michal

On Thu, Jul 11, 2019 at 18:12:16 +0200, Michal Privoznik wrote:
On 7/11/19 6:05 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:58 +0200, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 129 +++++++++++++++++++++++++ src/libvirt_private.syms | 1 + src/qemu/qemu_block.c | 1 + src/qemu/qemu_command.c | 1 + src/qemu/qemu_driver.c | 4 + src/qemu/qemu_migration.c | 1 + src/util/virstoragefile.c | 59 +++++++++++ src/util/virstoragefile.h | 15 +++ src/xenconfig/xen_xl.c | 1 + tests/qemuxml2argvdata/disk-nvme.xml | 12 ++- tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + 12 files changed, 224 insertions(+), 2 deletions(-) create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 3323c9a5b1..73f5e1fa0f 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5088,6 +5088,11 @@ virDomainDiskDefPostParse(virDomainDiskDefPtr disk, return -1; } + if (disk->src->type == VIR_STORAGE_TYPE_NVME) { + if (disk->src->nvme->managed == VIR_TRISTATE_BOOL_ABSENT) + disk->src->nvme->managed = VIR_TRISTATE_BOOL_YES; + } + if (disk->info.type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE && virDomainDiskDefAssignAddress(xmlopt, disk, def) < 0) { return -1; @@ -5938,6 +5943,38 @@ virDomainDiskDefValidate(const virDomainDiskDef *disk) return -1; } + if (disk->src->type == VIR_STORAGE_TYPE_NVME) {
Note that this can potentially happen in the backing chain as well, so this should be checked throughout the whole backing chain.
Is that so? I mean, other checks done in this funtion check only 'top level' disk->src too.
Yes it certainly will be possible with blockdev. Also you have such a file in the backing chain which gets detected from the file metadata on the disk, so such a check will probably need to be duplicated also when starting the VM (the validate callback function may be better match).
Also this seems all to belong to the qemu specific post parse callback.
Possibly. But since other drivers would still use virNVMeDevice module I'm adding later in this series, and since the module is build on these assumptions I figured the best place to check for them is in driver agnostic callback.
Fair enough.

On Thu, Jul 11, 2019 at 18:16:58 +0200, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 18:12:16 +0200, Michal Privoznik wrote:
On 7/11/19 6:05 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:58 +0200, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ---
[...]
Yes it certainly will be possible with blockdev. Also you have such a file in the backing chain which gets detected from the file metadata on the disk, so such a check will probably need to be duplicated also when starting the VM (the validate callback function may be better match).
Also this seems all to belong to the qemu specific post parse callback.
Possibly. But since other drivers would still use virNVMeDevice module I'm adding later in this series, and since the module is build on these assumptions I figured the best place to check for them is in driver agnostic callback.
Fair enough.
Thinking about this a bit more, if there will be a separate module for this, that module should expose the validator. Also I'm not persuaded about the universality of this code at all thus I doubt that it will be reused in other hypervisors as it requires a userspace driver for NVMe in the hypervisor, which is a pretty niche configuraion/use case.

On 7/16/19 2:38 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 18:16:58 +0200, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 18:12:16 +0200, Michal Privoznik wrote:
On 7/11/19 6:05 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:58 +0200, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ---
[...]
Yes it certainly will be possible with blockdev. Also you have such a file in the backing chain which gets detected from the file metadata on the disk, so such a check will probably need to be duplicated also when starting the VM (the validate callback function may be better match).
Also this seems all to belong to the qemu specific post parse callback.
Possibly. But since other drivers would still use virNVMeDevice module I'm adding later in this series, and since the module is build on these assumptions I figured the best place to check for them is in driver agnostic callback.
Fair enough.
Thinking about this a bit more, if there will be a separate module for this, that module should expose the validator.
Well, the module sees virNVMeDevice struct and not virDomainDiskDef (these are two different structures). And since we do not want code from src/util/ include anything in src/conf/ I'm not quite sure how to access virDomainDiskDef from src/util/virnvme.c.
Also I'm not persuaded about the universality of this code at all thus I doubt that it will be reused in other hypervisors as it requires a userspace driver for NVMe in the hypervisor, which is a pretty niche configuraion/use case.
Well, nearly everything that we work on is qemu specific, because quite frankly, we only touch other drivers when an internal API is changed and a codebase wide adoption is needed. Michal

On Thu, Jul 11, 2019 at 17:53:58 +0200, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 129 +++++++++++++++++++++++++ src/libvirt_private.syms | 1 + src/qemu/qemu_block.c | 1 + src/qemu/qemu_command.c | 1 + src/qemu/qemu_driver.c | 4 + src/qemu/qemu_migration.c | 1 + src/util/virstoragefile.c | 59 +++++++++++ src/util/virstoragefile.h | 15 +++ src/xenconfig/xen_xl.c | 1 + tests/qemuxml2argvdata/disk-nvme.xml | 12 ++- tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + 12 files changed, 224 insertions(+), 2 deletions(-) create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 3323c9a5b1..73f5e1fa0f 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c
[...]
@@ -5938,6 +5943,38 @@ virDomainDiskDefValidate(const virDomainDiskDef *disk) return -1; }
+ if (disk->src->type == VIR_STORAGE_TYPE_NVME) { + /* These might not be valid for all hypervisors, but be + * strict now and possibly refine in the future. */ + if (disk->device != VIR_DOMAIN_DISK_DEVICE_DISK) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported disk type '%s' for NVMe disk"), + virDomainDiskDeviceTypeToString(disk->device)); + return -1; + } + + if (disk->bus != VIR_DOMAIN_DISK_BUS_VIRTIO) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported bus '%s' for NVMe disk"), + virDomainDiskBusTypeToString(disk->bus)); + return -1; + } + + if (disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_DEFAULT && + disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_MANDATORY) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported startup policy '%s' for NVMe disk"), + virDomainStartupPolicyTypeToString(disk->startupPolicy)); + return -1; + } + + if (disk->src->shared) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Unsupported <shareable/> for NVMe disk")); + return -1; + } + } + return 0; }
As noted in the other thread, this really should be extracted, placed in the validation callback rather than post parse and must iterate the backing chain if you want this to keep working with -blockdev.
@@ -9184,6 +9221,76 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node, }
+static int +virDomainDiskSourceNVMeParse(xmlNodePtr node, + xmlXPathContextPtr ctxt, + virStorageSourcePtr src) +{ + VIR_AUTOPTR(virStorageSourceNVMeDef) nvme = NULL; + VIR_AUTOFREE(char *) type = NULL; + VIR_AUTOFREE(char *) namespace = NULL; + VIR_AUTOFREE(char *) managed = NULL; + xmlNodePtr address; + + if (VIR_ALLOC(nvme) < 0) + return -1; + + if (!(type = virXMLPropString(node, "type"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'type' attribute to disk source")); + return -1; + } + + if (STRNEQ(type, "pci")) { + virReportError(VIR_ERR_XML_ERROR, + _("unsupported source type '%s'"), + type); + return -1; + } + + if (!(namespace = virXMLPropString(node, "namespace"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'namespace' attribute to disk source")); + return -1; + } + + if (virStrToLong_ul(namespace, NULL, 10, &nvme->namespace) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed namespace '%s'"), + namespace); + return -1; + } + + /* NVMe namespaces start from 1 */ + if (nvme->namespace == 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe namespace can't be zero")); + return -1; + } + + if ((managed = virXMLPropString(node, "managed"))) { + if ((nvme->managed = virTristateBoolTypeFromString(managed)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed managed value '%s'"), + managed); + return -1; + } + } + + if (!(address = virXPathNode("./address", ctxt))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe disk source is missing address")); + return -1; + }
I'm displeased that this is yet another function adding validation in the parser. You don't make the status quo any worse though so this is not a request to change it.
+ + if (virPCIDeviceAddressParseXML(address, &nvme->pciAddr) < 0) + return -1; + + VIR_STEAL_PTR(src->nvme, nvme); + return 0; +} + + static int virDomainDiskSourcePRParse(xmlNodePtr node, xmlXPathContextPtr ctxt,
[...]
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 2436f5051b..87adccab3d 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c
Missing change to 'qemuMigrationSrcIsSafe' to reject VMs containing NVMe disks not having shared source. Plus that function should probably check the full backing chain rather than the top element only.
diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c index 269d0050fd..18aa33fe05 100644 --- a/src/util/virstoragefile.c +++ b/src/util/virstoragefile.c
[...]
@@ -2114,6 +2115,48 @@ virStoragePRDefCopy(virStoragePRDefPtr src) }
+static virStorageSourceNVMeDefPtr +virStorageSourceNVMeDefCopy(const virStorageSourceNVMeDef *src) +{ + VIR_AUTOPTR(virStorageSourceNVMeDef) ret = NULL; + + if (VIR_ALLOC(ret) < 0) + return NULL; + + *ret = *src;
You opted to use memcpy for the pci address few patches ago.
+ VIR_RETURN_PTR(ret); +}
[...]
@@ -2463,6 +2514,7 @@ virStorageSourceIsLocalStorage(const virStorageSource *src)
case VIR_STORAGE_TYPE_NETWORK: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME:
Welp. While I agree that virStorageSourceIsLocalStorage should return false you should really add a documentation comment explaining why NVMe is different. (e.g. regular code accessing src->path would not work).
case VIR_STORAGE_TYPE_LAST: case VIR_STORAGE_TYPE_NONE: return false; @@ -2493,6 +2545,10 @@ virStorageSourceIsEmpty(virStorageSourcePtr src) src->protocol == VIR_STORAGE_NET_PROTOCOL_NONE) return true;
+ if (src->type == VIR_STORAGE_TYPE_NVME && + !src->nvme) + return true;
Formating a disk type='nvme' without any data would not be parseable in our parser, so this will never happen.
+ return false; }
[...]
diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h index 38ba901858..a1294ea608 100644 --- a/src/util/virstoragefile.h +++ b/src/util/virstoragefile.h
[...]
@@ -231,6 +233,14 @@ struct _virStorageSourceInitiatorDef { char *iqn; /* Initiator IQN */ };
Add a comment noting that the copy function needs to be fixed if this is ever being updated
+typedef struct _virStorageSourceNVMeDef virStorageSourceNVMeDef; +typedef virStorageSourceNVMeDef *virStorageSourceNVMeDefPtr; +struct _virStorageSourceNVMeDef { + unsigned long namespace;
'long' is either 32 or 64 bit depending on the architecture, please use unsigned int or unsigned long long.
+ int managed; /* enum virTristateBool */ + virPCIDeviceAddress pciAddr; +}; + typedef struct _virStorageDriverData virStorageDriverData; typedef virStorageDriverData *virStorageDriverDataPtr;
[...]
@@ -416,6 +428,9 @@ bool virStoragePRDefIsManaged(virStoragePRDefPtr prd); bool virStorageSourceChainHasManagedPR(virStorageSourcePtr src);
+void virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def); +VIR_DEFINE_AUTOPTR_FUNC(virStorageSourceNVMeDef, virStorageSourceNVMeDefFree);
Do these need to be exposed?
+ virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model);
[...]
diff --git a/tests/qemuxml2argvdata/disk-nvme.xml b/tests/qemuxml2argvdata/disk-nvme.xml index 0b3dbad4eb..fe956d5ab6 100644 --- a/tests/qemuxml2argvdata/disk-nvme.xml +++ b/tests/qemuxml2argvdata/disk-nvme.xml @@ -20,6 +20,7 @@ <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> @@ -27,6 +28,7 @@ <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vdb' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> @@ -34,6 +36,7 @@ <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </source> <target dev='vdc' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> @@ -44,10 +47,15 @@ </encryption> </source> <target dev='vdd' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk> - <controller type='usb' index='0'/> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> <controller type='pci' index='0' model='pci-root'/> - <controller type='scsi' index='0' model='virtio-scsi'/> + <controller type='scsi' index='0' model='virtio-scsi'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </controller> <input type='mouse' bus='ps2'/> <input type='keyboard' bus='ps2'/> <memballoon model='none'/>
All of these belong to the previous patch adding the test file in the first place.

On 7/16/19 3:00 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:58 +0200, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 129 +++++++++++++++++++++++++ src/libvirt_private.syms | 1 + src/qemu/qemu_block.c | 1 + src/qemu/qemu_command.c | 1 + src/qemu/qemu_driver.c | 4 + src/qemu/qemu_migration.c | 1 + src/util/virstoragefile.c | 59 +++++++++++ src/util/virstoragefile.h | 15 +++ src/xenconfig/xen_xl.c | 1 + tests/qemuxml2argvdata/disk-nvme.xml | 12 ++- tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + 12 files changed, 224 insertions(+), 2 deletions(-) create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 3323c9a5b1..73f5e1fa0f 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c
[...]
@@ -5938,6 +5943,38 @@ virDomainDiskDefValidate(const virDomainDiskDef *disk) return -1; }
+ if (disk->src->type == VIR_STORAGE_TYPE_NVME) { + /* These might not be valid for all hypervisors, but be + * strict now and possibly refine in the future. */ + if (disk->device != VIR_DOMAIN_DISK_DEVICE_DISK) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported disk type '%s' for NVMe disk"), + virDomainDiskDeviceTypeToString(disk->device)); + return -1; + } + + if (disk->bus != VIR_DOMAIN_DISK_BUS_VIRTIO) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported bus '%s' for NVMe disk"), + virDomainDiskBusTypeToString(disk->bus)); + return -1; + } + + if (disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_DEFAULT && + disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_MANDATORY) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported startup policy '%s' for NVMe disk"), + virDomainStartupPolicyTypeToString(disk->startupPolicy)); + return -1; + } + + if (disk->src->shared) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Unsupported <shareable/> for NVMe disk")); + return -1; + } + } + return 0; }
As noted in the other thread, this really should be extracted, placed in the validation callback rather than post parse and must iterate the backing chain if you want this to keep working with -blockdev.
I'm not sure I understand what you mean. This function is called virDomainDiskDefValidate() and therefore it is validation callback rather than post parse callback. Where do you want me to put these checks?
@@ -9184,6 +9221,76 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node, }
+static int +virDomainDiskSourceNVMeParse(xmlNodePtr node, + xmlXPathContextPtr ctxt, + virStorageSourcePtr src) +{ + VIR_AUTOPTR(virStorageSourceNVMeDef) nvme = NULL; + VIR_AUTOFREE(char *) type = NULL; + VIR_AUTOFREE(char *) namespace = NULL; + VIR_AUTOFREE(char *) managed = NULL; + xmlNodePtr address; + + if (VIR_ALLOC(nvme) < 0) + return -1; + + if (!(type = virXMLPropString(node, "type"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'type' attribute to disk source")); + return -1; + } + + if (STRNEQ(type, "pci")) { + virReportError(VIR_ERR_XML_ERROR, + _("unsupported source type '%s'"), + type); + return -1; + } + + if (!(namespace = virXMLPropString(node, "namespace"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'namespace' attribute to disk source")); + return -1; + } + + if (virStrToLong_ul(namespace, NULL, 10, &nvme->namespace) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed namespace '%s'"), + namespace); + return -1; + } + + /* NVMe namespaces start from 1 */ + if (nvme->namespace == 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe namespace can't be zero")); + return -1; + } + + if ((managed = virXMLPropString(node, "managed"))) { + if ((nvme->managed = virTristateBoolTypeFromString(managed)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed managed value '%s'"), + managed); + return -1; + } + } + + if (!(address = virXPathNode("./address", ctxt))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe disk source is missing address")); + return -1; + }
I'm displeased that this is yet another function adding validation in the parser. You don't make the status quo any worse though so this is not a request to change it.
What validation do you mean? namespace != 0 check? Well, that stems straight from NVMe specs, so it is completely independent of any driver. Or do you mean this !address check? Well, it's needed later in the parsing.
+ + if (virPCIDeviceAddressParseXML(address, &nvme->pciAddr) < 0) + return -1; + + VIR_STEAL_PTR(src->nvme, nvme); + return 0; +} + + static int virDomainDiskSourcePRParse(xmlNodePtr node, xmlXPathContextPtr ctxt,
[...]
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 2436f5051b..87adccab3d 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c
Missing change to 'qemuMigrationSrcIsSafe' to reject VMs containing NVMe disks not having shared source.
A NVMe disk can't be shared. It's even explicitly denied in the validation callback. The reason is that there is no way to share a single PCI device with multiple domains. But this is a good point. I'll probably put it into a different commt though. Even though I'm changing some parts of qemu driver here it's only because of the way we handle switch() with enums.
Plus that function should probably check the full backing chain rather than the top element only.
Pre-existing, but I can try to fix it. Okay.
diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c index 269d0050fd..18aa33fe05 100644 --- a/src/util/virstoragefile.c +++ b/src/util/virstoragefile.c
[...]
@@ -2114,6 +2115,48 @@ virStoragePRDefCopy(virStoragePRDefPtr src) }
+static virStorageSourceNVMeDefPtr +virStorageSourceNVMeDefCopy(const virStorageSourceNVMeDef *src) +{ + VIR_AUTOPTR(virStorageSourceNVMeDef) ret = NULL; + + if (VIR_ALLOC(ret) < 0) + return NULL; + + *ret = *src;
You opted to use memcpy for the pci address few patches ago.
Darn! You're right. Honestly, I wanted to use coccinelle to adapt our code to virPCIDeviceAddressCopy(); I've written the spatch to do that but then failed to install coccinelle on my rawhide box. And then I've simply forgot about it. Ehm.
+ VIR_RETURN_PTR(ret); +}
[...]
@@ -2463,6 +2514,7 @@ virStorageSourceIsLocalStorage(const virStorageSource *src)
case VIR_STORAGE_TYPE_NETWORK: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME:
Welp. While I agree that virStorageSourceIsLocalStorage should return false you should really add a documentation comment explaining why NVMe is different. (e.g. regular code accessing src->path would not work).
I think I'm mentioning this somewhere later in the series, but it makes sense to add it here. Okay.
case VIR_STORAGE_TYPE_LAST: case VIR_STORAGE_TYPE_NONE: return false; @@ -2493,6 +2545,10 @@ virStorageSourceIsEmpty(virStorageSourcePtr src) src->protocol == VIR_STORAGE_NET_PROTOCOL_NONE) return true;
+ if (src->type == VIR_STORAGE_TYPE_NVME && + !src->nvme) + return true;
Formating a disk type='nvme' without any data would not be parseable in our parser, so this will never happen.
+ return false; }
[...]
diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h index 38ba901858..a1294ea608 100644 --- a/src/util/virstoragefile.h +++ b/src/util/virstoragefile.h
[...]
@@ -231,6 +233,14 @@ struct _virStorageSourceInitiatorDef { char *iqn; /* Initiator IQN */ };
Add a comment noting that the copy function needs to be fixed if this is ever being updated
+typedef struct _virStorageSourceNVMeDef virStorageSourceNVMeDef; +typedef virStorageSourceNVMeDef *virStorageSourceNVMeDefPtr; +struct _virStorageSourceNVMeDef { + unsigned long namespace;
'long' is either 32 or 64 bit depending on the architecture, please use unsigned int or unsigned long long.
+ int managed; /* enum virTristateBool */ + virPCIDeviceAddress pciAddr; +}; + typedef struct _virStorageDriverData virStorageDriverData; typedef virStorageDriverData *virStorageDriverDataPtr;
[...]
@@ -416,6 +428,9 @@ bool virStoragePRDefIsManaged(virStoragePRDefPtr prd); bool virStorageSourceChainHasManagedPR(virStorageSourcePtr src);
+void virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def); +VIR_DEFINE_AUTOPTR_FUNC(virStorageSourceNVMeDef, virStorageSourceNVMeDefFree);
Do these need to be exposed?
Yes. In fact, you can see it used right in this patch in virDomainDiskSourceNVMeParse() which lives in src/conf/domain_conf.c.
+ virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model);
[...]
diff --git a/tests/qemuxml2argvdata/disk-nvme.xml b/tests/qemuxml2argvdata/disk-nvme.xml index 0b3dbad4eb..fe956d5ab6 100644 --- a/tests/qemuxml2argvdata/disk-nvme.xml +++ b/tests/qemuxml2argvdata/disk-nvme.xml @@ -20,6 +20,7 @@ <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> @@ -27,6 +28,7 @@ <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vdb' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> @@ -34,6 +36,7 @@ <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </source> <target dev='vdc' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </disk> <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> @@ -44,10 +47,15 @@ </encryption> </source> <target dev='vdd' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk> - <controller type='usb' index='0'/> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> <controller type='pci' index='0' model='pci-root'/> - <controller type='scsi' index='0' model='virtio-scsi'/> + <controller type='scsi' index='0' model='virtio-scsi'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </controller> <input type='mouse' bus='ps2'/> <input type='keyboard' bus='ps2'/> <memballoon model='none'/>
All of these belong to the previous patch adding the test file in the first place.
Okay. Michal

On Wed, Jul 17, 2019 at 17:05:08 +0200, Michal Privoznik wrote:
On 7/16/19 3:00 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:58 +0200, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 129 +++++++++++++++++++++++++ src/libvirt_private.syms | 1 + src/qemu/qemu_block.c | 1 + src/qemu/qemu_command.c | 1 + src/qemu/qemu_driver.c | 4 + src/qemu/qemu_migration.c | 1 + src/util/virstoragefile.c | 59 +++++++++++ src/util/virstoragefile.h | 15 +++ src/xenconfig/xen_xl.c | 1 + tests/qemuxml2argvdata/disk-nvme.xml | 12 ++- tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + 12 files changed, 224 insertions(+), 2 deletions(-) create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 3323c9a5b1..73f5e1fa0f 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c
[...]
@@ -5938,6 +5943,38 @@ virDomainDiskDefValidate(const virDomainDiskDef *disk) return -1; } + if (disk->src->type == VIR_STORAGE_TYPE_NVME) { + /* These might not be valid for all hypervisors, but be + * strict now and possibly refine in the future. */ + if (disk->device != VIR_DOMAIN_DISK_DEVICE_DISK) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported disk type '%s' for NVMe disk"), + virDomainDiskDeviceTypeToString(disk->device)); + return -1; + } + + if (disk->bus != VIR_DOMAIN_DISK_BUS_VIRTIO) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported bus '%s' for NVMe disk"), + virDomainDiskBusTypeToString(disk->bus)); + return -1; + } + + if (disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_DEFAULT && + disk->startupPolicy != VIR_DOMAIN_STARTUP_POLICY_MANDATORY) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unsupported startup policy '%s' for NVMe disk"), + virDomainStartupPolicyTypeToString(disk->startupPolicy)); + return -1; + } + + if (disk->src->shared) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Unsupported <shareable/> for NVMe disk")); + return -1; + } + } + return 0; }
As noted in the other thread, this really should be extracted, placed in the validation callback rather than post parse and must iterate the backing chain if you want this to keep working with -blockdev.
I'm not sure I understand what you mean. This function is called virDomainDiskDefValidate() and therefore it is validation callback rather than post parse callback. Where do you want me to put these checks?
Sorry I thought it was in post-parse. This is okay.
@@ -9184,6 +9221,76 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node, } +static int +virDomainDiskSourceNVMeParse(xmlNodePtr node, + xmlXPathContextPtr ctxt, + virStorageSourcePtr src) +{ + VIR_AUTOPTR(virStorageSourceNVMeDef) nvme = NULL; + VIR_AUTOFREE(char *) type = NULL; + VIR_AUTOFREE(char *) namespace = NULL; + VIR_AUTOFREE(char *) managed = NULL; + xmlNodePtr address; + + if (VIR_ALLOC(nvme) < 0) + return -1; + + if (!(type = virXMLPropString(node, "type"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'type' attribute to disk source")); + return -1; + } + + if (STRNEQ(type, "pci")) { + virReportError(VIR_ERR_XML_ERROR, + _("unsupported source type '%s'"), + type); + return -1; + } + + if (!(namespace = virXMLPropString(node, "namespace"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'namespace' attribute to disk source")); + return -1; + } + + if (virStrToLong_ul(namespace, NULL, 10, &nvme->namespace) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed namespace '%s'"), + namespace); + return -1; + } + + /* NVMe namespaces start from 1 */ + if (nvme->namespace == 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe namespace can't be zero")); + return -1; + } + + if ((managed = virXMLPropString(node, "managed"))) { + if ((nvme->managed = virTristateBoolTypeFromString(managed)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed managed value '%s'"), + managed); + return -1; + } + } + + if (!(address = virXPathNode("./address", ctxt))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe disk source is missing address")); + return -1; + }
I'm displeased that this is yet another function adding validation in the parser. You don't make the status quo any worse though so this is not a request to change it.
What validation do you mean? namespace != 0 check? Well, that stems straight from NVMe specs, so it is completely independent of any driver. Or do you mean this !address check? Well, it's needed later in the parsing.
Yes, but that stuff still does not have to be intermixed in the parser code. It's a pre-existing mess though.
+ + if (virPCIDeviceAddressParseXML(address, &nvme->pciAddr) < 0) + return -1; + + VIR_STEAL_PTR(src->nvme, nvme); + return 0; +} + + static int virDomainDiskSourcePRParse(xmlNodePtr node, xmlXPathContextPtr ctxt,
[...]
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 2436f5051b..87adccab3d 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c
Missing change to 'qemuMigrationSrcIsSafe' to reject VMs containing NVMe disks not having shared source.
A NVMe disk can't be shared. It's even explicitly denied in the validation callback. The reason is that there is no way to share a single PCI device with multiple domains.
I meant that the NVMe disk is NOT available on the destination of the migration. That means that 'qemuMigrationSrcIsSafe' must reject it as not having a shared storage (note that "shared" here has a different conotation as <shareable/>, just look at he named function).
But this is a good point. I'll probably put it into a different commt though. Even though I'm changing some parts of qemu driver here it's only because of the way we handle switch() with enums.
Plus that function should probably check the full backing chain rather than the top element only.
Pre-existing, but I can try to fix it. Okay.

This module will be used by virHostdevManager and it's inspired by virPCIDevice module. They are very similar except instead of what makes a NVMe device: PCI address AND namespace ID. This means that a NVMe device can appear in a domain multiple times, each time with a different namespace. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 18 ++ src/util/Makefile.inc.am | 2 + src/util/virnvme.c | 412 +++++++++++++++++++++++++++++++++++++++ src/util/virnvme.h | 89 +++++++++ 4 files changed, 521 insertions(+) create mode 100644 src/util/virnvme.c create mode 100644 src/util/virnvme.h diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 350b638193..856b770e57 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2585,6 +2585,24 @@ virNumaSetPagePoolSize; virNumaSetupMemoryPolicy; +# util/virnvme.h +virNVMeDeviceAddressGet; +virNVMeDeviceCopy; +virNVMeDeviceFree; +virNVMeDeviceListAdd; +virNVMeDeviceListCount; +virNVMeDeviceListCreateDetachList; +virNVMeDeviceListDel; +virNVMeDeviceListGet; +virNVMeDeviceListLookup; +virNVMeDeviceListLookupIndex; +virNVMeDeviceListNew; +virNVMeDeviceNew; +virNVMeDeviceUsedByClear; +virNVMeDeviceUsedByGet; +virNVMeDeviceUsedBySet; + + # util/virobject.h virClassForObject; virClassForObjectLockable; diff --git a/src/util/Makefile.inc.am b/src/util/Makefile.inc.am index a47f333a98..998bec741e 100644 --- a/src/util/Makefile.inc.am +++ b/src/util/Makefile.inc.am @@ -143,6 +143,8 @@ UTIL_SOURCES = \ util/virnetlink.h \ util/virnodesuspend.c \ util/virnodesuspend.h \ + util/virnvme.c \ + util/virnvme.h \ util/virkmod.c \ util/virkmod.h \ util/virnuma.c \ diff --git a/src/util/virnvme.c b/src/util/virnvme.c new file mode 100644 index 0000000000..53724b63f7 --- /dev/null +++ b/src/util/virnvme.c @@ -0,0 +1,412 @@ +/* + * virnvme.c: helper APIs for managing NVMe devices + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + */ + +#include <config.h> + +#include "virnvme.h" +#include "virobject.h" +#include "virpci.h" +#include "viralloc.h" +#include "virlog.h" +#include "virstring.h" + +VIR_LOG_INIT("util.pci"); +#define VIR_FROM_THIS VIR_FROM_NONE + +struct _virNVMeDevice { + virPCIDeviceAddress address; /* PCI address of controller */ + unsigned long namespace; /* Namespace ID */ + bool managed; + + char *drvname; + char *domname; +}; + + +struct _virNVMeDeviceList { + virObjectLockable parent; + + size_t count; + virNVMeDevicePtr *devs; +}; + + +static virClassPtr virNVMeDeviceListClass; + +static void virNVMeDeviceListDispose(void *obj); + +static int +virNVMeOnceInit(void) +{ + if (!VIR_CLASS_NEW(virNVMeDeviceList, virClassForObjectLockable())) + return -1; + + return 0; +} + +VIR_ONCE_GLOBAL_INIT(virNVMe); + + +virNVMeDevicePtr +virNVMeDeviceNew(const virPCIDeviceAddress *address, + unsigned long namespace, + bool managed) +{ + VIR_AUTOPTR(virNVMeDevice) dev = NULL; + + if (VIR_ALLOC(dev) < 0) + return NULL; + + virPCIDeviceAddressCopy(&dev->address, address); + dev->namespace = namespace; + dev->managed = managed; + + VIR_RETURN_PTR(dev); +} + + +void +virNVMeDeviceFree(virNVMeDevicePtr dev) +{ + if (!dev) + return; + + virNVMeDeviceUsedByClear(dev); + VIR_FREE(dev); +} + + +virNVMeDevicePtr +virNVMeDeviceCopy(const virNVMeDevice *dev) +{ + VIR_AUTOPTR(virNVMeDevice) copy = NULL; + + if (VIR_ALLOC(copy) < 0 || + VIR_STRDUP(copy->drvname, dev->drvname) < 0 || + VIR_STRDUP(copy->domname, dev->domname) < 0) + return NULL; + + virPCIDeviceAddressCopy(©->address, &dev->address); + copy->namespace = dev->namespace; + copy->managed = dev->managed; + + VIR_RETURN_PTR(copy); +} + + +const virPCIDeviceAddress * +virNVMeDeviceAddressGet(const virNVMeDevice *dev) +{ + return &dev->address; +} + + +void +virNVMeDeviceUsedByClear(virNVMeDevicePtr dev) +{ + VIR_FREE(dev->drvname); + VIR_FREE(dev->domname); +} + + +void +virNVMeDeviceUsedByGet(const virNVMeDevice *dev, + const char **drv, + const char **dom) +{ + *drv = dev->drvname; + *dom = dev->domname; +} + + +int +virNVMeDeviceUsedBySet(virNVMeDevicePtr dev, + const char *drv, + const char *dom) +{ + if (VIR_STRDUP(dev->drvname, drv) < 0 || + VIR_STRDUP(dev->domname, dom) < 0) { + virNVMeDeviceUsedByClear(dev); + return -1; + } + + return 0; +} + + +virNVMeDeviceListPtr +virNVMeDeviceListNew(void) +{ + virNVMeDeviceListPtr list; + + if (virNVMeInitialize() < 0) + return NULL; + + if (!(list = virObjectLockableNew(virNVMeDeviceListClass))) + return NULL; + + return list; +} + + +static void +virNVMeDeviceListDispose(void *obj) +{ + virNVMeDeviceListPtr list = obj; + size_t i; + + for (i = 0; i < list->count; i++) + virNVMeDeviceFree(list->devs[i]); + + VIR_FREE(list->devs); +} + + +size_t +virNVMeDeviceListCount(const virNVMeDeviceList *list) +{ + return list->count; +} + + +int +virNVMeDeviceListAdd(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + virNVMeDevicePtr tmp; + + if ((tmp = virNVMeDeviceListLookup(list, dev))) { + VIR_AUTOFREE(char *) addrStr = virPCIDeviceAddressAsString(&tmp->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %lu is already on the list"), + NULLSTR(addrStr), tmp->namespace); + return -1; + } + + if (!(tmp = virNVMeDeviceCopy(dev)) || + VIR_APPEND_ELEMENT(list->devs, list->count, tmp) < 0) { + virNVMeDeviceFree(tmp); + return -1; + } + + return 0; +} + + +int +virNVMeDeviceListDel(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + ssize_t idx; + virNVMeDevicePtr tmp = NULL; + + if ((idx = virNVMeDeviceListLookupIndex(list, dev)) < 0) { + VIR_AUTOFREE(char *) addrStr = virPCIDeviceAddressAsString(&dev->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %lu not found"), + NULLSTR(addrStr), dev->namespace); + return -1; + } + + tmp = list->devs[idx]; + VIR_DELETE_ELEMENT(list->devs, idx, list->count); + virNVMeDeviceFree(tmp); + return 0; +} + + +virNVMeDevicePtr +virNVMeDeviceListGet(virNVMeDeviceListPtr list, + size_t i) +{ + return i < list->count ? list->devs[i] : NULL; +} + + +virNVMeDevicePtr +virNVMeDeviceListLookup(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + ssize_t idx; + + if ((idx = virNVMeDeviceListLookupIndex(list, dev)) < 0) + return NULL; + + return list->devs[idx]; +} + + +ssize_t +virNVMeDeviceListLookupIndex(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + size_t i; + + if (!list) + return -1; + + for (i = 0; i < list->count; i++) { + virNVMeDevicePtr other = list->devs[i]; + + if (virPCIDeviceAddressEqual(&dev->address, &other->address) && + dev->namespace == other->namespace) + return i; + } + + return -1; +} + + +static virNVMeDevicePtr +virNVMeDeviceListLookupByPCIAddress(virNVMeDeviceListPtr list, + const virPCIDeviceAddress *address) +{ + size_t i; + + if (!list) + return NULL; + + for (i = 0; i < list->count; i++) { + virNVMeDevicePtr other = list->devs[i]; + + if (virPCIDeviceAddressEqual(address, &other->address)) + return other; + } + + return NULL; +} + + +virPCIDeviceListPtr +virNVMeDeviceListCreateDetachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toDetachList) +{ + VIR_AUTOUNREF(virPCIDeviceListPtr) pciDevices = NULL; + size_t i; + + if (!(pciDevices = virPCIDeviceListNew())) + return NULL; + + for (i = 0; i < toDetachList->count; i++) { + const virNVMeDevice *d = toDetachList->devs[i]; + VIR_AUTOPTR(virPCIDevice) pci = NULL; + + /* If there is a NVMe device with the same PCI address on + * the activeList, the device is already detached. */ + if (virNVMeDeviceListLookupByPCIAddress(activeList, &d->address)) + continue; + + /* It may happen that we want to detach two namespaces + * from the same NVMe device. This will be represented as + * two different instances of virNVMeDevice, but + * obviously we want to put the PCI device on the detach + * list only once. */ + if (virPCIDeviceListFindByIDs(pciDevices, + d->address.domain, + d->address.bus, + d->address.slot, + d->address.function)) + continue; + + if (!(pci = virPCIDeviceNew(d->address.domain, + d->address.bus, + d->address.slot, + d->address.function))) + return NULL; + + /* NVMe devices must be bound to vfio */ + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_VFIO); + virPCIDeviceSetManaged(pci, d->managed); + + if (virPCIDeviceListAdd(pciDevices, pci) < 0) + return NULL; + + /* avoid freeing the device */ + pci = NULL; + } + + VIR_RETURN_PTR(pciDevices); +} + + +virPCIDeviceListPtr +virNVMeDeviceListCreateReAttachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toReAttachList) +{ + VIR_AUTOUNREF(virPCIDeviceListPtr) pciDevices = NULL; + size_t i; + + if (!(pciDevices = virPCIDeviceListNew())) + return NULL; + + for (i = 0; i < toReAttachList->count; i++) { + const virNVMeDevice *d = toReAttachList->devs[i]; + VIR_AUTOPTR(virPCIDevice) pci = NULL; + size_t nused = 0; + + /* Check if there is any other NVMe device with the same PCI address as + * @d. To simplify this, let's just count how many NVMe devices with + * the same PCI address there are on the @activeList. */ + for (i = 0; i < activeList->count; i++) { + virNVMeDevicePtr other = activeList->devs[i]; + + if (!virPCIDeviceAddressEqual(&d->address, &other->address)) + continue; + + nused++; + } + + /* Now, the following cases can happen: + * nused > 1 -> there are other NVMe device active, do NOT detach it + * nused == 1 -> we've found only @d on the @activeList, detach it + * nused == 0 -> huh, wait, what? @d is NOT on the @active list, how can + * we reattach it? + */ + + if (nused == 0) { + /* Shouldn't happen (TM) */ + VIR_AUTOFREE(char *) addrStr = virPCIDeviceAddressAsString(&d->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %lu not found"), + NULLSTR(addrStr), d->namespace); + return NULL; + } else if (nused > 1) { + /* NVMe device is still in use */ + continue; + } + + /* nused == 1 -> detach the device */ + if (!(pci = virPCIDeviceNew(d->address.domain, + d->address.bus, + d->address.slot, + d->address.function))) + return NULL; + + /* NVMe devices must be bound to vfio */ + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_VFIO); + virPCIDeviceSetManaged(pci, d->managed); + + if (virPCIDeviceListAdd(pciDevices, pci) < 0) + return NULL; + + /* avoid freeing the device */ + pci = NULL; + } + + VIR_RETURN_PTR(pciDevices); +} diff --git a/src/util/virnvme.h b/src/util/virnvme.h new file mode 100644 index 0000000000..edf5fe58ab --- /dev/null +++ b/src/util/virnvme.h @@ -0,0 +1,89 @@ +/* + * virnvme.h: helper APIs for managing NVMe devices + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + */ + +#pragma once + +#include "virpci.h" + +typedef struct _virNVMeDevice virNVMeDevice; +typedef virNVMeDevice *virNVMeDevicePtr; +typedef struct _virNVMeDeviceList virNVMeDeviceList; +typedef virNVMeDeviceList *virNVMeDeviceListPtr; + +virNVMeDevicePtr +virNVMeDeviceNew(const virPCIDeviceAddress *address, + unsigned long namespace, + bool managed); + +void +virNVMeDeviceFree(virNVMeDevicePtr dev); + +VIR_DEFINE_AUTOPTR_FUNC(virNVMeDevice, virNVMeDeviceFree); + +virNVMeDevicePtr +virNVMeDeviceCopy(const virNVMeDevice *dev); + +const virPCIDeviceAddress * +virNVMeDeviceAddressGet(const virNVMeDevice *dev); + +void +virNVMeDeviceUsedByClear(virNVMeDevicePtr dev); + +void +virNVMeDeviceUsedByGet(const virNVMeDevice *dev, + const char **drv, + const char **dom); + +int +virNVMeDeviceUsedBySet(virNVMeDevicePtr dev, + const char *drv, + const char *dom); + +virNVMeDeviceListPtr +virNVMeDeviceListNew(void); + +size_t +virNVMeDeviceListCount(const virNVMeDeviceList *list); + +int +virNVMeDeviceListAdd(virNVMeDeviceListPtr list, + const virNVMeDevice *dev); + +int +virNVMeDeviceListDel(virNVMeDeviceListPtr list, + const virNVMeDevice *dev); + +virNVMeDevicePtr +virNVMeDeviceListGet(virNVMeDeviceListPtr list, + size_t i); + +virNVMeDevicePtr +virNVMeDeviceListLookup(virNVMeDeviceListPtr list, + const virNVMeDevice *dev); + +ssize_t +virNVMeDeviceListLookupIndex(virNVMeDeviceListPtr list, + const virNVMeDevice *dev); + +virPCIDeviceListPtr +virNVMeDeviceListCreateDetachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toDetachList); + +virPCIDeviceListPtr +virNVMeDeviceListCreateReAttachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toReAttachList); -- 2.21.0

On Thu, Jul 11, 2019 at 17:53:59 +0200, Michal Privoznik wrote:
This module will be used by virHostdevManager and it's inspired by virPCIDevice module. They are very similar except instead of what makes a NVMe device: PCI address AND namespace ID. This means that a NVMe device can appear in a domain multiple times, each time with a different namespace.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 18 ++ src/util/Makefile.inc.am | 2 + src/util/virnvme.c | 412 +++++++++++++++++++++++++++++++++++++++ src/util/virnvme.h | 89 +++++++++ 4 files changed, 521 insertions(+) create mode 100644 src/util/virnvme.c create mode 100644 src/util/virnvme.h
[...]
diff --git a/src/util/virnvme.c b/src/util/virnvme.c new file mode 100644 index 0000000000..53724b63f7 --- /dev/null +++ b/src/util/virnvme.c @@ -0,0 +1,412 @@ +/* + * virnvme.c: helper APIs for managing NVMe devices + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + */ + +#include <config.h> + +#include "virnvme.h" +#include "virobject.h" +#include "virpci.h" +#include "viralloc.h" +#include "virlog.h" +#include "virstring.h" + +VIR_LOG_INIT("util.pci");
please use a different log domain
+#define VIR_FROM_THIS VIR_FROM_NONE + +struct _virNVMeDevice { + virPCIDeviceAddress address; /* PCI address of controller */ + unsigned long namespace; /* Namespace ID */
unsinged int/unsigned long long
+ bool managed; + + char *drvname; + char *domname; +}; + + +struct _virNVMeDeviceList { + virObjectLockable parent; + + size_t count; + virNVMeDevicePtr *devs; +}; +
[...]
+int +virNVMeDeviceListAdd(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + virNVMeDevicePtr tmp; + + if ((tmp = virNVMeDeviceListLookup(list, dev))) { + VIR_AUTOFREE(char *) addrStr = virPCIDeviceAddressAsString(&tmp->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %lu is already on the list"), + NULLSTR(addrStr), tmp->namespace); + return -1; + } + + if (!(tmp = virNVMeDeviceCopy(dev)) || + VIR_APPEND_ELEMENT(list->devs, list->count, tmp) < 0) { + virNVMeDeviceFree(tmp); + return -1; + } + + return 0; +} + + +int +virNVMeDeviceListDel(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + ssize_t idx; + virNVMeDevicePtr tmp = NULL; + + if ((idx = virNVMeDeviceListLookupIndex(list, dev)) < 0) { + VIR_AUTOFREE(char *) addrStr = virPCIDeviceAddressAsString(&dev->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %lu not found"), + NULLSTR(addrStr), dev->namespace); + return -1; + } + + tmp = list->devs[idx]; + VIR_DELETE_ELEMENT(list->devs, idx, list->count); + virNVMeDeviceFree(tmp); + return 0; +} + + +virNVMeDevicePtr +virNVMeDeviceListGet(virNVMeDeviceListPtr list, + size_t i)
[1] (see below)
+{ + return i < list->count ? list->devs[i] : NULL; +} + + +virNVMeDevicePtr +virNVMeDeviceListLookup(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + ssize_t idx; + + if ((idx = virNVMeDeviceListLookupIndex(list, dev)) < 0) + return NULL; + + return list->devs[idx]; +} + + +ssize_t
This function seems to be too unsafe to export as people might want to store the index while not holding the lock and something would then change it. Also [1] has the same issue.
+virNVMeDeviceListLookupIndex(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + size_t i; + + if (!list) + return -1; + + for (i = 0; i < list->count; i++) { + virNVMeDevicePtr other = list->devs[i]; + + if (virPCIDeviceAddressEqual(&dev->address, &other->address) && + dev->namespace == other->namespace) + return i; + } + + return -1; +} + + +static virNVMeDevicePtr +virNVMeDeviceListLookupByPCIAddress(virNVMeDeviceListPtr list, + const virPCIDeviceAddress *address) +{ + size_t i; + + if (!list) + return NULL; + + for (i = 0; i < list->count; i++) { + virNVMeDevicePtr other = list->devs[i]; + + if (virPCIDeviceAddressEqual(address, &other->address)) + return other; + } + + return NULL; +} + + +virPCIDeviceListPtr +virNVMeDeviceListCreateDetachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toDetachList) +{ + VIR_AUTOUNREF(virPCIDeviceListPtr) pciDevices = NULL; + size_t i; + + if (!(pciDevices = virPCIDeviceListNew())) + return NULL; + + for (i = 0; i < toDetachList->count; i++) { + const virNVMeDevice *d = toDetachList->devs[i]; + VIR_AUTOPTR(virPCIDevice) pci = NULL; + + /* If there is a NVMe device with the same PCI address on + * the activeList, the device is already detached. */ + if (virNVMeDeviceListLookupByPCIAddress(activeList, &d->address)) + continue; + + /* It may happen that we want to detach two namespaces + * from the same NVMe device. This will be represented as + * two different instances of virNVMeDevice, but + * obviously we want to put the PCI device on the detach + * list only once. */ + if (virPCIDeviceListFindByIDs(pciDevices, + d->address.domain, + d->address.bus, + d->address.slot, + d->address.function)) + continue; + + if (!(pci = virPCIDeviceNew(d->address.domain, + d->address.bus, + d->address.slot, + d->address.function))) + return NULL; + + /* NVMe devices must be bound to vfio */ + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_VFIO); + virPCIDeviceSetManaged(pci, d->managed); + + if (virPCIDeviceListAdd(pciDevices, pci) < 0) + return NULL; + + /* avoid freeing the device */ + pci = NULL; + } + + VIR_RETURN_PTR(pciDevices); +} + + +virPCIDeviceListPtr +virNVMeDeviceListCreateReAttachList(virNVMeDeviceListPtr activeList,
This function is too complex to be without a comment describing it. Especially since it's returning list of pci devices.
+ virNVMeDeviceListPtr toReAttachList) +{ + VIR_AUTOUNREF(virPCIDeviceListPtr) pciDevices = NULL; + size_t i; + + if (!(pciDevices = virPCIDeviceListNew())) + return NULL; + + for (i = 0; i < toReAttachList->count; i++) { + const virNVMeDevice *d = toReAttachList->devs[i]; + VIR_AUTOPTR(virPCIDevice) pci = NULL; + size_t nused = 0; + + /* Check if there is any other NVMe device with the same PCI address as + * @d. To simplify this, let's just count how many NVMe devices with + * the same PCI address there are on the @activeList. */ + for (i = 0; i < activeList->count; i++) { + virNVMeDevicePtr other = activeList->devs[i]; + + if (!virPCIDeviceAddressEqual(&d->address, &other->address)) + continue; + + nused++; + } + + /* Now, the following cases can happen: + * nused > 1 -> there are other NVMe device active, do NOT detach it + * nused == 1 -> we've found only @d on the @activeList, detach it + * nused == 0 -> huh, wait, what? @d is NOT on the @active list, how can + * we reattach it? + */ + + if (nused == 0) { + /* Shouldn't happen (TM) */ + VIR_AUTOFREE(char *) addrStr = virPCIDeviceAddressAsString(&d->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %lu not found"), + NULLSTR(addrStr), d->namespace); + return NULL; + } else if (nused > 1) { + /* NVMe device is still in use */ + continue; + } + + /* nused == 1 -> detach the device */ + if (!(pci = virPCIDeviceNew(d->address.domain, + d->address.bus, + d->address.slot, + d->address.function))) + return NULL; + + /* NVMe devices must be bound to vfio */ + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_VFIO); + virPCIDeviceSetManaged(pci, d->managed); + + if (virPCIDeviceListAdd(pciDevices, pci) < 0) + return NULL; + + /* avoid freeing the device */ + pci = NULL; + } + + VIR_RETURN_PTR(pciDevices); +}
Note that I did not look at the patches using the code, thus comments about some APIs being unnecessary may not be true.

On 7/16/19 3:54 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:53:59 +0200, Michal Privoznik wrote:
This module will be used by virHostdevManager and it's inspired by virPCIDevice module. They are very similar except instead of what makes a NVMe device: PCI address AND namespace ID. This means that a NVMe device can appear in a domain multiple times, each time with a different namespace.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 18 ++ src/util/Makefile.inc.am | 2 + src/util/virnvme.c | 412 +++++++++++++++++++++++++++++++++++++++ src/util/virnvme.h | 89 +++++++++ 4 files changed, 521 insertions(+) create mode 100644 src/util/virnvme.c create mode 100644 src/util/virnvme.h
[...]
diff --git a/src/util/virnvme.c b/src/util/virnvme.c new file mode 100644 index 0000000000..53724b63f7 --- /dev/null +++ b/src/util/virnvme.c @@ -0,0 +1,412 @@ +/* + * virnvme.c: helper APIs for managing NVMe devices + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + */ + +#include <config.h> + +#include "virnvme.h" +#include "virobject.h" +#include "virpci.h" +#include "viralloc.h" +#include "virlog.h" +#include "virstring.h" + +VIR_LOG_INIT("util.pci");
please use a different log domain
+#define VIR_FROM_THIS VIR_FROM_NONE + +struct _virNVMeDevice { + virPCIDeviceAddress address; /* PCI address of controller */ + unsigned long namespace; /* Namespace ID */
unsinged int/unsigned long long
+ bool managed; + + char *drvname; + char *domname; +}; + + +struct _virNVMeDeviceList { + virObjectLockable parent; + + size_t count; + virNVMeDevicePtr *devs; +}; +
[...]
+int +virNVMeDeviceListAdd(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + virNVMeDevicePtr tmp; + + if ((tmp = virNVMeDeviceListLookup(list, dev))) { + VIR_AUTOFREE(char *) addrStr = virPCIDeviceAddressAsString(&tmp->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %lu is already on the list"), + NULLSTR(addrStr), tmp->namespace); + return -1; + } + + if (!(tmp = virNVMeDeviceCopy(dev)) || + VIR_APPEND_ELEMENT(list->devs, list->count, tmp) < 0) { + virNVMeDeviceFree(tmp); + return -1; + } + + return 0; +} + + +int +virNVMeDeviceListDel(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + ssize_t idx; + virNVMeDevicePtr tmp = NULL; + + if ((idx = virNVMeDeviceListLookupIndex(list, dev)) < 0) { + VIR_AUTOFREE(char *) addrStr = virPCIDeviceAddressAsString(&dev->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %lu not found"), + NULLSTR(addrStr), dev->namespace); + return -1; + } + + tmp = list->devs[idx]; + VIR_DELETE_ELEMENT(list->devs, idx, list->count); + virNVMeDeviceFree(tmp); + return 0; +} + + +virNVMeDevicePtr +virNVMeDeviceListGet(virNVMeDeviceListPtr list, + size_t i)
[1] (see below)
+{ + return i < list->count ? list->devs[i] : NULL; +} + + +virNVMeDevicePtr +virNVMeDeviceListLookup(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + ssize_t idx; + + if ((idx = virNVMeDeviceListLookupIndex(list, dev)) < 0) + return NULL; + + return list->devs[idx]; +} + + +ssize_t
This function seems to be too unsafe to export as people might want to store the index while not holding the lock and something would then change it. Also [1] has the same issue.
This is not any different to other modules used by virhostdev. I agree that they are unsafe, but at the same time I think rewriting them all (to keep consistency) wouldn't result in cleaner code. Note that the list is locked outside of this source file (is locked from within virhostdev) - again, not something that complies with our rules, but it makes sense IMO. Note that all other APIs require locking from the caller (e.g. virNVMeDeviceListAdd()). I'll add a comment into the header file that virNVMeDeviceList is a lockable object that requires caller to acquire the lock and hold it throughout whole section involving it.
+virNVMeDeviceListLookupIndex(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + size_t i; + + if (!list) + return -1; + + for (i = 0; i < list->count; i++) { + virNVMeDevicePtr other = list->devs[i]; + + if (virPCIDeviceAddressEqual(&dev->address, &other->address) && + dev->namespace == other->namespace) + return i; + } + + return -1; +} + + +static virNVMeDevicePtr +virNVMeDeviceListLookupByPCIAddress(virNVMeDeviceListPtr list, + const virPCIDeviceAddress *address) +{ + size_t i; + + if (!list) + return NULL; + + for (i = 0; i < list->count; i++) { + virNVMeDevicePtr other = list->devs[i]; + + if (virPCIDeviceAddressEqual(address, &other->address)) + return other; + } + + return NULL; +} + + +virPCIDeviceListPtr +virNVMeDeviceListCreateDetachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toDetachList) +{ + VIR_AUTOUNREF(virPCIDeviceListPtr) pciDevices = NULL; + size_t i; + + if (!(pciDevices = virPCIDeviceListNew())) + return NULL; + + for (i = 0; i < toDetachList->count; i++) { + const virNVMeDevice *d = toDetachList->devs[i]; + VIR_AUTOPTR(virPCIDevice) pci = NULL; + + /* If there is a NVMe device with the same PCI address on + * the activeList, the device is already detached. */ + if (virNVMeDeviceListLookupByPCIAddress(activeList, &d->address)) + continue; + + /* It may happen that we want to detach two namespaces + * from the same NVMe device. This will be represented as + * two different instances of virNVMeDevice, but + * obviously we want to put the PCI device on the detach + * list only once. */ + if (virPCIDeviceListFindByIDs(pciDevices, + d->address.domain, + d->address.bus, + d->address.slot, + d->address.function)) + continue; + + if (!(pci = virPCIDeviceNew(d->address.domain, + d->address.bus, + d->address.slot, + d->address.function))) + return NULL; + + /* NVMe devices must be bound to vfio */ + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_VFIO); + virPCIDeviceSetManaged(pci, d->managed); + + if (virPCIDeviceListAdd(pciDevices, pci) < 0) + return NULL; + + /* avoid freeing the device */ + pci = NULL; + } + + VIR_RETURN_PTR(pciDevices); +} + + +virPCIDeviceListPtr +virNVMeDeviceListCreateReAttachList(virNVMeDeviceListPtr activeList,
This function is too complex to be without a comment describing it. Especially since it's returning list of pci devices.
Okay, I'll add a comment here too. Michal

Now that we have virNVMeDevice module (introduced in previous commit), let's use it int virHostdev to track which NVMe devices are free to be used by a domain and which are taken. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 3 + src/util/virhostdev.c | 244 +++++++++++++++++++++++++++++++++++++++ src/util/virhostdev.h | 25 ++++ 3 files changed, 272 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 856b770e57..bc6583562a 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2077,18 +2077,21 @@ virHostdevPCINodeDeviceReAttach; virHostdevPCINodeDeviceReset; virHostdevPrepareDomainDevices; virHostdevPrepareMediatedDevices; +virHostdevPrepareNVMeDevices; virHostdevPreparePCIDevices; virHostdevPrepareSCSIDevices; virHostdevPrepareSCSIVHostDevices; virHostdevPrepareUSBDevices; virHostdevReAttachDomainDevices; virHostdevReAttachMediatedDevices; +virHostdevReAttachNVMeDevices; virHostdevReAttachPCIDevices; virHostdevReAttachSCSIDevices; virHostdevReAttachSCSIVHostDevices; virHostdevReAttachUSBDevices; virHostdevUpdateActiveDomainDevices; virHostdevUpdateActiveMediatedDevices; +virHostdevUpdateActiveNVMeDevices; virHostdevUpdateActivePCIDevices; virHostdevUpdateActiveSCSIDevices; virHostdevUpdateActiveUSBDevices; diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 07397b9682..90d94b0a92 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -137,6 +137,7 @@ virHostdevManagerDispose(void *obj) virObjectUnref(hostdevMgr->activeSCSIHostdevs); virObjectUnref(hostdevMgr->activeSCSIVHostHostdevs); virObjectUnref(hostdevMgr->activeMediatedHostdevs); + virObjectUnref(hostdevMgr->activeNVMeHostdevs); VIR_FREE(hostdevMgr->stateDir); } @@ -167,6 +168,9 @@ virHostdevManagerNew(void) if (!(hostdevMgr->activeMediatedHostdevs = virMediatedDeviceListNew())) return NULL; + if (!(hostdevMgr->activeNVMeHostdevs = virNVMeDeviceListNew())) + return NULL; + if (privileged) { if (VIR_STRDUP(hostdevMgr->stateDir, HOSTDEV_STATE_DIR) < 0) return NULL; @@ -2229,3 +2233,243 @@ virHostdevUpdateActiveDomainDevices(virHostdevManagerPtr mgr, return 0; } + + +static virNVMeDeviceListPtr +virHostdevGetNVMeDeviceList(virDomainDiskDefPtr *disks, + size_t ndisks, + const char *drv_name, + const char *dom_name) +{ + VIR_AUTOUNREF(virNVMeDeviceListPtr) nvmeDevices = NULL; + size_t i; + + if (!(nvmeDevices = virNVMeDeviceListNew())) + return NULL; + + for (i = 0; i < ndisks; i++) { + virDomainDiskDefPtr disk = disks[i]; + virStorageSourcePtr n; + + for (n = disk->src; virStorageSourceIsBacking(n); n = n->backingStore) { + VIR_AUTOPTR(virNVMeDevice) dev = NULL; + const virStorageSourceNVMeDef *srcNVMe = n->nvme; + + if (n->type != VIR_STORAGE_TYPE_NVME) + continue; + + if (!(dev = virNVMeDeviceNew(&srcNVMe->pciAddr, + srcNVMe->namespace, + srcNVMe->managed))) + return NULL; + + if (virNVMeDeviceUsedBySet(dev, drv_name, dom_name) < 0) + return NULL; + + if (virNVMeDeviceListAdd(nvmeDevices, dev) < 0) + return NULL; + } + } + + VIR_RETURN_PTR(nvmeDevices); +} + + +int +virHostdevPrepareNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + VIR_AUTOUNREF(virNVMeDeviceListPtr) nvmeDevices = NULL; + VIR_AUTOUNREF(virPCIDeviceListPtr) pciDevices = NULL; + const unsigned int pciFlags = 0; + virNVMeDevicePtr temp = NULL; + size_t i; + ssize_t lastGoodNVMeIdx = -1; + int ret = -1; + + if (!(nvmeDevices = virHostdevGetNVMeDeviceList(disks, ndisks, drv_name, dom_name))) + return -1; + + if (virNVMeDeviceListCount(nvmeDevices) == 0) + return 0; + + virObjectLock(hostdev_mgr->activeNVMeHostdevs); + + /* Firstly, let's check if all devices are free */ + for (i = 0; i < virNVMeDeviceListCount(nvmeDevices); i++) { + const virNVMeDevice *dev = virNVMeDeviceListGet(nvmeDevices, i); + const virPCIDeviceAddress *addr = NULL; + VIR_AUTOFREE(char *) addrStr = NULL; + const char *actual_drvname = NULL; + const char *actual_domname = NULL; + + temp = virNVMeDeviceListLookup(hostdev_mgr->activeNVMeHostdevs, dev); + + /* Not on the list means not used */ + if (!temp) + continue; + + virNVMeDeviceUsedByGet(temp, &actual_drvname, &actual_domname); + addr = virNVMeDeviceAddressGet(dev); + addrStr = virPCIDeviceAddressAsString(addr); + + virReportError(VIR_ERR_OPERATION_INVALID, + _("NVMe device %s already in use by driver %s domain %s"), + NULLSTR(addrStr), actual_drvname, actual_domname); + goto cleanup; + } + + if (!(pciDevices = virNVMeDeviceListCreateDetachList(hostdev_mgr->activeNVMeHostdevs, + nvmeDevices))) + goto cleanup; + + /* This looks like a good opportunity to merge inactive NVMe devices onto + * the active list. This, however, means that if something goes wrong we + * have to perform a rollback before returning.*/ + for (i = 0; i < virNVMeDeviceListCount(nvmeDevices); i++) { + temp = virNVMeDeviceListGet(nvmeDevices, i); + + if (virNVMeDeviceListAdd(hostdev_mgr->activeNVMeHostdevs, temp) < 0) + goto rollback; + + lastGoodNVMeIdx = i; + } + + if (virHostdevPreparePCIDevicesImpl(hostdev_mgr, + drv_name, dom_name, NULL, + pciDevices, NULL, 0, pciFlags) < 0) + goto rollback; + + ret = 0; + cleanup: + virObjectUnlock(hostdev_mgr->activeNVMeHostdevs); + return ret; + + rollback: + while (lastGoodNVMeIdx >= 0) { + temp = virNVMeDeviceListGet(nvmeDevices, lastGoodNVMeIdx); + + virNVMeDeviceListDel(hostdev_mgr->activeNVMeHostdevs, temp); + + lastGoodNVMeIdx--; + } + goto cleanup; +} + + +int +virHostdevReAttachNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + VIR_AUTOUNREF(virNVMeDeviceListPtr) nvmeDevices = NULL; + VIR_AUTOUNREF(virPCIDeviceListPtr) pciDevices = NULL; + size_t i; + int ret = -1; + + if (!(nvmeDevices = virHostdevGetNVMeDeviceList(disks, ndisks, drv_name, dom_name))) + return -1; + + if (virNVMeDeviceListCount(nvmeDevices) == 0) + return 0; + + virObjectLock(hostdev_mgr->activeNVMeHostdevs); + + if (!(pciDevices = virNVMeDeviceListCreateReAttachList(hostdev_mgr->activeNVMeHostdevs, + nvmeDevices))) + goto cleanup; + + virHostdevReAttachPCIDevicesImpl(hostdev_mgr, + drv_name, dom_name, pciDevices, + NULL, 0, NULL); + + for (i = 0; i < virNVMeDeviceListCount(nvmeDevices); i++) { + virNVMeDevicePtr temp = virNVMeDeviceListGet(nvmeDevices, i); + + if (virNVMeDeviceListDel(hostdev_mgr->activeNVMeHostdevs, temp) < 0) + goto cleanup; + } + + ret = 0; + cleanup: + virObjectUnlock(hostdev_mgr->activeNVMeHostdevs); + return ret; +} + + +int +virHostdevUpdateActiveNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + VIR_AUTOUNREF(virNVMeDeviceListPtr) nvmeDevices = NULL; + VIR_AUTOUNREF(virPCIDeviceListPtr) pciDevices = NULL; + virNVMeDevicePtr temp = NULL; + size_t i; + ssize_t lastGoodNVMeIdx = -1; + ssize_t lastGoodPCIIdx = -1; + int ret = -1; + + if (!(nvmeDevices = virHostdevGetNVMeDeviceList(disks, ndisks, drv_name, dom_name))) + return -1; + + if (virNVMeDeviceListCount(nvmeDevices) == 0) + return 0; + + virObjectLock(hostdev_mgr->activeNVMeHostdevs); + virObjectLock(hostdev_mgr->activePCIHostdevs); + virObjectLock(hostdev_mgr->inactivePCIHostdevs); + + if (!(pciDevices = virNVMeDeviceListCreateDetachList(hostdev_mgr->activeNVMeHostdevs, + nvmeDevices))) + goto cleanup; + + for (i = 0; i < virNVMeDeviceListCount(nvmeDevices); i++) { + temp = virNVMeDeviceListGet(nvmeDevices, i); + + if (virNVMeDeviceListAdd(hostdev_mgr->activeNVMeHostdevs, temp) < 0) + goto rollback; + + lastGoodNVMeIdx = i; + } + + for (i = 0; i < virPCIDeviceListCount(pciDevices); i++) { + virPCIDevicePtr actual = virPCIDeviceListGet(pciDevices, i); + + if (virPCIDeviceListAddCopy(hostdev_mgr->activePCIHostdevs, actual) < 0) + goto rollback; + + lastGoodPCIIdx = i; + } + + ret = 0; + cleanup: + virObjectUnlock(hostdev_mgr->inactivePCIHostdevs); + virObjectUnlock(hostdev_mgr->activePCIHostdevs); + virObjectUnlock(hostdev_mgr->activeNVMeHostdevs); + return ret; + + rollback: + while (lastGoodNVMeIdx >= 0) { + temp = virNVMeDeviceListGet(nvmeDevices, lastGoodNVMeIdx); + + virNVMeDeviceListDel(hostdev_mgr->activeNVMeHostdevs, temp); + + lastGoodNVMeIdx--; + } + while (lastGoodPCIIdx >= 0) { + virPCIDevicePtr actual = virPCIDeviceListGet(pciDevices, i); + + virPCIDeviceListDel(hostdev_mgr->activePCIHostdevs, actual); + + lastGoodPCIIdx--; + } + goto cleanup; +} diff --git a/src/util/virhostdev.h b/src/util/virhostdev.h index 88501e2743..98dc226631 100644 --- a/src/util/virhostdev.h +++ b/src/util/virhostdev.h @@ -29,6 +29,7 @@ #include "virscsivhost.h" #include "conf/domain_conf.h" #include "virmdev.h" +#include "virnvme.h" typedef enum { VIR_HOSTDEV_STRICT_ACS_CHECK = (1 << 0), /* strict acs check */ @@ -53,6 +54,9 @@ struct _virHostdevManager { virSCSIDeviceListPtr activeSCSIHostdevs; virSCSIVHostDeviceListPtr activeSCSIVHostHostdevs; virMediatedDeviceListPtr activeMediatedHostdevs; + /* NVMe devices are PCI devices really, but one NVMe disk can + * have multiple namespaces. */ + virNVMeDeviceListPtr activeNVMeHostdevs; }; virHostdevManagerPtr virHostdevManagerGetDefault(void); @@ -201,3 +205,24 @@ int virHostdevPCINodeDeviceReAttach(virHostdevManagerPtr mgr, int virHostdevPCINodeDeviceReset(virHostdevManagerPtr mgr, virPCIDevicePtr pci) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); + +int +virHostdevPrepareNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks); + +int +virHostdevReAttachNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks); + +int +virHostdevUpdateActiveNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks); -- 2.21.0

The myInit() function is called before any of the test cases because it prepares all internal structures for individual cases. Well, if it fails there's no point in proceeding with testing. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virhostdevtest.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tests/virhostdevtest.c b/tests/virhostdevtest.c index 20eaca82e0..cf39c83c76 100644 --- a/tests/virhostdevtest.c +++ b/tests/virhostdevtest.c @@ -574,8 +574,11 @@ mymain(void) ret = -1; \ } while (0) - if (myInit() < 0) + if (myInit() < 0) { fprintf(stderr, "Init data structures failed."); + virFileDeleteTree(fakerootdir); + return EXIT_FAILURE; + } DO_TEST(testVirHostdevRoundtripNoGuest); DO_TEST(testVirHostdevRoundtripUnmanaged); -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:01 +0200, Michal Privoznik wrote:
The myInit() function is called before any of the test cases because it prepares all internal structures for individual cases. Well, if it fails there's no point in proceeding with testing.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ---
ACK

In near future we will need to check for number of members of two different types of lists: PCI and NVMe. Rename CHECK_LIST_COUNT to CHECK_PCI_LIST_COUNT to mark explicitly what type of list it is working with. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virhostdevtest.c | 83 ++++++++++++++++++++++-------------------- 1 file changed, 43 insertions(+), 40 deletions(-) diff --git a/tests/virhostdevtest.c b/tests/virhostdevtest.c index cf39c83c76..7d15a87797 100644 --- a/tests/virhostdevtest.c +++ b/tests/virhostdevtest.c @@ -34,10 +34,10 @@ VIR_LOG_INIT("tests.hostdevtest"); -# define CHECK_LIST_COUNT(list, cnt) \ +# define CHECK_LIST_COUNT(list, cnt, cb) \ do { \ size_t actualCount; \ - if ((actualCount = virPCIDeviceListCount(list)) != cnt) { \ + if ((actualCount = cb(list)) != cnt) { \ virReportError(VIR_ERR_INTERNAL_ERROR, \ "Unexpected count of items in " #list ": %zu, " \ "expecting %zu", actualCount, (size_t) cnt); \ @@ -45,6 +45,9 @@ VIR_LOG_INIT("tests.hostdevtest"); } \ } while (0) +# define CHECK_PCI_LIST_COUNT(list, cnt) \ + CHECK_LIST_COUNT(list, cnt, virPCIDeviceListCount) + # define TEST_STATE_DIR abs_builddir "/hostdevmgr" static const char *drv_name = "test_driver"; static const char *dom_name = "test_domain"; @@ -143,16 +146,16 @@ testVirHostdevPreparePCIHostdevs_unmanaged(void) if (virHostdevPreparePCIDevices(mgr, drv_name, dom_name, uuid, NULL, 0, 0) < 0) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); /* Test unmanaged hostdevs */ VIR_DEBUG("Test >=1 unmanaged hostdevs"); if (virHostdevPreparePCIDevices(mgr, drv_name, dom_name, uuid, hostdevs, nhostdevs, 0) < 0) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count + nhostdevs); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count - nhostdevs); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count + nhostdevs); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count - nhostdevs); /* Test conflict */ active_count = virPCIDeviceListCount(mgr->activePCIHostdevs); @@ -161,22 +164,22 @@ testVirHostdevPreparePCIHostdevs_unmanaged(void) if (!virHostdevPreparePCIDevices(mgr, drv_name, dom_name, uuid, &hostdevs[0], 1, 0)) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); VIR_DEBUG("Test: prepare same hostdevs for same driver, diff domain again"); if (!virHostdevPreparePCIDevices(mgr, drv_name, "test_domain1", uuid, &hostdevs[1], 1, 0)) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); VIR_DEBUG("Test: prepare same hostdevs for diff driver/domain again"); if (!virHostdevPreparePCIDevices(mgr, "test_driver1", dom_name, uuid, &hostdevs[2], 1, 0)) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); ret = 0; @@ -203,14 +206,14 @@ testVirHostdevReAttachPCIHostdevs_unmanaged(void) VIR_DEBUG("Test 0 hostdevs"); virHostdevReAttachPCIDevices(mgr, drv_name, dom_name, NULL, 0, NULL); - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); VIR_DEBUG("Test >=1 unmanaged hostdevs"); virHostdevReAttachPCIDevices(mgr, drv_name, dom_name, hostdevs, nhostdevs, NULL); - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count - nhostdevs); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count + nhostdevs); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count - nhostdevs); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count + nhostdevs); ret = 0; @@ -236,14 +239,14 @@ testVirHostdevPreparePCIHostdevs_managed(bool mixed) if (virHostdevPreparePCIDevices(mgr, drv_name, dom_name, uuid, hostdevs, nhostdevs, 0) < 0) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count + nhostdevs); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count + nhostdevs); /* If testing a mixed roundtrip, devices are already in the inactive list * before we start and are removed from it as soon as we attach them to * the guest */ if (mixed) - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count - nhostdevs); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count - nhostdevs); else - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); /* Test conflict */ active_count = virPCIDeviceListCount(mgr->activePCIHostdevs); @@ -252,22 +255,22 @@ testVirHostdevPreparePCIHostdevs_managed(bool mixed) if (!virHostdevPreparePCIDevices(mgr, drv_name, dom_name, uuid, &hostdevs[0], 1, 0)) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); VIR_DEBUG("Test: prepare same hostdevs for same driver, diff domain again"); if (!virHostdevPreparePCIDevices(mgr, drv_name, "test_domain1", uuid, &hostdevs[1], 1, 0)) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); VIR_DEBUG("Test: prepare same hostdevs for diff driver/domain again"); if (!virHostdevPreparePCIDevices(mgr, "test_driver1", dom_name, uuid, &hostdevs[2], 1, 0)) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); ret = 0; @@ -294,19 +297,19 @@ testVirHostdevReAttachPCIHostdevs_managed(bool mixed) VIR_DEBUG("Test 0 hostdevs"); virHostdevReAttachPCIDevices(mgr, drv_name, dom_name, NULL, 0, NULL); - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); VIR_DEBUG("Test >=1 hostdevs"); virHostdevReAttachPCIDevices(mgr, drv_name, dom_name, hostdevs, nhostdevs, NULL); - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count - nhostdevs); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count - nhostdevs); /* If testing a mixed roundtrip, devices are added back to the inactive * list as soon as we detach from the guest */ if (mixed) - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count + nhostdevs); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count + nhostdevs); else - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); ret = 0; @@ -326,8 +329,8 @@ testVirHostdevDetachPCINodeDevice(void) inactive_count = virPCIDeviceListCount(mgr->inactivePCIHostdevs); if (virHostdevPCINodeDeviceDetach(mgr, dev[i]) < 0) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count + 1); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count + 1); } ret = 0; @@ -347,8 +350,8 @@ testVirHostdevResetPCINodeDevice(void) inactive_count = virPCIDeviceListCount(mgr->inactivePCIHostdevs); if (virHostdevPCINodeDeviceReset(mgr, dev[i]) < 0) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); } ret = 0; @@ -369,8 +372,8 @@ testVirHostdevReAttachPCINodeDevice(void) inactive_count = virPCIDeviceListCount(mgr->inactivePCIHostdevs); if (virHostdevPCINodeDeviceReAttach(mgr, dev[i]) < 0) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count - 1); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count - 1); } ret = 0; @@ -393,15 +396,15 @@ testVirHostdevUpdateActivePCIHostdevs(void) if (virHostdevUpdateActivePCIDevices(mgr, NULL, 0, drv_name, dom_name) < 0) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); VIR_DEBUG("Test >=1 hostdevs"); if (virHostdevUpdateActivePCIDevices(mgr, hostdevs, nhostdevs, drv_name, dom_name) < 0) goto cleanup; - CHECK_LIST_COUNT(mgr->activePCIHostdevs, active_count + nhostdevs); - CHECK_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, active_count + nhostdevs); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, inactive_count); ret = 0; -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:02 +0200, Michal Privoznik wrote:
In near future we will need to check for number of members of two different types of lists: PCI and NVMe. Rename CHECK_LIST_COUNT to CHECK_PCI_LIST_COUNT to mark explicitly what type of list it is working with.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ---
ACK

The device configs (which are actually the same one config) come from a NVMe disk of mine. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virpcimock.c | 3 +++ tests/virpcitestdata/0000-01-00.0.config | Bin 0 -> 4096 bytes tests/virpcitestdata/0000-02-00.0.config | Bin 0 -> 4096 bytes 3 files changed, 3 insertions(+) create mode 100644 tests/virpcitestdata/0000-01-00.0.config create mode 100644 tests/virpcitestdata/0000-02-00.0.config diff --git a/tests/virpcimock.c b/tests/virpcimock.c index 18d06d11d4..a26b7c1b2e 100644 --- a/tests/virpcimock.c +++ b/tests/virpcimock.c @@ -880,6 +880,7 @@ init_env(void) MAKE_PCI_DRIVER("i915", 0x8086, 0x0046, 0x8086, 0x0047); MAKE_PCI_DRIVER("pci-stub", -1, -1); pci_driver_new("vfio-pci", PCI_ACTION_BIND, -1, -1); + MAKE_PCI_DRIVER("nvme", 0x1cc1, 0x8201); # define MAKE_PCI_DEVICE(Id, Vendor, Device, ...) \ do { \ @@ -902,6 +903,8 @@ init_env(void) MAKE_PCI_DEVICE("0000:0a:01.0", 0x8086, 0x0047); MAKE_PCI_DEVICE("0000:0a:02.0", 0x8286, 0x0048); MAKE_PCI_DEVICE("0000:0a:03.0", 0x8386, 0x0048); + MAKE_PCI_DEVICE("0000:01:00.0", 0x1cc1, 0x8201, .iommuGroup = 8, .klass = 0x010802); + MAKE_PCI_DEVICE("0000:02:00.0", 0x1cc1, 0x8201, .iommuGroup = 9, .klass = 0x010802); } diff --git a/tests/virpcitestdata/0000-01-00.0.config b/tests/virpcitestdata/0000-01-00.0.config new file mode 100644 index 0000000000000000000000000000000000000000..f92455e2ac5701ce60a51ae19828658b80744399 GIT binary patch literal 4096 zcmeHIF;Buk6#lMP3WC_80!0h0vM?|ZHYb~)#L>jXKcF*<t0@ew#(&`84`6g`<KpZu zaP%LTxZu0C6dH+PAO@3rm)v{zz3*Nx-?guS#YUQHfGar$G8Om~evt*l6}THG3$%ls zbL8T+aGAkfSZ5AOg~nJxaQ{&~b`11hPvNqjks{E-76s`bTjV$zsdNdt2Zx}86r3!0 z5-k@njLH$yMaSt!;XCjcMN7{rhH)Lzgm%?1tR|ap5RC)?OfZuh+-MNn&bPwMnC1$Y zWtY532v8x%Zq%*)y_#9Aly`TwONPEx+$`iba#<~-a)litu!!pkegQ#S0so<=$gQVo zbjxR?KpRD$uKLK=emkc^$-Zgf<*MZ@;uWY8!!UdWrXoze;QR3q_ajXzAQg}b)NzRk z87VP9U%z%(HWJ0B|6JTCgnRh9Wz<V@F5}c=IPK9xUNln!?lk>Mr%9L>;{S2-*!p%x w=b5pKIZ+ec2|JnLT|5EZ*;+<Yfz>MDr9bc*RexLU6J#~1fK)@Fxm<1DBgMR{#J2 literal 0 HcmV?d00001 diff --git a/tests/virpcitestdata/0000-02-00.0.config b/tests/virpcitestdata/0000-02-00.0.config new file mode 100644 index 0000000000000000000000000000000000000000..ebb44d8f69c91809b82e8d2669026dfff42c3100 GIT binary patch literal 4096 zcmeHIJ5Iwu5Pj>-$HXLdfc(IT4QW!Oh|*DEDG*U2(QpB%)6hF9Xc0G{-~cHpZ9zfJ z322bGLCP>|J5DSjlp;bw+F5C5_RZVz>a9KYO*YD;3~)tdAWH!g;g^|DT!A}LQllO0 zf<ukg!legyL7fFC5gKC!{{2_w#5T}-JA=b|MuI>KOBAGo6v%Nj66qpz7dAnM2{>Nx zI9e@W7?nb%gO1$~!w=vwj8>jg7)EtS6WUe7uo7>+ML1#rsDf3w!Hov7tz0X}jA<@| znO4!A1^^YZtw!BE*soP9<<j2nPSMZ{`E4z?rDikf6j#_0e3Q7Y;A`;P3iuB_MQ$@K zL$`cR3bc`bbvTZ_%x~vZDA})?c)4!b%Xk`9Vi*Rmz)Xah7kn=o;(nw_1*8H}fjX`* zB_kyU=<7E&%Z8$O^q-3wg>Vm(Pe#2&br`1}!)cEm@WPoIaHr{&J59pe0RNAZ%Qm+& w+Ruz#E{GcIPT1)j@8SvQ&et-M3anQFH~E3rsQUYQpCGGA1*8H}fj?2;8=mzw{{R30 literal 0 HcmV?d00001 -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:03 +0200, Michal Privoznik wrote:
The device configs (which are actually the same one config) come from a NVMe disk of mine.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ---
ACK

Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/virhostdevtest.c | 97 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) diff --git a/tests/virhostdevtest.c b/tests/virhostdevtest.c index 7d15a87797..e34959c4f1 100644 --- a/tests/virhostdevtest.c +++ b/tests/virhostdevtest.c @@ -48,6 +48,9 @@ VIR_LOG_INIT("tests.hostdevtest"); # define CHECK_PCI_LIST_COUNT(list, cnt) \ CHECK_LIST_COUNT(list, cnt, virPCIDeviceListCount) +# define CHECK_NVME_LIST_COUNT(list, cnt) \ + CHECK_LIST_COUNT(list, cnt, virNVMeDeviceListCount) + # define TEST_STATE_DIR abs_builddir "/hostdevmgr" static const char *drv_name = "test_driver"; static const char *dom_name = "test_domain"; @@ -57,6 +60,36 @@ static int nhostdevs = 3; static virDomainHostdevDefPtr hostdevs[] = {NULL, NULL, NULL}; static virPCIDevicePtr dev[] = {NULL, NULL, NULL}; static virHostdevManagerPtr mgr; +static const size_t ndisks = 3; +static virDomainDiskDefPtr disks[] = {NULL, NULL, NULL}; +static const char *diskXML[] = { + "<disk type='nvme' device='disk'>" + " <driver name='qemu' type='raw'/>" + " <source type='pci' managed='yes' namespace='1'>" + " <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>" + " </source>" + " <target dev='vda' bus='virtio'/>" + " <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>" + "</disk>", + + "<disk type='nvme' device='disk'>" + " <driver name='qemu' type='raw'/>" + " <source type='pci' managed='yes' namespace='2'>" + " <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>" + " </source>" + " <target dev='vdb' bus='virtio'/>" + " <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>" + "</disk>", + + "<disk type='nvme' device='disk'>" + " <driver name='qemu' type='raw'/>" + " <source type='pci' managed='no' namespace='1'>" + " <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>" + " </source>" + " <target dev='vdc' bus='virtio'/>" + " <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>" + "</disk>" +}; static void myCleanup(void) @@ -67,6 +100,9 @@ myCleanup(void) virDomainHostdevDefFree(hostdevs[i]); } + for (i = 0; i < ndisks; i++) + virDomainDiskDefFree(disks[i]); + if (mgr) { if (!getenv("LIBVIRT_SKIP_CLEANUP")) virFileDeleteTree(mgr->stateDir); @@ -75,6 +111,7 @@ myCleanup(void) virObjectUnref(mgr->activeUSBHostdevs); virObjectUnref(mgr->inactivePCIHostdevs); virObjectUnref(mgr->activeSCSIHostdevs); + virObjectUnref(mgr->activeNVMeHostdevs); VIR_FREE(mgr->stateDir); VIR_FREE(mgr); } @@ -107,6 +144,11 @@ myInit(void) virPCIDeviceSetStubDriver(dev[i], VIR_PCI_STUB_DRIVER_KVM); } + for (i = 0; i < ndisks; i++) { + if (!(disks[i] = virDomainDiskDefParse(diskXML[i], NULL, NULL, 0))) + goto cleanup; + } + if (VIR_ALLOC(mgr) < 0) goto cleanup; if ((mgr->activePCIHostdevs = virPCIDeviceListNew()) == NULL) @@ -117,6 +159,8 @@ myInit(void) goto cleanup; if ((mgr->activeSCSIHostdevs = virSCSIDeviceListNew()) == NULL) goto cleanup; + if ((mgr->activeNVMeHostdevs = virNVMeDeviceListNew()) == NULL) + goto cleanup; if (VIR_STRDUP(mgr->stateDir, TEST_STATE_DIR) < 0) goto cleanup; if (virFileMakePath(mgr->stateDir) < 0) @@ -550,6 +594,58 @@ testVirHostdevOther(const void *opaque ATTRIBUTE_UNUSED) return ret; } +static int +testNVMeDiskRoundtrip(const void *opaque ATTRIBUTE_UNUSED) +{ + int ret = -1; + + /* Don't rely on a state that previous test cases might have + * left the manager in. Start with a clean slate. */ + virHostdevReAttachPCIDevices(mgr, drv_name, dom_name, + hostdevs, nhostdevs, NULL); + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 0); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 0); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + /* Firstly, attach all NVMe disks */ + if (virHostdevPrepareNVMeDevices(mgr, drv_name, dom_name, disks, ndisks) < 0) + goto cleanup; + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 3); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 2); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + /* Now, try to detach the first one. */ + if (virHostdevReAttachNVMeDevices(mgr, drv_name, dom_name, disks, 1) < 0) + goto cleanup; + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 2); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 2); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + /* And the last one */ + if (virHostdevReAttachNVMeDevices(mgr, drv_name, dom_name, &disks[2], 1) < 0) + goto cleanup; + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 1); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 1); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + /* Finally, detach the middle one */ + if (virHostdevReAttachNVMeDevices(mgr, drv_name, dom_name, &disks[1], 1) < 0) + goto cleanup; + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 0); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 0); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + ret = 0; + cleanup: + return ret; +} + + # define FAKEROOTDIRTEMPLATE abs_builddir "/fakerootdir-XXXXXX" static int @@ -588,6 +684,7 @@ mymain(void) DO_TEST(testVirHostdevRoundtripManaged); DO_TEST(testVirHostdevRoundtripMixed); DO_TEST(testVirHostdevOther); + DO_TEST(testNVMeDiskRoundtrip); myCleanup(); -- 2.21.0

The qemu driver has its own wrappers around virHostdev module (so that some arguments are filled in automatically). Extend these to include NVMe devices too. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hostdev.c | 49 ++++++++++++++++++++++++++++++++++++++--- src/qemu/qemu_hostdev.h | 10 +++++++++ 2 files changed, 56 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index 92b037e1ed..efa4d62f1f 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -96,13 +96,28 @@ qemuHostdevUpdateActiveMediatedDevices(virQEMUDriverPtr driver, } +int +qemuHostdevUpdateActiveNVMeDevices(virQEMUDriverPtr driver, + virDomainDefPtr def) +{ + return virHostdevUpdateActiveNVMeDevices(driver->hostdevMgr, + QEMU_DRIVER_NAME, + def->name, + def->disks, + def->ndisks); +} + + int qemuHostdevUpdateActiveDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def) { - if (!def->nhostdevs) + if (!def->nhostdevs && !def->ndisks) return 0; + if (qemuHostdevUpdateActiveNVMeDevices(driver, def) < 0) + return -1; + if (qemuHostdevUpdateActivePCIDevices(driver, def) < 0) return -1; @@ -226,6 +241,17 @@ qemuHostdevPreparePCIDevicesCheckSupport(virDomainHostdevDefPtr *hostdevs, return true; } +int +qemuHostdevPrepareNVMeDevices(virQEMUDriverPtr driver, + const char *name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + return virHostdevPrepareNVMeDevices(driver->hostdevMgr, + QEMU_DRIVER_NAME, + name, disks, ndisks); +} + int qemuHostdevPreparePCIDevices(virQEMUDriverPtr driver, const char *name, @@ -342,9 +368,12 @@ qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, virQEMUCapsPtr qemuCaps, unsigned int flags) { - if (!def->nhostdevs) + if (!def->nhostdevs && !def->ndisks) return 0; + if (qemuHostdevPrepareNVMeDevices(driver, def->name, def->disks, def->ndisks) < 0) + return -1; + if (qemuHostdevPreparePCIDevices(driver, def->name, def->uuid, def->hostdevs, def->nhostdevs, qemuCaps, flags) < 0) @@ -369,6 +398,17 @@ qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, return 0; } +void +qemuHostdevReAttachNVMeDevices(virQEMUDriverPtr driver, + const char *name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + virHostdevReAttachNVMeDevices(driver->hostdevMgr, + QEMU_DRIVER_NAME, + name, disks, ndisks); +} + void qemuHostdevReAttachPCIDevices(virQEMUDriverPtr driver, const char *name, @@ -448,9 +488,12 @@ void qemuHostdevReAttachDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def) { - if (!def->nhostdevs) + if (!def->nhostdevs && !def->ndisks) return; + qemuHostdevReAttachNVMeDevices(driver, def->name, def->disks, + def->ndisks); + qemuHostdevReAttachPCIDevices(driver, def->name, def->hostdevs, def->nhostdevs); diff --git a/src/qemu/qemu_hostdev.h b/src/qemu/qemu_hostdev.h index f6d76c1c2a..4afb103354 100644 --- a/src/qemu/qemu_hostdev.h +++ b/src/qemu/qemu_hostdev.h @@ -27,6 +27,8 @@ bool qemuHostdevHostSupportsPassthroughLegacy(void); bool qemuHostdevHostSupportsPassthroughVFIO(void); +int qemuHostdevUpdateActiveNVMeDevices(virQEMUDriverPtr driver, + virDomainDefPtr def); int qemuHostdevUpdateActiveMediatedDevices(virQEMUDriverPtr driver, virDomainDefPtr def); int qemuHostdevUpdateActivePCIDevices(virQEMUDriverPtr driver, @@ -38,6 +40,10 @@ int qemuHostdevUpdateActiveSCSIDevices(virQEMUDriverPtr driver, int qemuHostdevUpdateActiveDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def); +int qemuHostdevPrepareNVMeDevices(virQEMUDriverPtr driver, + const char *name, + virDomainDiskDefPtr *disks, + size_t ndisks); int qemuHostdevPreparePCIDevices(virQEMUDriverPtr driver, const char *name, const unsigned char *uuid, @@ -67,6 +73,10 @@ int qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, virQEMUCapsPtr qemuCaps, unsigned int flags); +void qemuHostdevReAttachNVMeDevices(virQEMUDriverPtr driver, + const char *name, + virDomainDiskDefPtr *disks, + size_t ndisks); void qemuHostdevReAttachPCIDevices(virQEMUDriverPtr driver, const char *name, virDomainHostdevDefPtr *hostdevs, -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:05 +0200, Michal Privoznik wrote:
The qemu driver has its own wrappers around virHostdev module (so that some arguments are filled in automatically). Extend these to include NVMe devices too.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hostdev.c | 49 ++++++++++++++++++++++++++++++++++++++--- src/qemu/qemu_hostdev.h | 10 +++++++++ 2 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index 92b037e1ed..efa4d62f1f 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -96,13 +96,28 @@ qemuHostdevUpdateActiveMediatedDevices(virQEMUDriverPtr driver, }
+int +qemuHostdevUpdateActiveNVMeDevices(virQEMUDriverPtr driver,
Please include "Disks" in the function name co clarify it's not a hostdev from the domain point. ACK to the rest

We have this beautiful function that does crystal ball divination. The function is named qemuDomainGetMemLockLimitBytes() and it calculates the upper limit of how much locked memory is given guest going to need. The function bases its guess on devices defined for a domain. For instance, if there is a VFIO hostdev defined then it adds 1GiB to the guessed maximum. Since NVMe disks are pretty much VFIO hostdevs (but not quite), we have to do the same sorcery. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index f09abc8a73..09e5ee37f4 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10950,6 +10950,21 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) } } + for (i = 0; i < def->ndisks; i++) { + virDomainDiskDefPtr disk = def->disks[i]; + virStorageSourcePtr n; + + if (!disk->src) + continue; + + for (n = disk->src; virStorageSourceIsBacking(n); n = n->backingStore) { + if (n->type == VIR_STORAGE_TYPE_NVME) { + memKB = virDomainDefGetMemoryTotal(def) + 1024 * 1024; + goto done; + } + } + } + done: return memKB << 10; } -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:06 +0200, Michal Privoznik wrote:
We have this beautiful function that does crystal ball divination. The function is named qemuDomainGetMemLockLimitBytes() and it calculates the upper limit of how much locked memory is given guest going to need. The function bases its guess on devices defined for a domain. For instance, if there is a VFIO hostdev defined then it adds 1GiB to the guessed maximum. Since NVMe disks are pretty much VFIO hostdevs (but not quite), we have to do the same sorcery.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index f09abc8a73..09e5ee37f4 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c
preceeding hunk for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevSubsysPtr subsys = &def->hostdevs[i]->source.subsys; if (def->hostdevs[i]->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && (subsys->type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV || (subsys->type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && subsys->u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO))) { memKB = virDomainDefGetMemoryTotal(def) + 1024 * 1024; goto done; } }
@@ -10950,6 +10950,21 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) } }
+ for (i = 0; i < def->ndisks; i++) { + virDomainDiskDefPtr disk = def->disks[i]; + virStorageSourcePtr n; + + if (!disk->src) + continue; + + for (n = disk->src; virStorageSourceIsBacking(n); n = n->backingStore) { + if (n->type == VIR_STORAGE_TYPE_NVME) { + memKB = virDomainDefGetMemoryTotal(def) + 1024 * 1024; + goto done; + } + } + }
Please set a booleand such as 'needVFIO' in the above hunk and here and do the calculation once based on that boolean. This implementation creates two instancess needing fixing in case when we'd need to ever change the number. ACK with that change

This function will return true if there's a storage source of type VIR_STORAGE_TYPE_NVME, or false otherwise. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virstoragefile.c | 14 ++++++++++++++ src/util/virstoragefile.h | 2 ++ 3 files changed, 17 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index bc6583562a..5b7aa58dd8 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2994,6 +2994,7 @@ virStoragePRDefIsManaged; virStoragePRDefParseXML; virStorageSourceBackingStoreClear; virStorageSourceChainHasManagedPR; +virStorageSourceChainHasNVMe; virStorageSourceClear; virStorageSourceCopy; virStorageSourceFindByNodeName; diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c index 18aa33fe05..a9ceb697cf 100644 --- a/src/util/virstoragefile.c +++ b/src/util/virstoragefile.c @@ -2157,6 +2157,20 @@ virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def) } +bool +virStorageSourceChainHasNVMe(const virStorageSource *src) +{ + const virStorageSource *n; + + for (n = src; virStorageSourceIsBacking(n); n = n->backingStore) { + if (n->type == VIR_STORAGE_TYPE_NVME) + return true; + } + + return false; +} + + virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model) diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h index a1294ea608..8afd5d60cb 100644 --- a/src/util/virstoragefile.h +++ b/src/util/virstoragefile.h @@ -431,6 +431,8 @@ virStorageSourceChainHasManagedPR(virStorageSourcePtr src); void virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def); VIR_DEFINE_AUTOPTR_FUNC(virStorageSourceNVMeDef, virStorageSourceNVMeDefFree); +bool virStorageSourceChainHasNVMe(const virStorageSource *src); + virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model); -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:07 +0200, Michal Privoznik wrote:
This function will return true if there's a storage source of type VIR_STORAGE_TYPE_NVME, or false otherwise.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virstoragefile.c | 14 ++++++++++++++ src/util/virstoragefile.h | 2 ++ 3 files changed, 17 insertions(+)
[...]
diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c index 18aa33fe05..a9ceb697cf 100644 --- a/src/util/virstoragefile.c +++ b/src/util/virstoragefile.c @@ -2157,6 +2157,20 @@ virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def) }
+bool +virStorageSourceChainHasNVMe(const virStorageSource *src) +{ + const virStorageSource *n; + + for (n = src; virStorageSourceIsBacking(n); n = n->backingStore) { + if (n->type == VIR_STORAGE_TYPE_NVME)
It occurs to me that if you introduce this function earlier you will be able to save some code in the previous patches. ACK to this though.

This function will return true if any of disks (or their backing chain) for given domain contains an NVMe disk. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 14 ++++++++++++++ src/conf/domain_conf.h | 3 +++ src/libvirt_private.syms | 1 + 3 files changed, 18 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 73f5e1fa0f..1b6ee3bfa6 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -31431,6 +31431,20 @@ virDomainDefHasManagedPR(const virDomainDef *def) } +bool +virDomainDefHasNVMeDisk(const virDomainDef *def) +{ + size_t i; + + for (i = 0; i < def->ndisks; i++) { + if (virStorageSourceChainHasNVMe(def->disks[i]->src)) + return true; + } + + return false; +} + + /** * virDomainGraphicsDefHasOpenGL: * @def: domain definition diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index c1b5fc1337..2a067633bd 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3611,6 +3611,9 @@ virDomainDiskGetDetectZeroesMode(virDomainDiskDiscard discard, bool virDomainDefHasManagedPR(const virDomainDef *def); +bool +virDomainDefHasNVMeDisk(const virDomainDef *def); + bool virDomainGraphicsDefHasOpenGL(const virDomainDef *def); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 5b7aa58dd8..15bfae115f 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -284,6 +284,7 @@ virDomainDefHasDeviceAddress; virDomainDefHasManagedPR; virDomainDefHasMemballoon; virDomainDefHasMemoryHotplug; +virDomainDefHasNVMeDisk; virDomainDefHasUSB; virDomainDefHasVcpusOffline; virDomainDefLifecycleActionAllowed; -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:08 +0200, Michal Privoznik wrote:
This function will return true if any of disks (or their backing chain) for given domain contains an NVMe disk.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 14 ++++++++++++++ src/conf/domain_conf.h | 3 +++ src/libvirt_private.syms | 1 + 3 files changed, 18 insertions(+)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 73f5e1fa0f..1b6ee3bfa6 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -31431,6 +31431,20 @@ virDomainDefHasManagedPR(const virDomainDef *def) }
+bool +virDomainDefHasNVMeDisk(const virDomainDef *def) +{ + size_t i; + + for (i = 0; i < def->ndisks; i++) { + if (virStorageSourceChainHasNVMe(def->disks[i]->src)) + return true; + } + + return false;
Same comment as in previous patch. ACK.

This piece of code will be re-used later. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 17 +++++++++++++++++ src/conf/domain_conf.h | 3 +++ src/libvirt_private.syms | 1 + src/qemu/qemu_domain.c | 13 ++----------- 4 files changed, 23 insertions(+), 11 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 1b6ee3bfa6..e71e484a6f 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -31445,6 +31445,23 @@ virDomainDefHasNVMeDisk(const virDomainDef *def) } +bool +virDomainDefHasVFIOHostdev(const virDomainDef *def) +{ + size_t i; + + for (i = 0; i < def->nhostdevs; i++) { + const virDomainHostdevDef *tmp = def->hostdevs[i]; + if (tmp->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + tmp->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + tmp->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) + return true; + } + + return false; +} + + /** * virDomainGraphicsDefHasOpenGL: * @def: domain definition diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 2a067633bd..5c6c5b7a33 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3614,6 +3614,9 @@ virDomainDefHasManagedPR(const virDomainDef *def); bool virDomainDefHasNVMeDisk(const virDomainDef *def); +bool +virDomainDefHasVFIOHostdev(const virDomainDef *def); + bool virDomainGraphicsDefHasOpenGL(const virDomainDef *def); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 15bfae115f..4fda747fb3 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -287,6 +287,7 @@ virDomainDefHasMemoryHotplug; virDomainDefHasNVMeDisk; virDomainDefHasUSB; virDomainDefHasVcpusOffline; +virDomainDefHasVFIOHostdev; virDomainDefLifecycleActionAllowed; virDomainDefMaybeAddController; virDomainDefMaybeAddInput; diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 09e5ee37f4..2a7f09ce24 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11764,7 +11764,7 @@ qemuDomainGetHostdevPath(virDomainDefPtr def, bool includeVFIO = false; char **tmpPaths = NULL; int *tmpPerms = NULL; - size_t i, tmpNpaths = 0; + size_t tmpNpaths = 0; int perm = 0; *npaths = 0; @@ -11787,16 +11787,7 @@ qemuDomainGetHostdevPath(virDomainDefPtr def, perm = VIR_CGROUP_DEVICE_RW; if (teardown) { - size_t nvfios = 0; - for (i = 0; i < def->nhostdevs; i++) { - virDomainHostdevDefPtr tmp = def->hostdevs[i]; - if (tmp->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && - tmp->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - tmp->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) - nvfios++; - } - - if (nvfios == 0) + if (!virDomainDefHasVFIOHostdev(def)) includeVFIO = true; } else { includeVFIO = true; -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:09 +0200, Michal Privoznik wrote:
This piece of code will be re-used later.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 17 +++++++++++++++++ src/conf/domain_conf.h | 3 +++ src/libvirt_private.syms | 1 + src/qemu/qemu_domain.c | 13 ++----------- 4 files changed, 23 insertions(+), 11 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 1b6ee3bfa6..e71e484a6f 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -31445,6 +31445,23 @@ virDomainDefHasNVMeDisk(const virDomainDef *def) }
+bool
If you plan to add some NVMe disk stuff to this function later please add a comment explaining what and why is happening here.
+virDomainDefHasVFIOHostdev(const virDomainDef *def) +{
ACK

Couple of places in the QEMU driver will want to know what paths are associated with NVMe disks (for instance CGroup code or namespaces code). Introduce helpers which return desired paths (for instance /dev/vfio/vfio and /dev/vfio/N). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 44 ++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 6 ++++++ 2 files changed, 50 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2a7f09ce24..949bbace88 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11723,6 +11723,50 @@ qemuDomainSupportsVideoVga(virDomainVideoDefPtr video, } +char * +qemuDomainGetNVMeDiskPath(const virStorageSourceNVMeDef *nvme) +{ + VIR_AUTOPTR(virPCIDevice) pci = NULL; + + /* All NVMe devices are VFIO PCI devices */ + if (!(pci = virPCIDeviceNew(nvme->pciAddr.domain, + nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function))) + return NULL; + + return virPCIDeviceGetIOMMUGroupDev(pci); +} + + +char ** +qemuDomainGetDiskNVMePaths(const virDomainDef *def, + const virStorageSource *src, + bool teardown) +{ + VIR_AUTOFREE(char *) iommuGroup = NULL; + VIR_AUTOSTRINGLIST paths = NULL; + bool includeVFIO = !teardown; + + if (!(iommuGroup = qemuDomainGetNVMeDiskPath(src->nvme))) + return NULL; + + if (virStringListAdd(&paths, iommuGroup) < 0) + return NULL; + + if (teardown && def && + !virDomainDefHasNVMeDisk(def) && + !virDomainDefHasVFIOHostdev(def)) + includeVFIO = true; + + if (includeVFIO && + virStringListAdd(&paths, QEMU_DEV_VFIO) < 0) + return NULL; + + VIR_RETURN_PTR(paths); +} + + /** * qemuDomainGetHostdevPath: * @def: domain definition diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 3eea8b0f96..82e225088d 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -1011,6 +1011,12 @@ int qemuDomainCheckMonitor(virQEMUDriverPtr driver, bool qemuDomainSupportsVideoVga(virDomainVideoDefPtr video, virQEMUCapsPtr qemuCaps); +char * qemuDomainGetNVMeDiskPath(const virStorageSourceNVMeDef *nvme); + +char ** qemuDomainGetDiskNVMePaths(const virDomainDef *def, + const virStorageSource *src, + bool teardown); + int qemuDomainGetHostdevPath(virDomainDefPtr def, virDomainHostdevDefPtr dev, bool teardown, -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:10 +0200, Michal Privoznik wrote:
Couple of places in the QEMU driver will want to know what paths are associated with NVMe disks (for instance CGroup code or namespaces code). Introduce helpers which return desired paths (for instance /dev/vfio/vfio and /dev/vfio/N).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 44 ++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 6 ++++++ 2 files changed, 50 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2a7f09ce24..949bbace88 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11723,6 +11723,50 @@ qemuDomainSupportsVideoVga(virDomainVideoDefPtr video, }
+char * +qemuDomainGetNVMeDiskPath(const virStorageSourceNVMeDef *nvme)
Name of this function is VERY misleading. It returns path of the IOMMU group associated with the host portion of the NVMe device, but the name implies smething completely different. Include IOMMUGroup in the name and add a comment.
+{ + VIR_AUTOPTR(virPCIDevice) pci = NULL; + + /* All NVMe devices are VFIO PCI devices */ + if (!(pci = virPCIDeviceNew(nvme->pciAddr.domain, + nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function))) + return NULL; + + return virPCIDeviceGetIOMMUGroupDev(pci); +} + + +char ** +qemuDomainGetDiskNVMePaths(const virDomainDef *def,
Also this name looks troubling.
+ const virStorageSource *src, + bool teardown) +{ + VIR_AUTOFREE(char *) iommuGroup = NULL; + VIR_AUTOSTRINGLIST paths = NULL; + bool includeVFIO = !teardown; + + if (!(iommuGroup = qemuDomainGetNVMeDiskPath(src->nvme))) + return NULL; + + if (virStringListAdd(&paths, iommuGroup) < 0) + return NULL; + + if (teardown && def && + !virDomainDefHasNVMeDisk(def) && + !virDomainDefHasVFIOHostdev(def)) + includeVFIO = true; + + if (includeVFIO && + virStringListAdd(&paths, QEMU_DEV_VFIO) < 0) + return NULL;
I don't like this. It's hiding the stuff necessary to detach VFIO groups in random function and the hostdev code will require exactly the same treatment. Additionally any further possible VFIO based device would require 3 places. Can't we consolidate that somehow?

On 7/16/19 4:30 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:54:10 +0200, Michal Privoznik wrote:
Couple of places in the QEMU driver will want to know what paths are associated with NVMe disks (for instance CGroup code or namespaces code). Introduce helpers which return desired paths (for instance /dev/vfio/vfio and /dev/vfio/N).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 44 ++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 6 ++++++ 2 files changed, 50 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2a7f09ce24..949bbace88 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11723,6 +11723,50 @@ qemuDomainSupportsVideoVga(virDomainVideoDefPtr video, }
+char * +qemuDomainGetNVMeDiskPath(const virStorageSourceNVMeDef *nvme)
Name of this function is VERY misleading. It returns path of the IOMMU group associated with the host portion of the NVMe device, but the name implies smething completely different.
Include IOMMUGroup in the name and add a comment.
+{ + VIR_AUTOPTR(virPCIDevice) pci = NULL; + + /* All NVMe devices are VFIO PCI devices */ + if (!(pci = virPCIDeviceNew(nvme->pciAddr.domain, + nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function))) + return NULL; + + return virPCIDeviceGetIOMMUGroupDev(pci); +} + + +char ** +qemuDomainGetDiskNVMePaths(const virDomainDef *def,
Also this name looks troubling.
+ const virStorageSource *src, + bool teardown) +{ + VIR_AUTOFREE(char *) iommuGroup = NULL; + VIR_AUTOSTRINGLIST paths = NULL; + bool includeVFIO = !teardown; + + if (!(iommuGroup = qemuDomainGetNVMeDiskPath(src->nvme))) + return NULL; + + if (virStringListAdd(&paths, iommuGroup) < 0) + return NULL; + + if (teardown && def && + !virDomainDefHasNVMeDisk(def) && + !virDomainDefHasVFIOHostdev(def)) + includeVFIO = true; + + if (includeVFIO && + virStringListAdd(&paths, QEMU_DEV_VFIO) < 0) + return NULL;
I don't like this. It's hiding the stuff necessary to detach VFIO groups in random function and the hostdev code will require exactly the same treatment. Additionally any further possible VFIO based device would require 3 places.
Can't we consolidate that somehow?
Agreed it's ugly. But I don't have any idea, sorry. Do you have any suggestion? Michal

If a domain has an NVMe disk configured, then we need to create /dev/vfio/* paths in domain's namespace so that qemu can open them. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 949bbace88..cd3205a588 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11831,7 +11831,8 @@ qemuDomainGetHostdevPath(virDomainDefPtr def, perm = VIR_CGROUP_DEVICE_RW; if (teardown) { - if (!virDomainDefHasVFIOHostdev(def)) + if (!virDomainDefHasVFIOHostdev(def) && + !virDomainDefHasNVMeDisk(def)) includeVFIO = true; } else { includeVFIO = true; @@ -12415,6 +12416,22 @@ qemuDomainSetupDisk(virQEMUDriverConfigPtr cfg ATTRIBUTE_UNUSED, int ret = -1; for (next = disk->src; virStorageSourceIsBacking(next); next = next->backingStore) { + /* NVMe disks must be checked before virStorageSourceIsLocalStorage() + * is called. This is because while NVMe disks are local, they don't + * have next->path set. */ + if (next->type == VIR_STORAGE_TYPE_NVME) { + VIR_AUTOSTRINGLIST nvmePaths = NULL; + size_t i; + + if (!(nvmePaths = qemuDomainGetDiskNVMePaths(NULL, next, false))) + goto cleanup; + + for (i = 0; nvmePaths[i]; i++) { + if (qemuDomainCreateDevice(nvmePaths[i], data, false) < 0) + goto cleanup; + } + } + if (!next->path || !virStorageSourceIsLocalStorage(next)) { /* Not creating device. Just continue. */ continue; @@ -13462,12 +13479,28 @@ qemuDomainNamespaceSetupDisk(virDomainObjPtr vm, virStorageSourcePtr src) { virStorageSourcePtr next; + VIR_AUTOSTRINGLIST nvmePaths = NULL; const char **paths = NULL; size_t npaths = 0; char *dmPath = NULL; int ret = -1; for (next = src; virStorageSourceIsBacking(next); next = next->backingStore) { + /* NVMe disks must be checked before virStorageSourceIsLocalStorage() + * is called. This is because while NVMe disks are local, they don't + * have next->path set. */ + if (next->type == VIR_STORAGE_TYPE_NVME) { + size_t i; + + if (!(nvmePaths = qemuDomainGetDiskNVMePaths(NULL, next, false))) + goto cleanup; + + for (i = 0; nvmePaths[i]; i++) { + if (VIR_APPEND_ELEMENT_COPY(paths, npaths, nvmePaths[i]) < 0) + goto cleanup; + } + } + if (virStorageSourceIsEmpty(next) || !virStorageSourceIsLocalStorage(next)) { /* Not creating device. Just continue. */ -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:11 +0200, Michal Privoznik wrote:
If a domain has an NVMe disk configured, then we need to create /dev/vfio/* paths in domain's namespace so that qemu can open them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 949bbace88..cd3205a588 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11831,7 +11831,8 @@ qemuDomainGetHostdevPath(virDomainDefPtr def,
perm = VIR_CGROUP_DEVICE_RW; if (teardown) { - if (!virDomainDefHasVFIOHostdev(def)) + if (!virDomainDefHasVFIOHostdev(def) && + !virDomainDefHasNVMeDisk(def))
As said previously I don't like this construct and also this hunk feels really that it does not belong to this patch.
includeVFIO = true; } else { includeVFIO = true; @@ -12415,6 +12416,22 @@ qemuDomainSetupDisk(virQEMUDriverConfigPtr cfg ATTRIBUTE_UNUSED, int ret = -1;
for (next = disk->src; virStorageSourceIsBacking(next); next = next->backingStore) { + /* NVMe disks must be checked before virStorageSourceIsLocalStorage() + * is called. This is because while NVMe disks are local, they don't + * have next->path set. */ + if (next->type == VIR_STORAGE_TYPE_NVME) { + VIR_AUTOSTRINGLIST nvmePaths = NULL; + size_t i; + + if (!(nvmePaths = qemuDomainGetDiskNVMePaths(NULL, next, false))) + goto cleanup; + + for (i = 0; nvmePaths[i]; i++) { + if (qemuDomainCreateDevice(nvmePaths[i], data, false) < 0)
/dev/vfio will be included for every NVMe-backed disk.
+ goto cleanup; + } + } + if (!next->path || !virStorageSourceIsLocalStorage(next)) { /* Not creating device. Just continue. */ continue; @@ -13462,12 +13479,28 @@ qemuDomainNamespaceSetupDisk(virDomainObjPtr vm, virStorageSourcePtr src) { virStorageSourcePtr next; + VIR_AUTOSTRINGLIST nvmePaths = NULL; const char **paths = NULL; size_t npaths = 0; char *dmPath = NULL; int ret = -1;
for (next = src; virStorageSourceIsBacking(next); next = next->backingStore) { + /* NVMe disks must be checked before virStorageSourceIsLocalStorage() + * is called. This is because while NVMe disks are local, they don't + * have next->path set. */
Well, that's why I've requested the comment for virStorageSourceIsLocalStorage outlining why we don't consider NVMe as local. You can then say that it's because that weirdness.
+ if (next->type == VIR_STORAGE_TYPE_NVME) { + size_t i; + + if (!(nvmePaths = qemuDomainGetDiskNVMePaths(NULL, next, false))) + goto cleanup; + + for (i = 0; nvmePaths[i]; i++) { + if (VIR_APPEND_ELEMENT_COPY(paths, npaths, nvmePaths[i]) < 0)
/dev/vfio will be present multiple times again.
+ goto cleanup; + } + } + if (virStorageSourceIsEmpty(next) || !virStorageSourceIsLocalStorage(next)) { /* Not creating device. Just continue. */ -- 2.21.0
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 7/16/19 4:37 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:54:11 +0200, Michal Privoznik wrote:
If a domain has an NVMe disk configured, then we need to create /dev/vfio/* paths in domain's namespace so that qemu can open them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 949bbace88..cd3205a588 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11831,7 +11831,8 @@ qemuDomainGetHostdevPath(virDomainDefPtr def,
perm = VIR_CGROUP_DEVICE_RW; if (teardown) { - if (!virDomainDefHasVFIOHostdev(def)) + if (!virDomainDefHasVFIOHostdev(def) && + !virDomainDefHasNVMeDisk(def))
As said previously I don't like this construct and also this hunk feels really that it does not belong to this patch.
The thing is that NVMe disks are both hostdevs and disks. So whenever we deal with /dev/vfio/* we have to consider both. One solution that comes to my mind is to take /dev/vfio/vfio completely out of the picture on qemuDomainGetHostdevPath() and qemuDomainGetNVMeDiskIOMMUGroupPaths() levels, have them return a single path that device is associated with (/dev/vfio/N) and let caller do checks then if /dev/vfio/vfio must also be included in whatever it is they want to do. Michal

If a domain has an NVMe disk configured, then we need to allow it on devices CGroup so that qemu can access it. There is one caveat though - if an NVMe disk is read only we need CGroup to allow write too. This is because when opening the device, qemu does couple of ioctl()-s which are considered as write. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 59 +++++++++++++++++++++++++++++++++--------- 1 file changed, 47 insertions(+), 12 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 19ca60905a..2a7fc07ac7 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -118,10 +118,29 @@ qemuSetupImageCgroupInternal(virDomainObjPtr vm, virStorageSourcePtr src, bool forceReadonly) { - if (!src->path || !virStorageSourceIsLocalStorage(src)) { - VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", - NULLSTR(src->path), virStorageTypeToString(src->type)); - return 0; + VIR_AUTOFREE(char *) path = NULL; + bool readonly = src->readonly || forceReadonly; + + if (src->type == VIR_STORAGE_TYPE_NVME) { + /* Even though disk is R/O we can't make it so in + * CGroups. QEMU will try to do some ioctl()-s over the + * device and such operations are R/W. */ + readonly = false; + + if (!(path = qemuDomainGetNVMeDiskPath(src->nvme))) + return -1; + + if (qemuSetupImagePathCgroup(vm, QEMU_DEV_VFIO, false) < 0) + return -1; + } else { + if (!src->path || !virStorageSourceIsLocalStorage(src)) { + VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", + NULLSTR(src->path), virStorageTypeToString(src->type)); + return 0; + } + + if (VIR_STRDUP(path, src->path) < 0) + return -1; } if (virStoragePRDefIsManaged(src->pr) && @@ -129,7 +148,7 @@ qemuSetupImageCgroupInternal(virDomainObjPtr vm, qemuSetupImagePathCgroup(vm, QEMU_DEVICE_MAPPER_CONTROL_PATH, false) < 0) return -1; - return qemuSetupImagePathCgroup(vm, src->path, src->readonly || forceReadonly); + return qemuSetupImagePathCgroup(vm, path, readonly); } @@ -146,6 +165,7 @@ qemuTeardownImageCgroup(virDomainObjPtr vm, virStorageSourcePtr src) { qemuDomainObjPrivatePtr priv = vm->privateData; + VIR_AUTOFREE(char *path) = NULL; int perms = VIR_CGROUP_DEVICE_RWM; size_t i; int ret; @@ -154,10 +174,25 @@ qemuTeardownImageCgroup(virDomainObjPtr vm, VIR_CGROUP_CONTROLLER_DEVICES)) return 0; - if (!src->path || !virStorageSourceIsLocalStorage(src)) { - VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", - NULLSTR(src->path), virStorageTypeToString(src->type)); - return 0; + if (src->type == VIR_STORAGE_TYPE_NVME) { + if (!(path = qemuDomainGetNVMeDiskPath(src->nvme))) + return -1; + + ret = virCgroupDenyDevicePath(priv->cgroup, QEMU_DEV_VFIO, perms, true); + virDomainAuditCgroupPath(vm, priv->cgroup, "deny", + QEMU_DEV_VFIO, + virCgroupGetDevicePermsString(perms), ret); + if (ret < 0) + return -1; + } else { + if (!src->path || !virStorageSourceIsLocalStorage(src)) { + VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", + NULLSTR(src->path), virStorageTypeToString(src->type)); + return 0; + } + + if (VIR_STRDUP(path, src->path) < 0) + return -1; } if (virFileExists(QEMU_DEVICE_MAPPER_CONTROL_PATH)) { @@ -184,11 +219,11 @@ qemuTeardownImageCgroup(virDomainObjPtr vm, } } - VIR_DEBUG("Deny path %s", src->path); + VIR_DEBUG("Deny path %s", path); - ret = virCgroupDenyDevicePath(priv->cgroup, src->path, perms, true); + ret = virCgroupDenyDevicePath(priv->cgroup, path, perms, true); - virDomainAuditCgroupPath(vm, priv->cgroup, "deny", src->path, + virDomainAuditCgroupPath(vm, priv->cgroup, "deny", path, virCgroupGetDevicePermsString(perms), ret); /* If you're looking for a counter part to -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:12 +0200, Michal Privoznik wrote:
If a domain has an NVMe disk configured, then we need to allow it on devices CGroup so that qemu can access it. There is one caveat though - if an NVMe disk is read only we need CGroup to allow write too. This is because when opening the device, qemu does couple of ioctl()-s which are considered as write.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 59 +++++++++++++++++++++++++++++++++--------- 1 file changed, 47 insertions(+), 12 deletions(-)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 19ca60905a..2a7fc07ac7 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -118,10 +118,29 @@ qemuSetupImageCgroupInternal(virDomainObjPtr vm, virStorageSourcePtr src, bool forceReadonly) { - if (!src->path || !virStorageSourceIsLocalStorage(src)) { - VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", - NULLSTR(src->path), virStorageTypeToString(src->type)); - return 0; + VIR_AUTOFREE(char *) path = NULL; + bool readonly = src->readonly || forceReadonly; + + if (src->type == VIR_STORAGE_TYPE_NVME) { + /* Even though disk is R/O we can't make it so in + * CGroups. QEMU will try to do some ioctl()-s over the + * device and such operations are R/W. */ + readonly = false;
Yeah, that should be fine. We tell qemu to open it R/O afterwards and we can't do better here.
+ + if (!(path = qemuDomainGetNVMeDiskPath(src->nvme))) + return -1; + + if (qemuSetupImagePathCgroup(vm, QEMU_DEV_VFIO, false) < 0) + return -1; + } else { + if (!src->path || !virStorageSourceIsLocalStorage(src)) { + VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", + NULLSTR(src->path), virStorageTypeToString(src->type)); + return 0; + } + + if (VIR_STRDUP(path, src->path) < 0) + return -1; }
if (virStoragePRDefIsManaged(src->pr) && @@ -129,7 +148,7 @@ qemuSetupImageCgroupInternal(virDomainObjPtr vm, qemuSetupImagePathCgroup(vm, QEMU_DEVICE_MAPPER_CONTROL_PATH, false) < 0) return -1;
- return qemuSetupImagePathCgroup(vm, src->path, src->readonly || forceReadonly); + return qemuSetupImagePathCgroup(vm, path, readonly); }
[...]
@@ -154,10 +174,25 @@ qemuTeardownImageCgroup(virDomainObjPtr vm, VIR_CGROUP_CONTROLLER_DEVICES)) return 0;
- if (!src->path || !virStorageSourceIsLocalStorage(src)) { - VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", - NULLSTR(src->path), virStorageTypeToString(src->type)); - return 0; + if (src->type == VIR_STORAGE_TYPE_NVME) { + if (!(path = qemuDomainGetNVMeDiskPath(src->nvme))) + return -1; + + ret = virCgroupDenyDevicePath(priv->cgroup, QEMU_DEV_VFIO, perms, true); + virDomainAuditCgroupPath(vm, priv->cgroup, "deny", + QEMU_DEV_VFIO, + virCgroupGetDevicePermsString(perms), ret);
What if the IOMMU group is shared by another disk? or perhaps even an hostdev?
+ if (ret < 0) + return -1; + } else { + if (!src->path || !virStorageSourceIsLocalStorage(src)) { + VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", + NULLSTR(src->path), virStorageTypeToString(src->type)); + return 0; + } + + if (VIR_STRDUP(path, src->path) < 0) + return -1; }
if (virFileExists(QEMU_DEVICE_MAPPER_CONTROL_PATH)) {

This function calls virSecuritySELinuxSetFilecon() or virSecuritySELinuxSetFileconOptional() from a lot of places. It works, because in all places we're passing src->path which is what we wanted. But not anymore. We will want to be able to pass a different path and thus the function must be reworked a bit. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/security/security_selinux.c | 39 ++++++++++++++------------------- 1 file changed, 17 insertions(+), 22 deletions(-) diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index ea20373a90..99cef3f212 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -1820,7 +1820,10 @@ virSecuritySELinuxSetImageLabelInternal(virSecurityManagerPtr mgr, virSecurityDeviceLabelDefPtr disk_seclabel; virSecurityDeviceLabelDefPtr parent_seclabel = NULL; bool remember; - int ret; + const char *path = src->path; + const char *tcon = NULL; + bool optional = false; + int ret = -1; if (!src->path || !virStorageSourceIsLocalStorage(src)) return 0; @@ -1853,40 +1856,32 @@ virSecuritySELinuxSetImageLabelInternal(virSecurityManagerPtr mgr, if (!disk_seclabel->relabel) return 0; - ret = virSecuritySELinuxSetFilecon(mgr, src->path, - disk_seclabel->label, remember); + tcon = disk_seclabel->label; } else if (parent_seclabel && (!parent_seclabel->relabel || parent_seclabel->label)) { if (!parent_seclabel->relabel) return 0; - ret = virSecuritySELinuxSetFilecon(mgr, src->path, - parent_seclabel->label, remember); + tcon = parent_seclabel->label; } else if (!parent || parent == src) { if (src->shared) { - ret = virSecuritySELinuxSetFileconOptional(mgr, - src->path, - data->file_context, - remember); + tcon = data->file_context; + optional = true; } else if (src->readonly) { - ret = virSecuritySELinuxSetFileconOptional(mgr, - src->path, - data->content_context, - remember); + tcon = data->content_context; + optional = true; } else if (secdef->imagelabel) { - ret = virSecuritySELinuxSetFileconOptional(mgr, - src->path, - secdef->imagelabel, - remember); + tcon = secdef->imagelabel; + optional = true; } else { - ret = 0; + return 0; } } else { - ret = virSecuritySELinuxSetFileconOptional(mgr, - src->path, - data->content_context, - remember); + optional = true; + tcon = data->content_context; } + ret = virSecuritySELinuxSetFileconHelper(mgr, path, tcon, optional, remember); + if (ret == 1 && !disk_seclabel) { /* If we failed to set a label, but virt_use_nfs let us * proceed anyway, then we don't need to relabel later. */ -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:13 +0200, Michal Privoznik wrote:
This function calls virSecuritySELinuxSetFilecon() or virSecuritySELinuxSetFileconOptional() from a lot of places. It works, because in all places we're passing src->path which is what we wanted. But not anymore. We will want to be able to pass a different path and thus the function must be reworked a bit.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/security/security_selinux.c | 39 ++++++++++++++------------------- 1 file changed, 17 insertions(+), 22 deletions(-)
ACK

This function is currently not called for any type of storage source that is not considered 'local' (as defined by virStorageSourceIsLocalStorage()). Well, NVMe disks are not 'local' from that point of view and therefore we will need to call this function more frequently. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/security/security_dac.c | 38 ++++++++++++++++++++++ src/security/security_selinux.c | 56 ++++++++++++++++++++++++++++----- 2 files changed, 87 insertions(+), 7 deletions(-) diff --git a/src/security/security_dac.c b/src/security/security_dac.c index 137daf5d28..d0b84c99b0 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -912,6 +912,23 @@ virSecurityDACSetImageLabelInternal(virSecurityManagerPtr mgr, return -1; } + /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + VIR_AUTOFREE(char *) vfioGroupDev = NULL; + VIR_AUTOUNREF(virPCIDevicePtr) pci = virPCIDeviceNew(nvme->pciAddr.domain, + nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function); + if (!pci || + !(vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci))) + return -1; + + return virSecurityDACSetOwnership(mgr, NULL, vfioGroupDev, user, group, false); + } + /* We can't do restore on shared resources safely. Not even * with refcounting implemented in XATTRs because if there * was a domain running with the feature turned off the @@ -1001,6 +1018,27 @@ virSecurityDACRestoreImageLabelInt(virSecurityManagerPtr mgr, } } + /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + VIR_AUTOFREE(char *) vfioGroupDev = NULL; + VIR_AUTOUNREF(virPCIDevicePtr) pci = virPCIDeviceNew(nvme->pciAddr.domain, + nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function); + if (!pci || + !(vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci))) + return -1; + + /* Ideally, we would check if there is not another PCI + * device within domain def that is in the same IOMMU + * group. But we're not doing that for hostdevs yet. */ + + return virSecurityDACRestoreFileLabelInternal(mgr, NULL, vfioGroupDev, false); + } + return virSecurityDACRestoreFileLabelInternal(mgr, src, NULL, true); } diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index 99cef3f212..a2e4dcb6da 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -1751,9 +1751,8 @@ virSecuritySELinuxRestoreImageLabelInt(virSecurityManagerPtr mgr, { virSecurityLabelDefPtr seclabel; virSecurityDeviceLabelDefPtr disk_seclabel; - - if (!src->path || !virStorageSourceIsLocalStorage(src)) - return 0; + VIR_AUTOFREE(char *) vfioGroupDev = NULL; + const char *path = src->path; seclabel = virDomainDefGetSecurityLabelDef(def, SECURITY_SELINUX_NAME); if (seclabel == NULL) @@ -1785,9 +1784,16 @@ virSecuritySELinuxRestoreImageLabelInt(virSecurityManagerPtr mgr, * ownership, because that kills access on the destination host which is * sub-optimal for the guest VM's I/O attempts :-) */ if (migrated) { - int rc = virFileIsSharedFS(src->path); - if (rc < 0) - return -1; + int rc = 1; + + if (virStorageSourceIsLocalStorage(src)) { + if (!src->path) + return 0; + + if ((rc = virFileIsSharedFS(src->path)) < 0) + return -1; + } + if (rc == 1) { VIR_DEBUG("Skipping image label restore on %s because FS is shared", src->path); @@ -1795,7 +1801,26 @@ virSecuritySELinuxRestoreImageLabelInt(virSecurityManagerPtr mgr, } } - return virSecuritySELinuxRestoreFileLabel(mgr, src->path, true); + /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + VIR_AUTOUNREF(virPCIDevicePtr) pci = virPCIDeviceNew(nvme->pciAddr.domain, + nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function); + if (!pci || + !(vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci))) + return -1; + + /* Ideally, we would check if there is not another PCI + * device within domain def that is in the same IOMMU + * group. But we're not doing that for hostdevs yet. */ + path = vfioGroupDev; + } + + return virSecuritySELinuxRestoreFileLabel(mgr, path, true); } @@ -1820,6 +1845,7 @@ virSecuritySELinuxSetImageLabelInternal(virSecurityManagerPtr mgr, virSecurityDeviceLabelDefPtr disk_seclabel; virSecurityDeviceLabelDefPtr parent_seclabel = NULL; bool remember; + VIR_AUTOFREE(char *) vfioGroupDev = NULL; const char *path = src->path; const char *tcon = NULL; bool optional = false; @@ -1880,6 +1906,22 @@ virSecuritySELinuxSetImageLabelInternal(virSecurityManagerPtr mgr, tcon = data->content_context; } + /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + VIR_AUTOUNREF(virPCIDevicePtr) pci = virPCIDeviceNew(nvme->pciAddr.domain, + nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function); + if (!pci || + !(vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci))) + return -1; + + path = vfioGroupDev; + } + ret = virSecuritySELinuxSetFileconHelper(mgr, path, tcon, optional, remember); if (ret == 1 && !disk_seclabel) { -- 2.21.0

The summary is misleading. Mention NVNe and since you are fixing two functions don't mention the name. On Thu, Jul 11, 2019 at 17:54:14 +0200, Michal Privoznik wrote:
This function is currently not called for any type of storage source that is not considered 'local' (as defined by virStorageSourceIsLocalStorage()). Well, NVMe disks are not 'local' from that point of view and therefore we will need to call this function more frequently.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/security/security_dac.c | 38 ++++++++++++++++++++++ src/security/security_selinux.c | 56 ++++++++++++++++++++++++++++----- 2 files changed, 87 insertions(+), 7 deletions(-)
diff --git a/src/security/security_dac.c b/src/security/security_dac.c index 137daf5d28..d0b84c99b0 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -912,6 +912,23 @@ virSecurityDACSetImageLabelInternal(virSecurityManagerPtr mgr, return -1; }
+ /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + VIR_AUTOFREE(char *) vfioGroupDev = NULL; + VIR_AUTOUNREF(virPCIDevicePtr) pci = virPCIDeviceNew(nvme->pciAddr.domain,
This is not a virObject.
+ nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function); + if (!pci || + !(vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci))) + return -1; + + return virSecurityDACSetOwnership(mgr, NULL, vfioGroupDev, user, group, false); + } + /* We can't do restore on shared resources safely. Not even * with refcounting implemented in XATTRs because if there * was a domain running with the feature turned off the @@ -1001,6 +1018,27 @@ virSecurityDACRestoreImageLabelInt(virSecurityManagerPtr mgr, } }
+ /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + VIR_AUTOFREE(char *) vfioGroupDev = NULL; + VIR_AUTOUNREF(virPCIDevicePtr) pci = virPCIDeviceNew(nvme->pciAddr.domain,
Same as above.
+ nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function); + if (!pci || + !(vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci))) + return -1; + + /* Ideally, we would check if there is not another PCI + * device within domain def that is in the same IOMMU + * group. But we're not doing that for hostdevs yet. */ + + return virSecurityDACRestoreFileLabelInternal(mgr, NULL, vfioGroupDev, false); + } + return virSecurityDACRestoreFileLabelInternal(mgr, src, NULL, true); }
diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index 99cef3f212..a2e4dcb6da 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -1751,9 +1751,8 @@ virSecuritySELinuxRestoreImageLabelInt(virSecurityManagerPtr mgr, { virSecurityLabelDefPtr seclabel; virSecurityDeviceLabelDefPtr disk_seclabel; - - if (!src->path || !virStorageSourceIsLocalStorage(src)) - return 0; + VIR_AUTOFREE(char *) vfioGroupDev = NULL; + const char *path = src->path;
seclabel = virDomainDefGetSecurityLabelDef(def, SECURITY_SELINUX_NAME); if (seclabel == NULL) @@ -1785,9 +1784,16 @@ virSecuritySELinuxRestoreImageLabelInt(virSecurityManagerPtr mgr, * ownership, because that kills access on the destination host which is * sub-optimal for the guest VM's I/O attempts :-) */ if (migrated) { - int rc = virFileIsSharedFS(src->path); - if (rc < 0) - return -1; + int rc = 1; + + if (virStorageSourceIsLocalStorage(src)) { + if (!src->path) + return 0; + + if ((rc = virFileIsSharedFS(src->path)) < 0) + return -1; + } + if (rc == 1) { VIR_DEBUG("Skipping image label restore on %s because FS is shared", src->path); @@ -1795,7 +1801,26 @@ virSecuritySELinuxRestoreImageLabelInt(virSecurityManagerPtr mgr, } }
- return virSecuritySELinuxRestoreFileLabel(mgr, src->path, true); + /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + VIR_AUTOUNREF(virPCIDevicePtr) pci = virPCIDeviceNew(nvme->pciAddr.domain,
same as above.
+ nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function); + if (!pci || + !(vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci))) + return -1; + + /* Ideally, we would check if there is not another PCI + * device within domain def that is in the same IOMMU + * group. But we're not doing that for hostdevs yet. */ + path = vfioGroupDev; + } + + return virSecuritySELinuxRestoreFileLabel(mgr, path, true); }
@@ -1820,6 +1845,7 @@ virSecuritySELinuxSetImageLabelInternal(virSecurityManagerPtr mgr, virSecurityDeviceLabelDefPtr disk_seclabel; virSecurityDeviceLabelDefPtr parent_seclabel = NULL; bool remember; + VIR_AUTOFREE(char *) vfioGroupDev = NULL; const char *path = src->path; const char *tcon = NULL; bool optional = false; @@ -1880,6 +1906,22 @@ virSecuritySELinuxSetImageLabelInternal(virSecurityManagerPtr mgr, tcon = data->content_context; }
+ /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + VIR_AUTOUNREF(virPCIDevicePtr) pci = virPCIDeviceNew(nvme->pciAddr.domain,
same as above.
+ nvme->pciAddr.bus, + nvme->pciAddr.slot, + nvme->pciAddr.function); + if (!pci || + !(vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci))) + return -1; + + path = vfioGroupDev; + } + ret = virSecuritySELinuxSetFileconHelper(mgr, path, tcon, optional, remember);
if (ret == 1 && !disk_seclabel) { -- 2.21.0
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

This capability tracks if qemu is capable of: -drive file.driver=nvme The feature was added in QEMU's commit of v2.12.0-rc0~104^2~2. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_capabilities.c | 4 ++++ src/qemu/qemu_capabilities.h | 3 +++ tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml | 1 + 20 files changed, 25 insertions(+) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 02e84edc15..87f506b09e 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -533,6 +533,9 @@ VIR_ENUM_IMPL(virQEMUCaps, "x86-max-cpu", "cpu-unavailable-features", "canonical-cpu-features", + + /* 335 */ + "drive-nvme", ); @@ -1274,6 +1277,7 @@ static struct virQEMUCapsStringFlags virQEMUCapsQMPSchemaQueries[] = { { "query-iothreads/ret-type/poll-max-ns", QEMU_CAPS_IOTHREAD_POLLING }, { "query-display-options/ret-type/+egl-headless/rendernode", QEMU_CAPS_EGL_HEADLESS_RENDERNODE }, { "nbd-server-add/arg-type/bitmap", QEMU_CAPS_NBD_BITMAP }, + { "blockdev-add/arg-type/+nvme", QEMU_CAPS_DRIVE_NVME }, }; typedef struct _virQEMUCapsObjectTypeProps virQEMUCapsObjectTypeProps; diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 915ba6cb2e..fc288b023c 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -515,6 +515,9 @@ typedef enum { /* virQEMUCapsFlags grouping marker for syntax-check */ QEMU_CAPS_CPU_UNAVAILABLE_FEATURES, /* "unavailable-features" CPU property */ QEMU_CAPS_CANONICAL_CPU_FEATURES, /* avoid CPU feature aliases */ + /* 335 */ + QEMU_CAPS_DRIVE_NVME, /* -drive file.driver=nvme */ + QEMU_CAPS_LAST /* this must always be the last item */ } virQEMUCapsFlags; diff --git a/tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml b/tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml index 140da91b86..3b72024216 100644 --- a/tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml +++ b/tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml @@ -153,6 +153,7 @@ <flag name='memory-backend-memfd.hugetlb'/> <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> + <flag name='drive-nvme'/> <version>2011090</version> <kvmVersion>0</kvmVersion> <microcodeVersion>61700807</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml index fd9ae0bcb8..d4ed7cea49 100644 --- a/tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml @@ -151,6 +151,7 @@ <flag name='memory-backend-memfd.hugetlb'/> <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> + <flag name='drive-nvme'/> <version>2011090</version> <kvmVersion>0</kvmVersion> <microcodeVersion>42900807</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml b/tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml index 2930381068..91aee9f838 100644 --- a/tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml +++ b/tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml @@ -121,6 +121,7 @@ <flag name='memory-backend-memfd.hugetlb'/> <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> + <flag name='drive-nvme'/> <version>2012000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>39100807</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml index 61b3602c48..96fa7c43c9 100644 --- a/tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml @@ -195,6 +195,7 @@ <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> <flag name='x86-max-cpu'/> + <flag name='drive-nvme'/> <version>2011090</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100807</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml index 40718981a8..d848e55fc1 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml @@ -151,6 +151,7 @@ <flag name='memory-backend-memfd.hugetlb'/> <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> + <flag name='drive-nvme'/> <version>2012050</version> <kvmVersion>0</kvmVersion> <microcodeVersion>42900757</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml b/tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml index 865becc179..d50789711a 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml @@ -92,6 +92,7 @@ <flag name='memory-backend-memfd.hugetlb'/> <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> + <flag name='drive-nvme'/> <version>3000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>0</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml b/tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml index eb54aeaff3..06a2e98b90 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml @@ -92,6 +92,7 @@ <flag name='memory-backend-memfd.hugetlb'/> <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> + <flag name='drive-nvme'/> <version>3000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>0</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml b/tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml index d511377262..a8fb99c9fd 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml @@ -123,6 +123,7 @@ <flag name='memory-backend-memfd.hugetlb'/> <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> + <flag name='drive-nvme'/> <version>3000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>39100757</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml index c6394db602..1c6fd4dd83 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml @@ -198,6 +198,7 @@ <flag name='memory-backend-file.align'/> <flag name='nvdimm.unarmed'/> <flag name='x86-max-cpu'/> + <flag name='drive-nvme'/> <version>3000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100757</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml index ee6921ff92..dc340838a8 100644 --- a/tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml @@ -156,6 +156,7 @@ <flag name='memory-backend-file.align'/> <flag name='memory-backend-file.pmem'/> <flag name='overcommit'/> + <flag name='drive-nvme'/> <version>3000091</version> <kvmVersion>0</kvmVersion> <microcodeVersion>42900758</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml index a8cb061bf3..f1517fa365 100644 --- a/tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml @@ -201,6 +201,7 @@ <flag name='nvdimm.unarmed'/> <flag name='overcommit'/> <flag name='x86-max-cpu'/> + <flag name='drive-nvme'/> <version>3000092</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100758</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml b/tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml index 250b7edd52..d2966b6045 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml @@ -163,6 +163,7 @@ <flag name='machine.virt.iommu'/> <flag name='bitmap-merge'/> <flag name='nbd-bitmap'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>61700758</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml index 24b55002a6..4afbc543cc 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml @@ -168,6 +168,7 @@ <flag name='query-current-machine'/> <flag name='bitmap-merge'/> <flag name='nbd-bitmap'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>42900758</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml b/tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml index 230e1e7c99..1b8137612a 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml @@ -166,6 +166,7 @@ <flag name='query-current-machine'/> <flag name='bitmap-merge'/> <flag name='nbd-bitmap'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>0</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml b/tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml index 4b2f4cf628..bd06a816d8 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml @@ -166,6 +166,7 @@ <flag name='query-current-machine'/> <flag name='bitmap-merge'/> <flag name='nbd-bitmap'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>0</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml b/tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml index a1ac2587a0..82326dec79 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml @@ -131,6 +131,7 @@ <flag name='query-current-machine'/> <flag name='bitmap-merge'/> <flag name='nbd-bitmap'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>39100758</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml index 716b756979..b8b46a7fa7 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml @@ -205,6 +205,7 @@ <flag name='bitmap-merge'/> <flag name='nbd-bitmap'/> <flag name='x86-max-cpu'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100758</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml index 9cbf65b405..6bbe0603d3 100644 --- a/tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml @@ -207,6 +207,7 @@ <flag name='x86-max-cpu'/> <flag name='cpu-unavailable-features'/> <flag name='canonical-cpu-features'/> + <flag name='drive-nvme'/> <version>4000050</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100759</microcodeVersion> -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:15 +0200, Michal Privoznik wrote:
This capability tracks if qemu is capable of:
-drive file.driver=nvme
The feature was added in QEMU's commit of v2.12.0-rc0~104^2~2.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_capabilities.c | 4 ++++ src/qemu/qemu_capabilities.h | 3 +++ tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml | 1 + 20 files changed, 25 insertions(+)
I've seen a few patches for the userspace NVMe stuff recently are you sure that it will work with qemu as old as 2.12?

On 7/11/19 6:15 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:54:15 +0200, Michal Privoznik wrote:
This capability tracks if qemu is capable of:
-drive file.driver=nvme
The feature was added in QEMU's commit of v2.12.0-rc0~104^2~2.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_capabilities.c | 4 ++++ src/qemu/qemu_capabilities.h | 3 +++ tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml | 1 + 20 files changed, 25 insertions(+)
I've seen a few patches for the userspace NVMe stuff recently are you sure that it will work with qemu as old as 2.12?
No. But qemu reports it supports NVMe since 2.12. Unless we want to do a version check in addition to this I'm not sure how to have the capability only for 'newer' qemus (and what 'newer' means actually). The worst thing that could happen is that qemu will fail to start / crash and libvirt reattaches NVMe back to the host. But that's the risk with any feature, isn't it? Michal

Now, that we have everything prepared, we can generate command line for NVMe disks. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_block.c | 25 ++++++++- src/qemu/qemu_command.c | 3 ++ src/qemu/qemu_process.c | 7 +++ .../disk-nvme.x86_64-latest.args | 52 +++++++++++++++++++ tests/qemuxml2argvtest.c | 1 + 5 files changed, 87 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c index 5eeb3757f1..bf49b7bc7c 100644 --- a/src/qemu/qemu_block.c +++ b/src/qemu/qemu_block.c @@ -992,6 +992,25 @@ qemuBlockStorageSourceGetVvfatProps(virStorageSourcePtr src) } +static virJSONValuePtr +qemuBlockStorageSourceGetNVMeProps(virStorageSourcePtr src) +{ + const virStorageSourceNVMeDef *nvme = src->nvme; + VIR_AUTOFREE(char *) pciAddr = NULL; + virJSONValuePtr ret = NULL; + + if (!(pciAddr = virPCIDeviceAddressAsString(&nvme->pciAddr))) + return NULL; + + ignore_value(virJSONValueObjectCreate(&ret, + "s:driver", "nvme", + "s:device", pciAddr, + "U:namespace", nvme->namespace, + NULL)); + return ret; +} + + static int qemuBlockStorageSourceGetBlockdevGetCacheProps(virStorageSourcePtr src, virJSONValuePtr props) @@ -1049,8 +1068,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src, return NULL; break; - case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: + if (!(fileprops = qemuBlockStorageSourceGetNVMeProps(src))) + return NULL; + break; + + case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: return NULL; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 927641cf46..e42377927e 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -1594,6 +1594,9 @@ qemuDiskSourceNeedsProps(virStorageSourcePtr src, src->haveTLS == VIR_TRISTATE_BOOL_YES) return true; + if (actualType == VIR_STORAGE_TYPE_NVME) + return true; + return false; } diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index aa09ef175a..ebe35e6363 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -5294,6 +5294,13 @@ qemuProcessStartValidateDisks(virDomainObjPtr vm, _("PowerPC pseries machines do not support floppy device")); return -1; } + + if (src->type == VIR_STORAGE_TYPE_NVME && + !virQEMUCapsGet(qemuCaps, QEMU_CAPS_DRIVE_NVME)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("NVMe disks are not supported with this QEMU binary")); + return -1; + } } return 0; diff --git a/tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args b/tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args new file mode 100644 index 0000000000..5ff41de7b9 --- /dev/null +++ b/tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args @@ -0,0 +1,52 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-QEMUGuest1 \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-QEMUGuest1/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-QEMUGuest1/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-QEMUGuest1/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-i686 \ +-name guest=QEMUGuest1,debug-threads=on \ +-S \ +-object secret,id=masterKey0,format=raw,\ +file=/tmp/lib/domain--1-QEMUGuest1/master-key.aes \ +-machine pc,accel=tcg,usb=off,dump-guest-core=off \ +-m 214 \ +-overcommit mem-lock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-boot strict=on \ +-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ +-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x3 \ +-drive file.driver=nvme,file.device=0000:01:00.0,file.namespace=1,format=raw,\ +if=none,id=drive-virtio-disk0 \ +-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,\ +id=virtio-disk0,bootindex=1 \ +-drive file.driver=nvme,file.device=0000:01:00.0,file.namespace=2,format=raw,\ +if=none,id=drive-virtio-disk1 \ +-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,\ +id=virtio-disk1 \ +-drive file.driver=nvme,file.device=0000:02:00.0,file.namespace=1,format=raw,\ +if=none,id=drive-virtio-disk2 \ +-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk2,\ +id=virtio-disk2 \ +-object secret,id=virtio-disk3-luks-secret0,\ +data=9eao5F8qtkGt+seB1HYivWIxbtwUu6MQtg1zpj/oDtUsPr1q8wBYM91uEHCn6j/1,\ +keyid=masterKey0,iv=AAECAwQFBgcICQoLDA0ODw==,format=base64 \ +-drive file.driver=nvme,file.device=0001:02:00.0,file.namespace=2,\ +key-secret=virtio-disk3-luks-secret0,format=luks,if=none,id=drive-virtio-disk3 \ +-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk3,\ +id=virtio-disk3 \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,\ +resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 0ac128be00..d4a16c35a6 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1068,6 +1068,7 @@ mymain(void) driver.config->vxhsTLS = 0; VIR_FREE(driver.config->vxhsTLSx509certdir); DO_TEST("disk-no-boot", NONE); + DO_TEST_CAPS_LATEST("disk-nvme"); DO_TEST_PARSE_ERROR("disk-device-lun-type-invalid", QEMU_CAPS_VIRTIO_SCSI); DO_TEST_FAILURE("disk-usb-nosupport", NONE); -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:16 +0200, Michal Privoznik wrote:
Now, that we have everything prepared, we can generate command line for NVMe disks.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_block.c | 25 ++++++++- src/qemu/qemu_command.c | 3 ++ src/qemu/qemu_process.c | 7 +++ .../disk-nvme.x86_64-latest.args | 52 +++++++++++++++++++ tests/qemuxml2argvtest.c | 1 + 5 files changed, 87 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args
Note that when you enable this you did not disallow snapshots (as in creating a local file snapshot on top of the NVMe image) nor implement the backing store string parser for this. This means that once you create the snapshot, restarting the VM will become impossible as we will not be able to parse the backing store string (which is probably a bad idea altogether, since the disk can change PCI addresses in the meanwhile so refering to it via the one stored in the backing file would be wrong anyways). You'll probably need to disable snapshots (see qemuDomainSnapshotPrepareDiskExternalActive) if the even 'domdisk' is NVMe too at least until we enable -blockdev support. Inactive external snapshots should be fine since we validate that only BLOCK and FILE disks are allowed in qemuDomainSnapshotPrepareDiskExternalInactive. At any rate please also add TEST_DISK_TO_JSON case for this in tests/qemublocktest.c Alternatively depending on how much we want to prevent from parsing nvme:// from the backing store string we'll also need changes to virStorageSourceNewFromBackingAbsolute which will deny the nvme backing store. The unfortunate part is that doing all these limitations basically removes all the advantages of using NVMe disks via the qemu block layer.

On 7/16/19 5:35 PM, Peter Krempa wrote:
On Thu, Jul 11, 2019 at 17:54:16 +0200, Michal Privoznik wrote:
Now, that we have everything prepared, we can generate command line for NVMe disks.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_block.c | 25 ++++++++- src/qemu/qemu_command.c | 3 ++ src/qemu/qemu_process.c | 7 +++ .../disk-nvme.x86_64-latest.args | 52 +++++++++++++++++++ tests/qemuxml2argvtest.c | 1 + 5 files changed, 87 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args
Note that when you enable this you did not disallow snapshots (as in creating a local file snapshot on top of the NVMe image) nor implement the backing store string parser for this.
This means that once you create the snapshot, restarting the VM will become impossible as we will not be able to parse the backing store string (which is probably a bad idea altogether, since the disk can change PCI addresses in the meanwhile so refering to it via the one stored in the backing file would be wrong anyways).
You'll probably need to disable snapshots (see qemuDomainSnapshotPrepareDiskExternalActive) if the even 'domdisk' is NVMe too at least until we enable -blockdev support.
Fair enough.
Inactive external snapshots should be fine since we validate that only BLOCK and FILE disks are allowed in qemuDomainSnapshotPrepareDiskExternalInactive.
At any rate please also add TEST_DISK_TO_JSON case for this in tests/qemublocktest.c
Okay.
Alternatively depending on how much we want to prevent from parsing nvme:// from the backing store string we'll also need changes to virStorageSourceNewFromBackingAbsolute which will deny the nvme backing store.
The unfortunate part is that doing all these limitations basically removes all the advantages of using NVMe disks via the qemu block layer.
But those limitations would exist only for the time being until we switch to -blockdev, right? I am willing to take that risk, and allow snapshots only with -blockdev. Also, given that we are already at 31 patches in this series, we can forbid snapshots for now and save the work for a follow up series (if somebody really needs to do snapshots). Note that, these patches still add some value even if we disallow snapshots - domain can still migrate for instance. Michal

At the very beginning of the attach function the qemuDomainStorageSourceChainAccessAllow() is called which modifies CGroups, locks and seclabels for new disk and its backing chain. This must be followed by a counterpart which reverts back all the changes if something goes wrong. This boils down to calling qemuDomainStorageSourceChainAccessRevoke() which is done under 'error' label. But not all failure branches jump there. They just jump onto 'cleanup' label where no revoke is done. Such mistake is easy to do because 'cleanup' label does exist. Therefore, dissolve 'error' block in 'cleanup' and have everything jump onto 'cleanup' label. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 7e9c1a1649..3c6c0da3a0 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -624,13 +624,13 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, VIR_AUTOFREE(char *) corAlias = NULL; if (qemuDomainStorageSourceChainAccessAllow(driver, vm, disk->src) < 0) - goto cleanup; + return -1; if (qemuAssignDeviceDiskAlias(vm->def, disk, priv->qemuCaps) < 0) - goto error; + goto cleanup; if (qemuDomainPrepareDiskSource(disk, priv, cfg) < 0) - goto error; + goto cleanup; if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_BLOCKDEV)) { if (disk->copy_on_read == VIR_TRISTATE_SWITCH_ON && @@ -647,13 +647,13 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, } if (!(devstr = qemuBuildDiskDeviceStr(vm->def, disk, 0, priv->qemuCaps))) - goto error; + goto cleanup; if (VIR_REALLOC_N(vm->def->disks, vm->def->ndisks + 1) < 0) - goto error; + goto cleanup; if (qemuHotplugAttachManagedPR(driver, vm, disk->src, QEMU_ASYNC_JOB_NONE) < 0) - goto error; + goto cleanup; qemuDomainObjEnterMonitor(driver, vm); @@ -674,7 +674,7 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, if (qemuDomainObjExitMonitor(driver, vm) < 0) { ret = -2; - goto error; + goto cleanup; } virDomainAuditDisk(vm, NULL, disk->src, "attach", true); @@ -683,6 +683,8 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, ret = 0; cleanup: + if (ret < 0) + ignore_value(qemuDomainStorageSourceChainAccessRevoke(driver, vm, disk->src)); qemuDomainSecretDiskDestroy(disk); VIR_FREE(devstr); return ret; @@ -700,9 +702,6 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, ret = -2; virDomainAuditDisk(vm, NULL, disk->src, "attach", false); - - error: - ignore_value(qemuDomainStorageSourceChainAccessRevoke(driver, vm, disk->src)); goto cleanup; } -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:17 +0200, Michal Privoznik wrote:
At the very beginning of the attach function the qemuDomainStorageSourceChainAccessAllow() is called which modifies CGroups, locks and seclabels for new disk and its backing chain. This must be followed by a counterpart which reverts back all the changes if something goes wrong. This boils down to calling qemuDomainStorageSourceChainAccessRevoke() which is done under 'error' label. But not all failure branches jump there. They just jump onto 'cleanup' label where no revoke is done. Such mistake is easy to do because 'cleanup' label does exist. Therefore, dissolve 'error' block in 'cleanup' and have everything jump onto 'cleanup' label.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-)
ACK

Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 65 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 60 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 3c6c0da3a0..6dbff23aa0 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -605,6 +605,54 @@ qemuDomainChangeEjectableMedia(virQEMUDriverPtr driver, } +static int +qemuDomainStorageSourcePrepareDisk(virQEMUDriverPtr driver, + virDomainObjPtr vm, + virDomainDiskDefPtr disk, + bool teardown) +{ + int rc; + bool adjustMemlock = false; + bool reattach = false; + + if (!virDomainDefHasNVMeDisk(vm->def) && + !virStorageSourceChainHasNVMe(disk->src)) + return 0; + + if (teardown) { + adjustMemlock = true; + reattach = true; + goto rollback; + } + + /* Tentatively add disk to domain def so that memlock limit can be computed. */ + vm->def->disks[vm->def->ndisks++] = disk; + rc = qemuDomainAdjustMaxMemLock(vm); + vm->def->disks[--vm->def->ndisks] = NULL; + + if (rc < 0) + return -1; + + adjustMemlock = true; + + if (qemuHostdevPrepareNVMeDevices(driver, vm->def->name, &disk, 1) < 0) + return -1; + + reattach = true; + + return 0; + + rollback: + if (reattach) + qemuHostdevReAttachNVMeDevices(driver, vm->def->name, &disk, 1); + + if (adjustMemlock) + qemuDomainAdjustMaxMemLock(vm); + + return 0; +} + + /** * qemuDomainAttachDiskGeneric: * @@ -623,8 +671,14 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, VIR_AUTOPTR(virJSONValue) corProps = NULL; VIR_AUTOFREE(char *) corAlias = NULL; + if (VIR_REALLOC_N(vm->def->disks, vm->def->ndisks + 1) < 0) + return -1; + + if (qemuDomainStorageSourcePrepareDisk(driver, vm, disk, false) < 0) + return -1; + if (qemuDomainStorageSourceChainAccessAllow(driver, vm, disk->src) < 0) - return -1; + goto cleanup; if (qemuAssignDeviceDiskAlias(vm->def, disk, priv->qemuCaps) < 0) goto cleanup; @@ -649,9 +703,6 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, if (!(devstr = qemuBuildDiskDeviceStr(vm->def, disk, 0, priv->qemuCaps))) goto cleanup; - if (VIR_REALLOC_N(vm->def->disks, vm->def->ndisks + 1) < 0) - goto cleanup; - if (qemuHotplugAttachManagedPR(driver, vm, disk->src, QEMU_ASYNC_JOB_NONE) < 0) goto cleanup; @@ -683,8 +734,10 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, ret = 0; cleanup: - if (ret < 0) + if (ret < 0) { ignore_value(qemuDomainStorageSourceChainAccessRevoke(driver, vm, disk->src)); + qemuDomainStorageSourcePrepareDisk(driver, vm, disk, true); + } qemuDomainSecretDiskDestroy(disk); VIR_FREE(devstr); return ret; @@ -4267,6 +4320,8 @@ qemuDomainRemoveDiskDevice(virQEMUDriverPtr driver, dev.data.disk = disk; ignore_value(qemuRemoveSharedDevice(driver, &dev, vm->def->name)); + qemuDomainStorageSourcePrepareDisk(driver, vm, disk, true); + if (virStorageSourceChainHasManagedPR(disk->src) && qemuHotplugRemoveManagedPR(driver, vm, QEMU_ASYNC_JOB_NONE) < 0) goto cleanup; -- 2.21.0

On Thu, Jul 11, 2019 at 17:54:18 +0200, Michal Privoznik wrote:
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 65 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 60 insertions(+), 5 deletions(-)
diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 3c6c0da3a0..6dbff23aa0 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -605,6 +605,54 @@ qemuDomainChangeEjectableMedia(virQEMUDriverPtr driver, }
+static int +qemuDomainStorageSourcePrepareDisk(virQEMUDriverPtr driver, + virDomainObjPtr vm, + virDomainDiskDefPtr disk, + bool teardown) +{ + int rc; + bool adjustMemlock = false; + bool reattach = false; + + if (!virDomainDefHasNVMeDisk(vm->def) && + !virStorageSourceChainHasNVMe(disk->src)) + return 0; + + if (teardown) { + adjustMemlock = true; + reattach = true; + goto rollback; + } + + /* Tentatively add disk to domain def so that memlock limit can be computed. */ + vm->def->disks[vm->def->ndisks++] = disk; + rc = qemuDomainAdjustMaxMemLock(vm); + vm->def->disks[--vm->def->ndisks] = NULL; + + if (rc < 0) + return -1; + + adjustMemlock = true;
What's the point of this ...
+ + if (qemuHostdevPrepareNVMeDevices(driver, vm->def->name, &disk, 1) < 0) + return -1;
... if this just exits the function?
+ + reattach = true; + + return 0; + + rollback: + if (reattach) + qemuHostdevReAttachNVMeDevices(driver, vm->def->name, &disk, 1); + + if (adjustMemlock) + qemuDomainAdjustMaxMemLock(vm); + + return 0; +} + + /** * qemuDomainAttachDiskGeneric: *
[...]
@@ -683,8 +734,10 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, ret = 0;
cleanup: - if (ret < 0) + if (ret < 0) { ignore_value(qemuDomainStorageSourceChainAccessRevoke(driver, vm, disk->src)); + qemuDomainStorageSourcePrepareDisk(driver, vm, disk, true); + } qemuDomainSecretDiskDestroy(disk); VIR_FREE(devstr); return ret; @@ -4267,6 +4320,8 @@ qemuDomainRemoveDiskDevice(virQEMUDriverPtr driver, dev.data.disk = disk; ignore_value(qemuRemoveSharedDevice(driver, &dev, vm->def->name));
+ qemuDomainStorageSourcePrepareDisk(driver, vm, disk, true); +
I'd prefer if you could base this on top of the new disk source tracking for new blockjobs I've done for -blockdev support: git fetch https://github.com/pipo/libvirt.git job-tracking-send That code tentatively moves this out if the disk backend is still required while the frontend was unplugged.
participants (2)
-
Michal Privoznik
-
Peter Krempa