[libvirt] [PATCH v3 00/30] Introduce NVMe support

v3 of: https://www.redhat.com/archives/libvir-list/2019-September/msg01209.html Builds just fine ;-) https://travis-ci.org/zippy2/libvirt/builds/619622349 diff to v2: - Fixed bug reported by Cole - unability to plug regular <hostdev/>-s caused by v2. Michal Prívozník (30): qemu: Explicitly add/remove /dev/vfio/vfio to/from NS/CGroups qemuDomainGetHostdevPath: Use more g_autoptr()/g_autofree qemuDomainGetHostdevPath: Don't include /dev/vfio/vfio in returned paths qemu: Drop some 'cleanup' labels virpci: Introduce and use virPCIDeviceAddressGetIOMMUGroupDev virHostdevPreparePCIDevices: Separate out function body virHostdevReAttachPCIDevices: Separate out function body virpci: Introduce virPCIDeviceAddressCopy qemuMigrationSrcIsSafe: Rework slightly schemas: Introduce disk type NVMe conf: Format and parse NVMe type disk virstoragefile: Introduce virStorageSourceChainHasNVMe domain_conf: Introduce virDomainDefHasNVMeDisk util: Introduce virNVMeDevice module virhostdev: Include virNVMeDevice module virpcimock: Introduce NVMe driver and devices virhostdevtest: Test virNVMeDevice assignment qemu: prepare NVMe devices too qemu: Take NVMe disks into account when calculating memlock limit qemu: Create NVMe disk in domain namespace qemu: Mark NVMe disks as 'need VFIO' qemu: Allow NVMe disk in CGroups virSecuritySELinuxRestoreImageLabelInt: Don't skip non-local storage qemu_capabilities: Introduce QEMU_CAPS_DRIVE_NVME qemu: Generate command line of NVMe disks qemu_monitor_text: Catch IOMMU/VFIO related errors in qemuMonitorTextAddDrive qemu: Don't leak storage perms on failure in qemuDomainAttachDiskGeneric qemu: Allow forcing VFIO when computing memlock limit qemu_hotplug: Prepare NVMe disks on hotplug virsh: Introduce nvme disk to domblklist docs/formatdomain.html.in | 57 ++- docs/schemas/domaincommon.rng | 32 ++ po/POTFILES.in | 1 + src/conf/domain_conf.c | 109 +++++ src/conf/domain_conf.h | 3 + src/libvirt_private.syms | 28 ++ src/libxl/xen_xl.c | 1 + src/qemu/qemu_block.c | 25 + src/qemu/qemu_capabilities.c | 2 + src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_cgroup.c | 203 +++++--- src/qemu/qemu_command.c | 6 +- src/qemu/qemu_domain.c | 375 +++++++++------ src/qemu/qemu_domain.h | 15 +- src/qemu/qemu_driver.c | 4 + src/qemu/qemu_hostdev.c | 71 ++- src/qemu/qemu_hostdev.h | 16 + src/qemu/qemu_hotplug.c | 31 +- src/qemu/qemu_migration.c | 30 +- src/qemu/qemu_monitor_text.c | 7 + src/qemu/qemu_process.c | 7 + src/security/security_apparmor.c | 25 +- src/security/security_dac.c | 30 ++ src/security/security_selinux.c | 51 +- src/util/Makefile.inc.am | 2 + src/util/virhostdev.c | 431 +++++++++++++++-- src/util/virhostdev.h | 37 ++ src/util/virnvme.c | 447 ++++++++++++++++++ src/util/virnvme.h | 97 ++++ src/util/virpci.c | 29 ++ src/util/virpci.h | 5 + src/util/virstoragefile.c | 71 +++ src/util/virstoragefile.h | 19 + .../caps_2.12.0.aarch64.xml | 1 + .../caps_2.12.0.ppc64.xml | 1 + .../caps_2.12.0.s390x.xml | 1 + .../caps_2.12.0.x86_64.xml | 1 + .../qemucapabilitiesdata/caps_3.0.0.ppc64.xml | 1 + .../caps_3.0.0.riscv32.xml | 1 + .../caps_3.0.0.riscv64.xml | 1 + .../qemucapabilitiesdata/caps_3.0.0.s390x.xml | 1 + .../caps_3.0.0.x86_64.xml | 1 + .../qemucapabilitiesdata/caps_3.1.0.ppc64.xml | 1 + .../caps_3.1.0.x86_64.xml | 1 + .../caps_4.0.0.aarch64.xml | 1 + .../qemucapabilitiesdata/caps_4.0.0.ppc64.xml | 1 + .../caps_4.0.0.riscv32.xml | 1 + .../caps_4.0.0.riscv64.xml | 1 + .../qemucapabilitiesdata/caps_4.0.0.s390x.xml | 1 + .../caps_4.0.0.x86_64.xml | 1 + .../caps_4.1.0.x86_64.xml | 1 + .../caps_4.2.0.aarch64.xml | 1 + .../qemucapabilitiesdata/caps_4.2.0.ppc64.xml | 1 + .../qemucapabilitiesdata/caps_4.2.0.s390x.xml | 1 + .../caps_4.2.0.x86_64.xml | 1 + tests/qemumemlocktest.c | 2 +- .../disk-nvme.x86_64-latest.args | 63 +++ tests/qemuxml2argvdata/disk-nvme.xml | 63 +++ tests/qemuxml2argvtest.c | 1 + tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + tests/virhostdevtest.c | 97 ++++ tests/virpcimock.c | 3 + tests/virpcitestdata/0000-01-00.0.config | Bin 0 -> 4096 bytes tests/virpcitestdata/0000-02-00.0.config | Bin 0 -> 4096 bytes tools/virsh-domain-monitor.c | 31 +- 66 files changed, 2240 insertions(+), 312 deletions(-) create mode 100644 src/util/virnvme.c create mode 100644 src/util/virnvme.h create mode 100644 tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml create mode 100644 tests/virpcitestdata/0000-01-00.0.config create mode 100644 tests/virpcitestdata/0000-02-00.0.config -- 2.23.0

In near future, the decision what to do with /dev/vfio/vfio with respect to domain namespace and CGroup is going to be moved out of qemuDomainGetHostdevPath() because there will be some other types of devices than hostdevs that need access to VFIO. All functions that I'm changing (except qemuSetupHostdevCgroup()) assume that hostdev we are adding/removing to VM is not in the definition yet (because of how qemuDomainNeedsVFIO() is written). Fortunately, this assumption is true. For qemuSetupHostdevCgroup(), the worst thing that may happen is that we allow /dev/vfio/vfio which was already allowed. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++- src/qemu/qemu_domain.c | 36 ++++++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 31cf71146b..65b148cd83 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -25,6 +25,7 @@ #include "qemu_domain.h" #include "qemu_process.h" #include "qemu_extdevice.h" +#include "qemu_hostdev.h" #include "virlog.h" #include "viralloc.h" #include "virerror.h" @@ -359,6 +360,17 @@ qemuTeardownInputCgroup(virDomainObjPtr vm, } +/** + * qemuSetupHostdevCgroup: + * vm: domain object + * @dev: device to allow + * + * For given host device @dev allow access to in Cgroups. + * Note, @dev must not be in @vm's definition. + * + * Returns: 0 on success, + * -1 otherwise. + */ int qemuSetupHostdevCgroup(virDomainObjPtr vm, virDomainHostdevDefPtr dev) @@ -385,6 +397,16 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, goto cleanup; } + if (qemuHostdevNeedsVFIO(dev)) { + VIR_DEBUG("Cgroup allow %s perms=%d", QEMU_DEV_VFIO, VIR_CGROUP_DEVICE_RW); + rv = virCgroupAllowDevicePath(priv->cgroup, QEMU_DEV_VFIO, + VIR_CGROUP_DEVICE_RW, false); + virDomainAuditCgroupPath(vm, priv->cgroup, "allow", + QEMU_DEV_VFIO, "rw", rv); + if (rv < 0) + goto cleanup; + } + ret = 0; cleanup: @@ -395,9 +417,21 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, return ret; } + +/** + * qemuTeardownHostdevCgroup: + * @vm: doamin object + * @dev: device to tear down + * + * For given host device @dev deny access to it in CGroups. + * Note, @dev must not be in @vm's definition. + * + * Returns: 0 on success, + * -1 otherwise. + */ int qemuTeardownHostdevCgroup(virDomainObjPtr vm, - virDomainHostdevDefPtr dev) + virDomainHostdevDefPtr dev) { qemuDomainObjPrivatePtr priv = vm->privateData; char **path = NULL; @@ -421,6 +455,17 @@ qemuTeardownHostdevCgroup(virDomainObjPtr vm, goto cleanup; } + if (qemuHostdevNeedsVFIO(dev) && + !qemuDomainNeedsVFIO(vm->def)) { + VIR_DEBUG("Cgroup deny " QEMU_DEV_VFIO); + rv = virCgroupDenyDevicePath(priv->cgroup, QEMU_DEV_VFIO, + VIR_CGROUP_DEVICE_RWM, false); + virDomainAuditCgroupPath(vm, priv->cgroup, "deny", + QEMU_DEV_VFIO, "rwm", rv); + if (rv < 0) + goto cleanup; + } + ret = 0; cleanup: for (i = 0; i < npaths; i++) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index d1596a28ca..550ae848df 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -13542,6 +13542,10 @@ qemuDomainSetupHostdev(virQEMUDriverConfigPtr cfg G_GNUC_UNUSED, goto cleanup; } + if (qemuHostdevNeedsVFIO(dev) && + qemuDomainCreateDevice(QEMU_DEV_VFIO, data, false) < 0) + goto cleanup; + ret = 0; cleanup: for (i = 0; i < npaths; i++) @@ -14565,6 +14569,17 @@ qemuDomainNamespaceTeardownDisk(virDomainObjPtr vm G_GNUC_UNUSED, } +/** + * qemuDomainNamespaceSetupHostdev: + * @vm: domain object + * @hostdev: hostdev to create in @vm's namespace + * + * For given @hostdev, create its devfs representation (if it has one) in + * domain namespace. Note, @hostdev must not be in @vm's definition. + * + * Returns: 0 on success, + * -1 otherwise. + */ int qemuDomainNamespaceSetupHostdev(virDomainObjPtr vm, virDomainHostdevDefPtr hostdev) @@ -14579,6 +14594,11 @@ qemuDomainNamespaceSetupHostdev(virDomainObjPtr vm, if (qemuDomainNamespaceMknodPaths(vm, (const char **)paths, npaths) < 0) goto cleanup; + if (qemuHostdevNeedsVFIO(hostdev) && + !qemuDomainNeedsVFIO(vm->def) && + qemuDomainNamespaceMknodPath(vm, QEMU_DEV_VFIO) < 0) + goto cleanup; + ret = 0; cleanup: for (i = 0; i < npaths; i++) @@ -14588,6 +14608,17 @@ qemuDomainNamespaceSetupHostdev(virDomainObjPtr vm, } +/** + * qemuDomainNamespaceTeardownHostdev: + * @vm: domain object + * @hostdev: hostdev to remove in @vm's namespace + * + * For given @hostdev, remove its devfs representation (if it has one) in + * domain namespace. Note, @hostdev must not be in @vm's definition. + * + * Returns: 0 on success, + * -1 otherwise. + */ int qemuDomainNamespaceTeardownHostdev(virDomainObjPtr vm, virDomainHostdevDefPtr hostdev) @@ -14603,6 +14634,11 @@ qemuDomainNamespaceTeardownHostdev(virDomainObjPtr vm, if (qemuDomainNamespaceUnlinkPaths(vm, (const char **)paths, npaths) < 0) goto cleanup; + if (qemuHostdevNeedsVFIO(hostdev) && + !qemuDomainNeedsVFIO(vm->def) && + qemuDomainNamespaceUnlinkPath(vm, QEMU_DEV_VFIO) < 0) + goto cleanup; + ret = 0; cleanup: for (i = 0; i < npaths; i++) -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
In near future, the decision what to do with /dev/vfio/vfio with respect to domain namespace and CGroup is going to be moved out of qemuDomainGetHostdevPath() because there will be some other types of devices than hostdevs that need access to VFIO.
All functions that I'm changing (except qemuSetupHostdevCgroup()) assume that hostdev we are adding/removing to VM is not in the definition yet (because of how qemuDomainNeedsVFIO() is written). Fortunately, this assumption is true.
For qemuSetupHostdevCgroup(), the worst thing that may happen is that we allow /dev/vfio/vfio which was already allowed.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++- src/qemu/qemu_domain.c | 36 ++++++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 31cf71146b..65b148cd83 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -25,6 +25,7 @@ #include "qemu_domain.h" #include "qemu_process.h" #include "qemu_extdevice.h" +#include "qemu_hostdev.h" #include "virlog.h" #include "viralloc.h" #include "virerror.h" @@ -359,6 +360,17 @@ qemuTeardownInputCgroup(virDomainObjPtr vm, }
+/** + * qemuSetupHostdevCgroup: + * vm: domain object + * @dev: device to allow + * + * For given host device @dev allow access to in Cgroups. + * Note, @dev must not be in @vm's definition. + *
This last line should be dropped, it is still called like that via qemuSetupDevicesCgroup
+ * Returns: 0 on success, + * -1 otherwise. + */ int qemuSetupHostdevCgroup(virDomainObjPtr vm, virDomainHostdevDefPtr dev) @@ -385,6 +397,16 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, goto cleanup; }
Maybe just add a comment here explaining that, at VM startup, this can lead to /dev/vfio/vfio being added to the cgroup multiple times, but that has never shown to be a problem. Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

There are several variables which could be automatically freed upon return from the function. I'm not changing @tmpPaths (which is a string list) because it is going to be removed in next commit. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 550ae848df..6daae24c3d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -12891,14 +12891,14 @@ qemuDomainGetHostdevPath(virDomainDefPtr def, virDomainHostdevSubsysSCSIPtr scsisrc = &dev->source.subsys.u.scsi; virDomainHostdevSubsysSCSIVHostPtr hostsrc = &dev->source.subsys.u.scsi_host; virDomainHostdevSubsysMediatedDevPtr mdevsrc = &dev->source.subsys.u.mdev; - virPCIDevicePtr pci = NULL; - virUSBDevicePtr usb = NULL; - virSCSIDevicePtr scsi = NULL; - virSCSIVHostDevicePtr host = NULL; - char *tmpPath = NULL; + g_autoptr(virPCIDevice) pci = NULL; + g_autoptr(virUSBDevice) usb = NULL; + g_autoptr(virSCSIDevice) scsi = NULL; + g_autoptr(virSCSIVHostDevice) host = NULL; + g_autofree char *tmpPath = NULL; bool includeVFIO = false; char **tmpPaths = NULL; - int *tmpPerms = NULL; + g_autofree int *tmpPerms = NULL; size_t tmpNpaths = 0; int perm = 0; @@ -13025,12 +13025,6 @@ qemuDomainGetHostdevPath(virDomainDefPtr def, ret = 0; cleanup: virStringListFreeCount(tmpPaths, tmpNpaths); - VIR_FREE(tmpPerms); - virPCIDeviceFree(pci); - virUSBDeviceFree(usb); - virSCSIDeviceFree(scsi); - virSCSIVHostDeviceFree(host); - VIR_FREE(tmpPath); return ret; } -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
There are several variables which could be automatically freed upon return from the function. I'm not changing @tmpPaths (which is a string list) because it is going to be removed in next commit.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

Now that all callers of qemuDomainGetHostdevPath() handle /dev/vfio/vfio on their own, we can safely drop handling in this function. In near future the decision whether domain needs VFIO file is going to include more device types than just virDomainHostdev. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 52 ++++++++-------------- src/qemu/qemu_domain.c | 98 +++++++++--------------------------------- src/qemu/qemu_domain.h | 9 ++-- 3 files changed, 42 insertions(+), 117 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 65b148cd83..d54d01d4c0 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -376,26 +376,23 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, virDomainHostdevDefPtr dev) { qemuDomainObjPrivatePtr priv = vm->privateData; - char **path = NULL; - int *perms = NULL; - size_t i, npaths = 0; + g_autofree char *path = NULL; + int perms; int rv, ret = -1; if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) return 0; - if (qemuDomainGetHostdevPath(NULL, dev, false, &npaths, &path, &perms) < 0) + if (qemuDomainGetHostdevPath(dev, &path, &perms) < 0) goto cleanup; - for (i = 0; i < npaths; i++) { - VIR_DEBUG("Cgroup allow %s perms=%d", path[i], perms[i]); - rv = virCgroupAllowDevicePath(priv->cgroup, path[i], perms[i], false); - virDomainAuditCgroupPath(vm, priv->cgroup, "allow", path[i], - virCgroupGetDevicePermsString(perms[i]), - rv); - if (rv < 0) - goto cleanup; - } + VIR_DEBUG("Cgroup allow %s perms=%d", path, perms); + rv = virCgroupAllowDevicePath(priv->cgroup, path, perms, false); + virDomainAuditCgroupPath(vm, priv->cgroup, "allow", path, + virCgroupGetDevicePermsString(perms), + rv); + if (rv < 0) + goto cleanup; if (qemuHostdevNeedsVFIO(dev)) { VIR_DEBUG("Cgroup allow %s perms=%d", QEMU_DEV_VFIO, VIR_CGROUP_DEVICE_RW); @@ -410,10 +407,6 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, ret = 0; cleanup: - for (i = 0; i < npaths; i++) - VIR_FREE(path[i]); - VIR_FREE(path); - VIR_FREE(perms); return ret; } @@ -434,26 +427,22 @@ qemuTeardownHostdevCgroup(virDomainObjPtr vm, virDomainHostdevDefPtr dev) { qemuDomainObjPrivatePtr priv = vm->privateData; - char **path = NULL; - size_t i, npaths = 0; + g_autofree char *path = NULL; int rv, ret = -1; if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) return 0; - if (qemuDomainGetHostdevPath(vm->def, dev, true, - &npaths, &path, NULL) < 0) + if (qemuDomainGetHostdevPath(dev, &path, NULL) < 0) goto cleanup; - for (i = 0; i < npaths; i++) { - VIR_DEBUG("Cgroup deny %s", path[i]); - rv = virCgroupDenyDevicePath(priv->cgroup, path[i], - VIR_CGROUP_DEVICE_RWM, false); - virDomainAuditCgroupPath(vm, priv->cgroup, - "deny", path[i], "rwm", rv); - if (rv < 0) - goto cleanup; - } + VIR_DEBUG("Cgroup deny %s", path); + rv = virCgroupDenyDevicePath(priv->cgroup, path, + VIR_CGROUP_DEVICE_RWM, false); + virDomainAuditCgroupPath(vm, priv->cgroup, + "deny", path, "rwm", rv); + if (rv < 0) + goto cleanup; if (qemuHostdevNeedsVFIO(dev) && !qemuDomainNeedsVFIO(vm->def)) { @@ -468,9 +457,6 @@ qemuTeardownHostdevCgroup(virDomainObjPtr vm, ret = 0; cleanup: - for (i = 0; i < npaths; i++) - VIR_FREE(path[i]); - VIR_FREE(path); return ret; } diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 6daae24c3d..be769cb3ff 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -12861,29 +12861,23 @@ qemuDomainNeedsVFIO(const virDomainDef *def) /** * qemuDomainGetHostdevPath: - * @def: domain definition * @dev: host device definition - * @teardown: true if device will be removed - * @npaths: number of items in @path and @perms arrays * @path: resulting path to @dev * @perms: Optional pointer to VIR_CGROUP_DEVICE_* perms * * For given device @dev fetch its host path and store it at - * @path. If a device requires other paths to be present/allowed - * they are stored in the @path array after the actual path. - * Optionally, caller can get @perms on the path (e.g. rw/ro). + * @path. Optionally, caller can get @perms on the path (e.g. + * rw/ro). * - * The caller is responsible for freeing the memory. + * The caller is responsible for freeing the @path when no longer + * needed. * * Returns 0 on success, -1 otherwise. */ int -qemuDomainGetHostdevPath(virDomainDefPtr def, - virDomainHostdevDefPtr dev, - bool teardown, - size_t *npaths, - char ***path, - int **perms) +qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, + char **path, + int *perms) { int ret = -1; virDomainHostdevSubsysUSBPtr usbsrc = &dev->source.subsys.u.usb; @@ -12896,14 +12890,9 @@ qemuDomainGetHostdevPath(virDomainDefPtr def, g_autoptr(virSCSIDevice) scsi = NULL; g_autoptr(virSCSIVHostDevice) host = NULL; g_autofree char *tmpPath = NULL; - bool includeVFIO = false; - char **tmpPaths = NULL; g_autofree int *tmpPerms = NULL; - size_t tmpNpaths = 0; int perm = 0; - *npaths = 0; - switch ((virDomainHostdevMode) dev->mode) { case VIR_DOMAIN_HOSTDEV_MODE_SUBSYS: switch ((virDomainHostdevSubsysType)dev->source.subsys.type) { @@ -12920,12 +12909,6 @@ qemuDomainGetHostdevPath(virDomainDefPtr def, goto cleanup; perm = VIR_CGROUP_DEVICE_RW; - if (teardown) { - if (!virDomainDefHasVFIOHostdev(def)) - includeVFIO = true; - } else { - includeVFIO = true; - } } break; @@ -12981,7 +12964,6 @@ qemuDomainGetHostdevPath(virDomainDefPtr def, if (!(tmpPath = virMediatedDeviceGetIOMMUGroupDev(mdevsrc->uuidstr))) goto cleanup; - includeVFIO = true; perm = VIR_CGROUP_DEVICE_RW; break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: @@ -12995,36 +12977,11 @@ qemuDomainGetHostdevPath(virDomainDefPtr def, break; } - if (tmpPath) { - size_t toAlloc = 1; - - if (includeVFIO) - toAlloc = 2; - - if (VIR_ALLOC_N(tmpPaths, toAlloc) < 0 || - VIR_ALLOC_N(tmpPerms, toAlloc) < 0) - goto cleanup; - tmpPaths[0] = g_strdup(tmpPath); - tmpNpaths = toAlloc; - tmpPerms[0] = perm; - - if (includeVFIO) { - tmpPaths[1] = g_strdup(QEMU_DEV_VFIO); - tmpPerms[1] = VIR_CGROUP_DEVICE_RW; - } - } - - *npaths = tmpNpaths; - tmpNpaths = 0; - *path = tmpPaths; - tmpPaths = NULL; - if (perms) { - *perms = tmpPerms; - tmpPerms = NULL; - } + *path = g_steal_pointer(&tmpPath); + if (perms) + *perms = perm; ret = 0; cleanup: - virStringListFreeCount(tmpPaths, tmpNpaths); return ret; } @@ -13525,16 +13482,13 @@ qemuDomainSetupHostdev(virQEMUDriverConfigPtr cfg G_GNUC_UNUSED, const struct qemuDomainCreateDeviceData *data) { int ret = -1; - char **path = NULL; - size_t i, npaths = 0; + g_autofree char *path = NULL; - if (qemuDomainGetHostdevPath(NULL, dev, false, &npaths, &path, NULL) < 0) + if (qemuDomainGetHostdevPath(dev, &path, NULL) < 0) goto cleanup; - for (i = 0; i < npaths; i++) { - if (qemuDomainCreateDevice(path[i], data, false) < 0) - goto cleanup; - } + if (qemuDomainCreateDevice(path, data, false) < 0) + goto cleanup; if (qemuHostdevNeedsVFIO(dev) && qemuDomainCreateDevice(QEMU_DEV_VFIO, data, false) < 0) @@ -13542,9 +13496,6 @@ qemuDomainSetupHostdev(virQEMUDriverConfigPtr cfg G_GNUC_UNUSED, ret = 0; cleanup: - for (i = 0; i < npaths; i++) - VIR_FREE(path[i]); - VIR_FREE(path); return ret; } @@ -14579,13 +14530,12 @@ qemuDomainNamespaceSetupHostdev(virDomainObjPtr vm, virDomainHostdevDefPtr hostdev) { int ret = -1; - char **paths = NULL; - size_t i, npaths = 0; + g_autofree char *path = NULL; - if (qemuDomainGetHostdevPath(NULL, hostdev, false, &npaths, &paths, NULL) < 0) + if (qemuDomainGetHostdevPath(hostdev, &path, NULL) < 0) goto cleanup; - if (qemuDomainNamespaceMknodPaths(vm, (const char **)paths, npaths) < 0) + if (qemuDomainNamespaceMknodPath(vm, path) < 0) goto cleanup; if (qemuHostdevNeedsVFIO(hostdev) && @@ -14595,9 +14545,6 @@ qemuDomainNamespaceSetupHostdev(virDomainObjPtr vm, ret = 0; cleanup: - for (i = 0; i < npaths; i++) - VIR_FREE(paths[i]); - VIR_FREE(paths); return ret; } @@ -14618,14 +14565,12 @@ qemuDomainNamespaceTeardownHostdev(virDomainObjPtr vm, virDomainHostdevDefPtr hostdev) { int ret = -1; - char **paths = NULL; - size_t i, npaths = 0; + g_autofree char *path = NULL; - if (qemuDomainGetHostdevPath(vm->def, hostdev, true, - &npaths, &paths, NULL) < 0) + if (qemuDomainGetHostdevPath(hostdev, &path, NULL) < 0) goto cleanup; - if (qemuDomainNamespaceUnlinkPaths(vm, (const char **)paths, npaths) < 0) + if (qemuDomainNamespaceUnlinkPath(vm, path) < 0) goto cleanup; if (qemuHostdevNeedsVFIO(hostdev) && @@ -14635,9 +14580,6 @@ qemuDomainNamespaceTeardownHostdev(virDomainObjPtr vm, ret = 0; cleanup: - for (i = 0; i < npaths; i++) - VIR_FREE(paths[i]); - VIR_FREE(paths); return ret; } diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index db45a932dc..65bd83aece 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -1079,12 +1079,9 @@ bool qemuDomainSupportsVideoVga(virDomainVideoDefPtr video, bool qemuDomainNeedsVFIO(const virDomainDef *def); -int qemuDomainGetHostdevPath(virDomainDefPtr def, - virDomainHostdevDefPtr dev, - bool teardown, - size_t *npaths, - char ***path, - int **perms); +int qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, + char **path, + int *perms); int qemuDomainBuildNamespace(virQEMUDriverConfigPtr cfg, virSecurityManagerPtr mgr, -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
Now that all callers of qemuDomainGetHostdevPath() handle /dev/vfio/vfio on their own, we can safely drop handling in this function. In near future the decision whether domain needs VFIO file is going to include more device types than just virDomainHostdev.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 52 ++++++++-------------- src/qemu/qemu_domain.c | 98 +++++++++--------------------------------- src/qemu/qemu_domain.h | 9 ++-- 3 files changed, 42 insertions(+), 117 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

Previous patches rendered some of 'cleanup' labels needless. Drop them. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 25 +++++++++------------ src/qemu/qemu_domain.c | 50 ++++++++++++++++-------------------------- 2 files changed, 29 insertions(+), 46 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index d54d01d4c0..d23318db8c 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -378,13 +378,13 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, qemuDomainObjPrivatePtr priv = vm->privateData; g_autofree char *path = NULL; int perms; - int rv, ret = -1; + int rv; if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) return 0; if (qemuDomainGetHostdevPath(dev, &path, &perms) < 0) - goto cleanup; + return -1; VIR_DEBUG("Cgroup allow %s perms=%d", path, perms); rv = virCgroupAllowDevicePath(priv->cgroup, path, perms, false); @@ -392,7 +392,7 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, virCgroupGetDevicePermsString(perms), rv); if (rv < 0) - goto cleanup; + return -1; if (qemuHostdevNeedsVFIO(dev)) { VIR_DEBUG("Cgroup allow %s perms=%d", QEMU_DEV_VFIO, VIR_CGROUP_DEVICE_RW); @@ -401,13 +401,10 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, virDomainAuditCgroupPath(vm, priv->cgroup, "allow", QEMU_DEV_VFIO, "rw", rv); if (rv < 0) - goto cleanup; + return -1; } - ret = 0; - - cleanup: - return ret; + return 0; } @@ -428,13 +425,13 @@ qemuTeardownHostdevCgroup(virDomainObjPtr vm, { qemuDomainObjPrivatePtr priv = vm->privateData; g_autofree char *path = NULL; - int rv, ret = -1; + int rv; if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) return 0; if (qemuDomainGetHostdevPath(dev, &path, NULL) < 0) - goto cleanup; + return -1; VIR_DEBUG("Cgroup deny %s", path); rv = virCgroupDenyDevicePath(priv->cgroup, path, @@ -442,7 +439,7 @@ qemuTeardownHostdevCgroup(virDomainObjPtr vm, virDomainAuditCgroupPath(vm, priv->cgroup, "deny", path, "rwm", rv); if (rv < 0) - goto cleanup; + return -1; if (qemuHostdevNeedsVFIO(dev) && !qemuDomainNeedsVFIO(vm->def)) { @@ -452,12 +449,10 @@ qemuTeardownHostdevCgroup(virDomainObjPtr vm, virDomainAuditCgroupPath(vm, priv->cgroup, "deny", QEMU_DEV_VFIO, "rwm", rv); if (rv < 0) - goto cleanup; + return -1; } - ret = 0; - cleanup: - return ret; + return 0; } diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index be769cb3ff..21a4dad37d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -12879,7 +12879,6 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, char **path, int *perms) { - int ret = -1; virDomainHostdevSubsysUSBPtr usbsrc = &dev->source.subsys.u.usb; virDomainHostdevSubsysPCIPtr pcisrc = &dev->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &dev->source.subsys.u.scsi; @@ -12903,10 +12902,10 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, pcisrc->addr.slot, pcisrc->addr.function); if (!pci) - goto cleanup; + return -1; if (!(tmpPath = virPCIDeviceGetIOMMUGroupDev(pci))) - goto cleanup; + return -1; perm = VIR_CGROUP_DEVICE_RW; } @@ -12919,7 +12918,7 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, usbsrc->device, NULL); if (!usb) - goto cleanup; + return -1; tmpPath = g_strdup(virUSBDeviceGetPath(usb)); perm = VIR_CGROUP_DEVICE_RW; @@ -12940,7 +12939,7 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, dev->shareable); if (!scsi) - goto cleanup; + return -1; tmpPath = g_strdup(virSCSIDeviceGetPath(scsi)); perm = virSCSIDeviceGetReadonly(scsi) ? @@ -12952,7 +12951,7 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, if (hostsrc->protocol == VIR_DOMAIN_HOSTDEV_SUBSYS_SCSI_HOST_PROTOCOL_TYPE_VHOST) { if (!(host = virSCSIVHostDeviceNew(hostsrc->wwpn))) - goto cleanup; + return -1; tmpPath = g_strdup(virSCSIVHostDeviceGetPath(host)); perm = VIR_CGROUP_DEVICE_RW; @@ -12962,7 +12961,7 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: if (!(tmpPath = virMediatedDeviceGetIOMMUGroupDev(mdevsrc->uuidstr))) - goto cleanup; + return -1; perm = VIR_CGROUP_DEVICE_RW; break; @@ -12980,9 +12979,7 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, *path = g_steal_pointer(&tmpPath); if (perms) *perms = perm; - ret = 0; - cleanup: - return ret; + return 0; } @@ -13481,22 +13478,19 @@ qemuDomainSetupHostdev(virQEMUDriverConfigPtr cfg G_GNUC_UNUSED, virDomainHostdevDefPtr dev, const struct qemuDomainCreateDeviceData *data) { - int ret = -1; g_autofree char *path = NULL; if (qemuDomainGetHostdevPath(dev, &path, NULL) < 0) - goto cleanup; + return -1; if (qemuDomainCreateDevice(path, data, false) < 0) - goto cleanup; + return -1; if (qemuHostdevNeedsVFIO(dev) && qemuDomainCreateDevice(QEMU_DEV_VFIO, data, false) < 0) - goto cleanup; + return -1; - ret = 0; - cleanup: - return ret; + return 0; } @@ -14529,23 +14523,20 @@ int qemuDomainNamespaceSetupHostdev(virDomainObjPtr vm, virDomainHostdevDefPtr hostdev) { - int ret = -1; g_autofree char *path = NULL; if (qemuDomainGetHostdevPath(hostdev, &path, NULL) < 0) - goto cleanup; + return -1; if (qemuDomainNamespaceMknodPath(vm, path) < 0) - goto cleanup; + return -1; if (qemuHostdevNeedsVFIO(hostdev) && !qemuDomainNeedsVFIO(vm->def) && qemuDomainNamespaceMknodPath(vm, QEMU_DEV_VFIO) < 0) - goto cleanup; + return -1; - ret = 0; - cleanup: - return ret; + return 0; } @@ -14564,23 +14555,20 @@ int qemuDomainNamespaceTeardownHostdev(virDomainObjPtr vm, virDomainHostdevDefPtr hostdev) { - int ret = -1; g_autofree char *path = NULL; if (qemuDomainGetHostdevPath(hostdev, &path, NULL) < 0) - goto cleanup; + return -1; if (qemuDomainNamespaceUnlinkPath(vm, path) < 0) - goto cleanup; + return -1; if (qemuHostdevNeedsVFIO(hostdev) && !qemuDomainNeedsVFIO(vm->def) && qemuDomainNamespaceUnlinkPath(vm, QEMU_DEV_VFIO) < 0) - goto cleanup; + return -1; - ret = 0; - cleanup: - return ret; + return 0; } -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
Previous patches rendered some of 'cleanup' labels needless. Drop them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 25 +++++++++------------ src/qemu/qemu_domain.c | 50 ++++++++++++++++-------------------------- 2 files changed, 29 insertions(+), 46 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

Sometimes, we have a PCI address and not fully allocated virPCIDevice and yet we still want to know its /dev/vfio/N path. Introduce virPCIDeviceAddressGetIOMMUGroupDev() function exactly for that. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/qemu/qemu_domain.c | 11 +---------- src/util/virpci.c | 15 +++++++++++++++ src/util/virpci.h | 1 + 4 files changed, 18 insertions(+), 10 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 8fe0bf9365..0988f36c20 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2692,6 +2692,7 @@ virPCIDeviceAddressAsString; virPCIDeviceAddressEqual; virPCIDeviceAddressFree; virPCIDeviceAddressGetIOMMUGroupAddresses; +virPCIDeviceAddressGetIOMMUGroupDev; virPCIDeviceAddressGetIOMMUGroupNum; virPCIDeviceAddressGetSysfsFile; virPCIDeviceAddressIOMMUGroupIterate; diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 21a4dad37d..59e98de2d2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -12884,12 +12884,10 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, virDomainHostdevSubsysSCSIPtr scsisrc = &dev->source.subsys.u.scsi; virDomainHostdevSubsysSCSIVHostPtr hostsrc = &dev->source.subsys.u.scsi_host; virDomainHostdevSubsysMediatedDevPtr mdevsrc = &dev->source.subsys.u.mdev; - g_autoptr(virPCIDevice) pci = NULL; g_autoptr(virUSBDevice) usb = NULL; g_autoptr(virSCSIDevice) scsi = NULL; g_autoptr(virSCSIVHostDevice) host = NULL; g_autofree char *tmpPath = NULL; - g_autofree int *tmpPerms = NULL; int perm = 0; switch ((virDomainHostdevMode) dev->mode) { @@ -12897,14 +12895,7 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, switch ((virDomainHostdevSubsysType)dev->source.subsys.type) { case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI: if (pcisrc->backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - pci = virPCIDeviceNew(pcisrc->addr.domain, - pcisrc->addr.bus, - pcisrc->addr.slot, - pcisrc->addr.function); - if (!pci) - return -1; - - if (!(tmpPath = virPCIDeviceGetIOMMUGroupDev(pci))) + if (!(tmpPath = virPCIDeviceAddressGetIOMMUGroupDev(&pcisrc->addr))) return -1; perm = VIR_CGROUP_DEVICE_RW; diff --git a/src/util/virpci.c b/src/util/virpci.c index 9bea5a20d0..b0a4107551 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1942,6 +1942,21 @@ virPCIDeviceAddressGetIOMMUGroupNum(virPCIDeviceAddressPtr addr) } +char * +virPCIDeviceAddressGetIOMMUGroupDev(const virPCIDeviceAddress *devAddr) +{ + g_autoptr(virPCIDevice) pci = NULL; + + if (!(pci = virPCIDeviceNew(devAddr->domain, + devAddr->bus, + devAddr->slot, + devAddr->function))) + return NULL; + + return virPCIDeviceGetIOMMUGroupDev(pci); +} + + /* virPCIDeviceGetIOMMUGroupDev - return the name of the device used * to control this PCI device's group (e.g. "/dev/vfio/15") */ diff --git a/src/util/virpci.h b/src/util/virpci.h index cfb4581edf..1c94dc307c 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -195,6 +195,7 @@ int virPCIDeviceAddressGetIOMMUGroupAddresses(virPCIDeviceAddressPtr devAddr, virPCIDeviceAddressPtr **iommuGroupDevices, size_t *nIommuGroupDevices); int virPCIDeviceAddressGetIOMMUGroupNum(virPCIDeviceAddressPtr addr); +char *virPCIDeviceAddressGetIOMMUGroupDev(const virPCIDeviceAddress *devAddr); char *virPCIDeviceGetIOMMUGroupDev(virPCIDevicePtr dev); int virPCIDeviceIsAssignable(virPCIDevicePtr dev, -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
Sometimes, we have a PCI address and not fully allocated virPCIDevice and yet we still want to know its /dev/vfio/N path. Introduce virPCIDeviceAddressGetIOMMUGroupDev() function exactly for that.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/qemu/qemu_domain.c | 11 +---------- src/util/virpci.c | 15 +++++++++++++++ src/util/virpci.h | 1 + 4 files changed, 18 insertions(+), 10 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

In near future we will have a list of PCI devices we want to detach (held in virPCIDeviceListPtr) but we don't have virDomainHostdevDefPtr. That's okay because virHostdevPreparePCIDevices() works with virPCIDeviceListPtr mostly anyway. And in very few places where it needs virDomainHostdevDefPtr are not interesting for our case. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/util/virhostdev.c | 48 +++++++++++++++++++++++++++++-------------- 1 file changed, 33 insertions(+), 15 deletions(-) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 39e6b8f49f..b8e6e2bc36 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -714,27 +714,22 @@ virHostdevReattachAllPCIDevices(virHostdevManagerPtr mgr, } } -int -virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, - const char *drv_name, - const char *dom_name, - const unsigned char *uuid, - virDomainHostdevDefPtr *hostdevs, - int nhostdevs, - unsigned int flags) + +static int +virHostdevPreparePCIDevicesImpl(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + const unsigned char *uuid, + virPCIDeviceListPtr pcidevs, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + unsigned int flags) { - g_autoptr(virPCIDeviceList) pcidevs = NULL; int last_processed_hostdev_vf = -1; size_t i; int ret = -1; virPCIDeviceAddressPtr devAddr = NULL; - if (!nhostdevs) - return 0; - - if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) - return -1; - virObjectLock(mgr->activePCIHostdevs); virObjectLock(mgr->inactivePCIHostdevs); @@ -984,6 +979,29 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, return ret; } + +int +virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + const unsigned char *uuid, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + unsigned int flags) +{ + g_autoptr(virPCIDeviceList) pcidevs = NULL; + + if (!nhostdevs) + return 0; + + if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) + return -1; + + return virHostdevPreparePCIDevicesImpl(mgr, drv_name, dom_name, uuid, + pcidevs, hostdevs, nhostdevs, flags); +} + + /* @oldStateDir: * For upgrade purpose: see virHostdevRestoreNetConfig */ -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
In near future we will have a list of PCI devices we want to detach (held in virPCIDeviceListPtr) but we don't have virDomainHostdevDefPtr. That's okay because virHostdevPreparePCIDevices() works with virPCIDeviceListPtr mostly anyway. And in very few places where it needs virDomainHostdevDefPtr are not interesting for our case.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/util/virhostdev.c | 48 +++++++++++++++++++++++++++++-------------- 1 file changed, 33 insertions(+), 15 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

In near future we will have a list of PCI devices we want to re-attach to the host (held in virPCIDeviceListPtr) but we don't have virDomainHostdevDefPtr. That's okay because virHostdevReAttachPCIDevices() works with virPCIDeviceListPtr mostly anyway. And in very few places where it needs virDomainHostdevDefPtr are not interesting for our case. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/util/virhostdev.c | 58 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 21 deletions(-) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index b8e6e2bc36..94a0185597 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -1002,30 +1002,17 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, } -/* @oldStateDir: - * For upgrade purpose: see virHostdevRestoreNetConfig - */ -void -virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr, - const char *drv_name, - const char *dom_name, - virDomainHostdevDefPtr *hostdevs, - int nhostdevs, - const char *oldStateDir) +static void +virHostdevReAttachPCIDevicesImpl(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + virPCIDeviceListPtr pcidevs, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + const char *oldStateDir) { - g_autoptr(virPCIDeviceList) pcidevs = NULL; size_t i; - if (!nhostdevs) - return; - - if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) { - VIR_ERROR(_("Failed to allocate PCI device list: %s"), - virGetLastErrorMessage()); - virResetLastError(); - return; - } - virObjectLock(mgr->activePCIHostdevs); virObjectLock(mgr->inactivePCIHostdevs); @@ -1121,6 +1108,35 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr, virObjectUnlock(mgr->inactivePCIHostdevs); } + +/* @oldStateDir: + * For upgrade purpose: see virHostdevRestoreNetConfig + */ +void +virHostdevReAttachPCIDevices(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs, + const char *oldStateDir) +{ + g_autoptr(virPCIDeviceList) pcidevs = NULL; + + if (!nhostdevs) + return; + + if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) { + VIR_ERROR(_("Failed to allocate PCI device list: %s"), + virGetLastErrorMessage()); + virResetLastError(); + return; + } + + virHostdevReAttachPCIDevicesImpl(mgr, drv_name, dom_name, pcidevs, + hostdevs, nhostdevs, oldStateDir); +} + + int virHostdevUpdateActivePCIDevices(virHostdevManagerPtr mgr, virDomainHostdevDefPtr *hostdevs, -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
In near future we will have a list of PCI devices we want to re-attach to the host (held in virPCIDeviceListPtr) but we don't have virDomainHostdevDefPtr. That's okay because virHostdevReAttachPCIDevices() works with virPCIDeviceListPtr mostly anyway. And in very few places where it needs virDomainHostdevDefPtr are not interesting for our case.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/util/virhostdev.c | 58 +++++++++++++++++++++++++++---------------- 1 file changed, 37 insertions(+), 21 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

This helper is cleaner than plain memcpy() because one doesn't have to look into virPCIDeviceAddress struct to see if it contains any strings / pointers. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virpci.c | 14 ++++++++++++++ src/util/virpci.h | 4 ++++ 3 files changed, 19 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 0988f36c20..6e4ef0413b 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2689,6 +2689,7 @@ virObjectUnref; # util/virpci.h virPCIDeviceAddressAsString; +virPCIDeviceAddressCopy; virPCIDeviceAddressEqual; virPCIDeviceAddressFree; virPCIDeviceAddressGetIOMMUGroupAddresses; diff --git a/src/util/virpci.c b/src/util/virpci.c index b0a4107551..99a8002743 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1340,6 +1340,20 @@ virPCIDeviceAddressEqual(const virPCIDeviceAddress *addr1, return false; } +/** + * virPCIDeviceAddressCopy: + * @dst: where to store address + * @src: source address to copy + * + * Creates a deep copy of given @src address and stores it into + * @dst which has to be pre-allocated by caller. + */ +void virPCIDeviceAddressCopy(virPCIDeviceAddressPtr dst, + const virPCIDeviceAddress *src) +{ + memcpy(dst, src, sizeof(*src)); +} + char * virPCIDeviceAddressAsString(const virPCIDeviceAddress *addr) { diff --git a/src/util/virpci.h b/src/util/virpci.h index 1c94dc307c..f6796fc422 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -44,6 +44,7 @@ typedef virZPCIDeviceAddress *virZPCIDeviceAddressPtr; struct _virZPCIDeviceAddress { unsigned int uid; /* exempt from syntax-check */ unsigned int fid; + /* Don't forget to update virPCIDeviceAddressCopy if needed. */ }; #define VIR_PCI_DEVICE_ADDRESS_FMT "%04x:%02x:%02x.%d" @@ -56,6 +57,7 @@ struct _virPCIDeviceAddress { int multi; /* virTristateSwitch */ int extFlags; /* enum virPCIDeviceAddressExtensionFlags */ virZPCIDeviceAddress zpci; + /* Don't forget to update virPCIDeviceAddressCopy if needed. */ }; typedef enum { @@ -236,6 +238,8 @@ bool virPCIDeviceAddressIsEmpty(const virPCIDeviceAddress *addr); bool virPCIDeviceAddressEqual(const virPCIDeviceAddress *addr1, const virPCIDeviceAddress *addr2); +void virPCIDeviceAddressCopy(virPCIDeviceAddressPtr dst, + const virPCIDeviceAddress *src); char *virPCIDeviceAddressAsString(const virPCIDeviceAddress *addr) ATTRIBUTE_NONNULL(1); -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
This helper is cleaner than plain memcpy() because one doesn't have to look into virPCIDeviceAddress struct to see if it contains any strings / pointers.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virpci.c | 14 ++++++++++++++ src/util/virpci.h | 4 ++++ 3 files changed, 19 insertions(+)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

There are going to be more disk types that are considered unsafe with respect to migration. Therefore, move the error reporting call outside of if() body and rework if-else combo to switch(). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_migration.c | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index c38bf342d7..e4760c6ad3 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1228,6 +1228,8 @@ qemuMigrationSrcIsSafe(virDomainDefPtr def, for (i = 0; i < def->ndisks; i++) { virDomainDiskDefPtr disk = def->disks[i]; const char *src = virDomainDiskGetSource(disk); + int actualType = virStorageSourceGetActualType(disk->src); + bool unsafe = false; /* Disks without any source (i.e. floppies and CD-ROMs) * OR readonly are safe. */ @@ -1241,21 +1243,34 @@ qemuMigrationSrcIsSafe(virDomainDefPtr def, continue; /* However, disks on local FS (e.g. ext4) are not safe. */ - if (virStorageSourceGetActualType(disk->src) == VIR_STORAGE_TYPE_FILE) { + switch ((virStorageType) actualType) { + case VIR_STORAGE_TYPE_FILE: if ((rc = virFileIsSharedFS(src)) < 0) { return false; } else if (rc == 0) { - virReportError(VIR_ERR_MIGRATE_UNSAFE, "%s", - _("Migration without shared storage is unsafe")); - return false; + unsafe = true; } if ((rc = virStorageFileIsClusterFS(src)) < 0) return false; else if (rc == 1) continue; - } else if (virStorageSourceGetActualType(disk->src) == VIR_STORAGE_TYPE_NETWORK) { + break; + case VIR_STORAGE_TYPE_NETWORK: /* But network disks are safe again. */ continue; + + case VIR_STORAGE_TYPE_NONE: + case VIR_STORAGE_TYPE_BLOCK: + case VIR_STORAGE_TYPE_DIR: + case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_LAST: + break; + } + + if (unsafe) { + virReportError(VIR_ERR_MIGRATE_UNSAFE, "%s", + _("Migration without shared storage is unsafe")); + return false; } /* Our code elsewhere guarantees shared disks are either readonly (in -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
There are going to be more disk types that are considered unsafe with respect to migration. Therefore, move the error reporting call outside of if() body and rework if-else combo to switch().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_migration.c | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

There is this class of PCI devices that act like disks: NVMe. Therefore, they are both PCI devices and disks. While we already have <hostdev/> (and can assign a NVMe device to a domain successfully) we don't have disk representation. There are three problems with PCI assignment in case of a NVMe device: 1) domains with <hostdev/> can't be migrated 2) NVMe device is assigned whole, there's no way to assign only a namespace 3) Because hypervisors see <hostdev/> they don't put block layer on top of it - users don't get all the fancy features like snapshots NVMe namespaces are way of splitting one continuous NVDIMM memory into smaller ones, effectively creating smaller NVMe-s (which can then be partitioned, LVMed, etc.) Because of all of this the following XML was chosen to model a NVMe device: <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> </disk> Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/formatdomain.html.in | 57 +++++++++++++++++++++++-- docs/schemas/domaincommon.rng | 32 ++++++++++++++ tests/qemuxml2argvdata/disk-nvme.xml | 63 ++++++++++++++++++++++++++++ 3 files changed, 149 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 6df4a8b26e..fe871d933f 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -2944,6 +2944,13 @@ </backingStore> <target dev='vdd' bus='virtio'/> </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vde' bus='virtio'/> + </disk> </devices> ...</pre> @@ -2957,7 +2964,8 @@ Valid values are "file", "block", "dir" (<span class="since">since 0.7.5</span>), "network" (<span class="since">since 0.8.7</span>), or - "volume" (<span class="since">since 1.0.5</span>) + "volume" (<span class="since">since 1.0.5</span>), or + "nvme" (<span class="since">since 5.6.0</span>) and refer to the underlying source for the disk. <span class="since">Since 0.0.3</span> </dd> @@ -3140,6 +3148,43 @@ <span class="since">Since 1.0.5</span> </p> </dd> + <dt><code>nvme</code></dt> + <dd> + To specify disk source for NVMe disk the <code>source</code> + element has the following attributes: + <dl> + <dt><code>type</code></dt> + <dd>The type of address specified in <code>address</code> + sub-element. Currently, only <code>pci</code> value is + accepted. + </dd> + + <dt><code>managed</code></dt> + <dd>This attribute instructs libvirt to detach NVMe + controller automatically on domain startup (<code>yes</code>) + or expect the controller to be detached by system + administrator (<code>no</code>). + </dd> + + <dt><code>namespace</code></dt> + <dd>The namespace ID which should be assigned to the domain. + According to NVMe standard, namespace numbers start from 1, + including. + </dd> + </dl> + + The difference between <code><disk type='nvme'></code> + and <code><hostdev/></code> is that the latter is plain + host device assignment with all its limitations (e.g. no live + migration), while the former makes hypervisor to run the NVMe + disk through hypervisor's block layer thus enabling all + features provided by the layer (e.g. snapshots, domain + migration, etc.). Moreover, since the NVMe disk is unbinded + from its PCI driver, the host kernel storage stack is not + involved (compared to passing say <code>/dev/nvme0n1</code> via + <code><disk type='block'></code> and therefore lower + latencies can be achieved. + </dd> </dl> With "file", "block", and "volume", one or more optional sub-elements <code>seclabel</code>, <a href="#seclabel">described @@ -3302,11 +3347,17 @@ initiator IQN needed to access the source via mandatory attribute <code>name</code>. </dd> + <dt><code>address</code></dt> + <dd>For disk of type <code>nvme</code> this element + specifies the PCI address of the host NVMe + controller. + <span class="since">Since 5.6.0</span> + </dd> </dl> <p> - For a "file" or "volume" disk type which represents a cdrom or floppy - (the <code>device</code> attribute), it is possible to define + For a "file" or "volume" disk type which represents a cdrom or + floppy (the <code>device</code> attribute), it is possible to define policy what to do with the disk if the source file is not accessible. (NB, <code>startupPolicy</code> is not valid for "volume" disk unless the specified storage volume is of "file" type). This is done by the diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index a83c9ae7a5..4b0638a836 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -1603,6 +1603,7 @@ <ref name="diskSourceDir"/> <ref name="diskSourceNetwork"/> <ref name="diskSourceVolume"/> + <ref name="diskSourceNvme"/> </choice> </define> @@ -1918,6 +1919,37 @@ </optional> </define> + <define name="diskSourceNvme"> + <attribute name="type"> + <value>nvme</value> + </attribute> + <optional> + <element name="source"> + <attribute name="type"> + <value>pci</value> + </attribute> + <attribute name="namespace"> + <ref name="uint32"/> + </attribute> + <optional> + <attribute name="managed"> + <ref name="virYesNo"/> + </attribute> + </optional> + <element name="address"> + <ref name="pciaddress"/> + </element> + <ref name="diskSourceCommon"/> + <optional> + <ref name="storageStartupPolicy"/> + </optional> + <optional> + <ref name="encryption"/> + </optional> + </element> + </optional> + </define> + <define name="diskTarget"> <data type="string"> <param name="pattern">(ioemu:)?(fd|hd|sd|vd|xvd|ubd)[a-zA-Z0-9_]+</param> diff --git a/tests/qemuxml2argvdata/disk-nvme.xml b/tests/qemuxml2argvdata/disk-nvme.xml new file mode 100644 index 0000000000..66b2d8450e --- /dev/null +++ b/tests/qemuxml2argvdata/disk-nvme.xml @@ -0,0 +1,63 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-i686</emulator> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vda' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='2'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vdb' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='no' namespace='1'> + <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> + </source> + <target dev='vdc' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> + </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='qcow2' cache='none'/> + <source type='pci' managed='no' namespace='2'> + <address domain='0x0001' bus='0x02' slot='0x00' function='0x0'/> + <encryption format='luks'> + <secret type='passphrase' uuid='0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f'/> + </encryption> + </source> + <target dev='vdd' bus='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> + </disk> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <controller type='scsi' index='0' model='virtio-scsi'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='none'/> + </devices> +</domain> -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
There is this class of PCI devices that act like disks: NVMe. Therefore, they are both PCI devices and disks. While we already have <hostdev/> (and can assign a NVMe device to a domain successfully) we don't have disk representation. There are three problems with PCI assignment in case of a NVMe device:
1) domains with <hostdev/> can't be migrated
2) NVMe device is assigned whole, there's no way to assign only a namespace
3) Because hypervisors see <hostdev/> they don't put block layer on top of it - users don't get all the fancy features like snapshots
NVMe namespaces are way of splitting one continuous NVDIMM memory into smaller ones, effectively creating smaller NVMe-s (which can then be partitioned, LVMed, etc.)
Because of all of this the following XML was chosen to model a NVMe device:
<disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> </disk>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/formatdomain.html.in | 57 +++++++++++++++++++++++-- docs/schemas/domaincommon.rng | 32 ++++++++++++++ tests/qemuxml2argvdata/disk-nvme.xml | 63 ++++++++++++++++++++++++++++ 3 files changed, 149 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 6df4a8b26e..fe871d933f 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -2944,6 +2944,13 @@ </backingStore> <target dev='vdd' bus='virtio'/> </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vde' bus='virtio'/> + </disk> </devices> ...</pre>
@@ -2957,7 +2964,8 @@ Valid values are "file", "block", "dir" (<span class="since">since 0.7.5</span>), "network" (<span class="since">since 0.8.7</span>), or - "volume" (<span class="since">since 1.0.5</span>) + "volume" (<span class="since">since 1.0.5</span>), or + "nvme" (<span class="since">since 5.6.0</span>)
6.0.0 or whatever version this will land in
and refer to the underlying source for the disk. <span class="since">Since 0.0.3</span> </dd> @@ -3140,6 +3148,43 @@ <span class="since">Since 1.0.5</span> </p> </dd> + <dt><code>nvme</code></dt> + <dd> + To specify disk source for NVMe disk the <code>source</code> + element has the following attributes: + <dl> + <dt><code>type</code></dt> + <dd>The type of address specified in <code>address</code> + sub-element. Currently, only <code>pci</code> value is + accepted. + </dd> + + <dt><code>managed</code></dt> + <dd>This attribute instructs libvirt to detach NVMe + controller automatically on domain startup (<code>yes</code>) + or expect the controller to be detached by system + administrator (<code>no</code>). + </dd> + + <dt><code>namespace</code></dt> + <dd>The namespace ID which should be assigned to the domain. + According to NVMe standard, namespace numbers start from 1, + including. + </dd> + </dl> + + The difference between <code><disk type='nvme'></code> + and <code><hostdev/></code> is that the latter is plain + host device assignment with all its limitations (e.g. no live + migration), while the former makes hypervisor to run the NVMe + disk through hypervisor's block layer thus enabling all + features provided by the layer (e.g. snapshots, domain + migration, etc.). Moreover, since the NVMe disk is unbinded + from its PCI driver, the host kernel storage stack is not + involved (compared to passing say <code>/dev/nvme0n1</code> via + <code><disk type='block'></code> and therefore lower + latencies can be achieved. + </dd> </dl> With "file", "block", and "volume", one or more optional sub-elements <code>seclabel</code>, <a href="#seclabel">described @@ -3302,11 +3347,17 @@ initiator IQN needed to access the source via mandatory attribute <code>name</code>. </dd> + <dt><code>address</code></dt> + <dd>For disk of type <code>nvme</code> this element + specifies the PCI address of the host NVMe + controller. + <span class="since">Since 5.6.0</span>
Same
+ </dd> </dl>
<p> - For a "file" or "volume" disk type which represents a cdrom or floppy - (the <code>device</code> attribute), it is possible to define + For a "file" or "volume" disk type which represents a cdrom or + floppy (the <code>device</code> attribute), it is possible to define
Stray change? Also, tn the test XML you need to "s/qemu-system-i686/qemu-system-i386/" or you'll hit a weird error. And VIR_TEST_REGENERATE_OUTPUT is also busted, see my patches elsewhere on this list. Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

On 12/9/19 11:55 PM, Cole Robinson wrote:
On 12/2/19 9:26 AM, Michal Privoznik wrote:
There is this class of PCI devices that act like disks: NVMe. Therefore, they are both PCI devices and disks. While we already have <hostdev/> (and can assign a NVMe device to a domain successfully) we don't have disk representation. There are three problems with PCI assignment in case of a NVMe device:
1) domains with <hostdev/> can't be migrated
2) NVMe device is assigned whole, there's no way to assign only a namespace
3) Because hypervisors see <hostdev/> they don't put block layer on top of it - users don't get all the fancy features like snapshots
NVMe namespaces are way of splitting one continuous NVDIMM memory into smaller ones, effectively creating smaller NVMe-s (which can then be partitioned, LVMed, etc.)
Because of all of this the following XML was chosen to model a NVMe device:
<disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> </disk>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/formatdomain.html.in | 57 +++++++++++++++++++++++-- docs/schemas/domaincommon.rng | 32 ++++++++++++++ tests/qemuxml2argvdata/disk-nvme.xml | 63 ++++++++++++++++++++++++++++ 3 files changed, 149 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.xml
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 6df4a8b26e..fe871d933f 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -2944,6 +2944,13 @@ </backingStore> <target dev='vdd' bus='virtio'/> </disk> + <disk type='nvme' device='disk'> + <driver name='qemu' type='raw'/> + <source type='pci' managed='yes' namespace='1'> + <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </source> + <target dev='vde' bus='virtio'/> + </disk> </devices> ...</pre>
@@ -2957,7 +2964,8 @@ Valid values are "file", "block", "dir" (<span class="since">since 0.7.5</span>), "network" (<span class="since">since 0.8.7</span>), or - "volume" (<span class="since">since 1.0.5</span>) + "volume" (<span class="since">since 1.0.5</span>), or + "nvme" (<span class="since">since 5.6.0</span>)
6.0.0 or whatever version this will land in
and refer to the underlying source for the disk. <span class="since">Since 0.0.3</span> </dd> @@ -3140,6 +3148,43 @@ <span class="since">Since 1.0.5</span> </p> </dd> + <dt><code>nvme</code></dt> + <dd> + To specify disk source for NVMe disk the <code>source</code> + element has the following attributes: + <dl> + <dt><code>type</code></dt> + <dd>The type of address specified in <code>address</code> + sub-element. Currently, only <code>pci</code> value is + accepted. + </dd> + + <dt><code>managed</code></dt> + <dd>This attribute instructs libvirt to detach NVMe + controller automatically on domain startup (<code>yes</code>) + or expect the controller to be detached by system + administrator (<code>no</code>). + </dd> + + <dt><code>namespace</code></dt> + <dd>The namespace ID which should be assigned to the domain. + According to NVMe standard, namespace numbers start from 1, + including. + </dd> + </dl> + + The difference between <code><disk type='nvme'></code> + and <code><hostdev/></code> is that the latter is plain + host device assignment with all its limitations (e.g. no live + migration), while the former makes hypervisor to run the NVMe + disk through hypervisor's block layer thus enabling all + features provided by the layer (e.g. snapshots, domain + migration, etc.). Moreover, since the NVMe disk is unbinded + from its PCI driver, the host kernel storage stack is not + involved (compared to passing say <code>/dev/nvme0n1</code> via + <code><disk type='block'></code> and therefore lower + latencies can be achieved. + </dd> </dl> With "file", "block", and "volume", one or more optional sub-elements <code>seclabel</code>, <a href="#seclabel">described @@ -3302,11 +3347,17 @@ initiator IQN needed to access the source via mandatory attribute <code>name</code>. </dd> + <dt><code>address</code></dt> + <dd>For disk of type <code>nvme</code> this element + specifies the PCI address of the host NVMe + controller. + <span class="since">Since 5.6.0</span>
Same
+ </dd> </dl>
<p> - For a "file" or "volume" disk type which represents a cdrom or floppy - (the <code>device</code> attribute), it is possible to define + For a "file" or "volume" disk type which represents a cdrom or + floppy (the <code>device</code> attribute), it is possible to define
Stray change?
Oh right. I've realigned this area when adding the address description. But this change does not belong here.
Also, tn the test XML you need to "s/qemu-system-i686/qemu-system-i386/" or you'll hit a weird error. And VIR_TEST_REGENERATE_OUTPUT is also busted, see my patches elsewhere on this list.
Yeah, I've noticed Dan posted patches after these. I've fixed that locally but never replied to this patch. Sorry.
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Thanks, Michal

On 12/2/19 3:26 PM, Michal Privoznik wrote:
There is this class of PCI devices that act like disks: NVMe. Therefore, they are both PCI devices and disks. While we already have <hostdev/> (and can assign a NVMe device to a domain successfully) we don't have disk representation. There are three problems with PCI assignment in case of a NVMe device:
1) domains with <hostdev/> can't be migrated
2) NVMe device is assigned whole, there's no way to assign only a namespace
3) Because hypervisors see <hostdev/> they don't put block layer on top of it - users don't get all the fancy features like snapshots
NVMe namespaces are way of splitting one continuous NVDIMM memory into smaller ones, effectively creating smaller NVMe-s (which can then be partitioned, LVMed, etc.)
Because of all of this the following XML was chosen to model a NVMe device:
<disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vda' bus='virtio'/> </disk>
Last week I've discussed this on IRC with Dan an Maxim (bot CC'ed) and there was a suggestion to accept /dev/nvmeXXX path instead of PCI address. The reasoning was that there is a tool that Maxim wrote (alas not merged into qemu/kvm yet) that acts like a standalone daemon which does VFIO magic and then serves qemus connecting to it (this allows a NVMe disk to be shared between multiple qemus which is now not allowed currently due to VFIO restriction). And if we accepted /dev/nvmeXXX here we could change the backend less invasively - we could either use qemu's -drive nvme://XXXX or the new tool. On the other hand, /dev/nvmeXXX (even though it may be a bit more user friendly) wouldn't work if host kernel doesn't have NVMe driver or if the disk is already detached. PCI address as I have it here. Note that sysfs offers translations both ways [PCI address, namespace] <-> /dev/nvmeXXX so that shouldn't be a limitation. Thoughts? Michal

To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/conf/domain_conf.c | 95 ++++++++++++++++++++++++++ src/libvirt_private.syms | 1 + src/libxl/xen_xl.c | 1 + src/qemu/qemu_block.c | 2 + src/qemu/qemu_command.c | 1 + src/qemu/qemu_driver.c | 4 ++ src/qemu/qemu_migration.c | 5 ++ src/util/virstoragefile.c | 57 ++++++++++++++++ src/util/virstoragefile.h | 17 +++++ tests/qemuxml2xmloutdata/disk-nvme.xml | 1 + tests/qemuxml2xmltest.c | 1 + 11 files changed, 185 insertions(+) create mode 120000 tests/qemuxml2xmloutdata/disk-nvme.xml diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 9580884747..58879a66ba 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5135,6 +5135,11 @@ virDomainDiskDefPostParse(virDomainDiskDefPtr disk, return -1; } + if (disk->src->type == VIR_STORAGE_TYPE_NVME) { + if (disk->src->nvme->managed == VIR_TRISTATE_BOOL_ABSENT) + disk->src->nvme->managed = VIR_TRISTATE_BOOL_YES; + } + if (disk->info.type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE && virDomainDiskDefAssignAddress(xmlopt, disk, def) < 0) { return -1; @@ -9271,6 +9276,75 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node, } +static int +virDomainDiskSourceNVMeParse(xmlNodePtr node, + xmlXPathContextPtr ctxt, + virStorageSourcePtr src) +{ + g_autoptr(virStorageSourceNVMeDef) nvme = NULL; + g_autofree char *type = NULL; + g_autofree char *namespace = NULL; + g_autofree char *managed = NULL; + xmlNodePtr address; + + nvme = g_new0(virStorageSourceNVMeDef, 1); + + if (!(type = virXMLPropString(node, "type"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'type' attribute to disk source")); + return -1; + } + + if (STRNEQ(type, "pci")) { + virReportError(VIR_ERR_XML_ERROR, + _("unsupported source type '%s'"), + type); + return -1; + } + + if (!(namespace = virXMLPropString(node, "namespace"))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing 'namespace' attribute to disk source")); + return -1; + } + + if (virStrToLong_ull(namespace, NULL, 10, &nvme->namespace) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed namespace '%s'"), + namespace); + return -1; + } + + /* NVMe namespaces start from 1 */ + if (nvme->namespace == 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe namespace can't be zero")); + return -1; + } + + if ((managed = virXMLPropString(node, "managed"))) { + if ((nvme->managed = virTristateBoolTypeFromString(managed)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, + _("malformed managed value '%s'"), + managed); + return -1; + } + } + + if (!(address = virXPathNode("./address", ctxt))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NVMe disk source is missing address")); + return -1; + } + + if (virPCIDeviceAddressParseXML(address, &nvme->pciAddr) < 0) + return -1; + + src->nvme = g_steal_pointer(&nvme); + return 0; +} + + static int virDomainDiskSourcePRParse(xmlNodePtr node, xmlXPathContextPtr ctxt, @@ -9369,6 +9443,10 @@ virDomainStorageSourceParse(xmlNodePtr node, if (virDomainDiskSourcePoolDefParse(node, &src->srcpool) < 0) return -1; break; + case VIR_STORAGE_TYPE_NVME: + if (virDomainDiskSourceNVMeParse(node, ctxt, src) < 0) + return -1; + break; case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -24102,6 +24180,19 @@ virDomainDiskSourceFormatNetwork(virBufferPtr attrBuf, } +static void +virDomainDiskSourceNVMeFormat(virBufferPtr attrBuf, + virBufferPtr childBuf, + const virStorageSourceNVMeDef *nvme) +{ + virBufferAddLit(attrBuf, " type='pci'"); + virBufferAsprintf(attrBuf, " managed='%s'", + virTristateBoolTypeToString(nvme->managed)); + virBufferAsprintf(attrBuf, " namespace='%llu'", nvme->namespace); + virPCIDeviceAddressFormat(childBuf, nvme->pciAddr, false); +} + + static int virDomainDiskSourceFormatPrivateData(virBufferPtr buf, virStorageSourcePtr src, @@ -24178,6 +24269,10 @@ virDomainDiskSourceFormat(virBufferPtr buf, break; + case VIR_STORAGE_TYPE_NVME: + virDomainDiskSourceNVMeFormat(&attrBuf, &childBuf, src->nvme); + break; + case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 6e4ef0413b..d37d3867a6 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3078,6 +3078,7 @@ virStorageSourceNetworkAssignDefaultPorts; virStorageSourceNew; virStorageSourceNewFromBacking; virStorageSourceNewFromBackingAbsolute; +virStorageSourceNVMeDefFree; virStorageSourceParseRBDColonString; virStorageSourcePoolDefFree; virStorageSourcePoolModeTypeFromString; diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c index 396adf6dac..8e9770c367 100644 --- a/src/libxl/xen_xl.c +++ b/src/libxl/xen_xl.c @@ -1649,6 +1649,7 @@ xenFormatXLDiskSrc(virStorageSourcePtr src, char **srcstr) break; case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: break; diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c index ada2c52947..8eeb9c3c57 100644 --- a/src/qemu/qemu_block.c +++ b/src/qemu/qemu_block.c @@ -1060,6 +1060,7 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src, break; case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: return NULL; @@ -2331,6 +2332,7 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSourcePtr src, case VIR_STORAGE_TYPE_BLOCK: case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: return 0; case VIR_STORAGE_TYPE_NONE: diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 3465d28b84..c61e38492a 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -1129,6 +1129,7 @@ qemuGetDriveSourceString(virStorageSourcePtr src, break; case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: break; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index b5300241a8..72ee89db1f 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -14776,6 +14776,7 @@ qemuDomainSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdi case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -14792,6 +14793,7 @@ qemuDomainSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdi case VIR_STORAGE_TYPE_NETWORK: case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -14859,6 +14861,7 @@ qemuDomainSnapshotPrepareDiskExternalActive(virDomainSnapshotDiskDefPtr snapdisk case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -14986,6 +14989,7 @@ qemuDomainSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk, case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index e4760c6ad3..3705b263a6 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -227,6 +227,7 @@ qemuMigrationDstPrecreateDisk(virConnectPtr conn, case VIR_STORAGE_TYPE_BLOCK: case VIR_STORAGE_TYPE_DIR: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -1259,6 +1260,10 @@ qemuMigrationSrcIsSafe(virDomainDefPtr def, /* But network disks are safe again. */ continue; + case VIR_STORAGE_TYPE_NVME: + unsafe = true; + break; + case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_BLOCK: case VIR_STORAGE_TYPE_DIR: diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c index 77f885ef1d..b4ec9624be 100644 --- a/src/util/virstoragefile.c +++ b/src/util/virstoragefile.c @@ -57,6 +57,7 @@ VIR_ENUM_IMPL(virStorage, "dir", "network", "volume", + "nvme", ); VIR_ENUM_IMPL(virStorageFileFormat, @@ -2094,6 +2095,49 @@ virStoragePRDefCopy(virStoragePRDefPtr src) } +static virStorageSourceNVMeDefPtr +virStorageSourceNVMeDefCopy(const virStorageSourceNVMeDef *src) +{ + virStorageSourceNVMeDefPtr ret = NULL; + + ret = g_new0(virStorageSourceNVMeDef, 1); + + ret->namespace = src->namespace; + ret->managed = src->managed; + virPCIDeviceAddressCopy(&ret->pciAddr, &src->pciAddr); + return ret; +} + + +static bool +virStorageSourceNVMeDefIsEqual(const virStorageSourceNVMeDef *a, + const virStorageSourceNVMeDef *b) +{ + if (!a && !b) + return true; + + if (!a || !b) + return false; + + if (a->namespace != b->namespace || + a->managed != b->managed || + !virPCIDeviceAddressEqual(&a->pciAddr, &b->pciAddr)) + return false; + + return true; +} + + +void +virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def) +{ + if (!def) + return; + + VIR_FREE(def); +} + + virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model) @@ -2292,6 +2336,9 @@ virStorageSourceCopy(const virStorageSource *src, !(def->pr = virStoragePRDefCopy(src->pr))) return NULL; + if (src->nvme) + def->nvme = virStorageSourceNVMeDefCopy(src->nvme); + if (virStorageSourceInitiatorCopy(&def->initiator, &src->initiator) < 0) return NULL; @@ -2350,6 +2397,10 @@ virStorageSourceIsSameLocation(virStorageSourcePtr a, } } + if (a->type == VIR_STORAGE_TYPE_NVME && + !virStorageSourceNVMeDefIsEqual(a->nvme, b->nvme)) + return false; + return true; } @@ -2432,6 +2483,9 @@ virStorageSourceIsLocalStorage(const virStorageSource *src) case VIR_STORAGE_TYPE_NETWORK: case VIR_STORAGE_TYPE_VOLUME: + /* While NVMe disks are local, they are not accessible via src->path. + * Therefore, we have to return false here. */ + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_LAST: case VIR_STORAGE_TYPE_NONE: return false; @@ -2517,6 +2571,7 @@ virStorageSourceClear(virStorageSourcePtr def) VIR_FREE(def->compat); virStorageEncryptionFree(def->encryption); virStoragePRDefFree(def->pr); + virStorageSourceNVMeDefFree(def->nvme); virStorageSourceSeclabelsClear(def); virStoragePermsFree(def->perms); VIR_FREE(def->timestamps); @@ -3773,6 +3828,7 @@ virStorageSourceUpdatePhysicalSize(virStorageSourcePtr src, /* We shouldn't get VOLUME, but the switch requires all cases */ case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: return -1; @@ -4229,6 +4285,7 @@ virStorageSourceIsRelative(virStorageSourcePtr src) case VIR_STORAGE_TYPE_NETWORK: case VIR_STORAGE_TYPE_VOLUME: + case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: return false; diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h index c357803e50..42ca49bc64 100644 --- a/src/util/virstoragefile.h +++ b/src/util/virstoragefile.h @@ -30,6 +30,7 @@ #include "virutil.h" #include "virsecret.h" #include "virenum.h" +#include "virpci.h" /* Minimum header size required to probe all known formats with * virStorageFileProbeFormat, or obtain metadata from a known format. @@ -51,6 +52,7 @@ typedef enum { VIR_STORAGE_TYPE_DIR, VIR_STORAGE_TYPE_NETWORK, VIR_STORAGE_TYPE_VOLUME, + VIR_STORAGE_TYPE_NVME, VIR_STORAGE_TYPE_LAST } virStorageType; @@ -230,6 +232,16 @@ struct _virStorageSourceInitiatorDef { char *iqn; /* Initiator IQN */ }; +typedef struct _virStorageSourceNVMeDef virStorageSourceNVMeDef; +typedef virStorageSourceNVMeDef *virStorageSourceNVMeDefPtr; +struct _virStorageSourceNVMeDef { + unsigned long long namespace; + int managed; /* enum virTristateBool */ + virPCIDeviceAddress pciAddr; + + /* Don't forget to update virStorageSourceNVMeDefCopy */ +}; + typedef struct _virStorageDriverData virStorageDriverData; typedef virStorageDriverData *virStorageDriverDataPtr; @@ -261,6 +273,8 @@ struct _virStorageSource { bool encryptionInherited; virStoragePRDefPtr pr; + virStorageSourceNVMeDefPtr nvme; /* type == VIR_STORAGE_TYPE_NVME */ + virStorageSourceInitiatorDef initiator; virObjectPtr privateData; @@ -416,6 +430,9 @@ bool virStoragePRDefIsManaged(virStoragePRDefPtr prd); bool virStorageSourceChainHasManagedPR(virStorageSourcePtr src); +void virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def); +G_DEFINE_AUTOPTR_CLEANUP_FUNC(virStorageSourceNVMeDef, virStorageSourceNVMeDefFree); + virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model); diff --git a/tests/qemuxml2xmloutdata/disk-nvme.xml b/tests/qemuxml2xmloutdata/disk-nvme.xml new file mode 120000 index 0000000000..ea9eb267ac --- /dev/null +++ b/tests/qemuxml2xmloutdata/disk-nvme.xml @@ -0,0 +1 @@ +../qemuxml2argvdata/disk-nvme.xml \ No newline at end of file diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 8b43f35f06..b11f086016 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -329,6 +329,7 @@ mymain(void) DO_TEST("disk-network-sheepdog", NONE); DO_TEST("disk-network-vxhs", NONE); DO_TEST("disk-network-tlsx509", NONE); + DO_TEST("disk-nvme", QEMU_CAPS_VIRTIO_SCSI, QEMU_CAPS_QCOW2_LUKS); DO_TEST("disk-scsi", QEMU_CAPS_SCSI_LSI, QEMU_CAPS_SCSI_MEGASAS, QEMU_CAPS_SCSI_MPTSAS1068, QEMU_CAPS_SCSI_DISK_WWN); DO_TEST("disk-virtio-scsi-reservations", -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com> Arguably the nvme->namespace == 0 should be in a Validate callback, but it's extremely minor - Cole

On 12/10/19 12:43 AM, Cole Robinson wrote:
On 12/2/19 9:26 AM, Michal Privoznik wrote:
To simplify implementation, some restrictions are added. For instance, an NVMe disk can't go to any bus but virtio and has to be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Arguably the nvme->namespace == 0 should be in a Validate callback, but it's extremely minor
Fixed. Initially, I thought that namespace == 0 is invalid XML but I can argue both ways. And since you have inclination to the Validate callback I've moved the check there. Michal

This function will return true if there's a storage source of type VIR_STORAGE_TYPE_NVME, or false otherwise. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virstoragefile.c | 14 ++++++++++++++ src/util/virstoragefile.h | 2 ++ 3 files changed, 17 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index d37d3867a6..a7f94feb0a 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3057,6 +3057,7 @@ virStoragePRDefIsManaged; virStoragePRDefParseXML; virStorageSourceBackingStoreClear; virStorageSourceChainHasManagedPR; +virStorageSourceChainHasNVMe; virStorageSourceClear; virStorageSourceCopy; virStorageSourceFindByNodeName; diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c index b4ec9624be..5a5a16aef7 100644 --- a/src/util/virstoragefile.c +++ b/src/util/virstoragefile.c @@ -2138,6 +2138,20 @@ virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def) } +bool +virStorageSourceChainHasNVMe(const virStorageSource *src) +{ + const virStorageSource *n; + + for (n = src; virStorageSourceIsBacking(n); n = n->backingStore) { + if (n->type == VIR_STORAGE_TYPE_NVME) + return true; + } + + return false; +} + + virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model) diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h index 42ca49bc64..39e50a989d 100644 --- a/src/util/virstoragefile.h +++ b/src/util/virstoragefile.h @@ -433,6 +433,8 @@ virStorageSourceChainHasManagedPR(virStorageSourcePtr src); void virStorageSourceNVMeDefFree(virStorageSourceNVMeDefPtr def); G_DEFINE_AUTOPTR_CLEANUP_FUNC(virStorageSourceNVMeDef, virStorageSourceNVMeDefFree); +bool virStorageSourceChainHasNVMe(const virStorageSource *src); + virSecurityDeviceLabelDefPtr virStorageSourceGetSecurityLabelDef(virStorageSourcePtr src, const char *model); -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
This function will return true if there's a storage source of type VIR_STORAGE_TYPE_NVME, or false otherwise.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virstoragefile.c | 14 ++++++++++++++ src/util/virstoragefile.h | 2 ++ 3 files changed, 17 insertions(+)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

This function will return true if any of disks (or their backing chain) for given domain contains an NVMe disk. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/conf/domain_conf.c | 14 ++++++++++++++ src/conf/domain_conf.h | 3 +++ src/libvirt_private.syms | 1 + 3 files changed, 18 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 58879a66ba..b6e04db0f7 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -31467,6 +31467,20 @@ virDomainDefHasManagedPR(const virDomainDef *def) } +bool +virDomainDefHasNVMeDisk(const virDomainDef *def) +{ + size_t i; + + for (i = 0; i < def->ndisks; i++) { + if (virStorageSourceChainHasNVMe(def->disks[i]->src)) + return true; + } + + return false; +} + + bool virDomainDefHasVFIOHostdev(const virDomainDef *def) { diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 49cb6970e6..520d2da0d2 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3674,6 +3674,9 @@ virDomainDiskGetDetectZeroesMode(virDomainDiskDiscard discard, bool virDomainDefHasManagedPR(const virDomainDef *def); +bool +virDomainDefHasNVMeDisk(const virDomainDef *def); + bool virDomainDefHasVFIOHostdev(const virDomainDef *def); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index a7f94feb0a..86ee9cbc53 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -297,6 +297,7 @@ virDomainDefHasManagedPR; virDomainDefHasMdevHostdev; virDomainDefHasMemballoon; virDomainDefHasMemoryHotplug; +virDomainDefHasNVMeDisk; virDomainDefHasUSB; virDomainDefHasVcpusOffline; virDomainDefHasVFIOHostdev; -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
This function will return true if any of disks (or their backing chain) for given domain contains an NVMe disk.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/conf/domain_conf.c | 14 ++++++++++++++ src/conf/domain_conf.h | 3 +++ src/libvirt_private.syms | 1 + 3 files changed, 18 insertions(+)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

This module will be used by virHostdevManager and it's inspired by virPCIDevice module. They are very similar except instead of what makes a NVMe device: PCI address AND namespace ID. This means that a NVMe device can appear in a domain multiple times, each time with a different namespace. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- po/POTFILES.in | 1 + src/libvirt_private.syms | 18 ++ src/util/Makefile.inc.am | 2 + src/util/virnvme.c | 447 +++++++++++++++++++++++++++++++++++++++ src/util/virnvme.h | 95 +++++++++ 5 files changed, 563 insertions(+) create mode 100644 src/util/virnvme.c create mode 100644 src/util/virnvme.h diff --git a/po/POTFILES.in b/po/POTFILES.in index debb51cd70..3d075201c4 100644 --- a/po/POTFILES.in +++ b/po/POTFILES.in @@ -277,6 +277,7 @@ @SRCDIR@/src/util/virnetlink.c @SRCDIR@/src/util/virnodesuspend.c @SRCDIR@/src/util/virnuma.c +@SRCDIR@/src/util/virnvme.c @SRCDIR@/src/util/virobject.c @SRCDIR@/src/util/virpci.c @SRCDIR@/src/util/virperf.c diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 86ee9cbc53..aab2b49bdf 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2664,6 +2664,24 @@ virNumaSetPagePoolSize; virNumaSetupMemoryPolicy; +# util/virnvme.h +virNVMeDeviceAddressGet; +virNVMeDeviceCopy; +virNVMeDeviceFree; +virNVMeDeviceListAdd; +virNVMeDeviceListCount; +virNVMeDeviceListCreateDetachList; +virNVMeDeviceListDel; +virNVMeDeviceListGet; +virNVMeDeviceListLookup; +virNVMeDeviceListLookupIndex; +virNVMeDeviceListNew; +virNVMeDeviceNew; +virNVMeDeviceUsedByClear; +virNVMeDeviceUsedByGet; +virNVMeDeviceUsedBySet; + + # util/virobject.h virClassForObject; virClassForObjectLockable; diff --git a/src/util/Makefile.inc.am b/src/util/Makefile.inc.am index 0855f152fd..838e9479a9 100644 --- a/src/util/Makefile.inc.am +++ b/src/util/Makefile.inc.am @@ -150,6 +150,8 @@ UTIL_SOURCES = \ util/virnetlink.h \ util/virnodesuspend.c \ util/virnodesuspend.h \ + util/virnvme.c \ + util/virnvme.h \ util/virkmod.c \ util/virkmod.h \ util/virnuma.c \ diff --git a/src/util/virnvme.c b/src/util/virnvme.c new file mode 100644 index 0000000000..b8179aa431 --- /dev/null +++ b/src/util/virnvme.c @@ -0,0 +1,447 @@ +/* + * virnvme.c: helper APIs for managing NVMe devices + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + */ + +#include <config.h> + +#include "virnvme.h" +#include "virobject.h" +#include "virpci.h" +#include "viralloc.h" +#include "virlog.h" +#include "virstring.h" + +VIR_LOG_INIT("util.nvme"); +#define VIR_FROM_THIS VIR_FROM_NONE + +struct _virNVMeDevice { + virPCIDeviceAddress address; /* PCI address of controller */ + unsigned int namespace; /* Namespace ID */ + bool managed; + + char *drvname; + char *domname; +}; + + +struct _virNVMeDeviceList { + virObjectLockable parent; + + size_t count; + virNVMeDevicePtr *devs; +}; + + +static virClassPtr virNVMeDeviceListClass; + +static void virNVMeDeviceListDispose(void *obj); + +static int +virNVMeOnceInit(void) +{ + if (!VIR_CLASS_NEW(virNVMeDeviceList, virClassForObjectLockable())) + return -1; + + return 0; +} + +VIR_ONCE_GLOBAL_INIT(virNVMe); + + +virNVMeDevicePtr +virNVMeDeviceNew(const virPCIDeviceAddress *address, + unsigned long namespace, + bool managed) +{ + virNVMeDevicePtr dev = NULL; + + dev = g_new0(virNVMeDevice, 1); + + virPCIDeviceAddressCopy(&dev->address, address); + dev->namespace = namespace; + dev->managed = managed; + + return dev; +} + + +void +virNVMeDeviceFree(virNVMeDevicePtr dev) +{ + if (!dev) + return; + + virNVMeDeviceUsedByClear(dev); + VIR_FREE(dev); +} + + +virNVMeDevicePtr +virNVMeDeviceCopy(const virNVMeDevice *dev) +{ + virNVMeDevicePtr copy = NULL; + + copy = g_new0(virNVMeDevice, 1); + copy->drvname = g_strdup(dev->drvname); + copy->domname = g_strdup(dev->domname); + + virPCIDeviceAddressCopy(©->address, &dev->address); + copy->namespace = dev->namespace; + copy->managed = dev->managed; + + return copy; +} + + +const virPCIDeviceAddress * +virNVMeDeviceAddressGet(const virNVMeDevice *dev) +{ + return &dev->address; +} + + +void +virNVMeDeviceUsedByClear(virNVMeDevicePtr dev) +{ + VIR_FREE(dev->drvname); + VIR_FREE(dev->domname); +} + + +void +virNVMeDeviceUsedByGet(const virNVMeDevice *dev, + const char **drv, + const char **dom) +{ + *drv = dev->drvname; + *dom = dev->domname; +} + + +void +virNVMeDeviceUsedBySet(virNVMeDevicePtr dev, + const char *drv, + const char *dom) +{ + dev->drvname = g_strdup(drv); + dev->domname = g_strdup(dom); +} + + +virNVMeDeviceListPtr +virNVMeDeviceListNew(void) +{ + virNVMeDeviceListPtr list; + + if (virNVMeInitialize() < 0) + return NULL; + + if (!(list = virObjectLockableNew(virNVMeDeviceListClass))) + return NULL; + + return list; +} + + +static void +virNVMeDeviceListDispose(void *obj) +{ + virNVMeDeviceListPtr list = obj; + size_t i; + + for (i = 0; i < list->count; i++) + virNVMeDeviceFree(list->devs[i]); + + VIR_FREE(list->devs); +} + + +size_t +virNVMeDeviceListCount(const virNVMeDeviceList *list) +{ + return list->count; +} + + +int +virNVMeDeviceListAdd(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + virNVMeDevicePtr tmp; + + if ((tmp = virNVMeDeviceListLookup(list, dev))) { + g_autofree char *addrStr = virPCIDeviceAddressAsString(&tmp->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %u is already on the list"), + NULLSTR(addrStr), tmp->namespace); + return -1; + } + + if (!(tmp = virNVMeDeviceCopy(dev)) || + VIR_APPEND_ELEMENT(list->devs, list->count, tmp) < 0) { + virNVMeDeviceFree(tmp); + return -1; + } + + return 0; +} + + +int +virNVMeDeviceListDel(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + ssize_t idx; + virNVMeDevicePtr tmp = NULL; + + if ((idx = virNVMeDeviceListLookupIndex(list, dev)) < 0) { + g_autofree char *addrStr = virPCIDeviceAddressAsString(&dev->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %u not found"), + NULLSTR(addrStr), dev->namespace); + return -1; + } + + tmp = list->devs[idx]; + VIR_DELETE_ELEMENT(list->devs, idx, list->count); + virNVMeDeviceFree(tmp); + return 0; +} + + +virNVMeDevicePtr +virNVMeDeviceListGet(virNVMeDeviceListPtr list, + size_t i) +{ + return i < list->count ? list->devs[i] : NULL; +} + + +virNVMeDevicePtr +virNVMeDeviceListLookup(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + ssize_t idx; + + if ((idx = virNVMeDeviceListLookupIndex(list, dev)) < 0) + return NULL; + + return list->devs[idx]; +} + + +ssize_t +virNVMeDeviceListLookupIndex(virNVMeDeviceListPtr list, + const virNVMeDevice *dev) +{ + size_t i; + + if (!list) + return -1; + + for (i = 0; i < list->count; i++) { + virNVMeDevicePtr other = list->devs[i]; + + if (virPCIDeviceAddressEqual(&dev->address, &other->address) && + dev->namespace == other->namespace) + return i; + } + + return -1; +} + + +static virNVMeDevicePtr +virNVMeDeviceListLookupByPCIAddress(virNVMeDeviceListPtr list, + const virPCIDeviceAddress *address) +{ + size_t i; + + if (!list) + return NULL; + + for (i = 0; i < list->count; i++) { + virNVMeDevicePtr other = list->devs[i]; + + if (virPCIDeviceAddressEqual(address, &other->address)) + return other; + } + + return NULL; +} + + +static virPCIDevicePtr +virNVMeDeviceCreatePCIDevice(const virNVMeDevice *nvme) +{ + g_autoptr(virPCIDevice) pci = NULL; + + if (!(pci = virPCIDeviceNew(nvme->address.domain, + nvme->address.bus, + nvme->address.slot, + nvme->address.function))) + return NULL; + + /* NVMe devices must be bound to vfio */ + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_VFIO); + virPCIDeviceSetManaged(pci, nvme->managed); + + return g_steal_pointer(&pci); +} + + +/** + * virNVMeDeviceListCreateDetachList: + * @activeList: list of active NVMe devices + * @toDetachList: list of NVMe devices to detach from the host + * + * This function creates a list of PCI devices which can then be + * reused by PCI device detach functions (e.g. + * virHostdevPreparePCIDevicesImpl()) as each PCI device from the + * returned list is initialized properly for detach. + * + * Basically, this just blindly collects unique PCI addresses + * from @toDetachList that don't appear on @activeList. + * + * Returns: a list on success, + * NULL otherwise. + */ +virPCIDeviceListPtr +virNVMeDeviceListCreateDetachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toDetachList) +{ + g_autoptr(virPCIDeviceList) pciDevices = NULL; + size_t i; + + if (!(pciDevices = virPCIDeviceListNew())) + return NULL; + + for (i = 0; i < toDetachList->count; i++) { + const virNVMeDevice *d = toDetachList->devs[i]; + g_autoptr(virPCIDevice) pci = NULL; + + /* If there is a NVMe device with the same PCI address on + * the activeList, the device is already detached. */ + if (virNVMeDeviceListLookupByPCIAddress(activeList, &d->address)) + continue; + + /* It may happen that we want to detach two namespaces + * from the same NVMe device. This will be represented as + * two different instances of virNVMeDevice, but + * obviously we want to put the PCI device on the detach + * list only once. */ + if (virPCIDeviceListFindByIDs(pciDevices, + d->address.domain, + d->address.bus, + d->address.slot, + d->address.function)) + continue; + + if (!(pci = virNVMeDeviceCreatePCIDevice(d))) + return NULL; + + if (virPCIDeviceListAdd(pciDevices, pci) < 0) + return NULL; + + /* avoid freeing the device */ + pci = NULL; + } + + return g_steal_pointer(&pciDevices); +} + + +/** + * virNVMeDeviceListCreateReAttachList: + * @activeList: list of active NVMe devices + * @toReAttachList: list of devices to reattach to the host + * + * This is a counterpart to virNVMeDeviceListCreateDetachList. + * + * This function creates a list of PCI devices which can then be + * reused by PCI device reattach functions (e.g. + * virHostdevReAttachPCIDevicesImpl()) as each PCI device from + * the returned list is initialized properly for reattach. + * + * Basically, this just collects unique PCI addresses + * of devices that appear on @toReAttachList and are used + * exactly once (i.e. no other namespaces are used from the same + * NVMe device). For that purpose, this function needs to know + * list of active NVMe devices (@activeList). + * + * Returns: a list on success, + * NULL otherwise. + */ +virPCIDeviceListPtr +virNVMeDeviceListCreateReAttachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toReAttachList) +{ + g_autoptr(virPCIDeviceList) pciDevices = NULL; + size_t i; + + if (!(pciDevices = virPCIDeviceListNew())) + return NULL; + + for (i = 0; i < toReAttachList->count; i++) { + const virNVMeDevice *d = toReAttachList->devs[i]; + g_autoptr(virPCIDevice) pci = NULL; + size_t nused = 0; + + /* Check if there is any other NVMe device with the same PCI address as + * @d. To simplify this, let's just count how many NVMe devices with + * the same PCI address there are on the @activeList. */ + for (i = 0; i < activeList->count; i++) { + virNVMeDevicePtr other = activeList->devs[i]; + + if (!virPCIDeviceAddressEqual(&d->address, &other->address)) + continue; + + nused++; + } + + /* Now, the following cases can happen: + * nused > 1 -> there are other NVMe device active, do NOT detach it + * nused == 1 -> we've found only @d on the @activeList, detach it + * nused == 0 -> huh, wait, what? @d is NOT on the @active list, how can + * we reattach it? + */ + + if (nused == 0) { + /* Shouldn't happen (TM) */ + g_autofree char *addrStr = virPCIDeviceAddressAsString(&d->address); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("NVMe device %s namespace %u not found"), + NULLSTR(addrStr), d->namespace); + return NULL; + } else if (nused > 1) { + /* NVMe device is still in use */ + continue; + } + + /* nused == 1 -> detach the device */ + if (!(pci = virNVMeDeviceCreatePCIDevice(d))) + return NULL; + + if (virPCIDeviceListAdd(pciDevices, pci) < 0) + return NULL; + + /* avoid freeing the device */ + pci = NULL; + } + + return g_steal_pointer(&pciDevices); +} diff --git a/src/util/virnvme.h b/src/util/virnvme.h new file mode 100644 index 0000000000..911a9d29f3 --- /dev/null +++ b/src/util/virnvme.h @@ -0,0 +1,95 @@ +/* + * virnvme.h: helper APIs for managing NVMe devices + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + */ + +#pragma once + +#include "virpci.h" + +typedef struct _virNVMeDevice virNVMeDevice; +typedef virNVMeDevice *virNVMeDevicePtr; + +/* Note that this list is lockable, and in fact, it is caller's + * responsibility to acquire the lock and release it. The reason + * is that in a lot of cases the list must be locked between two + * API calls and therefore only caller knows when it is safe to + * finally release the lock. */ +typedef struct _virNVMeDeviceList virNVMeDeviceList; +typedef virNVMeDeviceList *virNVMeDeviceListPtr; + +virNVMeDevicePtr +virNVMeDeviceNew(const virPCIDeviceAddress *address, + unsigned long namespace, + bool managed); + +void +virNVMeDeviceFree(virNVMeDevicePtr dev); + +G_DEFINE_AUTOPTR_CLEANUP_FUNC(virNVMeDevice, virNVMeDeviceFree); + +virNVMeDevicePtr +virNVMeDeviceCopy(const virNVMeDevice *dev); + +const virPCIDeviceAddress * +virNVMeDeviceAddressGet(const virNVMeDevice *dev); + +void +virNVMeDeviceUsedByClear(virNVMeDevicePtr dev); + +void +virNVMeDeviceUsedByGet(const virNVMeDevice *dev, + const char **drv, + const char **dom); + +void +virNVMeDeviceUsedBySet(virNVMeDevicePtr dev, + const char *drv, + const char *dom); + +virNVMeDeviceListPtr +virNVMeDeviceListNew(void); + +size_t +virNVMeDeviceListCount(const virNVMeDeviceList *list); + +int +virNVMeDeviceListAdd(virNVMeDeviceListPtr list, + const virNVMeDevice *dev); + +int +virNVMeDeviceListDel(virNVMeDeviceListPtr list, + const virNVMeDevice *dev); + +virNVMeDevicePtr +virNVMeDeviceListGet(virNVMeDeviceListPtr list, + size_t i); + +virNVMeDevicePtr +virNVMeDeviceListLookup(virNVMeDeviceListPtr list, + const virNVMeDevice *dev); + +ssize_t +virNVMeDeviceListLookupIndex(virNVMeDeviceListPtr list, + const virNVMeDevice *dev); + +virPCIDeviceListPtr +virNVMeDeviceListCreateDetachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toDetachList); + +virPCIDeviceListPtr +virNVMeDeviceListCreateReAttachList(virNVMeDeviceListPtr activeList, + virNVMeDeviceListPtr toReAttachList); -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
This module will be used by virHostdevManager and it's inspired by virPCIDevice module. They are very similar except instead of what makes a NVMe device: PCI address AND namespace ID. This means that a NVMe device can appear in a domain multiple times, each time with a different namespace.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- po/POTFILES.in | 1 + src/libvirt_private.syms | 18 ++ src/util/Makefile.inc.am | 2 + src/util/virnvme.c | 447 +++++++++++++++++++++++++++++++++++++++ src/util/virnvme.h | 95 +++++++++ 5 files changed, 563 insertions(+) create mode 100644 src/util/virnvme.c create mode 100644 src/util/virnvme.h
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

Now that we have virNVMeDevice module (introduced in previous commit), let's use it int virHostdev to track which NVMe devices are free to be used by a domain and which are taken. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 5 + src/util/virhostdev.c | 331 +++++++++++++++++++++++++++++++++++++++ src/util/virhostdev.h | 37 +++++ src/util/virnvme.h | 2 + 4 files changed, 375 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index aab2b49bdf..398f8c21ac 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2146,18 +2146,23 @@ virHostdevPCINodeDeviceReAttach; virHostdevPCINodeDeviceReset; virHostdevPrepareDomainDevices; virHostdevPrepareMediatedDevices; +virHostdevPrepareNVMeDevices; +virHostdevPrepareOneNVMeDevice; virHostdevPreparePCIDevices; virHostdevPrepareSCSIDevices; virHostdevPrepareSCSIVHostDevices; virHostdevPrepareUSBDevices; virHostdevReAttachDomainDevices; virHostdevReAttachMediatedDevices; +virHostdevReAttachNVMeDevices; +virHostdevReAttachOneNVMeDevice; virHostdevReAttachPCIDevices; virHostdevReAttachSCSIDevices; virHostdevReAttachSCSIVHostDevices; virHostdevReAttachUSBDevices; virHostdevUpdateActiveDomainDevices; virHostdevUpdateActiveMediatedDevices; +virHostdevUpdateActiveNVMeDevices; virHostdevUpdateActivePCIDevices; virHostdevUpdateActiveSCSIDevices; virHostdevUpdateActiveUSBDevices; diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 94a0185597..78e409732a 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -135,6 +135,7 @@ virHostdevManagerDispose(void *obj) virObjectUnref(hostdevMgr->activeSCSIHostdevs); virObjectUnref(hostdevMgr->activeSCSIVHostHostdevs); virObjectUnref(hostdevMgr->activeMediatedHostdevs); + virObjectUnref(hostdevMgr->activeNVMeHostdevs); VIR_FREE(hostdevMgr->stateDir); } @@ -165,6 +166,9 @@ virHostdevManagerNew(void) if (!(hostdevMgr->activeMediatedHostdevs = virMediatedDeviceListNew())) return NULL; + if (!(hostdevMgr->activeNVMeHostdevs = virNVMeDeviceListNew())) + return NULL; + if (privileged) { hostdevMgr->stateDir = g_strdup(HOSTDEV_STATE_DIR); @@ -2235,3 +2239,330 @@ virHostdevUpdateActiveDomainDevices(virHostdevManagerPtr mgr, return 0; } + + +static int +virHostdevGetNVMeDeviceList(virNVMeDeviceListPtr nvmeDevices, + virStorageSourcePtr src, + const char *drv_name, + const char *dom_name) +{ + virStorageSourcePtr n; + + for (n = src; virStorageSourceIsBacking(n); n = n->backingStore) { + g_autoptr(virNVMeDevice) dev = NULL; + const virStorageSourceNVMeDef *srcNVMe = n->nvme; + + if (n->type != VIR_STORAGE_TYPE_NVME) + continue; + + if (!(dev = virNVMeDeviceNew(&srcNVMe->pciAddr, + srcNVMe->namespace, + srcNVMe->managed))) + return -1; + + virNVMeDeviceUsedBySet(dev, drv_name, dom_name); + + if (virNVMeDeviceListAdd(nvmeDevices, dev) < 0) + return -1; + } + + return 0; +} + + +int +virHostdevPrepareOneNVMeDevice(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virStorageSourcePtr src) +{ + g_autoptr(virNVMeDeviceList) nvmeDevices = NULL; + g_autoptr(virPCIDeviceList) pciDevices = NULL; + const unsigned int pciFlags = 0; + virNVMeDevicePtr temp = NULL; + size_t i; + ssize_t lastGoodNVMeIdx = -1; + int ret = -1; + + if (!(nvmeDevices = virNVMeDeviceListNew())) + return -1; + + if (virHostdevGetNVMeDeviceList(nvmeDevices, src, drv_name, dom_name) < 0) + return -1; + + if (virNVMeDeviceListCount(nvmeDevices) == 0) + return 0; + + virObjectLock(hostdev_mgr->activeNVMeHostdevs); + + /* Firstly, let's check if all devices are free */ + for (i = 0; i < virNVMeDeviceListCount(nvmeDevices); i++) { + const virNVMeDevice *dev = virNVMeDeviceListGet(nvmeDevices, i); + const virPCIDeviceAddress *addr = NULL; + g_autofree char *addrStr = NULL; + const char *actual_drvname = NULL; + const char *actual_domname = NULL; + + temp = virNVMeDeviceListLookup(hostdev_mgr->activeNVMeHostdevs, dev); + + /* Not on the list means not used */ + if (!temp) + continue; + + virNVMeDeviceUsedByGet(temp, &actual_drvname, &actual_domname); + addr = virNVMeDeviceAddressGet(dev); + addrStr = virPCIDeviceAddressAsString(addr); + + virReportError(VIR_ERR_OPERATION_INVALID, + _("NVMe device %s already in use by driver %s domain %s"), + NULLSTR(addrStr), actual_drvname, actual_domname); + goto cleanup; + } + + if (!(pciDevices = virNVMeDeviceListCreateDetachList(hostdev_mgr->activeNVMeHostdevs, + nvmeDevices))) + goto cleanup; + + /* Let's check if all PCI devices are NVMe disks. */ + for (i = 0; i < virPCIDeviceListCount(pciDevices); i++) { + virPCIDevicePtr pci = virPCIDeviceListGet(pciDevices, i); + g_autofree char *drvPath = NULL; + g_autofree char *drvName = NULL; + int stub = VIR_PCI_STUB_DRIVER_NONE; + + if (virPCIDeviceGetDriverPathAndName(pci, &drvPath, &drvName) < 0) + goto cleanup; + + if (drvName) + stub = virPCIStubDriverTypeFromString(drvName); + + if (stub == VIR_PCI_STUB_DRIVER_VFIO || + STREQ_NULLABLE(drvName, "nvme")) + continue; + + VIR_WARN("Suspicious NVMe disk assignment. PCI device " + "%s is not an NVMe disk, it has %s driver", + virPCIDeviceGetName(pci), NULLSTR(drvName)); + } + + /* This looks like a good opportunity to merge inactive NVMe devices onto + * the active list. This, however, means that if something goes wrong we + * have to perform a rollback before returning. */ + for (i = 0; i < virNVMeDeviceListCount(nvmeDevices); i++) { + temp = virNVMeDeviceListGet(nvmeDevices, i); + + if (virNVMeDeviceListAdd(hostdev_mgr->activeNVMeHostdevs, temp) < 0) + goto rollback; + + lastGoodNVMeIdx = i; + } + + if (virHostdevPreparePCIDevicesImpl(hostdev_mgr, + drv_name, dom_name, NULL, + pciDevices, NULL, 0, pciFlags) < 0) + goto rollback; + + ret = 0; + cleanup: + virObjectUnlock(hostdev_mgr->activeNVMeHostdevs); + return ret; + + rollback: + while (lastGoodNVMeIdx >= 0) { + temp = virNVMeDeviceListGet(nvmeDevices, lastGoodNVMeIdx); + + virNVMeDeviceListDel(hostdev_mgr->activeNVMeHostdevs, temp); + + lastGoodNVMeIdx--; + } + goto cleanup; +} + + +int +virHostdevPrepareNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + size_t i; + ssize_t lastGoodDiskIdx = -1; + + for (i = 0; i < ndisks; i++) { + if (virHostdevPrepareOneNVMeDevice(hostdev_mgr, drv_name, + dom_name, disks[i]->src) < 0) + goto rollback; + + lastGoodDiskIdx = i; + } + + return 0; + + rollback: + while (lastGoodDiskIdx >= 0) { + if (virHostdevReAttachOneNVMeDevice(hostdev_mgr, drv_name, dom_name, + disks[lastGoodDiskIdx]->src) < 0) { + VIR_ERROR(_("Failed to reattach NVMe for disk target: %s"), + disks[lastGoodDiskIdx]->dst); + } + + lastGoodDiskIdx--; + } + + return -1; +} + + +int +virHostdevReAttachOneNVMeDevice(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virStorageSourcePtr src) +{ + g_autoptr(virNVMeDeviceList) nvmeDevices = NULL; + g_autoptr(virPCIDeviceList) pciDevices = NULL; + size_t i; + int ret = -1; + + if (!(nvmeDevices = virNVMeDeviceListNew())) + return -1; + + if (virHostdevGetNVMeDeviceList(nvmeDevices, src, drv_name, dom_name) < 0) + return -1; + + if (virNVMeDeviceListCount(nvmeDevices) == 0) + return 0; + + virObjectLock(hostdev_mgr->activeNVMeHostdevs); + + if (!(pciDevices = virNVMeDeviceListCreateReAttachList(hostdev_mgr->activeNVMeHostdevs, + nvmeDevices))) + goto cleanup; + + virHostdevReAttachPCIDevicesImpl(hostdev_mgr, + drv_name, dom_name, pciDevices, + NULL, 0, NULL); + + for (i = 0; i < virNVMeDeviceListCount(nvmeDevices); i++) { + virNVMeDevicePtr temp = virNVMeDeviceListGet(nvmeDevices, i); + + if (virNVMeDeviceListDel(hostdev_mgr->activeNVMeHostdevs, temp) < 0) + goto cleanup; + } + + ret = 0; + cleanup: + virObjectUnlock(hostdev_mgr->activeNVMeHostdevs); + return ret; +} + + +int +virHostdevReAttachNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + size_t i; + int ret = 0; + + /* Contrary to virHostdevPrepareNVMeDevices, this is a best + * effort approach. Just iterate over all disks and try to + * reattach them. Don't stop at the first failure. */ + for (i = 0; i < ndisks; i++) { + if (virHostdevReAttachOneNVMeDevice(hostdev_mgr, drv_name, + dom_name, disks[i]->src) < 0) { + VIR_ERROR(_("Failed to reattach NVMe for disk target: %s"), + disks[i]->dst); + ret = -1; + } + } + + return ret; +} + + +int +virHostdevUpdateActiveNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + g_autoptr(virNVMeDeviceList) nvmeDevices = NULL; + g_autoptr(virPCIDeviceList) pciDevices = NULL; + virNVMeDevicePtr temp = NULL; + size_t i; + ssize_t lastGoodNVMeIdx = -1; + ssize_t lastGoodPCIIdx = -1; + int ret = -1; + + if (!(nvmeDevices = virNVMeDeviceListNew())) + return -1; + + for (i = 0; i < ndisks; i++) { + if (virHostdevGetNVMeDeviceList(nvmeDevices, disks[i]->src, drv_name, dom_name) < 0) + return -1; + } + + if (virNVMeDeviceListCount(nvmeDevices) == 0) + return 0; + + virObjectLock(hostdev_mgr->activeNVMeHostdevs); + virObjectLock(hostdev_mgr->activePCIHostdevs); + virObjectLock(hostdev_mgr->inactivePCIHostdevs); + + if (!(pciDevices = virNVMeDeviceListCreateDetachList(hostdev_mgr->activeNVMeHostdevs, + nvmeDevices))) + goto cleanup; + + for (i = 0; i < virNVMeDeviceListCount(nvmeDevices); i++) { + temp = virNVMeDeviceListGet(nvmeDevices, i); + + if (virNVMeDeviceListAdd(hostdev_mgr->activeNVMeHostdevs, temp) < 0) + goto rollback; + + lastGoodNVMeIdx = i; + } + + for (i = 0; i < virPCIDeviceListCount(pciDevices); i++) { + virPCIDevicePtr actual = virPCIDeviceListGet(pciDevices, i); + + /* We must restore some attributes that were lost on daemon restart. */ + virPCIDeviceSetUnbindFromStub(actual, true); + if (virPCIDeviceSetUsedBy(actual, drv_name, dom_name) < 0) + goto rollback; + + if (virPCIDeviceListAddCopy(hostdev_mgr->activePCIHostdevs, actual) < 0) + goto rollback; + + lastGoodPCIIdx = i; + } + + ret = 0; + cleanup: + virObjectUnlock(hostdev_mgr->inactivePCIHostdevs); + virObjectUnlock(hostdev_mgr->activePCIHostdevs); + virObjectUnlock(hostdev_mgr->activeNVMeHostdevs); + return ret; + + rollback: + while (lastGoodNVMeIdx >= 0) { + temp = virNVMeDeviceListGet(nvmeDevices, lastGoodNVMeIdx); + + virNVMeDeviceListDel(hostdev_mgr->activeNVMeHostdevs, temp); + + lastGoodNVMeIdx--; + } + while (lastGoodPCIIdx >= 0) { + virPCIDevicePtr actual = virPCIDeviceListGet(pciDevices, i); + + virPCIDeviceListDel(hostdev_mgr->activePCIHostdevs, actual); + + lastGoodPCIIdx--; + } + goto cleanup; +} diff --git a/src/util/virhostdev.h b/src/util/virhostdev.h index 2d61c21e9d..ae84ed3d20 100644 --- a/src/util/virhostdev.h +++ b/src/util/virhostdev.h @@ -29,6 +29,7 @@ #include "virscsivhost.h" #include "conf/domain_conf.h" #include "virmdev.h" +#include "virnvme.h" typedef enum { VIR_HOSTDEV_STRICT_ACS_CHECK = (1 << 0), /* strict acs check */ @@ -53,6 +54,9 @@ struct _virHostdevManager { virSCSIDeviceListPtr activeSCSIHostdevs; virSCSIVHostDeviceListPtr activeSCSIVHostHostdevs; virMediatedDeviceListPtr activeMediatedHostdevs; + /* NVMe devices are PCI devices really, but one NVMe disk can + * have multiple namespaces. */ + virNVMeDeviceListPtr activeNVMeHostdevs; }; G_DEFINE_AUTOPTR_CLEANUP_FUNC(virHostdevManager, virObjectUnref); @@ -207,3 +211,36 @@ int virHostdevPCINodeDeviceReAttach(virHostdevManagerPtr mgr, int virHostdevPCINodeDeviceReset(virHostdevManagerPtr mgr, virPCIDevicePtr pci) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); + +int +virHostdevPrepareOneNVMeDevice(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virStorageSourcePtr src); + +int +virHostdevPrepareNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks); + +int +virHostdevReAttachOneNVMeDevice(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virStorageSourcePtr src); + +int +virHostdevReAttachNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks); + +int +virHostdevUpdateActiveNVMeDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainDiskDefPtr *disks, + size_t ndisks); diff --git a/src/util/virnvme.h b/src/util/virnvme.h index 911a9d29f3..db4d72ab9a 100644 --- a/src/util/virnvme.h +++ b/src/util/virnvme.h @@ -31,6 +31,8 @@ typedef virNVMeDevice *virNVMeDevicePtr; typedef struct _virNVMeDeviceList virNVMeDeviceList; typedef virNVMeDeviceList *virNVMeDeviceListPtr; +G_DEFINE_AUTOPTR_CLEANUP_FUNC(virNVMeDeviceList, virObjectUnref); + virNVMeDevicePtr virNVMeDeviceNew(const virPCIDeviceAddress *address, unsigned long namespace, -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
Now that we have virNVMeDevice module (introduced in previous commit), let's use it int virHostdev to track which NVMe devices are free to be used by a domain and which are taken.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 5 + src/util/virhostdev.c | 331 +++++++++++++++++++++++++++++++++++++++ src/util/virhostdev.h | 37 +++++ src/util/virnvme.h | 2 + 4 files changed, 375 insertions(+)
Any mentions of 'rollback' make me nervous, but I couldn't find anything sketchy in there Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

The device configs (which are actually the same one config) come from a NVMe disk of mine. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- tests/virpcimock.c | 3 +++ tests/virpcitestdata/0000-01-00.0.config | Bin 0 -> 4096 bytes tests/virpcitestdata/0000-02-00.0.config | Bin 0 -> 4096 bytes 3 files changed, 3 insertions(+) create mode 100644 tests/virpcitestdata/0000-01-00.0.config create mode 100644 tests/virpcitestdata/0000-02-00.0.config diff --git a/tests/virpcimock.c b/tests/virpcimock.c index cd6ae1cff6..6048118d5c 100644 --- a/tests/virpcimock.c +++ b/tests/virpcimock.c @@ -980,6 +980,7 @@ init_env(void) MAKE_PCI_DRIVER("iwlwifi", 0x8086, 0x0044); MAKE_PCI_DRIVER("i915", 0x8086, 0x0046, 0x8086, 0x0047); MAKE_PCI_DRIVER("vfio-pci", -1, -1); + MAKE_PCI_DRIVER("nvme", 0x1cc1, 0x8201); # define MAKE_PCI_DEVICE(Id, Vendor, Device, IommuGroup, ...) \ do { \ @@ -1015,6 +1016,8 @@ init_env(void) MAKE_PCI_DEVICE("0021:de:1f.1", 0x8086, 0x0047, 13, .physfn = "0021:de:1f.0"); /* Virtual Function */ + MAKE_PCI_DEVICE("0000:01:00.0", 0x1cc1, 0x8201, 14, .klass = 0x010802); + MAKE_PCI_DEVICE("0000:02:00.0", 0x1cc1, 0x8201, 15, .klass = 0x010802); } diff --git a/tests/virpcitestdata/0000-01-00.0.config b/tests/virpcitestdata/0000-01-00.0.config new file mode 100644 index 0000000000000000000000000000000000000000..f92455e2ac5701ce60a51ae19828658b80744399 GIT binary patch literal 4096 zcmeHIF;Buk6#lMP3WC_80!0h0vM?|ZHYb~)#L>jXKcF*<t0@ew#(&`84`6g`<KpZu zaP%LTxZu0C6dH+PAO@3rm)v{zz3*Nx-?guS#YUQHfGar$G8Om~evt*l6}THG3$%ls zbL8T+aGAkfSZ5AOg~nJxaQ{&~b`11hPvNqjks{E-76s`bTjV$zsdNdt2Zx}86r3!0 z5-k@njLH$yMaSt!;XCjcMN7{rhH)Lzgm%?1tR|ap5RC)?OfZuh+-MNn&bPwMnC1$Y zWtY532v8x%Zq%*)y_#9Aly`TwONPEx+$`iba#<~-a)litu!!pkegQ#S0so<=$gQVo zbjxR?KpRD$uKLK=emkc^$-Zgf<*MZ@;uWY8!!UdWrXoze;QR3q_ajXzAQg}b)NzRk z87VP9U%z%(HWJ0B|6JTCgnRh9Wz<V@F5}c=IPK9xUNln!?lk>Mr%9L>;{S2-*!p%x w=b5pKIZ+ec2|JnLT|5EZ*;+<Yfz>MDr9bc*RexLU6J#~1fK)@Fxm<1DBgMR{#J2 literal 0 HcmV?d00001 diff --git a/tests/virpcitestdata/0000-02-00.0.config b/tests/virpcitestdata/0000-02-00.0.config new file mode 100644 index 0000000000000000000000000000000000000000..ebb44d8f69c91809b82e8d2669026dfff42c3100 GIT binary patch literal 4096 zcmeHIJ5Iwu5Pj>-$HXLdfc(IT4QW!Oh|*DEDG*U2(QpB%)6hF9Xc0G{-~cHpZ9zfJ z322bGLCP>|J5DSjlp;bw+F5C5_RZVz>a9KYO*YD;3~)tdAWH!g;g^|DT!A}LQllO0 zf<ukg!legyL7fFC5gKC!{{2_w#5T}-JA=b|MuI>KOBAGo6v%Nj66qpz7dAnM2{>Nx zI9e@W7?nb%gO1$~!w=vwj8>jg7)EtS6WUe7uo7>+ML1#rsDf3w!Hov7tz0X}jA<@| znO4!A1^^YZtw!BE*soP9<<j2nPSMZ{`E4z?rDikf6j#_0e3Q7Y;A`;P3iuB_MQ$@K zL$`cR3bc`bbvTZ_%x~vZDA})?c)4!b%Xk`9Vi*Rmz)Xah7kn=o;(nw_1*8H}fjX`* zB_kyU=<7E&%Z8$O^q-3wg>Vm(Pe#2&br`1}!)cEm@WPoIaHr{&J59pe0RNAZ%Qm+& w+Ruz#E{GcIPT1)j@8SvQ&et-M3anQFH~E3rsQUYQpCGGA1*8H}fj?2;8=mzw{{R30 literal 0 HcmV?d00001 -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
The device configs (which are actually the same one config) come from a NVMe disk of mine.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- tests/virpcimock.c | 3 +++ tests/virpcitestdata/0000-01-00.0.config | Bin 0 -> 4096 bytes tests/virpcitestdata/0000-02-00.0.config | Bin 0 -> 4096 bytes 3 files changed, 3 insertions(+) create mode 100644 tests/virpcitestdata/0000-01-00.0.config create mode 100644 tests/virpcitestdata/0000-02-00.0.config
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- tests/virhostdevtest.c | 97 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) diff --git a/tests/virhostdevtest.c b/tests/virhostdevtest.c index ed8e7e8fc1..39216edc67 100644 --- a/tests/virhostdevtest.c +++ b/tests/virhostdevtest.c @@ -48,6 +48,9 @@ VIR_LOG_INIT("tests.hostdevtest"); # define CHECK_PCI_LIST_COUNT(list, cnt) \ CHECK_LIST_COUNT(list, cnt, virPCIDeviceListCount) +# define CHECK_NVME_LIST_COUNT(list, cnt) \ + CHECK_LIST_COUNT(list, cnt, virNVMeDeviceListCount) + # define TEST_STATE_DIR abs_builddir "/hostdevmgr" static const char *drv_name = "test_driver"; static const char *dom_name = "test_domain"; @@ -57,6 +60,36 @@ static int nhostdevs = 3; static virDomainHostdevDefPtr hostdevs[] = {NULL, NULL, NULL}; static virPCIDevicePtr dev[] = {NULL, NULL, NULL}; static virHostdevManagerPtr mgr; +static const size_t ndisks = 3; +static virDomainDiskDefPtr disks[] = {NULL, NULL, NULL}; +static const char *diskXML[] = { + "<disk type='nvme' device='disk'>" + " <driver name='qemu' type='raw'/>" + " <source type='pci' managed='yes' namespace='1'>" + " <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>" + " </source>" + " <target dev='vda' bus='virtio'/>" + " <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>" + "</disk>", + + "<disk type='nvme' device='disk'>" + " <driver name='qemu' type='raw'/>" + " <source type='pci' managed='yes' namespace='2'>" + " <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>" + " </source>" + " <target dev='vdb' bus='virtio'/>" + " <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>" + "</disk>", + + "<disk type='nvme' device='disk'>" + " <driver name='qemu' type='raw'/>" + " <source type='pci' managed='no' namespace='1'>" + " <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>" + " </source>" + " <target dev='vdc' bus='virtio'/>" + " <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>" + "</disk>" +}; static void myCleanup(void) @@ -67,6 +100,9 @@ myCleanup(void) virDomainHostdevDefFree(hostdevs[i]); } + for (i = 0; i < ndisks; i++) + virDomainDiskDefFree(disks[i]); + if (mgr) { if (!getenv("LIBVIRT_SKIP_CLEANUP")) virFileDeleteTree(mgr->stateDir); @@ -75,6 +111,7 @@ myCleanup(void) virObjectUnref(mgr->activeUSBHostdevs); virObjectUnref(mgr->inactivePCIHostdevs); virObjectUnref(mgr->activeSCSIHostdevs); + virObjectUnref(mgr->activeNVMeHostdevs); VIR_FREE(mgr->stateDir); VIR_FREE(mgr); } @@ -107,6 +144,11 @@ myInit(void) virPCIDeviceSetStubDriver(dev[i], VIR_PCI_STUB_DRIVER_VFIO); } + for (i = 0; i < ndisks; i++) { + if (!(disks[i] = virDomainDiskDefParse(diskXML[i], NULL, NULL, 0))) + goto cleanup; + } + if (VIR_ALLOC(mgr) < 0) goto cleanup; if ((mgr->activePCIHostdevs = virPCIDeviceListNew()) == NULL) @@ -117,6 +159,8 @@ myInit(void) goto cleanup; if ((mgr->activeSCSIHostdevs = virSCSIDeviceListNew()) == NULL) goto cleanup; + if ((mgr->activeNVMeHostdevs = virNVMeDeviceListNew()) == NULL) + goto cleanup; mgr->stateDir = g_strdup(TEST_STATE_DIR); if (virFileMakePath(mgr->stateDir) < 0) goto cleanup; @@ -492,6 +536,58 @@ testVirHostdevOther(const void *opaque G_GNUC_UNUSED) return 0; } +static int +testNVMeDiskRoundtrip(const void *opaque G_GNUC_UNUSED) +{ + int ret = -1; + + /* Don't rely on a state that previous test cases might have + * left the manager in. Start with a clean slate. */ + virHostdevReAttachPCIDevices(mgr, drv_name, dom_name, + hostdevs, nhostdevs, NULL); + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 0); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 0); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + /* Firstly, attach all NVMe disks */ + if (virHostdevPrepareNVMeDevices(mgr, drv_name, dom_name, disks, ndisks) < 0) + goto cleanup; + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 3); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 2); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + /* Now, try to detach the first one. */ + if (virHostdevReAttachNVMeDevices(mgr, drv_name, dom_name, disks, 1) < 0) + goto cleanup; + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 2); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 2); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + /* And the last one */ + if (virHostdevReAttachNVMeDevices(mgr, drv_name, dom_name, &disks[2], 1) < 0) + goto cleanup; + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 1); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 1); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + /* Finally, detach the middle one */ + if (virHostdevReAttachNVMeDevices(mgr, drv_name, dom_name, &disks[1], 1) < 0) + goto cleanup; + + CHECK_NVME_LIST_COUNT(mgr->activeNVMeHostdevs, 0); + CHECK_PCI_LIST_COUNT(mgr->activePCIHostdevs, 0); + CHECK_PCI_LIST_COUNT(mgr->inactivePCIHostdevs, 0); + + ret = 0; + cleanup: + return ret; +} + + # define FAKEROOTDIRTEMPLATE abs_builddir "/fakerootdir-XXXXXX" static int @@ -526,6 +622,7 @@ mymain(void) DO_TEST(testVirHostdevRoundtripManaged); DO_TEST(testVirHostdevRoundtripMixed); DO_TEST(testVirHostdevOther); + DO_TEST(testNVMeDiskRoundtrip); myCleanup(); -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- tests/virhostdevtest.c | 97 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+)
You'll need to adjust virDomainDiskDefParse usage to match the changed signature in git Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

The qemu driver has its own wrappers around virHostdev module (so that some arguments are filled in automatically). Extend these to include NVMe devices too. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/qemu/qemu_hostdev.c | 49 ++++++++++++++++++++++++++++++++++++++--- src/qemu/qemu_hostdev.h | 10 +++++++++ 2 files changed, 56 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index ebbca817b8..5ab0217858 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -96,13 +96,28 @@ qemuHostdevUpdateActiveMediatedDevices(virQEMUDriverPtr driver, } +int +qemuHostdevUpdateActiveNVMeDisks(virQEMUDriverPtr driver, + virDomainDefPtr def) +{ + return virHostdevUpdateActiveNVMeDevices(driver->hostdevMgr, + QEMU_DRIVER_NAME, + def->name, + def->disks, + def->ndisks); +} + + int qemuHostdevUpdateActiveDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def) { - if (!def->nhostdevs) + if (!def->nhostdevs && !def->ndisks) return 0; + if (qemuHostdevUpdateActiveNVMeDisks(driver, def) < 0) + return -1; + if (qemuHostdevUpdateActivePCIDevices(driver, def) < 0) return -1; @@ -197,6 +212,17 @@ qemuHostdevPreparePCIDevicesCheckSupport(virDomainHostdevDefPtr *hostdevs, return true; } +int +qemuHostdevPrepareNVMeDisks(virQEMUDriverPtr driver, + const char *name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + return virHostdevPrepareNVMeDevices(driver->hostdevMgr, + QEMU_DRIVER_NAME, + name, disks, ndisks); +} + int qemuHostdevPreparePCIDevices(virQEMUDriverPtr driver, const char *name, @@ -313,9 +339,12 @@ qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, virQEMUCapsPtr qemuCaps, unsigned int flags) { - if (!def->nhostdevs) + if (!def->nhostdevs && !def->ndisks) return 0; + if (qemuHostdevPrepareNVMeDisks(driver, def->name, def->disks, def->ndisks) < 0) + return -1; + if (qemuHostdevPreparePCIDevices(driver, def->name, def->uuid, def->hostdevs, def->nhostdevs, qemuCaps, flags) < 0) @@ -340,6 +369,17 @@ qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, return 0; } +void +qemuHostdevReAttachNVMeDisks(virQEMUDriverPtr driver, + const char *name, + virDomainDiskDefPtr *disks, + size_t ndisks) +{ + virHostdevReAttachNVMeDevices(driver->hostdevMgr, + QEMU_DRIVER_NAME, + name, disks, ndisks); +} + void qemuHostdevReAttachPCIDevices(virQEMUDriverPtr driver, const char *name, @@ -419,9 +459,12 @@ void qemuHostdevReAttachDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def) { - if (!def->nhostdevs) + if (!def->nhostdevs && !def->ndisks) return; + qemuHostdevReAttachNVMeDisks(driver, def->name, def->disks, + def->ndisks); + qemuHostdevReAttachPCIDevices(driver, def->name, def->hostdevs, def->nhostdevs); diff --git a/src/qemu/qemu_hostdev.h b/src/qemu/qemu_hostdev.h index 536069fe8a..735414b6aa 100644 --- a/src/qemu/qemu_hostdev.h +++ b/src/qemu/qemu_hostdev.h @@ -28,6 +28,8 @@ bool qemuHostdevNeedsVFIO(const virDomainHostdevDef *hostdev); bool qemuHostdevHostSupportsPassthroughVFIO(void); +int qemuHostdevUpdateActiveNVMeDisks(virQEMUDriverPtr driver, + virDomainDefPtr def); int qemuHostdevUpdateActiveMediatedDevices(virQEMUDriverPtr driver, virDomainDefPtr def); int qemuHostdevUpdateActivePCIDevices(virQEMUDriverPtr driver, @@ -39,6 +41,10 @@ int qemuHostdevUpdateActiveSCSIDevices(virQEMUDriverPtr driver, int qemuHostdevUpdateActiveDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def); +int qemuHostdevPrepareNVMeDisks(virQEMUDriverPtr driver, + const char *name, + virDomainDiskDefPtr *disks, + size_t ndisks); int qemuHostdevPreparePCIDevices(virQEMUDriverPtr driver, const char *name, const unsigned char *uuid, @@ -68,6 +74,10 @@ int qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, virQEMUCapsPtr qemuCaps, unsigned int flags); +void qemuHostdevReAttachNVMeDisks(virQEMUDriverPtr driver, + const char *name, + virDomainDiskDefPtr *disks, + size_t ndisks); void qemuHostdevReAttachPCIDevices(virQEMUDriverPtr driver, const char *name, virDomainHostdevDefPtr *hostdevs, -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
The qemu driver has its own wrappers around virHostdev module (so that some arguments are filled in automatically). Extend these to include NVMe devices too.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/qemu/qemu_hostdev.c | 49 ++++++++++++++++++++++++++++++++++++++--- src/qemu/qemu_hostdev.h | 10 +++++++++ 2 files changed, 56 insertions(+), 3 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

We have this beautiful function that does crystal ball divination. The function is named qemuDomainGetMemLockLimitBytes() and it calculates the upper limit of how much locked memory is given guest going to need. The function bases its guess on devices defined for a domain. For instance, if there is a VFIO hostdev defined then it adds 1GiB to the guessed maximum. Since NVMe disks are pretty much VFIO hostdevs (but not quite), we have to do the same sorcery. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/qemu/qemu_domain.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 59e98de2d2..7b515b9520 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11920,6 +11920,9 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) } } + if (virDomainDefHasNVMeDisk(def)) + usesVFIO = true; + memory = virDomainDefGetMemoryTotal(def); if (def->mem.max_memory) @@ -12018,6 +12021,7 @@ unsigned long long qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) { unsigned long long memKB = 0; + bool usesVFIO = false; size_t i; /* prefer the hard limit */ @@ -12058,11 +12062,17 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) for (i = 0; i < def->nhostdevs; i++) { if (virHostdevIsVFIODevice(def->hostdevs[i]) || virHostdevIsMdevDevice(def->hostdevs[i])) { - memKB = virDomainDefGetMemoryTotal(def) + 1024 * 1024; - goto done; + usesVFIO = true; + break; } } + if (virDomainDefHasNVMeDisk(def)) + usesVFIO = true; + + if (usesVFIO) + memKB = virDomainDefGetMemoryTotal(def) + 1024 * 1024; + done: return memKB << 10; } -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
We have this beautiful function that does crystal ball divination. The function is named qemuDomainGetMemLockLimitBytes() and it calculates the upper limit of how much locked memory is given guest going to need. The function bases its guess on devices defined for a domain. For instance, if there is a VFIO hostdev defined then it adds 1GiB to the guessed maximum. Since NVMe disks are pretty much VFIO hostdevs (but not quite), we have to do the same sorcery.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ACKed-by: Peter Krempa <pkrempa@redhat.com> --- src/qemu/qemu_domain.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
For the patch as-is Reviewed-by: Cole Robinson <crobinso@redhat.com> But...
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 59e98de2d2..7b515b9520 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11920,6 +11920,9 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) } }
+ if (virDomainDefHasNVMeDisk(def)) + usesVFIO = true; +
We have qemuDomainNeedsVFIO which helpully captures these rules. Looks the logic has already diverged for ppc here because it doesn't consider IsMdev devices. Something for a future patch - Cole
memory = virDomainDefGetMemoryTotal(def);
if (def->mem.max_memory) @@ -12018,6 +12021,7 @@ unsigned long long qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) { unsigned long long memKB = 0; + bool usesVFIO = false; size_t i;
/* prefer the hard limit */ @@ -12058,11 +12062,17 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) for (i = 0; i < def->nhostdevs; i++) { if (virHostdevIsVFIODevice(def->hostdevs[i]) || virHostdevIsMdevDevice(def->hostdevs[i])) { - memKB = virDomainDefGetMemoryTotal(def) + 1024 * 1024; - goto done; + usesVFIO = true; + break; } }
+ if (virDomainDefHasNVMeDisk(def)) + usesVFIO = true; + + if (usesVFIO) + memKB = virDomainDefGetMemoryTotal(def) + 1024 * 1024; + done: return memKB << 10; }
- Cole

If a domain has an NVMe disk configured, then we need to create /dev/vfio/* paths in domain's namespace so that qemu can open them. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 67 ++++++++++++++++++++++++++++++++---------- 1 file changed, 51 insertions(+), 16 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 7b515b9520..70c4ee8e65 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -13430,16 +13430,29 @@ qemuDomainSetupDisk(virQEMUDriverConfigPtr cfg G_GNUC_UNUSED, { virStorageSourcePtr next; char *dst = NULL; + bool hasNVMe = false; int ret = -1; for (next = disk->src; virStorageSourceIsBacking(next); next = next->backingStore) { - if (!next->path || !virStorageSourceIsLocalStorage(next)) { - /* Not creating device. Just continue. */ - continue; - } + if (next->type == VIR_STORAGE_TYPE_NVME) { + g_autofree char *nvmePath = NULL; - if (qemuDomainCreateDevice(next->path, data, false) < 0) - goto cleanup; + hasNVMe = true; + + if (!(nvmePath = virPCIDeviceAddressGetIOMMUGroupDev(&next->nvme->pciAddr))) + goto cleanup; + + if (qemuDomainCreateDevice(nvmePath, data, false) < 0) + goto cleanup; + } else { + if (!next->path || !virStorageSourceIsLocalStorage(next)) { + /* Not creating device. Just continue. */ + continue; + } + + if (qemuDomainCreateDevice(next->path, data, false) < 0) + goto cleanup; + } } /* qemu-pr-helper might require access to /dev/mapper/control. */ @@ -13447,6 +13460,10 @@ qemuDomainSetupDisk(virQEMUDriverConfigPtr cfg G_GNUC_UNUSED, qemuDomainCreateDevice(QEMU_DEVICE_MAPPER_CONTROL_PATH, data, true) < 0) goto cleanup; + if (hasNVMe && + qemuDomainCreateDevice(QEMU_DEV_VFIO, data, false) < 0) + goto cleanup; + ret = 0; cleanup: VIR_FREE(dst); @@ -14461,19 +14478,32 @@ qemuDomainNamespaceSetupDisk(virDomainObjPtr vm, virStorageSourcePtr src) { virStorageSourcePtr next; - const char **paths = NULL; + char **paths = NULL; size_t npaths = 0; - char *dmPath = NULL; + bool hasNVMe = false; + g_autofree char *dmPath = NULL; + g_autofree char *vfioPath = NULL; int ret = -1; for (next = src; virStorageSourceIsBacking(next); next = next->backingStore) { - if (virStorageSourceIsEmpty(next) || - !virStorageSourceIsLocalStorage(next)) { - /* Not creating device. Just continue. */ - continue; + g_autofree char *tmpPath = NULL; + + if (next->type == VIR_STORAGE_TYPE_NVME) { + hasNVMe = true; + + if (!(tmpPath = virPCIDeviceAddressGetIOMMUGroupDev(&next->nvme->pciAddr))) + goto cleanup; + } else { + if (virStorageSourceIsEmpty(next) || + !virStorageSourceIsLocalStorage(next)) { + /* Not creating device. Just continue. */ + continue; + } + + tmpPath = g_strdup(next->path); } - if (VIR_APPEND_ELEMENT_COPY(paths, npaths, next->path) < 0) + if (VIR_APPEND_ELEMENT(paths, npaths, tmpPath) < 0) goto cleanup; } @@ -14484,13 +14514,18 @@ qemuDomainNamespaceSetupDisk(virDomainObjPtr vm, goto cleanup; } - if (qemuDomainNamespaceMknodPaths(vm, paths, npaths) < 0) + if (hasNVMe) { + vfioPath = g_strdup(QEMU_DEV_VFIO); + if (VIR_APPEND_ELEMENT(paths, npaths, vfioPath) < 0) + goto cleanup; + } + + if (qemuDomainNamespaceMknodPaths(vm, (const char **) paths, npaths) < 0) goto cleanup; ret = 0; cleanup: - VIR_FREE(dmPath); - VIR_FREE(paths); + virStringListFreeCount(paths, npaths); return ret; } -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
If a domain has an NVMe disk configured, then we need to create /dev/vfio/* paths in domain's namespace so that qemu can open them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 67 ++++++++++++++++++++++++++++++++---------- 1 file changed, 51 insertions(+), 16 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 7b515b9520..70c4ee8e65 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -13430,16 +13430,29 @@ qemuDomainSetupDisk(virQEMUDriverConfigPtr cfg G_GNUC_UNUSED, { virStorageSourcePtr next; char *dst = NULL; + bool hasNVMe = false; int ret = -1;
for (next = disk->src; virStorageSourceIsBacking(next); next = next->backingStore) { - if (!next->path || !virStorageSourceIsLocalStorage(next)) { - /* Not creating device. Just continue. */ - continue; - } + if (next->type == VIR_STORAGE_TYPE_NVME) { + g_autofree char *nvmePath = NULL;
- if (qemuDomainCreateDevice(next->path, data, false) < 0) - goto cleanup; + hasNVMe = true; + + if (!(nvmePath = virPCIDeviceAddressGetIOMMUGroupDev(&next->nvme->pciAddr))) + goto cleanup; + + if (qemuDomainCreateDevice(nvmePath, data, false) < 0) + goto cleanup; + } else { + if (!next->path || !virStorageSourceIsLocalStorage(next)) { + /* Not creating device. Just continue. */ + continue; + } + + if (qemuDomainCreateDevice(next->path, data, false) < 0) + goto cleanup; + } }
I don't see anything wrong with this patch. In the interest of getting this series into git, here is my: Reviewed-by: Cole Robinson <crobinso@redhat.com> But, I'm finding a lot of the 'next->type == VIR_STORAGE_TYPE_NVME' sprinkled around unfortunate, here and in later patches. It seems like there's two StorageSource cases here involving local filesystem paths: 1) Does src->path contain the <source> storage location? Answered by virStorageSourceIsLocalStorage 2) Does src have a local path that the VM needs to access? Until now this was always the same as #1 && src->path, but not any more. So maybe add helpers like: bool virStorageSourceHasLocalResource(virStorageSourcePtr src) { return next->path || next->type == VIR_STORAGE_TYPE_NVME; } const char * virStorageSourceGetLocalResource(virStorageSourcePtr src) { if (!virStorageSourceHasLocalResource(src)) <error> if (next->type == NVME) <return VFIO path or error> return strdup(next->path); } I think virStorageSourceIsLocalStorage should be renamed too because now it is ambiguous. Maybe something more explicit like SourceUsesPath, or SourceIsLocalPath. Naming sucks obviously so suggestions welcome With those funtions, the loop above becomes: for (...) { g_autofree char *path = NULL; if (!HasLocalResource) continue if (!(path = GetLocalResource(...)) error <do stuff with path> } The hasNVMe bit can then be replaced with virStorageSourceChainHasNVMe security driver usage will be simpler too I think. But maybe I'm missing some complication - Cole

On 12/10/19 9:16 PM, Cole Robinson wrote:
On 12/2/19 9:26 AM, Michal Privoznik wrote:
If a domain has an NVMe disk configured, then we need to create /dev/vfio/* paths in domain's namespace so that qemu can open them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 67 ++++++++++++++++++++++++++++++++---------- 1 file changed, 51 insertions(+), 16 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 7b515b9520..70c4ee8e65 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -13430,16 +13430,29 @@ qemuDomainSetupDisk(virQEMUDriverConfigPtr cfg G_GNUC_UNUSED, { virStorageSourcePtr next; char *dst = NULL; + bool hasNVMe = false; int ret = -1;
for (next = disk->src; virStorageSourceIsBacking(next); next = next->backingStore) { - if (!next->path || !virStorageSourceIsLocalStorage(next)) { - /* Not creating device. Just continue. */ - continue; - } + if (next->type == VIR_STORAGE_TYPE_NVME) { + g_autofree char *nvmePath = NULL;
- if (qemuDomainCreateDevice(next->path, data, false) < 0) - goto cleanup; + hasNVMe = true; + + if (!(nvmePath = virPCIDeviceAddressGetIOMMUGroupDev(&next->nvme->pciAddr))) + goto cleanup; + + if (qemuDomainCreateDevice(nvmePath, data, false) < 0) + goto cleanup; + } else { + if (!next->path || !virStorageSourceIsLocalStorage(next)) { + /* Not creating device. Just continue. */ + continue; + } + + if (qemuDomainCreateDevice(next->path, data, false) < 0) + goto cleanup; + } }
I don't see anything wrong with this patch. In the interest of getting this series into git, here is my:
Reviewed-by: Cole Robinson <crobinso@redhat.com>
But, I'm finding a lot of the 'next->type == VIR_STORAGE_TYPE_NVME' sprinkled around unfortunate, here and in later patches. It seems like there's two StorageSource cases here involving local filesystem paths:
1) Does src->path contain the <source> storage location? Answered by virStorageSourceIsLocalStorage 2) Does src have a local path that the VM needs to access? Until now this was always the same as #1 && src->path, but not any more.
So maybe add helpers like:
bool virStorageSourceHasLocalResource(virStorageSourcePtr src) { return next->path || next->type == VIR_STORAGE_TYPE_NVME; }
const char * virStorageSourceGetLocalResource(virStorageSourcePtr src) { if (!virStorageSourceHasLocalResource(src)) <error>
if (next->type == NVME) <return VFIO path or error>
return strdup(next->path); }
I think virStorageSourceIsLocalStorage should be renamed too because now it is ambiguous. Maybe something more explicit like SourceUsesPath, or SourceIsLocalPath. Naming sucks obviously so suggestions welcome
With those funtions, the loop above becomes:
for (...) { g_autofree char *path = NULL; if (!HasLocalResource) continue if (!(path = GetLocalResource(...)) error <do stuff with path> }
The hasNVMe bit can then be replaced with virStorageSourceChainHasNVMe
security driver usage will be simpler too I think. But maybe I'm missing some complication
I think you're right. I will save that to a follow up series. Michal

There are couple of places where a domain with a VFIO device gets special treatment: in CGroups when enabling/disabling access to /dev/vfio/vfio, and when creating/removing nodes in domain mount namespace. Well, a NVMe disk is a VFIO device too. Fortunately, we have this qemuDomainNeedsVFIO() function which is the only place that needs adjustment. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 70c4ee8e65..e81d986e9e 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -12865,7 +12865,8 @@ bool qemuDomainNeedsVFIO(const virDomainDef *def) { return virDomainDefHasVFIOHostdev(def) || - virDomainDefHasMdevHostdev(def); + virDomainDefHasMdevHostdev(def) || + virDomainDefHasNVMeDisk(def); } -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
There are couple of places where a domain with a VFIO device gets special treatment: in CGroups when enabling/disabling access to /dev/vfio/vfio, and when creating/removing nodes in domain mount namespace. Well, a NVMe disk is a VFIO device too. Fortunately, we have this qemuDomainNeedsVFIO() function which is the only place that needs adjustment.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> ---
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

If a domain has an NVMe disk configured, then we need to allow it on devices CGroup so that qemu can access it. There is one caveat though - if an NVMe disk is read only we need CGroup to allow write too. This is because when opening the device, qemu does couple of ioctl()-s which are considered as write. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 97 ++++++++++++++++++++++++++++++------------ 1 file changed, 69 insertions(+), 28 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index d23318db8c..1d3994ecc9 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -118,10 +118,29 @@ qemuSetupImageCgroupInternal(virDomainObjPtr vm, virStorageSourcePtr src, bool forceReadonly) { - if (!src->path || !virStorageSourceIsLocalStorage(src)) { - VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", - NULLSTR(src->path), virStorageTypeToString(src->type)); - return 0; + g_autofree char *path = NULL; + bool readonly = src->readonly || forceReadonly; + + if (src->type == VIR_STORAGE_TYPE_NVME) { + /* Even though disk is R/O we can't make it so in + * CGroups. QEMU will try to do some ioctl()-s over the + * device and such operations are considered R/W by the + * kernel */ + readonly = false; + + if (!(path = virPCIDeviceAddressGetIOMMUGroupDev(&src->nvme->pciAddr))) + return -1; + + if (qemuSetupImagePathCgroup(vm, QEMU_DEV_VFIO, false) < 0) + return -1; + } else { + if (!src->path || !virStorageSourceIsLocalStorage(src)) { + VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", + NULLSTR(src->path), virStorageTypeToString(src->type)); + return 0; + } + + path = g_strdup(src->path); } if (virStoragePRDefIsManaged(src->pr) && @@ -129,7 +148,7 @@ qemuSetupImageCgroupInternal(virDomainObjPtr vm, qemuSetupImagePathCgroup(vm, QEMU_DEVICE_MAPPER_CONTROL_PATH, false) < 0) return -1; - return qemuSetupImagePathCgroup(vm, src->path, src->readonly || forceReadonly); + return qemuSetupImagePathCgroup(vm, path, readonly); } @@ -146,7 +165,10 @@ qemuTeardownImageCgroup(virDomainObjPtr vm, virStorageSourcePtr src) { qemuDomainObjPrivatePtr priv = vm->privateData; + g_autofree char *path = NULL; int perms = VIR_CGROUP_DEVICE_RWM; + bool hasPR = false; + bool hasNVMe = false; size_t i; int ret; @@ -154,41 +176,60 @@ qemuTeardownImageCgroup(virDomainObjPtr vm, VIR_CGROUP_CONTROLLER_DEVICES)) return 0; - if (!src->path || !virStorageSourceIsLocalStorage(src)) { - VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", - NULLSTR(src->path), virStorageTypeToString(src->type)); - return 0; - } + for (i = 0; i < vm->def->ndisks; i++) { + virStorageSourcePtr diskSrc = vm->def->disks[i]->src; - if (virFileExists(QEMU_DEVICE_MAPPER_CONTROL_PATH)) { - for (i = 0; i < vm->def->ndisks; i++) { - virStorageSourcePtr diskSrc = vm->def->disks[i]->src; + if (src == diskSrc) + continue; - if (src == diskSrc) - continue; + if (virStoragePRDefIsManaged(diskSrc->pr)) + hasPR = true; - if (virStoragePRDefIsManaged(diskSrc->pr)) - break; - } + if (virStorageSourceChainHasNVMe(diskSrc)) + hasNVMe = true; + } - if (i == vm->def->ndisks) { - VIR_DEBUG("Disabling device mapper control"); - ret = virCgroupDenyDevicePath(priv->cgroup, - QEMU_DEVICE_MAPPER_CONTROL_PATH, - perms, true); + if (src->type == VIR_STORAGE_TYPE_NVME) { + if (!(path = virPCIDeviceAddressGetIOMMUGroupDev(&src->nvme->pciAddr))) + return -1; + + if (!hasNVMe && + !qemuDomainNeedsVFIO(vm->def)) { + ret = virCgroupDenyDevicePath(priv->cgroup, QEMU_DEV_VFIO, perms, true); virDomainAuditCgroupPath(vm, priv->cgroup, "deny", - QEMU_DEVICE_MAPPER_CONTROL_PATH, + QEMU_DEV_VFIO, virCgroupGetDevicePermsString(perms), ret); if (ret < 0) - return ret; + return -1; } + } else { + if (!src->path || !virStorageSourceIsLocalStorage(src)) { + VIR_DEBUG("Not updating cgroups for disk path '%s', type: %s", + NULLSTR(src->path), virStorageTypeToString(src->type)); + return 0; + } + + path = g_strdup(src->path); + } + + if (!hasPR && + virFileExists(QEMU_DEVICE_MAPPER_CONTROL_PATH)) { + VIR_DEBUG("Disabling device mapper control"); + ret = virCgroupDenyDevicePath(priv->cgroup, + QEMU_DEVICE_MAPPER_CONTROL_PATH, + perms, true); + virDomainAuditCgroupPath(vm, priv->cgroup, "deny", + QEMU_DEVICE_MAPPER_CONTROL_PATH, + virCgroupGetDevicePermsString(perms), ret); + if (ret < 0) + return ret; } - VIR_DEBUG("Deny path %s", src->path); + VIR_DEBUG("Deny path %s", path); - ret = virCgroupDenyDevicePath(priv->cgroup, src->path, perms, true); + ret = virCgroupDenyDevicePath(priv->cgroup, path, perms, true); - virDomainAuditCgroupPath(vm, priv->cgroup, "deny", src->path, + virDomainAuditCgroupPath(vm, priv->cgroup, "deny", path, virCgroupGetDevicePermsString(perms), ret); /* If you're looking for a counter part to -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
If a domain has an NVMe disk configured, then we need to allow it on devices CGroup so that qemu can access it. There is one caveat though - if an NVMe disk is read only we need CGroup to allow write too. This is because when opening the device, qemu does couple of ioctl()-s which are considered as write.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 97 ++++++++++++++++++++++++++++++------------ 1 file changed, 69 insertions(+), 28 deletions(-)
A lot going on here, a bit tough to digest, the extra helpers I suggested should help here. But I didn't notice anything bad Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

This function is currently not called for any type of storage source that is not considered 'local' (as defined by virStorageSourceIsLocalStorage()). Well, NVMe disks are not 'local' from that point of view and therefore we will need to call this function more frequently. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/security/security_apparmor.c | 25 ++++++++++++---- src/security/security_dac.c | 30 +++++++++++++++++++ src/security/security_selinux.c | 51 +++++++++++++++++++++++++++----- 3 files changed, 92 insertions(+), 14 deletions(-) diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 21560b2330..7cea788931 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -779,9 +779,8 @@ AppArmorSetSecurityImageLabel(virSecurityManagerPtr mgr, virSecurityDomainImageLabelFlags flags G_GNUC_UNUSED) { virSecurityLabelDefPtr secdef; - - if (!src->path || !virStorageSourceIsLocalStorage(src)) - return 0; + g_autofree char *vfioGroupDev = NULL; + const char *path; secdef = virDomainDefGetSecurityLabelDef(def, SECURITY_APPARMOR_NAME); if (!secdef || !secdef->relabel) @@ -790,15 +789,29 @@ AppArmorSetSecurityImageLabel(virSecurityManagerPtr mgr, if (!secdef->imagelabel) return 0; + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + + if (!(vfioGroupDev = virPCIDeviceAddressGetIOMMUGroupDev(&nvme->pciAddr))) + return -1; + + path = vfioGroupDev; + } else { + if (!src->path || !virStorageSourceIsLocalStorage(src)) + return 0; + + path = src->path; + } + /* if the device doesn't exist, error out */ - if (!virFileExists(src->path)) { + if (!virFileExists(path)) { virReportError(VIR_ERR_INTERNAL_ERROR, _("\'%s\' does not exist"), - src->path); + path); return -1; } - return reload_profile(mgr, def, src->path, true); + return reload_profile(mgr, def, path, true); } static int diff --git a/src/security/security_dac.c b/src/security/security_dac.c index a9a1fad5d7..411aa1b159 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -911,6 +911,19 @@ virSecurityDACSetImageLabelInternal(virSecurityManagerPtr mgr, return -1; } + /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + g_autofree char *vfioGroupDev = NULL; + + if (!(vfioGroupDev = virPCIDeviceAddressGetIOMMUGroupDev(&nvme->pciAddr))) + return -1; + + return virSecurityDACSetOwnership(mgr, NULL, vfioGroupDev, user, group, false); + } + /* We can't do restore on shared resources safely. Not even * with refcounting implemented in XATTRs because if there * was a domain running with the feature turned off the @@ -1017,6 +1030,23 @@ virSecurityDACRestoreImageLabelSingle(virSecurityManagerPtr mgr, } } + /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + g_autofree char *vfioGroupDev = NULL; + + if (!(vfioGroupDev = virPCIDeviceAddressGetIOMMUGroupDev(&nvme->pciAddr))) + return -1; + + /* Ideally, we would check if there is not another PCI + * device within domain def that is in the same IOMMU + * group. But we're not doing that for hostdevs yet. */ + + return virSecurityDACRestoreFileLabelInternal(mgr, NULL, vfioGroupDev, false); + } + return virSecurityDACRestoreFileLabelInternal(mgr, src, NULL, true); } diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index e05ef7593e..26bbe55ce7 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -1729,9 +1729,8 @@ virSecuritySELinuxRestoreImageLabelSingle(virSecurityManagerPtr mgr, { virSecurityLabelDefPtr seclabel; virSecurityDeviceLabelDefPtr disk_seclabel; - - if (!src->path || !virStorageSourceIsLocalStorage(src)) - return 0; + g_autofree char *vfioGroupDev = NULL; + const char *path = src->path; seclabel = virDomainDefGetSecurityLabelDef(def, SECURITY_SELINUX_NAME); if (seclabel == NULL) @@ -1763,9 +1762,16 @@ virSecuritySELinuxRestoreImageLabelSingle(virSecurityManagerPtr mgr, * ownership, because that kills access on the destination host which is * sub-optimal for the guest VM's I/O attempts :-) */ if (migrated) { - int rc = virFileIsSharedFS(src->path); - if (rc < 0) - return -1; + int rc = 1; + + if (virStorageSourceIsLocalStorage(src)) { + if (!src->path) + return 0; + + if ((rc = virFileIsSharedFS(src->path)) < 0) + return -1; + } + if (rc == 1) { VIR_DEBUG("Skipping image label restore on %s because FS is shared", src->path); @@ -1773,7 +1779,22 @@ virSecuritySELinuxRestoreImageLabelSingle(virSecurityManagerPtr mgr, } } - return virSecuritySELinuxRestoreFileLabel(mgr, src->path, true); + /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + + if (!(vfioGroupDev = virPCIDeviceAddressGetIOMMUGroupDev(&nvme->pciAddr))) + return -1; + + /* Ideally, we would check if there is not another PCI + * device within domain def that is in the same IOMMU + * group. But we're not doing that for hostdevs yet. */ + path = vfioGroupDev; + } + + return virSecuritySELinuxRestoreFileLabel(mgr, path, true); } @@ -1820,6 +1841,8 @@ virSecuritySELinuxSetImageLabelInternal(virSecurityManagerPtr mgr, char *use_label = NULL; bool remember; bool is_toplevel = parent == src || parent->externalDataStore == src; + g_autofree char *vfioGroupDev = NULL; + const char *path = src->path; int ret; if (!src->path || !virStorageSourceIsLocalStorage(src)) @@ -1872,7 +1895,19 @@ virSecuritySELinuxSetImageLabelInternal(virSecurityManagerPtr mgr, use_label = data->content_context; } - ret = virSecuritySELinuxSetFilecon(mgr, src->path, use_label, remember); + /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + + if (!(vfioGroupDev = virPCIDeviceAddressGetIOMMUGroupDev(&nvme->pciAddr))) + return -1; + + path = vfioGroupDev; + } + + ret = virSecuritySELinuxSetFilecon(mgr, path, use_label, remember); if (ret == 1 && !disk_seclabel) { /* If we failed to set a label, but virt_use_nfs let us -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
This function is currently not called for any type of storage source that is not considered 'local' (as defined by virStorageSourceIsLocalStorage()). Well, NVMe disks are not 'local' from that point of view and therefore we will need to call this function more frequently.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/security/security_apparmor.c | 25 ++++++++++++---- src/security/security_dac.c | 30 +++++++++++++++++++ src/security/security_selinux.c | 51 +++++++++++++++++++++++++++----- 3 files changed, 92 insertions(+), 14 deletions(-)
diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 21560b2330..7cea788931 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -779,9 +779,8 @@ AppArmorSetSecurityImageLabel(virSecurityManagerPtr mgr, virSecurityDomainImageLabelFlags flags G_GNUC_UNUSED) { virSecurityLabelDefPtr secdef; - - if (!src->path || !virStorageSourceIsLocalStorage(src)) - return 0; + g_autofree char *vfioGroupDev = NULL; + const char *path;
secdef = virDomainDefGetSecurityLabelDef(def, SECURITY_APPARMOR_NAME); if (!secdef || !secdef->relabel) @@ -790,15 +789,29 @@ AppArmorSetSecurityImageLabel(virSecurityManagerPtr mgr, if (!secdef->imagelabel) return 0;
+ if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + + if (!(vfioGroupDev = virPCIDeviceAddressGetIOMMUGroupDev(&nvme->pciAddr))) + return -1; + + path = vfioGroupDev; + } else { + if (!src->path || !virStorageSourceIsLocalStorage(src)) + return 0; + + path = src->path; + } + /* if the device doesn't exist, error out */ - if (!virFileExists(src->path)) { + if (!virFileExists(path)) { virReportError(VIR_ERR_INTERNAL_ERROR, _("\'%s\' does not exist"), - src->path); + path); return -1; }
- return reload_profile(mgr, def, src->path, true); + return reload_profile(mgr, def, path, true); }
static int diff --git a/src/security/security_dac.c b/src/security/security_dac.c index a9a1fad5d7..411aa1b159 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -911,6 +911,19 @@ virSecurityDACSetImageLabelInternal(virSecurityManagerPtr mgr, return -1; }
+ /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + g_autofree char *vfioGroupDev = NULL; + + if (!(vfioGroupDev = virPCIDeviceAddressGetIOMMUGroupDev(&nvme->pciAddr))) + return -1; + + return virSecurityDACSetOwnership(mgr, NULL, vfioGroupDev, user, group, false); + } + /* We can't do restore on shared resources safely. Not even * with refcounting implemented in XATTRs because if there * was a domain running with the feature turned off the @@ -1017,6 +1030,23 @@ virSecurityDACRestoreImageLabelSingle(virSecurityManagerPtr mgr, } }
+ /* This is not very clean. But so far we don't have NVMe + * storage pool backend so that its chownCallback would be + * called. And this place looks least offensive. */ + if (src->type == VIR_STORAGE_TYPE_NVME) { + const virStorageSourceNVMeDef *nvme = src->nvme; + g_autofree char *vfioGroupDev = NULL; + + if (!(vfioGroupDev = virPCIDeviceAddressGetIOMMUGroupDev(&nvme->pciAddr))) + return -1; + + /* Ideally, we would check if there is not another PCI + * device within domain def that is in the same IOMMU + * group. But we're not doing that for hostdevs yet. */ + + return virSecurityDACRestoreFileLabelInternal(mgr, NULL, vfioGroupDev, false); + } + return virSecurityDACRestoreFileLabelInternal(mgr, src, NULL, true); }
I think my suggested helpers will simplify selinux and apparmor a bit. Not sure about DAC but I don't fully understand the chownCallback + transaction stuff. Looks mechanically sound AFAICT Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

This capability tracks if qemu is capable of: -drive file.driver=nvme The feature was added in QEMU's commit of v2.12.0-rc0~104^2~2. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml | 1 + tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_4.2.0.ppc64.xml | 1 + tests/qemucapabilitiesdata/caps_4.2.0.s390x.xml | 1 + tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml | 1 + 24 files changed, 25 insertions(+) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index f65af5c228..4b3b4f2eff 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -552,6 +552,7 @@ VIR_ENUM_IMPL(virQEMUCaps, "arm-max-cpu", "blockdev-file-dynamic-auto-read-only", "savevm-monitor-nodes", + "drive-nvme", ); @@ -1399,6 +1400,7 @@ static struct virQEMUCapsStringFlags virQEMUCapsQMPSchemaQueries[] = { { "blockdev-add/arg-type/+file/drop-cache", QEMU_CAPS_MIGRATION_FILE_DROP_CACHE }, { "blockdev-add/arg-type/+file/$dynamic-auto-read-only", QEMU_CAPS_BLOCK_FILE_AUTO_READONLY_DYNAMIC }, { "human-monitor-command/$savevm-monitor-nodes", QEMU_CAPS_SAVEVM_MONITOR_NODES }, + { "blockdev-add/arg-type/+nvme", QEMU_CAPS_DRIVE_NVME }, }; typedef struct _virQEMUCapsObjectTypeProps virQEMUCapsObjectTypeProps; diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 3fd8bebe79..3a9b70c1cf 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -533,6 +533,7 @@ typedef enum { /* virQEMUCapsFlags grouping marker for syntax-check */ QEMU_CAPS_ARM_MAX_CPU, /* max-arm-cpu type exists */ QEMU_CAPS_BLOCK_FILE_AUTO_READONLY_DYNAMIC, /* the auto-read-only property of block backends for files is dynamic */ QEMU_CAPS_SAVEVM_MONITOR_NODES, /* 'savevm' handles monitor-owned nodes properly */ + QEMU_CAPS_DRIVE_NVME, /* -drive file.driver=nvme */ QEMU_CAPS_LAST /* this must always be the last item */ } virQEMUCapsFlags; diff --git a/tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml b/tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml index 5d1bfb798f..72c63533a3 100644 --- a/tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml +++ b/tests/qemucapabilitiesdata/caps_2.12.0.aarch64.xml @@ -154,6 +154,7 @@ <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> <flag name='arm-max-cpu'/> + <flag name='drive-nvme'/> <version>2012000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>61700289</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml index 84179e8ef6..147c017fe3 100644 --- a/tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_2.12.0.ppc64.xml @@ -151,6 +151,7 @@ <flag name='memory-backend-memfd.hugetlb'/> <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> + <flag name='drive-nvme'/> <version>2011090</version> <kvmVersion>0</kvmVersion> <microcodeVersion>42900289</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml b/tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml index 4be4018d44..8ff0633724 100644 --- a/tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml +++ b/tests/qemucapabilitiesdata/caps_2.12.0.s390x.xml @@ -123,6 +123,7 @@ <flag name='memory-backend-file.align'/> <flag name='query-cpu-model-baseline'/> <flag name='query-cpu-model-comparison'/> + <flag name='drive-nvme'/> <version>2012000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>39100289</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml index 81903cd493..85954d352e 100644 --- a/tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_2.12.0.x86_64.xml @@ -195,6 +195,7 @@ <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> <flag name='x86-max-cpu'/> + <flag name='drive-nvme'/> <version>2011090</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100289</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml index 8714c3c1f4..baea300753 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.ppc64.xml @@ -153,6 +153,7 @@ <flag name='memory-backend-file.align'/> <flag name='bochs-display'/> <flag name='ramfb'/> + <flag name='drive-nvme'/> <version>2012050</version> <kvmVersion>0</kvmVersion> <microcodeVersion>42900239</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml b/tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml index a31fc2a250..3b81ea11c3 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.riscv32.xml @@ -93,6 +93,7 @@ <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> <flag name='ramfb'/> + <flag name='drive-nvme'/> <version>3000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>0</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml b/tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml index 6bf0e68725..615667d9b2 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.riscv64.xml @@ -93,6 +93,7 @@ <flag name='iothread.poll-max-ns'/> <flag name='memory-backend-file.align'/> <flag name='ramfb'/> + <flag name='drive-nvme'/> <version>3000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>0</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml b/tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml index 941369003b..34a4999294 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.s390x.xml @@ -126,6 +126,7 @@ <flag name='query-cpu-model-baseline'/> <flag name='query-cpu-model-comparison'/> <flag name='ramfb'/> + <flag name='drive-nvme'/> <version>3000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>39100239</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml index 92abf01e4e..3d5535f404 100644 --- a/tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_3.0.0.x86_64.xml @@ -200,6 +200,7 @@ <flag name='x86-max-cpu'/> <flag name='bochs-display'/> <flag name='ramfb'/> + <flag name='drive-nvme'/> <version>3000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100239</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml index 5667860b44..0f64cfb2ba 100644 --- a/tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_3.1.0.ppc64.xml @@ -157,6 +157,7 @@ <flag name='memory-backend-file.pmem'/> <flag name='overcommit'/> <flag name='bochs-display'/> + <flag name='drive-nvme'/> <version>3000091</version> <kvmVersion>0</kvmVersion> <microcodeVersion>42900240</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml index 5cdba54a11..c468ad0153 100644 --- a/tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_3.1.0.x86_64.xml @@ -203,6 +203,7 @@ <flag name='x86-max-cpu'/> <flag name='bochs-display'/> <flag name='ramfb'/> + <flag name='drive-nvme'/> <version>3000092</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100240</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml b/tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml index dacbadd208..9fc3c7e6f5 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.aarch64.xml @@ -167,6 +167,7 @@ <flag name='migration-file-drop-cache'/> <flag name='ramfb'/> <flag name='arm-max-cpu'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>61700240</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml index cd860daa9f..e84dd557a4 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.ppc64.xml @@ -171,6 +171,7 @@ <flag name='bochs-display'/> <flag name='migration-file-drop-cache'/> <flag name='machine.pseries.cap-ccf-assist'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>42900240</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml b/tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml index a1a1a26ece..bc675fc727 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.riscv32.xml @@ -168,6 +168,7 @@ <flag name='nbd-bitmap'/> <flag name='bochs-display'/> <flag name='migration-file-drop-cache'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>0</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml b/tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml index 5bdc1e9451..1eeda32b7c 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.riscv64.xml @@ -168,6 +168,7 @@ <flag name='nbd-bitmap'/> <flag name='bochs-display'/> <flag name='migration-file-drop-cache'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>0</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml b/tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml index 54f7aba2dc..8b8acd5a33 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.s390x.xml @@ -134,6 +134,7 @@ <flag name='migration-file-drop-cache'/> <flag name='query-cpu-model-baseline'/> <flag name='query-cpu-model-comparison'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>39100240</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml index 90f8452710..c29be71658 100644 --- a/tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_4.0.0.x86_64.xml @@ -208,6 +208,7 @@ <flag name='bochs-display'/> <flag name='migration-file-drop-cache'/> <flag name='ramfb'/> + <flag name='drive-nvme'/> <version>4000000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100240</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml index 9f1675aa55..2bc9672063 100644 --- a/tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_4.1.0.x86_64.xml @@ -213,6 +213,7 @@ <flag name='vhost-user-vga'/> <flag name='ramfb'/> <flag name='blockdev-file-dynamic-auto-read-only'/> + <flag name='drive-nvme'/> <version>4001000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100241</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml b/tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml index 6ed4829be6..588e682064 100644 --- a/tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml +++ b/tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml @@ -173,6 +173,7 @@ <flag name='arm-max-cpu'/> <flag name='blockdev-file-dynamic-auto-read-only'/> <flag name='savevm-monitor-nodes'/> + <flag name='drive-nvme'/> <version>4001050</version> <kvmVersion>0</kvmVersion> <microcodeVersion>61700242</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.2.0.ppc64.xml b/tests/qemucapabilitiesdata/caps_4.2.0.ppc64.xml index bc7042e7e1..9a480c4eb3 100644 --- a/tests/qemucapabilitiesdata/caps_4.2.0.ppc64.xml +++ b/tests/qemucapabilitiesdata/caps_4.2.0.ppc64.xml @@ -174,6 +174,7 @@ <flag name='vhost-user-vga'/> <flag name='machine.pseries.cap-ccf-assist'/> <flag name='blockdev-file-dynamic-auto-read-only'/> + <flag name='drive-nvme'/> <version>4001050</version> <kvmVersion>0</kvmVersion> <microcodeVersion>42900242</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.2.0.s390x.xml b/tests/qemucapabilitiesdata/caps_4.2.0.s390x.xml index 714634dcc3..505b3adcb6 100644 --- a/tests/qemucapabilitiesdata/caps_4.2.0.s390x.xml +++ b/tests/qemucapabilitiesdata/caps_4.2.0.s390x.xml @@ -133,6 +133,7 @@ <flag name='query-cpu-model-baseline'/> <flag name='query-cpu-model-comparison'/> <flag name='blockdev-file-dynamic-auto-read-only'/> + <flag name='drive-nvme'/> <version>4001050</version> <kvmVersion>0</kvmVersion> <microcodeVersion>39100242</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml index c7bac03f44..31302c9a7b 100644 --- a/tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml @@ -216,6 +216,7 @@ <flag name='ramfb'/> <flag name='blockdev-file-dynamic-auto-read-only'/> <flag name='savevm-monitor-nodes'/> + <flag name='drive-nvme'/> <version>4001092</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100242</microcodeVersion> -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
This capability tracks if qemu is capable of:
-drive file.driver=nvme
The feature was added in QEMU's commit of v2.12.0-rc0~104^2~2.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

Now, that we have everything prepared, we can generate command line for NVMe disks. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_block.c | 25 +++++++- src/qemu/qemu_command.c | 3 + src/qemu/qemu_process.c | 7 +++ .../disk-nvme.x86_64-latest.args | 63 +++++++++++++++++++ tests/qemuxml2argvtest.c | 1 + 5 files changed, 98 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c index 8eeb9c3c57..1ff5a7f226 100644 --- a/src/qemu/qemu_block.c +++ b/src/qemu/qemu_block.c @@ -977,6 +977,25 @@ qemuBlockStorageSourceGetVvfatProps(virStorageSourcePtr src, } +static virJSONValuePtr +qemuBlockStorageSourceGetNVMeProps(virStorageSourcePtr src) +{ + const virStorageSourceNVMeDef *nvme = src->nvme; + g_autofree char *pciAddr = NULL; + virJSONValuePtr ret = NULL; + + if (!(pciAddr = virPCIDeviceAddressAsString(&nvme->pciAddr))) + return NULL; + + ignore_value(virJSONValueObjectCreate(&ret, + "s:driver", "nvme", + "s:device", pciAddr, + "U:namespace", nvme->namespace, + NULL)); + return ret; +} + + static int qemuBlockStorageSourceGetBlockdevGetCacheProps(virStorageSourcePtr src, virJSONValuePtr props) @@ -1059,8 +1078,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src, return NULL; break; - case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: + if (!(fileprops = qemuBlockStorageSourceGetNVMeProps(src))) + return NULL; + break; + + case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: return NULL; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index c61e38492a..97d11c1d5f 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -1527,6 +1527,9 @@ qemuDiskSourceNeedsProps(virStorageSourcePtr src, src->haveTLS == VIR_TRISTATE_BOOL_YES) return true; + if (actualType == VIR_STORAGE_TYPE_NVME) + return true; + return false; } diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index a588ee25f8..fa507691ff 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -5284,6 +5284,13 @@ qemuProcessStartValidateDisks(virDomainObjPtr vm, _("PowerPC pseries machines do not support floppy device")); return -1; } + + if (src->type == VIR_STORAGE_TYPE_NVME && + !virQEMUCapsGet(qemuCaps, QEMU_CAPS_DRIVE_NVME)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("NVMe disks are not supported with this QEMU binary")); + return -1; + } } return 0; diff --git a/tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args b/tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args new file mode 100644 index 0000000000..b4486086fa --- /dev/null +++ b/tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args @@ -0,0 +1,63 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-QEMUGuest1 \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-QEMUGuest1/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-QEMUGuest1/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-QEMUGuest1/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-i686 \ +-name guest=QEMUGuest1,debug-threads=on \ +-S \ +-object secret,id=masterKey0,format=raw,\ +file=/tmp/lib/domain--1-QEMUGuest1/master-key.aes \ +-machine pc,accel=tcg,usb=off,dump-guest-core=off \ +-m 214 \ +-overcommit mem-lock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-boot strict=on \ +-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ +-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x3 \ +-blockdev '{"driver":"nvme","device":"0000:01:00.0","namespace":1,\ +"node-name":"libvirt-4-storage","auto-read-only":true,"discard":"unmap"}' \ +-blockdev '{"node-name":"libvirt-4-format","read-only":false,"driver":"raw",\ +"file":"libvirt-4-storage"}' \ +-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=libvirt-4-format,\ +id=virtio-disk0,bootindex=1 \ +-blockdev '{"driver":"nvme","device":"0000:01:00.0","namespace":2,\ +"node-name":"libvirt-3-storage","auto-read-only":true,"discard":"unmap"}' \ +-blockdev '{"node-name":"libvirt-3-format","read-only":false,"driver":"raw",\ +"file":"libvirt-3-storage"}' \ +-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=libvirt-3-format,\ +id=virtio-disk1 \ +-blockdev '{"driver":"nvme","device":"0000:02:00.0","namespace":1,\ +"node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"}' \ +-blockdev '{"node-name":"libvirt-2-format","read-only":false,"driver":"raw",\ +"file":"libvirt-2-storage"}' \ +-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=libvirt-2-format,\ +id=virtio-disk2 \ +-object secret,id=libvirt-1-format-luks-secret0,\ +data=9eao5F8qtkGt+seB1HYivWIxbtwUu6MQtg1zpj/oDtUsPr1q8wBYM91uEHCn6j/1,\ +keyid=masterKey0,iv=AAECAwQFBgcICQoLDA0ODw==,format=base64 \ +-blockdev '{"driver":"nvme","device":"0001:02:00.0","namespace":2,\ +"node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},\ +"auto-read-only":true,"discard":"unmap"}' \ +-blockdev '{"node-name":"libvirt-1-format","read-only":false,\ +"cache":{"direct":true,"no-flush":false},"driver":"qcow2",\ +"encrypt":{"format":"luks","key-secret":"libvirt-1-format-luks-secret0"},\ +"file":"libvirt-1-storage"}' \ +-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=libvirt-1-format,\ +id=virtio-disk3,write-cache=on \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,\ +resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 86b8899921..440f47309f 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1078,6 +1078,7 @@ mymain(void) driver.config->vxhsTLS = 0; VIR_FREE(driver.config->vxhsTLSx509certdir); DO_TEST("disk-no-boot", NONE); + DO_TEST_CAPS_LATEST("disk-nvme"); DO_TEST_PARSE_ERROR("disk-device-lun-type-invalid", QEMU_CAPS_VIRTIO_SCSI); DO_TEST_CAPS_LATEST_PARSE_ERROR("disk-attaching-partition-nosupport"); -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
Now, that we have everything prepared, we can generate command line for NVMe disks.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_block.c | 25 +++++++- src/qemu/qemu_command.c | 3 + src/qemu/qemu_process.c | 7 +++ .../disk-nvme.x86_64-latest.args | 63 +++++++++++++++++++ tests/qemuxml2argvtest.c | 1 + 5 files changed, 98 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/disk-nvme.x86_64-latest.args
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

Because this is a HMP we're dealing with, there is nothing like class of reply message, so we have to do some string comparison to guess if the command fails. Well, with NVMe disks whole new class of errors comes to play because qemu needs to initialize IOMMU and VFIO for them. You can see all the messages it may produce in qemu_vfio_init_pci(). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_monitor_text.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index 9054682d60..6948a5bf90 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -75,6 +75,13 @@ int qemuMonitorTextAddDrive(qemuMonitorPtr mon, goto cleanup; } + if (strstr(reply, "IOMMU") || + strstr(reply, "VFIO")) { + virReportError(VIR_ERR_OPERATION_FAILED, "%s", + reply); + goto cleanup; + } + ret = 0; cleanup: -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
Because this is a HMP we're dealing with, there is nothing like class of reply message, so we have to do some string comparison to guess if the command fails. Well, with NVMe disks whole new class of errors comes to play because qemu needs to initialize IOMMU and VFIO for them. You can see all the messages it may produce in qemu_vfio_init_pci().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_monitor_text.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index 9054682d60..6948a5bf90 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -75,6 +75,13 @@ int qemuMonitorTextAddDrive(qemuMonitorPtr mon, goto cleanup; }
+ if (strstr(reply, "IOMMU") || + strstr(reply, "VFIO")) { + virReportError(VIR_ERR_OPERATION_FAILED, "%s", + reply); + goto cleanup; + } + ret = 0;
cleanup:
For the code: Reviewed-by: Cole Robinson <crobinso@redhat.com> Does the blockdev infrastructure have magic that turns the Props into a drive string? I don't see any -drive examples in the test output - Cole

On 12/10/19 9:50 PM, Cole Robinson wrote:
On 12/2/19 9:26 AM, Michal Privoznik wrote:
Because this is a HMP we're dealing with, there is nothing like class of reply message, so we have to do some string comparison to guess if the command fails. Well, with NVMe disks whole new class of errors comes to play because qemu needs to initialize IOMMU and VFIO for them. You can see all the messages it may produce in qemu_vfio_init_pci().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_monitor_text.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index 9054682d60..6948a5bf90 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -75,6 +75,13 @@ int qemuMonitorTextAddDrive(qemuMonitorPtr mon, goto cleanup; }
+ if (strstr(reply, "IOMMU") || + strstr(reply, "VFIO")) { + virReportError(VIR_ERR_OPERATION_FAILED, "%s", + reply); + goto cleanup; + } + ret = 0;
cleanup:
For the code:
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Does the blockdev infrastructure have magic that turns the Props into a drive string? I don't see any -drive examples in the test output
Blockdev usues -device + -blockdev pair and obsoletes -drive. For instance: libvirt.git $ grep "\-drive" $(git grep -l "\-blockdev" -- tests/qemuxml2argvdata/) Neither of -blockdev files has -drive. Michal

On 12/12/19 4:19 AM, Michal Privoznik wrote:
On 12/10/19 9:50 PM, Cole Robinson wrote:
On 12/2/19 9:26 AM, Michal Privoznik wrote:
Because this is a HMP we're dealing with, there is nothing like class of reply message, so we have to do some string comparison to guess if the command fails. Well, with NVMe disks whole new class of errors comes to play because qemu needs to initialize IOMMU and VFIO for them. You can see all the messages it may produce in qemu_vfio_init_pci().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_monitor_text.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index 9054682d60..6948a5bf90 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -75,6 +75,13 @@ int qemuMonitorTextAddDrive(qemuMonitorPtr mon, goto cleanup; } + if (strstr(reply, "IOMMU") || + strstr(reply, "VFIO")) { + virReportError(VIR_ERR_OPERATION_FAILED, "%s", + reply); + goto cleanup; + } + ret = 0; cleanup:
For the code:
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Does the blockdev infrastructure have magic that turns the Props into a drive string? I don't see any -drive examples in the test output
Blockdev usues -device + -blockdev pair and obsoletes -drive. For instance:
libvirt.git $ grep "\-drive" $(git grep -l "\-blockdev" -- tests/qemuxml2argvdata/)
Neither of -blockdev files has -drive.
Right. I think my confusion is that this is a pre-blockdev/-drive code path, but I didn't see any explicit -drive handling in the series, just -blockdev. So either I 1) missed the -drive changes but in that case there should probably be test output demoing -drive output here, or 2) this feature is -blockdev only in which case this is dead code. Or some other case that I'm missing :) - Cole

At the very beginning of the attach function the qemuDomainStorageSourceChainAccessAllow() is called which modifies CGroups, locks and seclabels for new disk and its backing chain. This must be followed by a counterpart which reverts back all the changes if something goes wrong. This boils down to calling qemuDomainStorageSourceChainAccessRevoke() which is done under 'error' label. But not all failure branches jump there. They just jump onto 'cleanup' label where no revoke is done. Such mistake is easy to do because 'cleanup' label does exist. Therefore, dissolve 'error' block in 'cleanup' and have everything jump onto 'cleanup' label. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index d4eacaf099..053523f8c5 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -683,13 +683,13 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, bool blockdev = virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_BLOCKDEV); if (qemuDomainStorageSourceChainAccessAllow(driver, vm, disk->src) < 0) - goto cleanup; + return -1; if (qemuAssignDeviceDiskAlias(vm->def, disk, priv->qemuCaps) < 0) - goto error; + goto cleanup; if (qemuDomainPrepareDiskSource(disk, priv, cfg) < 0) - goto error; + goto cleanup; if (blockdev) { if (disk->copy_on_read == VIR_TRISTATE_SWITCH_ON && @@ -706,13 +706,13 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, } if (!(devstr = qemuBuildDiskDeviceStr(vm->def, disk, 0, priv->qemuCaps))) - goto error; + goto cleanup; if (VIR_REALLOC_N(vm->def->disks, vm->def->ndisks + 1) < 0) - goto error; + goto cleanup; if (qemuHotplugAttachManagedPR(driver, vm, disk->src, QEMU_ASYNC_JOB_NONE) < 0) - goto error; + goto cleanup; qemuDomainObjEnterMonitor(driver, vm); @@ -748,7 +748,7 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, if (qemuDomainObjExitMonitor(driver, vm) < 0) { ret = -2; - goto error; + goto cleanup; } virDomainAuditDisk(vm, NULL, disk->src, "attach", true); @@ -757,6 +757,8 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, ret = 0; cleanup: + if (ret < 0) + ignore_value(qemuDomainStorageSourceChainAccessRevoke(driver, vm, disk->src)); qemuDomainSecretDiskDestroy(disk); return ret; @@ -773,9 +775,6 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, ret = -2; virDomainAuditDisk(vm, NULL, disk->src, "attach", false); - - error: - ignore_value(qemuDomainStorageSourceChainAccessRevoke(driver, vm, disk->src)); goto cleanup; } -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
At the very beginning of the attach function the qemuDomainStorageSourceChainAccessAllow() is called which modifies CGroups, locks and seclabels for new disk and its backing chain. This must be followed by a counterpart which reverts back all the changes if something goes wrong. This boils down to calling qemuDomainStorageSourceChainAccessRevoke() which is done under 'error' label. But not all failure branches jump there. They just jump onto 'cleanup' label where no revoke is done. Such mistake is easy to do because 'cleanup' label does exist. Therefore, dissolve 'error' block in 'cleanup' and have everything jump onto 'cleanup' label.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

With NVMe disks, one can start a blockjob with a NVMe disk that is not visible in domain XML (at least right away). Usually, it's fairly easy to override this limitation of qemuDomainGetMemLockLimitBytes() - for instance for hostdevs we temporarily add the device to domain def, let the function calculate the limit and then remove the device. But it's not so easy with virStorageSourcePtr - in some cases they don't necessarily are attached to a disk. And even if they are it's done later in the process and frankly, I find it too complicated to be able to use the simple trick we use with hostdevs. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_command.c | 2 +- src/qemu/qemu_domain.c | 46 ++++++++++++++++++++++++++--------------- src/qemu/qemu_domain.h | 6 ++++-- src/qemu/qemu_hotplug.c | 12 +++++------ tests/qemumemlocktest.c | 2 +- 5 files changed, 41 insertions(+), 27 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 97d11c1d5f..7cc6a55c75 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -10463,7 +10463,7 @@ qemuBuildCommandLine(virQEMUDriverPtr driver, /* In some situations, eg. VFIO passthrough, QEMU might need to lock a * significant amount of memory, so we need to set the limit accordingly */ - virCommandSetMaxMemLock(cmd, qemuDomainGetMemLockLimitBytes(def)); + virCommandSetMaxMemLock(cmd, qemuDomainGetMemLockLimitBytes(def, false)); if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_MSG_TIMESTAMP) && cfg->logTimestamp) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index e81d986e9e..be4aef049d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11875,12 +11875,14 @@ ppc64VFIODeviceIsNV2Bridge(const char *device) /** * getPPC64MemLockLimitBytes: * @def: domain definition + * @forceVFIO: force VFIO usage * * A PPC64 helper that calculates the memory locking limit in order for * the guest to operate properly. */ static unsigned long long -getPPC64MemLockLimitBytes(virDomainDefPtr def) +getPPC64MemLockLimitBytes(virDomainDefPtr def, + bool forceVFIO) { unsigned long long memKB = 0; unsigned long long baseLimit = 0; @@ -11971,7 +11973,7 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) passthroughLimit = maxMemory + 128 * (1ULL<<30) / 512 * nPCIHostBridges + 8192; - } else if (usesVFIO) { + } else if (usesVFIO || forceVFIO) { /* For regular (non-NVLink2 present) VFIO passthrough, the value * of passthroughLimit is: * @@ -12009,16 +12011,20 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) /** * qemuDomainGetMemLockLimitBytes: * @def: domain definition + * @forceVFIO: force VFIO calculation * * Calculate the memory locking limit that needs to be set in order for * the guest to operate properly. The limit depends on a number of factors, * including certain configuration options and less immediately apparent ones * such as the guest architecture or the use of certain devices. + * The @forceVFIO argument can be used to tell this function will use VFIO even + * though @def doesn't indicates so right now. * * Returns: the memory locking limit, or 0 if setting the limit is not needed */ unsigned long long -qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) +qemuDomainGetMemLockLimitBytes(virDomainDefPtr def, + bool forceVFIO) { unsigned long long memKB = 0; bool usesVFIO = false; @@ -12039,7 +12045,7 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) return VIR_DOMAIN_MEMORY_PARAM_UNLIMITED; if (ARCH_IS_PPC64(def->os.arch) && def->virtType == VIR_DOMAIN_VIRT_KVM) - return getPPC64MemLockLimitBytes(def); + return getPPC64MemLockLimitBytes(def, forceVFIO); /* For device passthrough using VFIO the guest memory and MMIO memory * regions need to be locked persistent in order to allow DMA. @@ -12059,18 +12065,20 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) * * Note that this may not be valid for all platforms. */ - for (i = 0; i < def->nhostdevs; i++) { - if (virHostdevIsVFIODevice(def->hostdevs[i]) || - virHostdevIsMdevDevice(def->hostdevs[i])) { - usesVFIO = true; - break; + if (!forceVFIO) { + for (i = 0; i < def->nhostdevs; i++) { + if (virHostdevIsVFIODevice(def->hostdevs[i]) || + virHostdevIsMdevDevice(def->hostdevs[i])) { + usesVFIO = true; + break; + } } - } - if (virDomainDefHasNVMeDisk(def)) - usesVFIO = true; + if (virDomainDefHasNVMeDisk(def)) + usesVFIO = true; + } - if (usesVFIO) + if (usesVFIO || forceVFIO) memKB = virDomainDefGetMemoryTotal(def) + 1024 * 1024; done: @@ -12081,9 +12089,12 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) /** * qemuDomainAdjustMaxMemLock: * @vm: domain + * @forceVFIO: apply VFIO requirements even if vm's def doesn't require it * * Adjust the memory locking limit for the QEMU process associated to @vm, in - * order to comply with VFIO or architecture requirements. + * order to comply with VFIO or architecture requirements. If @forceVFIO is + * true then the limit is changed even if nothing in @vm's definition indicates + * so. * * The limit will not be changed unless doing so is needed; the first time * the limit is changed, the original (default) limit is stored in @vm and @@ -12093,12 +12104,13 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) * Returns: 0 on success, <0 on failure */ int -qemuDomainAdjustMaxMemLock(virDomainObjPtr vm) +qemuDomainAdjustMaxMemLock(virDomainObjPtr vm, + bool forceVFIO) { unsigned long long bytes = 0; int ret = -1; - bytes = qemuDomainGetMemLockLimitBytes(vm->def); + bytes = qemuDomainGetMemLockLimitBytes(vm->def, forceVFIO); if (bytes) { /* If this is the first time adjusting the limit, save the current @@ -12147,7 +12159,7 @@ qemuDomainAdjustMaxMemLockHostdev(virDomainObjPtr vm, int ret = 0; vm->def->hostdevs[vm->def->nhostdevs++] = hostdev; - if (qemuDomainAdjustMaxMemLock(vm) < 0) + if (qemuDomainAdjustMaxMemLock(vm, false) < 0) ret = -1; vm->def->hostdevs[--(vm->def->nhostdevs)] = NULL; diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 65bd83aece..65d137462d 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -955,8 +955,10 @@ bool qemuDomainSupportsPCI(virDomainDefPtr def, void qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm); -unsigned long long qemuDomainGetMemLockLimitBytes(virDomainDefPtr def); -int qemuDomainAdjustMaxMemLock(virDomainObjPtr vm); +unsigned long long qemuDomainGetMemLockLimitBytes(virDomainDefPtr def, + bool forceVFIO); +int qemuDomainAdjustMaxMemLock(virDomainObjPtr vm, + bool forceVFIO); int qemuDomainAdjustMaxMemLockHostdev(virDomainObjPtr vm, virDomainHostdevDefPtr hostdev); diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 053523f8c5..61deaffdd9 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1646,7 +1646,7 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, if (teardowndevice && qemuDomainNamespaceTeardownHostdev(vm, hostdev) < 0) VIR_WARN("Unable to remove host device from /dev"); - if (teardownmemlock && qemuDomainAdjustMaxMemLock(vm) < 0) + if (teardownmemlock && qemuDomainAdjustMaxMemLock(vm, false) < 0) VIR_WARN("Unable to reset maximum locked memory on hotplug fail"); if (releaseaddr) @@ -2383,7 +2383,7 @@ qemuDomainAttachMemory(virQEMUDriverPtr driver, if (virDomainMemoryInsert(vm->def, mem) < 0) goto cleanup; - if (qemuDomainAdjustMaxMemLock(vm) < 0) + if (qemuDomainAdjustMaxMemLock(vm, false) < 0) goto removedef; qemuDomainObjEnterMonitor(driver, vm); @@ -2454,7 +2454,7 @@ qemuDomainAttachMemory(virQEMUDriverPtr driver, /* reset the mlock limit */ virErrorPreserveLast(&orig_err); - ignore_value(qemuDomainAdjustMaxMemLock(vm)); + ignore_value(qemuDomainAdjustMaxMemLock(vm, false)); virErrorRestore(&orig_err); goto audit; @@ -2857,7 +2857,7 @@ qemuDomainAttachMediatedDevice(virQEMUDriverPtr driver, ret = 0; cleanup: if (ret < 0) { - if (teardownmemlock && qemuDomainAdjustMaxMemLock(vm) < 0) + if (teardownmemlock && qemuDomainAdjustMaxMemLock(vm, false) < 0) VIR_WARN("Unable to reset maximum locked memory on hotplug fail"); if (teardowncgroup && qemuTeardownHostdevCgroup(vm, hostdev) < 0) VIR_WARN("Unable to remove host device cgroup ACL on hotplug fail"); @@ -4356,7 +4356,7 @@ qemuDomainRemoveMemoryDevice(virQEMUDriverPtr driver, ignore_value(qemuProcessRefreshBalloonState(driver, vm, QEMU_ASYNC_JOB_NONE)); /* decrease the mlock limit after memory unplug if necessary */ - ignore_value(qemuDomainAdjustMaxMemLock(vm)); + ignore_value(qemuDomainAdjustMaxMemLock(vm, false)); return 0; } @@ -4483,7 +4483,7 @@ qemuDomainRemoveHostDevice(virQEMUDriverPtr driver, qemuDomainRemovePCIHostDevice(driver, vm, hostdev); /* QEMU might no longer need to lock as much memory, eg. we just * detached the last VFIO device, so adjust the limit here */ - if (qemuDomainAdjustMaxMemLock(vm) < 0) + if (qemuDomainAdjustMaxMemLock(vm, false) < 0) VIR_WARN("Failed to adjust locked memory limit"); break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB: diff --git a/tests/qemumemlocktest.c b/tests/qemumemlocktest.c index ef3bfa0345..23a144a8ce 100644 --- a/tests/qemumemlocktest.c +++ b/tests/qemumemlocktest.c @@ -42,7 +42,7 @@ testCompareMemLock(const void *data) goto cleanup; } - ret = virTestCompareToULL(info->memlock, qemuDomainGetMemLockLimitBytes(def)); + ret = virTestCompareToULL(info->memlock, qemuDomainGetMemLockLimitBytes(def, false)); cleanup: virDomainDefFree(def); -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
With NVMe disks, one can start a blockjob with a NVMe disk that is not visible in domain XML (at least right away). Usually, it's fairly easy to override this limitation of qemuDomainGetMemLockLimitBytes() - for instance for hostdevs we temporarily add the device to domain def, let the function calculate the limit and then remove the device. But it's not so easy with virStorageSourcePtr - in some cases they don't necessarily are attached to a disk. And even if they are it's done later in the process and frankly, I find it too complicated to be able to use the simple trick we use with hostdevs.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 58 +++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_hostdev.c | 22 ++++++++++++++++ src/qemu/qemu_hostdev.h | 6 +++++ 3 files changed, 86 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index be4aef049d..7d3a5d07d7 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10570,6 +10570,54 @@ typedef enum { } qemuDomainStorageSourceAccessFlags; +static int +qemuDomainStorageSourceAccessModifyNVMe(virQEMUDriverPtr driver, + virDomainObjPtr vm, + virStorageSourcePtr src, + bool revoke) +{ + bool revoke_maxmemlock = false; + bool revoke_hostdev = false; + int ret = -1; + + if (!virStorageSourceChainHasNVMe(src)) + return 0; + + VIR_DEBUG("Modifying access for a NVMe disk src=%p revoke=%d", + src, revoke); + + if (revoke) { + revoke_maxmemlock = true; + revoke_hostdev = true; + ret = 0; + goto revoke; + } + + if (qemuDomainAdjustMaxMemLock(vm, true) < 0) + goto revoke; + + revoke_maxmemlock = true; + + if (qemuHostdevPrepareOneNVMeDisk(driver, vm->def->name, src) < 0) + goto revoke; + + revoke_hostdev = true; + + return 0; + + revoke: + if (revoke_maxmemlock) { + if (qemuDomainAdjustMaxMemLock(vm, false) < 0) + VIR_WARN("Unable to change max memlock limit"); + } + + if (revoke_hostdev) + qemuHostdevReAttachOneNVMeDisk(driver, vm->def->name, src); + + return ret; +} + + /** * qemuDomainStorageSourceAccessModify: * @driver: qemu driver struct @@ -10601,6 +10649,7 @@ qemuDomainStorageSourceAccessModify(virQEMUDriverPtr driver, bool revoke_cgroup = false; bool revoke_label = false; bool revoke_namespace = false; + bool revoke_nvme = false; bool revoke_lockspace = false; VIR_DEBUG("src='%s' readonly=%d force_ro=%d force_rw=%d revoke=%d chain=%d", @@ -10618,6 +10667,7 @@ qemuDomainStorageSourceAccessModify(virQEMUDriverPtr driver, revoke_cgroup = true; revoke_label = true; revoke_namespace = true; + revoke_nvme = true; revoke_lockspace = true; ret = 0; goto revoke; @@ -10628,6 +10678,11 @@ qemuDomainStorageSourceAccessModify(virQEMUDriverPtr driver, revoke_lockspace = true; + if (qemuDomainStorageSourceAccessModifyNVMe(driver, vm, src, false) < 0) + goto revoke; + + revoke_nvme = true; + /* When modifying access of existing @src namespace does not need update */ if (!(flags & QEMU_DOMAIN_STORAGE_SOURCE_ACCESS_MODIFY_ACCESS)) { if (qemuDomainNamespaceSetupDisk(vm, src) < 0) @@ -10678,6 +10733,9 @@ qemuDomainStorageSourceAccessModify(virQEMUDriverPtr driver, VIR_WARN("Unable to remove /dev entry for %s", srcstr); } + if (revoke_nvme) + qemuDomainStorageSourceAccessModifyNVMe(driver, vm, src, true); + if (revoke_lockspace) { if (virDomainLockImageDetach(driver->lockManager, vm, src) < 0) VIR_WARN("Unable to release lock on %s", srcstr); diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index 5ab0217858..6ab052724a 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -212,6 +212,17 @@ qemuHostdevPreparePCIDevicesCheckSupport(virDomainHostdevDefPtr *hostdevs, return true; } +int +qemuHostdevPrepareOneNVMeDisk(virQEMUDriverPtr driver, + const char *name, + virStorageSourcePtr src) +{ + return virHostdevPrepareOneNVMeDevice(driver->hostdevMgr, + QEMU_DRIVER_NAME, + name, + src); +} + int qemuHostdevPrepareNVMeDisks(virQEMUDriverPtr driver, const char *name, @@ -369,6 +380,17 @@ qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, return 0; } +void +qemuHostdevReAttachOneNVMeDisk(virQEMUDriverPtr driver, + const char *name, + virStorageSourcePtr src) +{ + virHostdevReAttachOneNVMeDevice(driver->hostdevMgr, + QEMU_DRIVER_NAME, + name, + src); +} + void qemuHostdevReAttachNVMeDisks(virQEMUDriverPtr driver, const char *name, diff --git a/src/qemu/qemu_hostdev.h b/src/qemu/qemu_hostdev.h index 735414b6aa..23df925529 100644 --- a/src/qemu/qemu_hostdev.h +++ b/src/qemu/qemu_hostdev.h @@ -41,6 +41,9 @@ int qemuHostdevUpdateActiveSCSIDevices(virQEMUDriverPtr driver, int qemuHostdevUpdateActiveDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def); +int qemuHostdevPrepareOneNVMeDisk(virQEMUDriverPtr driver, + const char *name, + virStorageSourcePtr src); int qemuHostdevPrepareNVMeDisks(virQEMUDriverPtr driver, const char *name, virDomainDiskDefPtr *disks, @@ -74,6 +77,9 @@ int qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, virQEMUCapsPtr qemuCaps, unsigned int flags); +void qemuHostdevReAttachOneNVMeDisk(virQEMUDriverPtr driver, + const char *name, + virStorageSourcePtr src); void qemuHostdevReAttachNVMeDisks(virQEMUDriverPtr driver, const char *name, virDomainDiskDefPtr *disks, -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 58 +++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_hostdev.c | 22 ++++++++++++++++ src/qemu/qemu_hostdev.h | 6 +++++ 3 files changed, 86 insertions(+)
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole

This is slightly more complicated because NVMe disk source is not a simple attribute to <source/> element. The format in which the PCI address and namespace ID are printed is the same as QEMU accepts them: nvme://XXXX:XX:XX.X/X Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tools/virsh-domain-monitor.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c index c2f157ad6b..30b186ffd1 100644 --- a/tools/virsh-domain-monitor.c +++ b/tools/virsh-domain-monitor.c @@ -631,8 +631,8 @@ cmdDomblklist(vshControl *ctl, const vshCmd *cmd) for (i = 0; i < ndisks; i++) { ctxt->node = disks[i]; + type = virXPathString("string(./@type)", ctxt); if (details) { - type = virXPathString("string(./@type)", ctxt); device = virXPathString("string(./@device)", ctxt); if (!type || !device) { vshPrint(ctl, "unable to query block list details"); @@ -645,11 +645,30 @@ cmdDomblklist(vshControl *ctl, const vshCmd *cmd) vshError(ctl, "unable to query block list"); goto cleanup; } - source = virXPathString("string(./source/@file" - "|./source/@dev" - "|./source/@dir" - "|./source/@name" - "|./source/@volume)", ctxt); + + if (STREQ_NULLABLE(type, "nvme")) { + g_autofree char *namespace = NULL; + virPCIDeviceAddress addr = { 0 }; + xmlNodePtr addrNode = NULL; + + if (!(namespace = virXPathString("string(./source/@namespace)", ctxt)) || + !(addrNode = virXPathNode("./source/address", ctxt)) || + virPCIDeviceAddressParseXML(addrNode, &addr) < 0) { + vshError(ctl, "Unable to query NVMe disk address"); + goto cleanup; + } + + source = g_strdup_printf("nvme://%04x:%02x:%02x.%d/%s", + addr.domain, addr.bus, addr.slot, + addr.function, namespace); + } else { + source = virXPathString("string(./source/@file" + "|./source/@dev" + "|./source/@dir" + "|./source/@name" + "|./source/@volume)", ctxt); + } + if (details) { if (vshTableRowAppend(table, type, device, target, NULLSTR_MINUS(source), NULL) < 0) -- 2.23.0

On 12/2/19 9:26 AM, Michal Privoznik wrote:
This is slightly more complicated because NVMe disk source is not a simple attribute to <source/> element. The format in which the PCI address and namespace ID are printed is the same as QEMU accepts them:
nvme://XXXX:XX:XX.X/X
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com> - Cole
participants (3)
-
Cole Robinson
-
Michal Privoznik
-
Michal Prívozník