[libvirt] [PATCH 0/8] VDIO support part 3

This *almost* completes VFIO support. I still need to add a couple of xml/args test cases. Laine Stump (8): util: new function virPCIDeviceGetVFIOGroupDev security: update hostdev labelling functions for VFIO qemu: add VFIO devices to cgroup ACL qemu: add QEMU_CAPS_VFIO_PCI_BOOTINDEX qemu: use vfio-pci on commandline when appropriate util: new virCommandSetMax(MemLock|Processes|Files) qemu: use new virCommandSetMax(Processes|Files) qemu: set qemu process' RLIMIT_MEMLOCK when VFIO is used configure.ac | 2 +- src/libvirt_private.syms | 7 ++ src/qemu/qemu_capabilities.c | 7 ++ src/qemu/qemu_capabilities.h | 2 +- src/qemu/qemu_cgroup.c | 11 +++ src/qemu/qemu_command.c | 56 ++++++++++++--- src/qemu/qemu_hotplug.c | 23 +++++- src/qemu/qemu_process.c | 38 +--------- src/security/security_apparmor.c | 12 +++- src/security/security_dac.c | 27 +++++++- src/security/security_selinux.c | 24 ++++++- src/util/vircommand.c | 38 ++++++++++ src/util/vircommand.h | 4 ++ src/util/virpci.c | 34 +++++++++ src/util/virpci.h | 2 + src/util/virutil.c | 146 +++++++++++++++++++++++++++++++++++++++ src/util/virutil.h | 4 ++ 17 files changed, 382 insertions(+), 55 deletions(-) -- 1.7.11.7

Given a virPCIDevice, this function returns the path for the device that controls the vfio group the device belongs to, e.g. "/dev/vfio/15". --- src/libvirt_private.syms | 1 + src/util/virpci.c | 34 ++++++++++++++++++++++++++++++++++ src/util/virpci.h | 2 ++ 3 files changed, 37 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 33ec379..2a2c40e 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1609,6 +1609,7 @@ virPCIDeviceGetReprobe; virPCIDeviceGetStubDriver; virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; +virPCIDeviceGetVFIOGroupDev; virPCIDeviceIsAssignable; virPCIDeviceListAdd; virPCIDeviceListCount; diff --git a/src/util/virpci.c b/src/util/virpci.c index 73f36d0..673568f 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1726,6 +1726,40 @@ cleanup: return ret; } +/* virPCIDeviceGetVFIOGroupDev - return the name of the device used to + * control this PCI device's group (e.g. "/dev/vfio/15") + */ +char * +virPCIDeviceGetVFIOGroupDev(virPCIDevicePtr dev) +{ + char *devPath = NULL; + char *groupPath = NULL; + char *groupDev = NULL; + + if (virPCIFile(&devPath, dev->name, "iommu_group") < 0) + goto cleanup; + if (virFileIsLink(devPath) != 1) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Invalid device %s iommu_group file %s is not a symlink"), + dev->name, devPath); + goto cleanup; + } + if (virFileResolveLink(devPath, &groupPath) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to resolve device %s iommu_group symlink %s"), + dev->name, devPath); + goto cleanup; + } + if (virAsprintf(&groupDev, "/dev/vfio/%s", basename(groupPath)) < 0) { + virReportOOMError(); + goto cleanup; + } +cleanup: + VIR_FREE(devPath); + VIR_FREE(groupPath); + return groupDev; +} + static int virPCIDeviceDownstreamLacksACS(virPCIDevicePtr dev) { diff --git a/src/util/virpci.h b/src/util/virpci.h index db0be35..3911b72 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -111,6 +111,8 @@ typedef int (*virPCIDeviceFileActor)(virPCIDevicePtr dev, int virPCIDeviceFileIterate(virPCIDevicePtr dev, virPCIDeviceFileActor actor, void *opaque); +char * +virPCIDeviceGetVFIOGroupDev(virPCIDevicePtr dev); int virPCIDeviceIsAssignable(virPCIDevicePtr dev, int strict_acs_check); -- 1.7.11.7

On 04/25/2013 11:57 AM, Laine Stump wrote:
Given a virPCIDevice, this function returns the path for the device that controls the vfio group the device belongs to, e.g. "/dev/vfio/15". ---
+ if (virAsprintf(&groupDev, "/dev/vfio/%s", basename(groupPath)) < 0) {
We shouldn't be using basename() from <libgen.h> - it has horrible portability problems, and is not thread-safe. Instead, use last_component() from gnulib's "dirname.h". [I really ought to add a syntax-check to cfg.mk and clean up the existing offenders]. ACK with that fixed. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 04/25/2013 04:06 PM, Eric Blake wrote:
Given a virPCIDevice, this function returns the path for the device that controls the vfio group the device belongs to, e.g. "/dev/vfio/15". --- + if (virAsprintf(&groupDev, "/dev/vfio/%s", basename(groupPath)) < 0) { We shouldn't be using basename() from <libgen.h> - it has horrible
On 04/25/2013 11:57 AM, Laine Stump wrote: portability problems, and is not thread-safe. Instead, use last_component() from gnulib's "dirname.h". [I really ought to add a syntax-check to cfg.mk and clean up the existing offenders].
ACK with that fixed.
Okay. I fixed that. I'm going to send one message for each of the three series when I push them.

Legacy kvm style pci device assignment requires changes to the labelling of several sysfs files for each device, but for vfio device assignment, the only thing that needs to be relabelled/chowned is the "group" device for the group that contains the device to be assigned. --- src/security/security_apparmor.c | 12 +++++++++++- src/security/security_dac.c | 27 ++++++++++++++++++++++++--- src/security/security_selinux.c | 24 ++++++++++++++++++++++-- 3 files changed, 57 insertions(+), 6 deletions(-) diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 122edd4..0aff794 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -831,7 +831,17 @@ AppArmorSetSecurityHostdevLabel(virSecurityManagerPtr mgr, if (!pci) goto done; - ret = virPCIDeviceFileIterate(pci, AppArmorSetSecurityPCILabel, ptr); + if (dev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + + if (!vfioGroupDev) + goto done; + ret = AppArmorSetSecurityPCILabel(pci, vfioGroupDev, ptr); + VIR_FREE(vfioGroupDev); + } else { + ret = virPCIDeviceFileIterate(pci, AppArmorSetSecurityPCILabel, ptr); + } virPCIDeviceFree(pci); break; } diff --git a/src/security/security_dac.c b/src/security/security_dac.c index 8576081..5e00112 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -516,8 +516,19 @@ virSecurityDACSetSecurityHostdevLabel(virSecurityManagerPtr mgr, if (!pci) goto done; - ret = virPCIDeviceFileIterate(pci, virSecurityDACSetSecurityPCILabel, - params); + if (dev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + + if (!vfioGroupDev) + goto done; + ret = virSecurityDACSetSecurityPCILabel(pci, vfioGroupDev, params); + VIR_FREE(vfioGroupDev); + } else { + ret = virPCIDeviceFileIterate(pci, virSecurityDACSetSecurityPCILabel, + params); + } + virPCIDeviceFree(pci); break; @@ -596,7 +607,17 @@ virSecurityDACRestoreSecurityHostdevLabel(virSecurityManagerPtr mgr, if (!pci) goto done; - ret = virPCIDeviceFileIterate(pci, virSecurityDACRestoreSecurityPCILabel, mgr); + if (dev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + + if (!vfioGroupDev) + goto done; + ret = virSecurityDACRestoreSecurityPCILabel(pci, vfioGroupDev, mgr); + VIR_FREE(vfioGroupDev); + } else { + ret = virPCIDeviceFileIterate(pci, virSecurityDACRestoreSecurityPCILabel, mgr); + } virPCIDeviceFree(pci); break; diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index a8b74ee..a5b54cb 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -1342,7 +1342,17 @@ virSecuritySELinuxSetSecurityHostdevSubsysLabel(virDomainDefPtr def, if (!pci) goto done; - ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxSetSecurityPCILabel, def); + if (dev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + + if (!vfioGroupDev) + goto done; + ret = virSecuritySELinuxSetSecurityPCILabel(pci, vfioGroupDev, def); + VIR_FREE(vfioGroupDev); + } else { + ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxSetSecurityPCILabel, def); + } virPCIDeviceFree(pci); break; @@ -1504,7 +1514,17 @@ virSecuritySELinuxRestoreSecurityHostdevSubsysLabel(virSecurityManagerPtr mgr, if (!pci) goto done; - ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxRestoreSecurityPCILabel, mgr); + if (dev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + + if (!vfioGroupDev) + goto done; + ret = virSecuritySELinuxRestoreSecurityPCILabel(pci, vfioGroupDev, mgr); + VIR_FREE(vfioGroupDev); + } else { + ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxRestoreSecurityPCILabel, mgr); + } virPCIDeviceFree(pci); break; -- 1.7.11.7

On 04/25/2013 11:57 AM, Laine Stump wrote:
Legacy kvm style pci device assignment requires changes to the labelling of several sysfs files for each device, but for vfio device assignment, the only thing that needs to be relabelled/chowned is the "group" device for the group that contains the device to be assigned. --- src/security/security_apparmor.c | 12 +++++++++++- src/security/security_dac.c | 27 ++++++++++++++++++++++++--- src/security/security_selinux.c | 24 ++++++++++++++++++++++-- 3 files changed, 57 insertions(+), 6 deletions(-)
ACK -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

We don't know exactly the names of the VFIO devices that will be needed (and due to hotplug, we can't ever assume we won't need them at all), so we just add an ACL to allow any vfio device - they all have the major number 244 (/dev/vfio/vfio is 244,0, and the /dev/vfio/n devices are up from there). --- src/qemu/qemu_cgroup.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 891984a..ad2027d 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -44,6 +44,7 @@ static const char *const defaultDeviceACL[] = { }; #define DEVICE_PTY_MAJOR 136 #define DEVICE_SND_MAJOR 116 +#define DEVICE_VFIO_MAJOR 244 static int qemuSetupDiskPathAllow(virDomainDiskDefPtr disk, @@ -388,6 +389,16 @@ int qemuSetupCgroup(virQEMUDriverPtr driver, } } + rc = virCgroupAllowDeviceMajor(priv->cgroup, 'c', DEVICE_VFIO_MAJOR, + VIR_CGROUP_DEVICE_RW); + virDomainAuditCgroupMajor(vm, priv->cgroup, "allow", DEVICE_VFIO_MAJOR, + "vfio", "rw", rc == 0); + if (rc != 0) { + virReportSystemError(-rc, "%s", + _("unable to allow /dev/vfio/ devices")); + goto cleanup; + } + for (i = 0; deviceACL[i] != NULL ; i++) { if (access(deviceACL[i], F_OK) < 0) { VIR_DEBUG("Ignoring non-existant device %s", -- 1.7.11.7

On 04/25/2013 11:57 AM, Laine Stump wrote:
We don't know exactly the names of the VFIO devices that will be needed (and due to hotplug, we can't ever assume we won't need them at all), so we just add an ACL to allow any vfio device - they all have the major number 244 (/dev/vfio/vfio is 244,0, and the /dev/vfio/n devices are up from there). --- src/qemu/qemu_cgroup.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
@@ -388,6 +389,16 @@ int qemuSetupCgroup(virQEMUDriverPtr driver, } }
+ rc = virCgroupAllowDeviceMajor(priv->cgroup, 'c', DEVICE_VFIO_MAJOR, + VIR_CGROUP_DEVICE_RW); + virDomainAuditCgroupMajor(vm, priv->cgroup, "allow", DEVICE_VFIO_MAJOR, + "vfio", "rw", rc == 0); + if (rc != 0) { + virReportSystemError(-rc, "%s", + _("unable to allow /dev/vfio/ devices")); + goto cleanup; + }
This matches what we have done for ptys and sound; but it's rather broad in that the access is now ALWAYS allowed. For ptys, always being enabled makes sense (you seldom start a guest without a pty, and even if we blindly grant the privilege to a guest that doesn't use a pty there's not much they could break by a rogue guest opening a pty device); but for vfio, it's tied to something that we don't want to allow rogue access to, when it can be avoided. Do we want to be a bit smarter and only add it to the file at the point when we start a guest with a vfio device and/or hotplug a vfio device, then revoke the cgroup privileges any time we hot-unplug the last vfio device? If so, it would be similar to how we add cgroup device ACL privileges for block disks at hotplug time, except that we'd need some domain tracking to count how many vfio devices are plugged at once, so we aren't revoking access until the last hot-unplug. I can live with this as-is (we at least audit that it happens), but wonder if it is worth respinning this patch to be smarter about when to allow vfio in the guest. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

For some reason, the bootindex parameter wasn't included in early versions of vfio support (qemu 1.4) so we have to check for it separately from vfio itself. --- src/qemu/qemu_capabilities.c | 7 +++++++ src/qemu/qemu_capabilities.h | 2 +- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index d5df4b3..2acf535 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -224,6 +224,7 @@ VIR_ENUM_IMPL(virQEMUCaps, QEMU_CAPS_LAST, "nvram", /* 140 */ "pci-bridge", /* 141 */ "vfio-pci", /* 142 */ + "vfio-pci.bootindex", /* 143 */ ); struct _virQEMUCaps { @@ -1376,6 +1377,10 @@ static struct virQEMUCapsStringFlags virQEMUCapsObjectPropsPciAssign[] = { { "bootindex", QEMU_CAPS_PCI_BOOTINDEX }, }; +static struct virQEMUCapsStringFlags virQEMUCapsObjectPropsVfioPci[] = { + { "bootindex", QEMU_CAPS_VFIO_PCI_BOOTINDEX }, +}; + static struct virQEMUCapsStringFlags virQEMUCapsObjectPropsScsiDisk[] = { { "channel", QEMU_CAPS_SCSI_DISK_CHANNEL }, { "wwn", QEMU_CAPS_SCSI_DISK_WWN }, @@ -1422,6 +1427,8 @@ static struct virQEMUCapsObjectTypeProps virQEMUCapsObjectProps[] = { ARRAY_CARDINALITY(virQEMUCapsObjectPropsPciAssign) }, { "kvm-pci-assign", virQEMUCapsObjectPropsPciAssign, ARRAY_CARDINALITY(virQEMUCapsObjectPropsPciAssign) }, + { "vfio-pci", virQEMUCapsObjectPropsVfioPci, + ARRAY_CARDINALITY(virQEMUCapsObjectPropsVfioPci) }, { "scsi-disk", virQEMUCapsObjectPropsScsiDisk, ARRAY_CARDINALITY(virQEMUCapsObjectPropsScsiDisk) }, { "ide-drive", virQEMUCapsObjectPropsIDEDrive, diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 49ee505..213f63c 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -181,7 +181,7 @@ enum virQEMUCapsFlags { QEMU_CAPS_DEVICE_NVRAM = 140, /* -global spapr-nvram.reg=xxxx */ QEMU_CAPS_DEVICE_PCI_BRIDGE = 141, /* -device pci-bridge */ QEMU_CAPS_DEVICE_VFIO_PCI = 142, /* -device vfio-pci */ - + QEMU_CAPS_VFIO_PCI_BOOTINDEX = 143, /* bootindex param for vfio-pci device */ QEMU_CAPS_LAST, /* this must always be the last item */ }; -- 1.7.11.7

On 04/25/2013 11:57 AM, Laine Stump wrote:
For some reason, the bootindex parameter wasn't included in early versions of vfio support (qemu 1.4) so we have to check for it separately from vfio itself. --- src/qemu/qemu_capabilities.c | 7 +++++++ src/qemu/qemu_capabilities.h | 2 +- 2 files changed, 8 insertions(+), 1 deletion(-)
ACK. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

The device option for vfio-pci is nearly identical to that for pci-assign - only the configfd parameter isn't supported (or needed). Checking for presence of the bootindex parameter is done separately from constructing the commandline, similar to how it is done for pci-assign. --- src/qemu/qemu_command.c | 48 ++++++++++++++++++++++++++++++++++++++---------- src/qemu/qemu_hotplug.c | 13 ++++++++++++- 2 files changed, 50 insertions(+), 11 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 4f14dff..761291c 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4345,14 +4345,19 @@ qemuBuildPCIHostdevDevStr(virDomainHostdevDefPtr dev, const char *configfd, { virBuffer buf = VIR_BUFFER_INITIALIZER; - virBufferAddLit(&buf, "pci-assign"); + if (dev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + virBufferAddLit(&buf, "vfio-pci"); + } else { + virBufferAddLit(&buf, "pci-assign"); + if (configfd && *configfd) + virBufferAsprintf(&buf, ",configfd=%s", configfd); + } virBufferAsprintf(&buf, ",host=%.2x:%.2x.%.1x", dev->source.subsys.u.pci.addr.bus, dev->source.subsys.u.pci.addr.slot, dev->source.subsys.u.pci.addr.function); virBufferAsprintf(&buf, ",id=%s", dev->info->alias); - if (configfd && *configfd) - virBufferAsprintf(&buf, ",configfd=%s", configfd); if (dev->info->bootIndex) virBufferAsprintf(&buf, ",bootindex=%d", dev->info->bootIndex); if (qemuBuildDeviceAddressStr(&buf, dev->info, qemuCaps) < 0) @@ -7850,12 +7855,23 @@ qemuBuildCommandLine(virConnectPtr conn, " supported for PCI and USB devices")); goto error; } else { - if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - !virQEMUCapsGet(qemuCaps, QEMU_CAPS_PCI_BOOTINDEX)) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("booting from assigned PCI devices is not" - " supported with this version of qemu")); - goto error; + if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + if (hostdev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_VFIO_PCI_BOOTINDEX)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("booting from PCI devices assigned with VFIO " + "is not supported with this version of qemu")); + goto error; + } + } else { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_PCI_BOOTINDEX)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("booting from assigned PCI devices is not" + " supported with this version of qemu")); + goto error; + } + } } if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB && !virQEMUCapsGet(qemuCaps, QEMU_CAPS_USB_HOST_BOOTINDEX)) { @@ -7889,9 +7905,21 @@ qemuBuildCommandLine(virConnectPtr conn, /* PCI */ if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + + if ((hostdev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) && + !virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE_VFIO_PCI)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("VFIO PCI device assignment is not supported by " + "this version of qemu")); + goto error; + } + if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE)) { char *configfd_name = NULL; - if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_PCI_CONFIGFD)) { + if ((hostdev->source.subsys.u.pci.backend + != VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) && + virQEMUCapsGet(qemuCaps, QEMU_CAPS_PCI_CONFIGFD)) { int configfd = qemuOpenPCIConfig(hostdev); if (configfd >= 0) { diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index f6b2fc8..30c8bda 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -999,13 +999,24 @@ int qemuDomainAttachHostPciDevice(virQEMUDriverPtr driver, &hostdev, 1) < 0) return -1; + if ((hostdev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) && + !virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE_VFIO_PCI)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("VFIO PCI device assignment is not supported by " + "this version of qemu")); + goto error; + } + if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE)) { if (qemuAssignDeviceHostdevAlias(vm->def, hostdev, -1) < 0) goto error; if (qemuDomainPCIAddressEnsureAddr(priv->pciaddrs, hostdev->info) < 0) goto error; releaseaddr = true; - if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_PCI_CONFIGFD)) { + if ((hostdev->source.subsys.u.pci.backend + != VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) && + virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_PCI_CONFIGFD)) { configfd = qemuOpenPCIConfig(hostdev); if (configfd >= 0) { if (virAsprintf(&configfd_name, "fd-%s", -- 1.7.11.7

On 04/25/2013 11:57 AM, Laine Stump wrote:
The device option for vfio-pci is nearly identical to that for pci-assign - only the configfd parameter isn't supported (or needed).
Checking for presence of the bootindex parameter is done separately from constructing the commandline, similar to how it is done for pci-assign. --- src/qemu/qemu_command.c | 48 ++++++++++++++++++++++++++++++++++++++---------- src/qemu/qemu_hotplug.c | 13 ++++++++++++- 2 files changed, 50 insertions(+), 11 deletions(-)
@@ -7850,12 +7855,23 @@ qemuBuildCommandLine(virConnectPtr conn, " supported for PCI and USB devices")); goto error; } else { - if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - !virQEMUCapsGet(qemuCaps, QEMU_CAPS_PCI_BOOTINDEX)) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("booting from assigned PCI devices is not" - " supported with this version of qemu")); - goto error; + if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + if (hostdev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_VFIO_PCI_BOOTINDEX)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("booting from PCI devices assigned with VFIO " + "is not supported with this version of qemu"));
Line break after space...
+ goto error; + } + } else { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_PCI_BOOTINDEX)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("booting from assigned PCI devices is not" + " supported with this version of qemu"));
...line break before space. Looks a bit inconsistent, but the end result is the same. ACK. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 04/25/2013 05:07 PM, Eric Blake wrote:
On 04/25/2013 11:57 AM, Laine Stump wrote:
The device option for vfio-pci is nearly identical to that for pci-assign - only the configfd parameter isn't supported (or needed).
Checking for presence of the bootindex parameter is done separately from constructing the commandline, similar to how it is done for pci-assign. --- src/qemu/qemu_command.c | 48 ++++++++++++++++++++++++++++++++++++++---------- src/qemu/qemu_hotplug.c | 13 ++++++++++++- 2 files changed, 50 insertions(+), 11 deletions(-)
@@ -7850,12 +7855,23 @@ qemuBuildCommandLine(virConnectPtr conn, " supported for PCI and USB devices")); goto error; } else { - if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - !virQEMUCapsGet(qemuCaps, QEMU_CAPS_PCI_BOOTINDEX)) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("booting from assigned PCI devices is not" - " supported with this version of qemu")); - goto error; + if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + if (hostdev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_VFIO_PCI_BOOTINDEX)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("booting from PCI devices assigned with VFIO " + "is not supported with this version of qemu")); Line break after space...
+ goto error; + } + } else { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_PCI_BOOTINDEX)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("booting from assigned PCI devices is not" + " supported with this version of qemu")); ...line break before space. Looks a bit inconsistent, but the end result is the same.
I fixed those all up so they have the space before the line break.

This patch adds two sets of functions: 1) lower level functions that will immediately set the RLIMIT_MEMLOCK. RLIMIT_NPROC, or RLIMIT_NOFILE of either the current process (using setrlimit()) or any other process (using prlimit()). 2) functions for virCommand* that will setup a virCommand object to set those limits at a later time just after it has forked a new process, but before it execs the new program. configure.ac has prlimit and setrlimit added to the list of functions to check for, and the low level functions are hopefully properly "unsupported" error) on all platforms. You'll notice some ugly typecasting around pid_t; this is all an attempt to follow the formula given to me by Eric for getting it to compile without warnings on all platforms. --- configure.ac | 2 +- src/libvirt_private.syms | 6 ++ src/util/vircommand.c | 38 ++++++++++++ src/util/vircommand.h | 4 ++ src/util/virutil.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++ src/util/virutil.h | 4 ++ 6 files changed, 199 insertions(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index 89dae3d..23c24d2 100644 --- a/configure.ac +++ b/configure.ac @@ -194,7 +194,7 @@ dnl Availability of various common functions (non-fatal if missing), dnl and various less common threadsafe functions AC_CHECK_FUNCS_ONCE([cfmakeraw geteuid getgid getgrnam_r getmntent_r \ getpwuid_r getuid initgroups kill mmap newlocale posix_fallocate \ - posix_memalign regexec sched_getaffinity setns symlink]) + posix_memalign prlimit regexec sched_getaffinity setns setrlimit symlink]) dnl Availability of pthread functions (if missing, win32 threading is dnl assumed). Because of $LIB_PTHREAD, we cannot use AC_CHECK_FUNCS_ONCE. diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 2a2c40e..c988c21 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1185,6 +1185,9 @@ virCommandSetErrorFD; virCommandSetGID; virCommandSetInputBuffer; virCommandSetInputFD; +virCommandSetMaxFiles; +virCommandSetMaxMemLock; +virCommandSetMaxProcesses; virCommandSetOutputBuffer; virCommandSetOutputFD; virCommandSetPidFile; @@ -1904,6 +1907,9 @@ virSetBlocking; virSetCloseExec; virSetDeviceUnprivSGIO; virSetInherit; +virSetMaxFiles; +virSetMaxMemLock; +virSetMaxProcesses; virSetNonBlock; virSetUIDGID; virSetUIDGIDWithCaps; diff --git a/src/util/vircommand.c b/src/util/vircommand.c index ac56a63..d0def8c 100644 --- a/src/util/vircommand.c +++ b/src/util/vircommand.c @@ -107,6 +107,10 @@ struct _virCommand { char *pidfile; bool reap; + unsigned long long maxMemLock; + unsigned int maxProcesses; + unsigned int maxFiles; + uid_t uid; gid_t gid; unsigned long long capabilities; @@ -598,6 +602,13 @@ virExec(virCommandPtr cmd) goto fork_error; } + if (virSetMaxMemLock(-1, cmd->maxMemLock) < 0) + goto fork_error; + if (virSetMaxProcesses(-1, cmd->maxProcesses) < 0) + goto fork_error; + if (virSetMaxFiles(-1, cmd->maxFiles) < 0) + goto fork_error; + if (cmd->hook) { VIR_DEBUG("Run hook %p %p", cmd->hook, cmd->opaque); ret = cmd->hook(cmd->opaque); @@ -958,6 +969,33 @@ virCommandSetUID(virCommandPtr cmd, uid_t uid) cmd->uid = uid; } +void +virCommandSetMaxMemLock(virCommandPtr cmd, unsigned long long bytes) +{ + if (!cmd || cmd->has_error) + return; + + cmd->maxMemLock = bytes; +} + +void +virCommandSetMaxProcesses(virCommandPtr cmd, unsigned int procs) +{ + if (!cmd || cmd->has_error) + return; + + cmd->maxProcesses = procs; +} + +void +virCommandSetMaxFiles(virCommandPtr cmd, unsigned int files) +{ + if (!cmd || cmd->has_error) + return; + + cmd->maxFiles = files; +} + /** * virCommandClearCaps: * @cmd: the command to modify diff --git a/src/util/vircommand.h b/src/util/vircommand.h index 6c13795..18568fe 100644 --- a/src/util/vircommand.h +++ b/src/util/vircommand.h @@ -65,6 +65,10 @@ void virCommandSetGID(virCommandPtr cmd, gid_t gid); void virCommandSetUID(virCommandPtr cmd, uid_t uid); +void virCommandSetMaxMemLock(virCommandPtr cmd, unsigned long long bytes); +void virCommandSetMaxProcesses(virCommandPtr cmd, unsigned int procs); +void virCommandSetMaxFiles(virCommandPtr cmd, unsigned int files); + void virCommandClearCaps(virCommandPtr cmd); void virCommandAllowCap(virCommandPtr cmd, diff --git a/src/util/virutil.c b/src/util/virutil.c index b9de33c..058a069 100644 --- a/src/util/virutil.c +++ b/src/util/virutil.c @@ -38,6 +38,10 @@ #include <sys/types.h> #include <sys/ioctl.h> #include <sys/wait.h> +#if HAVE_SETRLIMIT +# include <sys/time.h> +# include <sys/resource.h> +#endif #if HAVE_MMAP # include <sys/mman.h> #endif @@ -2992,6 +2996,148 @@ virGetGroupName(gid_t gid ATTRIBUTE_UNUSED) } #endif /* HAVE_GETPWUID_R */ +#if HAVE_SETRLIMIT + +# if HAVE_PRLIMIT +static int +virPrLimit(pid_t pid, int resource, struct rlimit *rlim) +{ + return prlimit(pid, resource, rlim, NULL); +} +# else +static int +virPrLimit(pid_t pid ATTRIBUTE_UNUSED, + int resource ATTRIBUTE_UNUSED, + struct rlimit *rlim ATTRIBUTE_UNUSED) +{ + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1; +} +# endif + +int +virSetMaxMemLock(pid_t pid, unsigned long long bytes) +{ + struct rlimit rlim; + + if (bytes == 0) + return 0; + + rlim.rlim_cur = rlim.rlim_max = bytes; + if (pid == (pid_t)-1) { + if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit locked memory to %llu"), + bytes); + return -1; + } + } else { + if (virPrLimit(pid, RLIMIT_MEMLOCK, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit locked memory " + "of process %lld to %llu"), + (long long int)pid, bytes); + return -1; + } + } + return 0; +} + +int +virSetMaxProcesses(pid_t pid, unsigned int procs) +{ + struct rlimit rlim; + + if (procs == 0) + return 0; + + rlim.rlim_cur = rlim.rlim_max = procs; + if (pid == (pid_t)-1) { + if (setrlimit(RLIMIT_NPROC, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of subprocesses to %u"), + procs); + return -1; + } + } else { + if (virPrLimit(pid, RLIMIT_NPROC, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of subprocesses " + "of process %lld to %u"), + (long long int)pid, procs); + return -1; + } + } + return 0; +} + +int +virSetMaxFiles(pid_t pid, unsigned int files) +{ + struct rlimit rlim; + + if (files == 0) + return 0; + + /* Max number of opened files is one greater than actual limit. See + * man setrlimit. + * + * NB: That indicates to me that we would want the following code + * to say "files - 1", but the original of this code in + * qemu_process.c also had files + 1, so this preserves current + * behavior. + */ + rlim.rlim_cur = rlim.rlim_max = files + 1; + if (pid == (pid_t)-1) { + if (setrlimit(RLIMIT_NOFILE, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of open files to %u"), + files); + return -1; + } + } else { + if (virPrLimit(pid, RLIMIT_NOFILE, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of open files " + "of process %lld to %u"), + (long long int)pid, files); + return -1; + } + } + return 0; +} +#else +int +virSetMaxMemLock(pid_t pid ATTRIBUTE_UNUSED, unsigned long long bytes) +{ + if (bytes == 0) + return 0; + + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1; +} + +int +virSetMaxProcesses(pid_t pid ATTRIBUTE_UNUSED, unsigned int procs) +{ + if (procs == 0) + return 0; + + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1; +} + +int +virSetMaxFiles(pid_t pid ATTRIBUTE_UNUSED, unsigned int files) +{ + if (files == 0) + return 0; + + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1; +} +#endif + #if WITH_CAPNG /* Set the real and effective uid and gid to the given values, while * maintaining the capabilities indicated by bits in @capBits. Return diff --git a/src/util/virutil.h b/src/util/virutil.h index 39033db..8b258c9 100644 --- a/src/util/virutil.h +++ b/src/util/virutil.h @@ -53,6 +53,10 @@ int virSetCloseExec(int fd) ATTRIBUTE_RETURN_CHECK; int virPipeReadUntilEOF(int outfd, int errfd, char **outbuf, char **errbuf); +int virSetMaxMemLock(pid_t pid, unsigned long long bytes); +int virSetMaxProcesses(pid_t pid, unsigned int procs); +int virSetMaxFiles(pid_t pid, unsigned int files); + int virSetUIDGID(uid_t uid, gid_t gid); int virSetUIDGIDWithCaps(uid_t uid, gid_t gid, unsigned long long capBits, bool clearExistingCaps); -- 1.7.11.7

On Thu, Apr 25, 2013 at 01:57:56PM -0400, Laine Stump wrote:
diff --git a/src/util/virutil.c b/src/util/virutil.c index b9de33c..058a069 100644 --- a/src/util/virutil.c +++ b/src/util/virutil.c @@ -38,6 +38,10 @@ #include <sys/types.h> #include <sys/ioctl.h> #include <sys/wait.h> +#if HAVE_SETRLIMIT +# include <sys/time.h> +# include <sys/resource.h> +#endif #if HAVE_MMAP # include <sys/mman.h> #endif @@ -2992,6 +2996,148 @@ virGetGroupName(gid_t gid ATTRIBUTE_UNUSED) } #endif /* HAVE_GETPWUID_R */
+#if HAVE_SETRLIMIT + +# if HAVE_PRLIMIT +static int +virPrLimit(pid_t pid, int resource, struct rlimit *rlim) +{ + return prlimit(pid, resource, rlim, NULL); +} +# else +static int +virPrLimit(pid_t pid ATTRIBUTE_UNUSED, + int resource ATTRIBUTE_UNUSED, + struct rlimit *rlim ATTRIBUTE_UNUSED) +{ + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1; +} +# endif + +int +virSetMaxMemLock(pid_t pid, unsigned long long bytes) +{ + struct rlimit rlim; + + if (bytes == 0) + return 0; + + rlim.rlim_cur = rlim.rlim_max = bytes; + if (pid == (pid_t)-1) { + if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit locked memory to %llu"), + bytes); + return -1; + } + } else { + if (virPrLimit(pid, RLIMIT_MEMLOCK, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit locked memory " + "of process %lld to %llu"), + (long long int)pid, bytes); + return -1; + } + } + return 0; +} + +int +virSetMaxProcesses(pid_t pid, unsigned int procs) +{ + struct rlimit rlim; + + if (procs == 0) + return 0; + + rlim.rlim_cur = rlim.rlim_max = procs; + if (pid == (pid_t)-1) { + if (setrlimit(RLIMIT_NPROC, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of subprocesses to %u"), + procs); + return -1; + } + } else { + if (virPrLimit(pid, RLIMIT_NPROC, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of subprocesses " + "of process %lld to %u"), + (long long int)pid, procs); + return -1; + } + } + return 0; +} + +int +virSetMaxFiles(pid_t pid, unsigned int files) +{ + struct rlimit rlim; + + if (files == 0) + return 0; + + /* Max number of opened files is one greater than actual limit. See + * man setrlimit. + * + * NB: That indicates to me that we would want the following code + * to say "files - 1", but the original of this code in + * qemu_process.c also had files + 1, so this preserves current + * behavior. + */ + rlim.rlim_cur = rlim.rlim_max = files + 1; + if (pid == (pid_t)-1) { + if (setrlimit(RLIMIT_NOFILE, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of open files to %u"), + files); + return -1; + } + } else { + if (virPrLimit(pid, RLIMIT_NOFILE, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of open files " + "of process %lld to %u"), + (long long int)pid, files); + return -1; + } + } + return 0; +} +#else +int +virSetMaxMemLock(pid_t pid ATTRIBUTE_UNUSED, unsigned long long bytes) +{ + if (bytes == 0) + return 0; + + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1; +} + +int +virSetMaxProcesses(pid_t pid ATTRIBUTE_UNUSED, unsigned int procs) +{ + if (procs == 0) + return 0; + + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1; +} + +int +virSetMaxFiles(pid_t pid ATTRIBUTE_UNUSED, unsigned int files) +{ + if (files == 0) + return 0; + + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1; +} +#endif
Since these functions all take pid_t as their first arg, they should all be named virProcessXXXXX and be located in virprocess.{c,h} rather than virutil.{c,h} Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 04/25/2013 03:27 PM, Daniel P. Berrange wrote:
Since these functions all take pid_t as their first arg, they should all be named virProcessXXXXX and be located in virprocess.{c,h} rather than virutil.{c,h}
Yes, of course! The thought "Move it somewhere else!" was trying to come to my mind while I was writing it, but I didn't give myself time enough to think.

On 04/25/2013 11:57 AM, Laine Stump wrote:
This patch adds two sets of functions:
1) lower level functions that will immediately set the RLIMIT_MEMLOCK. RLIMIT_NPROC, or RLIMIT_NOFILE of either the current process (using setrlimit()) or any other process (using prlimit()).
2) functions for virCommand* that will setup a virCommand object to set those limits at a later time just after it has forked a new process, but before it execs the new program.
configure.ac has prlimit and setrlimit added to the list of functions to check for, and the low level functions are hopefully properly "unsupported" error) on all platforms.
You'll notice some ugly typecasting around pid_t; this is all an attempt to follow the formula given to me by Eric for getting it to compile without warnings on all platforms. --- configure.ac | 2 +- src/libvirt_private.syms | 6 ++ src/util/vircommand.c | 38 ++++++++++++ src/util/vircommand.h | 4 ++ src/util/virutil.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++ src/util/virutil.h | 4 ++ 6 files changed, 199 insertions(+), 1 deletion(-)
@@ -598,6 +602,13 @@ virExec(virCommandPtr cmd) goto fork_error; }
+ if (virSetMaxMemLock(-1, cmd->maxMemLock) < 0) + goto fork_error; + if (virSetMaxProcesses(-1, cmd->maxProcesses) < 0) + goto fork_error; + if (virSetMaxFiles(-1, cmd->maxFiles) < 0) + goto fork_error;
Hmm. uid_t and gid_t can validly be 0, hence the sentinel has to be -1. But pid_t can never be 0 (it starts with 1), so it is easier if you let 0 be the sentinel that means current process.
+#if HAVE_SETRLIMIT + +# if HAVE_PRLIMIT +static int +virPrLimit(pid_t pid, int resource, struct rlimit *rlim) +{ + return prlimit(pid, resource, rlim, NULL);
This doesn't report errors;
+} +# else +static int +virPrLimit(pid_t pid ATTRIBUTE_UNUSED, + int resource ATTRIBUTE_UNUSED, + struct rlimit *rlim ATTRIBUTE_UNUSED) +{ + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1;
while this does. Either the caller should always be responsible for reporting the errors, or you should fix the HAVE_PRLIMIT version to call virReportSystemError() as needed. Looking below at your callers, I go for the latter - simplify the non-prlimit variant to just be: errno = ENOSYS; return -1; [1]...
+} +# endif + +int +virSetMaxMemLock(pid_t pid, unsigned long long bytes) +{ + struct rlimit rlim; + + if (bytes == 0) + return 0; + + rlim.rlim_cur = rlim.rlim_max = bytes; + if (pid == (pid_t)-1) {
Overkill; since there is never a process id of 0, that makes a sentinel that is much easier to check for. (You did get the cast right, per our IRC conversation; only I didn't realize why you needed the cast in the first place - pid_t -1 is the process id that represents the entire process group of the init(1) process).
+ if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) {
RLIMIT_MEMLOCK is not portable. POSIX lists only the following: http://pubs.opengroup.org/onlinepubs/9699919799/functions/setrlimit.html RLIMIT_CORE RLIMIT_CPU RLIMIT_DATA RLIMIT_FSIZE RLIMIT_NOFILE RLIMIT_STACK RLIMIT_AS The rest are extensions, so you'll need to wrap this code in #ifdef RLIMIT_MEMLOCK and provide a fallback when it is not present. You need #ifdef instead of #if (POSIX guarantees that all RLIMIT_* extensions in <sys/resource.h> will be macros, but does not guarantee which, if any, of those macros will have the value 0), but at least you don't need to touch configure.ac for this particular probe.
+ virReportSystemError(errno, + _("cannot limit locked memory to %llu"), + bytes); + return -1; + } + } else { + if (virPrLimit(pid, RLIMIT_MEMLOCK, &rlim) < 0) {
prlimit(0, ...) is identical to prlimit(getpid(), ...), which in turn is identical to setrlimit(...). Then again, since setrlimit() is more portable than prlimit, I agree with you doing the differentiation on the sentinel pid yourself instead of blindly trying virPrLimit().
+ virReportSystemError(errno, + _("cannot limit locked memory " + "of process %lld to %llu"), + (long long int)pid, bytes); + return -1;
...[1] since here we always report any failure.
+ } + } + return 0; +} + +int +virSetMaxProcesses(pid_t pid, unsigned int procs) +{ + struct rlimit rlim; + + if (procs == 0) + return 0; + + rlim.rlim_cur = rlim.rlim_max = procs; + if (pid == (pid_t)-1) { + if (setrlimit(RLIMIT_NPROC, &rlim) < 0) {
Another non-portable limit, needing #ifdef treatment. ...
+ rlim.rlim_cur = rlim.rlim_max = files + 1; + if (pid == (pid_t)-1) { + if (setrlimit(RLIMIT_NOFILE, &rlim) < 0) {
POSIX requires this one. You can play brave and blindly use it (since we already guarded things by an outer HAVE_SETRLIMIT (about the only platform where we might have problems would be cygwin, which doesn't yet provide all of the posix-mandated limits; but I don't have my cygwin machine booted at the moment to check which limits it lacks), or you could play it consistent and do the same thing here as for the other two extension limits. At any rate, I'll do a cygwin build after we freeze for 1.0.5 but before the final release, to clean up any mess we may have made in our assumptions here.
+#else +int +virSetMaxMemLock(pid_t pid ATTRIBUTE_UNUSED, unsigned long long bytes) +{ + if (bytes == 0) + return 0; + + virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); + return -1;
Nice that you let the call ignore unchanged settings, and only error for impossible-to-fulfill limits.
+++ b/src/util/virutil.h @@ -53,6 +53,10 @@ int virSetCloseExec(int fd) ATTRIBUTE_RETURN_CHECK; int virPipeReadUntilEOF(int outfd, int errfd, char **outbuf, char **errbuf);
+int virSetMaxMemLock(pid_t pid, unsigned long long bytes); +int virSetMaxProcesses(pid_t pid, unsigned int procs); +int virSetMaxFiles(pid_t pid, unsigned int files);
And of course fix Dan's comments about putting these in the right file. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

These were previously being set in a custom hook function, but now that virCommand directly supports setting them, we can eliminate that part of the hook and call the APIs directly. --- src/qemu/qemu_process.c | 38 ++------------------------------------ 1 file changed, 2 insertions(+), 36 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 925939d..f12d7d5 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -25,8 +25,6 @@ #include <unistd.h> #include <signal.h> #include <sys/stat.h> -#include <sys/time.h> -#include <sys/resource.h> #if defined(__linux__) # include <linux/capability.h> #elif defined(__FreeBSD__) @@ -2453,37 +2451,6 @@ qemuProcessPrepareChardevDevice(virDomainDefPtr def ATTRIBUTE_UNUSED, } -static int -qemuProcessLimits(virQEMUDriverConfigPtr cfg) -{ - struct rlimit rlim; - - if (cfg->maxProcesses > 0) { - rlim.rlim_cur = rlim.rlim_max = cfg->maxProcesses; - if (setrlimit(RLIMIT_NPROC, &rlim) < 0) { - virReportSystemError(errno, - _("cannot limit number of processes to %d"), - cfg->maxProcesses); - return -1; - } - } - - if (cfg->maxFiles > 0) { - /* Max number of opened files is one greater than - * actual limit. See man setrlimit */ - rlim.rlim_cur = rlim.rlim_max = cfg->maxFiles + 1; - if (setrlimit(RLIMIT_NOFILE, &rlim) < 0) { - virReportSystemError(errno, - _("cannot set max opened files to %d"), - cfg->maxFiles); - return -1; - } - } - - return 0; -} - - struct qemuProcessHookData { virConnectPtr conn; virDomainObjPtr vm; @@ -2526,9 +2493,6 @@ static int qemuProcessHook(void *data) if (virSecurityManagerClearSocketLabel(h->driver->securityManager, h->vm->def) < 0) goto cleanup; - if (qemuProcessLimits(h->cfg) < 0) - goto cleanup; - /* This must take place before exec(), so that all QEMU * memory allocation is on the correct NUMA node */ @@ -3697,6 +3661,8 @@ int qemuProcessStart(virConnectPtr conn, } virCommandSetPreExecHook(cmd, qemuProcessHook, &hookData); + virCommandSetMaxProcesses(cmd, cfg->maxProcesses); + virCommandSetMaxFiles(cmd, cfg->maxFiles); VIR_DEBUG("Setting up security labelling"); if (virSecurityManagerSetChildProcessLabel(driver->securityManager, -- 1.7.11.7

On 04/25/2013 11:57 AM, Laine Stump wrote:
These were previously being set in a custom hook function, but now that virCommand directly supports setting them, we can eliminate that part of the hook and call the APIs directly. --- src/qemu/qemu_process.c | 38 ++------------------------------------ 1 file changed, 2 insertions(+), 36 deletions(-)
fun diffstat! ACK. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

VFIO requires all of the guest's memory and IO space to be lockable in RAM. The domain's max_balloon is the maximum amount of memory the domain can have (in KiB). We add a generous 1GiB to that for IO space (still much better than KVM device assignment, where the KVM module actually *ignores* the process limits and locks everything anyway), and convert from KiB to bytes. In the case of hotplug, we are changing the limit for the already existing qemu process (prlimit() is used under the hood), and for regular commandline additions of vfio devices, we schedule a call to setrlimit() that will happen after the qemu process is forked. --- src/qemu/qemu_command.c | 22 +++++++++++++++------- src/qemu/qemu_hotplug.c | 24 +++++++++++++++++------- 2 files changed, 32 insertions(+), 14 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 761291c..80a0dec 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -7906,13 +7906,21 @@ qemuBuildCommandLine(virConnectPtr conn, if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { - if ((hostdev->source.subsys.u.pci.backend - == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) && - !virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE_VFIO_PCI)) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("VFIO PCI device assignment is not supported by " - "this version of qemu")); - goto error; + if (hostdev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE_VFIO_PCI)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("VFIO PCI device assignment is not " + "supported by this version of qemu")); + goto error; + } + /* VFIO requires all of the guest's memory to be + * locked resident, plus some amount for IO + * space. Alex Williamson suggested adding 1GiB for IO + * space just to be safe (some finer tuning might be + * nice, though). + */ + virCommandSetMaxMemLock(cmd, ((unsigned long long)def->mem.max_balloon + (1024 * 1024)) * 1024); } if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE)) { diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 30c8bda..af0b8a6 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -999,13 +999,23 @@ int qemuDomainAttachHostPciDevice(virQEMUDriverPtr driver, &hostdev, 1) < 0) return -1; - if ((hostdev->source.subsys.u.pci.backend - == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) && - !virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE_VFIO_PCI)) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("VFIO PCI device assignment is not supported by " - "this version of qemu")); - goto error; + if (hostdev->source.subsys.u.pci.backend + == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_VFIO) { + if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE_VFIO_PCI)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("VFIO PCI device assignment is not " + "supported by this version of qemu")); + goto error; + } + /* VFIO requires all of the guest's memory to be locked + * resident, plus some amount for IO space. Alex Williamson + * suggested adding 1GiB for IO space just to be safe (some + * finer tuning might be nice, though). + * In this case, the guest's memory may already be locked, but + * it doesn't hurt to "change" the limit to the same value. + */ + virSetMaxMemLock(vm->pid, + ((unsigned long long)vm->def->mem.max_balloon + (1024 * 1024)) * 1024); } if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_DEVICE)) { -- 1.7.11.7

On 04/25/2013 11:57 AM, Laine Stump wrote:
VFIO requires all of the guest's memory and IO space to be lockable in RAM. The domain's max_balloon is the maximum amount of memory the domain can have (in KiB). We add a generous 1GiB to that for IO space (still much better than KVM device assignment, where the KVM module actually *ignores* the process limits and locks everything anyway), and convert from KiB to bytes.
In the case of hotplug, we are changing the limit for the already existing qemu process (prlimit() is used under the hood), and for regular commandline additions of vfio devices, we schedule a call to setrlimit() that will happen after the qemu process is forked. --- src/qemu/qemu_command.c | 22 +++++++++++++++------- src/qemu/qemu_hotplug.c | 24 +++++++++++++++++------- 2 files changed, 32 insertions(+), 14 deletions(-)
+ /* VFIO requires all of the guest's memory to be + * locked resident, plus some amount for IO + * space. Alex Williamson suggested adding 1GiB for IO + * space just to be safe (some finer tuning might be + * nice, though). + */ + virCommandSetMaxMemLock(cmd, ((unsigned long long)def->mem.max_balloon + (1024 * 1024)) * 1024);
Worth wrapping the long line, perhaps at the + operator? Or even using a temporary: unsigned long long mem_kbytes = def->mem.max_balloon + (1024 * 1024); virCommandSetMaxMemLock(cmd, mem_kbytes * 1024);
+ virSetMaxMemLock(vm->pid, + ((unsigned long long)vm->def->mem.max_balloon + (1024 * 1024)) * 1024);
and again. Line wrapping is trivial, though, so: ACK. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 04/25/2013 06:14 PM, Eric Blake wrote:
On 04/25/2013 11:57 AM, Laine Stump wrote:
VFIO requires all of the guest's memory and IO space to be lockable in RAM. The domain's max_balloon is the maximum amount of memory the domain can have (in KiB). We add a generous 1GiB to that for IO space (still much better than KVM device assignment, where the KVM module actually *ignores* the process limits and locks everything anyway), and convert from KiB to bytes.
In the case of hotplug, we are changing the limit for the already existing qemu process (prlimit() is used under the hood), and for regular commandline additions of vfio devices, we schedule a call to setrlimit() that will happen after the qemu process is forked. --- src/qemu/qemu_command.c | 22 +++++++++++++++------- src/qemu/qemu_hotplug.c | 24 +++++++++++++++++------- 2 files changed, 32 insertions(+), 14 deletions(-)
+ /* VFIO requires all of the guest's memory to be + * locked resident, plus some amount for IO + * space. Alex Williamson suggested adding 1GiB for IO + * space just to be safe (some finer tuning might be + * nice, though). + */ + virCommandSetMaxMemLock(cmd, ((unsigned long long)def->mem.max_balloon + (1024 * 1024)) * 1024); Worth wrapping the long line, perhaps at the + operator? Or even using a temporary: unsigned long long mem_kbytes = def->mem.max_balloon + (1024 * 1024); virCommandSetMaxMemLock(cmd, mem_kbytes * 1024);
I did the latter.
+ virSetMaxMemLock(vm->pid, + ((unsigned long long)vm->def->mem.max_balloon + (1024 * 1024)) * 1024); and again.
Same here.
Line wrapping is trivial, though, so:
ACK.

These should be squashed in with the patch that adds commandline handling of vfio (they would fail at any earlier time). --- .../qemuxml2argv-hostdev-vfio.args | 5 +++ .../qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml | 33 +++++++++++++++++ .../qemuxml2argv-net-hostdev-vfio.args | 6 ++++ .../qemuxml2argv-net-hostdev-vfio.xml | 41 ++++++++++++++++++++++ tests/qemuxml2argvtest.c | 6 ++++ tests/qemuxml2xmltest.c | 2 ++ 6 files changed, 93 insertions(+) create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.xml diff --git a/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.args b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.args new file mode 100644 index 0000000..e6e42de --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.args @@ -0,0 +1,5 @@ +LC_ALL=C PATH=/bin HOME=/home/test USER=test LOGNAME=test /usr/bin/qemu -S -M \ +pc -m 214 -smp 1 -nographic -nodefconfig -nodefaults -monitor \ +unix:/tmp/test-monitor,server,nowait -no-acpi -boot c -usb -hda \ +/dev/HostVG/QEMUGuest2 -device vfio-pci,host=06:12.5,id=hostdev0,\ +bus=pci.0,addr=0x3 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 diff --git a/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml new file mode 100644 index 0000000..8daa53a --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml @@ -0,0 +1,33 @@ +<domain type='qemu'> + <name>QEMUGuest2</name> + <uuid>c7a5fdbd-edaf-9466-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <disk type='block' device='disk'> + <source dev='/dev/HostVG/QEMUGuest2'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + </hostdev> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.args b/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.args new file mode 100644 index 0000000..da5886e --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.args @@ -0,0 +1,6 @@ +LC_ALL=C PATH=/bin HOME=/home/test USER=test LOGNAME=test /usr/bin/qemu -S \ +-M pc -m 214 -smp 1 -nographic -nodefconfig -nodefaults -monitor \ +unix:/tmp/test-monitor,server,nowait -no-acpi -boot c \ +-usb -hda /dev/HostVG/QEMUGuest1 \ +-device vfio-pci,host=03:07.1,id=hostdev0,bus=pci.0,addr=0x3 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 diff --git a/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.xml b/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.xml new file mode 100644 index 0000000..90419a4 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.xml @@ -0,0 +1,41 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <disk type='block' device='disk'> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <interface type='hostdev' managed='yes'> + <mac address='00:11:22:33:44:55'/> + <driver name='vfio'/> + <source> + <address type='pci' domain='0x0002' bus='0x03' slot='0x07' function='0x1'/> + </source> + <vlan> + <tag id='42'/> + </vlan> + <virtualport type='802.1Qbg'> + <parameters managerid='11' typeid='1193047' typeidversion='2' instanceid='09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f'/> + </virtualport> + <model type='rtl8139'/> + </interface> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 25fa5b7..7c4d1ce 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -672,6 +672,9 @@ mymain(void) DO_TEST("net-mcast", NONE); DO_TEST("net-hostdev", QEMU_CAPS_PCIDEVICE, QEMU_CAPS_DEVICE, QEMU_CAPS_NODEFCONFIG); + DO_TEST("net-hostdev-vfio", + QEMU_CAPS_PCIDEVICE, QEMU_CAPS_DEVICE, QEMU_CAPS_NODEFCONFIG, + QEMU_CAPS_DEVICE_VFIO_PCI); DO_TEST("serial-vc", NONE); DO_TEST("serial-pty", NONE); @@ -834,6 +837,9 @@ mymain(void) DO_TEST("hostdev-pci-address", QEMU_CAPS_PCIDEVICE); DO_TEST("hostdev-pci-address-device", QEMU_CAPS_PCIDEVICE, QEMU_CAPS_DEVICE, QEMU_CAPS_NODEFCONFIG); + DO_TEST("hostdev-vfio", + QEMU_CAPS_PCIDEVICE, QEMU_CAPS_DEVICE, QEMU_CAPS_NODEFCONFIG, + QEMU_CAPS_DEVICE_VFIO_PCI); DO_TEST("pci-rom", QEMU_CAPS_PCIDEVICE, QEMU_CAPS_DEVICE, QEMU_CAPS_NODEFCONFIG, QEMU_CAPS_PCI_ROMBAR); diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 0c50482..a81cfcf 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -208,6 +208,7 @@ mymain(void) DO_TEST("net-eth-ifname"); DO_TEST("net-virtio-network-portgroup"); DO_TEST("net-hostdev"); + DO_TEST("net-hostdev-vfio"); DO_TEST("net-openvswitch"); DO_TEST("sound"); DO_TEST("sound-device"); @@ -230,6 +231,7 @@ mymain(void) DO_TEST("hostdev-usb-address"); DO_TEST("hostdev-pci-address"); + DO_TEST("hostdev-vfio"); DO_TEST("pci-rom"); DO_TEST("encrypted-disk"); -- 1.7.11.7

On 04/25/2013 12:37 PM, Laine Stump wrote:
These should be squashed in with the patch that adds commandline handling of vfio (they would fail at any earlier time). --- .../qemuxml2argv-hostdev-vfio.args | 5 +++ .../qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml | 33 +++++++++++++++++ .../qemuxml2argv-net-hostdev-vfio.args | 6 ++++ .../qemuxml2argv-net-hostdev-vfio.xml | 41 ++++++++++++++++++++++ tests/qemuxml2argvtest.c | 6 ++++ tests/qemuxml2xmltest.c | 2 ++ 6 files changed, 93 insertions(+) create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.xml
ACK. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 04/25/2013 01:57 PM, Laine Stump wrote:
This *almost* completes VFIO support. I still need to add a couple of xml/args test cases.
I actually added the test cases as a separate patch in this series, but then squashed them into the earlier patch that constructs a qemu commandline with vfio support before pushing
Laine Stump (8): util: new function virPCIDeviceGetVFIOGroupDev security: update hostdev labelling functions for VFIO
qemu: add QEMU_CAPS_VFIO_PCI_BOOTINDEX qemu: use vfio-pci on commandline when appropriate
I pushed the above 4 patches, but didn't push the ones below
[PATCH 3/8] qemu: add VFIO devices to cgroup ACL
Eric questioned whether 3/8 was doing the right thing, and I'd like to get danpb's opinion.
[PATCH 6/8] util: new virCommandSetMax(MemLock|Processes|Files) [PATCH 7/8] qemu: use new virCommandSetMax(Processes|Files) [PATCH 8/8] qemu: set qemu process' RLIMIT_MEMLOCK when VFIO is used
Eric pointed out several problems with 6/8, which I've addressed, and 7/8 and 8/8 depend on 6/8 so I can't push them yet. To make things easier, I'm going to close out this thread and repost a new thread with just these 4 remaining patches. If they are ACKed and DV wants to freeze before I wake up, please feel free to push them. Thanks for all the quick reviews!
participants (3)
-
Daniel P. Berrange
-
Eric Blake
-
Laine Stump