[libvirt] [RFC PATCH] hostdev: add support for "managed='detach'"

Suggested by Alex Williamson. If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine. However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work. And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it. This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver. --- I'm sending this with the "RFC" tag because I'm concerned it might be considered "feature creep" by some (although I think it makes at least as much sense as "managed='yes'") and also because, even once (if) it is ACKed, I wouldn't want to push it until abologna is finished hacking around with the driver bind/unbind code - he has enough grief to deal with without me causing a bunch of merge conflicts :-) docs/formatdomain.html.in | 29 ++++++++++++----- docs/schemas/basictypes.rng | 8 +++++ docs/schemas/domaincommon.rng | 4 +-- docs/schemas/network.rng | 2 +- src/conf/domain_conf.c | 21 +++++++----- src/conf/domain_conf.h | 2 +- src/conf/network_conf.c | 18 +++++------ src/conf/network_conf.h | 5 +-- src/libvirt_private.syms | 2 ++ src/network/bridge_driver.c | 2 +- src/qemu/qemu_parse_command.c | 4 +-- src/util/virhostdev.c | 10 +++--- src/util/virpci.c | 14 +++++--- src/util/virpci.h | 14 ++++++-- src/xenconfig/xen_common.c | 3 +- src/xenconfig/xen_sxpr.c | 9 +++--- tests/networkxml2xmlin/hostdev-managed-detach.xml | 10 ++++++ tests/networkxml2xmlout/hostdev-managed-detach.xml | 10 ++++++ tests/networkxml2xmltest.c | 1 + .../qemuxml2argv-hostdev-managed-detach.args | 22 +++++++++++++ .../qemuxml2argv-hostdev-managed-detach.xml | 30 ++++++++++++++++++ tests/qemuxml2argvtest.c | 3 ++ .../qemuxml2xmlout-hostdev-managed-detach.xml | 37 ++++++++++++++++++++++ tests/qemuxml2xmltest.c | 1 + tests/virhostdevtest.c | 8 ++--- 25 files changed, 215 insertions(+), 54 deletions(-) create mode 100644 tests/networkxml2xmlin/hostdev-managed-detach.xml create mode 100644 tests/networkxml2xmlout/hostdev-managed-detach.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-managed-detach.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-managed-detach.xml create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-managed-detach.xml diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 41f2488..a732c1c 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -3323,15 +3323,28 @@ hot-unplugged. </dd> <dt>pci</dt> - <dd>For PCI devices, when <code>managed</code> is "yes" it is - detached from the host before being passed on to the guest - and reattached to the host after the guest exits. If + <dd>For PCI devices, when <code>managed</code> is "yes" it + is detached from the host driver (and attached to the + appropriate driver for the planned type of device + assignment, e.g. vfio-pci for VFIO, pci-stub for legacy + kvm, or pciback for Xen device passthrough) before being + passed to the guest, and reattached to the host driver + after the guest exits. If <code>managed</code> is + "detach", then the device is detached from the host driver + before passing to the guest, but is left attached to + vfio-pci or pci-stub when the guest exits. If <code>managed</code> is omitted or "no", the user is - responsible to call <code>virNodeDeviceDetachFlags</code> - (or <code>virsh nodedev-detach</code> before starting the guest - or hot-plugging the device and <code>virNodeDeviceReAttach</code> - (or <code>virsh nodedev-reattach</code>) after hot-unplug or - stopping the guest. + responsible for assuring that the device is attached to + the proper driver for the desired device assignment mode + before starting the guest or hot-plugging the device. This + can be done in the libvirt API with + <code>virNodeDeviceDetachFlags</code> (or by running + <code>virsh nodedev-detach</code>). Likewise, once the + guest is finished using the device, it will be up to the + user to re-attach it to the host driver (if that is + desired), or example by using + libvirt's <code>virNodeDeviceReAttach</code> API or + running <code>virsh nodedev-reattach</code>. </dd> <dt>scsi</dt> <dd>For SCSI devices, user is responsible to make sure the device diff --git a/docs/schemas/basictypes.rng b/docs/schemas/basictypes.rng index a83063a..50ba9c7 100644 --- a/docs/schemas/basictypes.rng +++ b/docs/schemas/basictypes.rng @@ -88,6 +88,14 @@ </optional> </define> + <define name="virPCIManagedMode"> + <choice> + <value>no</value> + <value>yes</value> + <value>detach</value> + </choice> + </define> + <!-- a 6 byte MAC address in ASCII-hex format, eg "12:34:56:78:9A:BC" --> <!-- The lowest bit of the 1st byte is the "multicast" bit. a --> <!-- uniMacAddr requires that bit to be 0, and a multiMacAddr --> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 8c6287d..3701f4f 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -2241,7 +2241,7 @@ </attribute> <optional> <attribute name="managed"> - <ref name="virYesNo"/> + <ref name="virPCIManagedMode"/> </attribute> </optional> <interleave> @@ -3756,7 +3756,7 @@ </optional> <optional> <attribute name="managed"> - <ref name="virYesNo"/> + <ref name="virPCIManagedMode"/> </attribute> </optional> <choice> diff --git a/docs/schemas/network.rng b/docs/schemas/network.rng index 4edb6eb..f5e049b 100644 --- a/docs/schemas/network.rng +++ b/docs/schemas/network.rng @@ -112,7 +112,7 @@ <optional> <attribute name="managed"> - <ref name="virYesNo"/> + <ref name="virPCIManagedMode"/> </attribute> </optional> <interleave> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 7d68096..2332b69 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -39,6 +39,7 @@ #include "viruuid.h" #include "virbuffer.h" #include "virlog.h" +#include "virpci.h" #include "nwfilter_conf.h" #include "storage_conf.h" #include "virstoragefile.h" @@ -5599,9 +5600,11 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, * element that might be (pure hostdev, or higher level device * (e.g. <interface>) with type='hostdev') */ - if ((managed = virXMLPropString(node, "managed")) != NULL) { - if (STREQ(managed, "yes")) - def->managed = true; + if ((managed = virXMLPropString(node, "managed")) && + (def->managed = virPCIManagedModeTypeFromString(managed)) < 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("unknown managed mode '%s'"), managed); + goto error; } sgio = virXMLPropString(node, "sgio"); @@ -19808,8 +19811,9 @@ virDomainActualNetDefFormat(virBufferPtr buf, virBufferAsprintf(buf, "<actual type='%s'", typeStr); if (type == VIR_DOMAIN_NET_TYPE_HOSTDEV) { virDomainHostdevDefPtr hostdef = virDomainNetGetActualHostdev(def); - if (hostdef && hostdef->managed) - virBufferAddLit(buf, " managed='yes'"); + if (hostdef && (hostdef->managed != VIR_PCI_MANAGED_NO)) + virBufferAsprintf(buf, " managed='%s'", + virPCIManagedModeTypeToString(hostdef->managed)); } if (def->trustGuestRxFilters) virBufferAsprintf(buf, " trustGuestRxFilters='%s'", @@ -19979,8 +19983,9 @@ virDomainNetDefFormat(virBufferPtr buf, } virBufferAsprintf(buf, "<interface type='%s'", typeStr); - if (hostdef && hostdef->managed) - virBufferAddLit(buf, " managed='yes'"); + if (hostdef && (hostdef->managed != VIR_PCI_MANAGED_NO)) + virBufferAsprintf(buf, " managed='%s'", + virPCIManagedModeTypeToString(hostdef->managed)); if (def->trustGuestRxFilters) virBufferAsprintf(buf, " trustGuestRxFilters='%s'", virTristateBoolTypeToString(def->trustGuestRxFilters)); @@ -21430,7 +21435,7 @@ virDomainHostdevDefFormat(virBufferPtr buf, mode, type); if (def->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS) { virBufferAsprintf(buf, " managed='%s'", - def->managed ? "yes" : "no"); + virPCIManagedModeTypeToString(def->managed)); if (def->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI && scsisrc->sgio) diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 85c4f55..42c7d36 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -543,7 +543,7 @@ struct _virDomainHostdevDef { virDomainDeviceDef parent; /* higher level Def containing this */ int mode; /* enum virDomainHostdevMode */ int startupPolicy; /* enum virDomainStartupPolicy */ - bool managed; + int managed; /* enum virPCIManagedMode */ bool missing; bool readonly; bool shareable; diff --git a/src/conf/network_conf.c b/src/conf/network_conf.c index 4fb2e2a..a40f3e0 100644 --- a/src/conf/network_conf.c +++ b/src/conf/network_conf.c @@ -1,7 +1,7 @@ /* * network_conf.c: network XML handling * - * Copyright (C) 2006-2015 Red Hat, Inc. + * Copyright (C) 2006-2016 Red Hat, Inc. * Copyright (C) 2006-2008 Daniel P. Berrange * * This library is free software; you can redistribute it and/or @@ -43,6 +43,7 @@ #include "virbuffer.h" #include "c-ctype.h" #include "virfile.h" +#include "virpci.h" #include "virstring.h" #define VIR_FROM_THIS VIR_FROM_NETWORK @@ -1825,10 +1826,11 @@ virNetworkForwardDefParseXML(const char *networkName, VIR_FREE(type); } - forwardManaged = virXPathString("string(./@managed)", ctxt); - if (forwardManaged != NULL && - STRCASEEQ(forwardManaged, "yes")) { - def->managed = true; + if ((forwardManaged = virXPathString("string(./@managed)", ctxt)) && + (def->managed = virPCIManagedModeTypeFromString(forwardManaged)) < 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("unknown managed mode '%s'"), forwardManaged); + goto cleanup; } forwardDriverName = virXPathString("string(./driver/@name)", ctxt); @@ -2723,10 +2725,8 @@ virNetworkDefFormatBuf(virBufferPtr buf, virBufferEscapeString(buf, " dev='%s'", dev); virBufferAsprintf(buf, " mode='%s'", mode); if (def->forward.type == VIR_NETWORK_FORWARD_HOSTDEV) { - if (def->forward.managed) - virBufferAddLit(buf, " managed='yes'"); - else - virBufferAddLit(buf, " managed='no'"); + virBufferAsprintf(buf, " managed='%s'", + virPCIManagedModeTypeToString(def->forward.managed)); } shortforward = !(def->forward.nifs || def->forward.npfs || VIR_SOCKET_ADDR_VALID(&def->forward.addr.start) diff --git a/src/conf/network_conf.h b/src/conf/network_conf.h index b72257b..3b86781 100644 --- a/src/conf/network_conf.h +++ b/src/conf/network_conf.h @@ -1,7 +1,7 @@ /* * network_conf.h: network XML handling * - * Copyright (C) 2006-2015 Red Hat, Inc. + * Copyright (C) 2006-2016 Red Hat, Inc. * Copyright (C) 2006-2008 Daniel P. Berrange * * This library is free software; you can redistribute it and/or @@ -187,7 +187,8 @@ typedef struct _virNetworkForwardDef virNetworkForwardDef; typedef virNetworkForwardDef *virNetworkForwardDefPtr; struct _virNetworkForwardDef { int type; /* One of virNetworkForwardType constants */ - bool managed; /* managed attribute for hostdev mode */ + int managed; /* managed attribute for hostdev mode */ + /* enum virPCIManagedMode */ int driverName; /* enum virNetworkForwardDriverNameType */ /* If there are multiple forward devices (i.e. a pool of diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 55c3047..3dd00ec 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2010,6 +2010,8 @@ virPCIGetVirtualFunctionIndex; virPCIGetVirtualFunctionInfo; virPCIGetVirtualFunctions; virPCIIsVirtualFunction; +virPCIManagedModeTypeFromString; +virPCIManagedModeTypeToString; virPCIStubDriverTypeFromString; virPCIStubDriverTypeToString; diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index 01c2ed6..bd1f57c 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -4085,7 +4085,7 @@ networkAllocateActualDevice(virDomainDefPtr dom, iface->data.network.actual->data.hostdev.def.parent.data.net = iface; iface->data.network.actual->data.hostdev.def.info = &iface->info; iface->data.network.actual->data.hostdev.def.mode = VIR_DOMAIN_HOSTDEV_MODE_SUBSYS; - iface->data.network.actual->data.hostdev.def.managed = netdef->forward.managed ? 1 : 0; + iface->data.network.actual->data.hostdev.def.managed = netdef->forward.managed; iface->data.network.actual->data.hostdev.def.source.subsys.type = dev->type; iface->data.network.actual->data.hostdev.def.source.subsys.u.pci.addr = dev->device.pci; diff --git a/src/qemu/qemu_parse_command.c b/src/qemu/qemu_parse_command.c index 60e3d69..060bf66 100644 --- a/src/qemu/qemu_parse_command.c +++ b/src/qemu/qemu_parse_command.c @@ -1192,7 +1192,7 @@ qemuParseCommandLinePCI(const char *val) } def->mode = VIR_DOMAIN_HOSTDEV_MODE_SUBSYS; - def->managed = true; + def->managed = VIR_PCI_MANAGED_YES; def->source.subsys.type = VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI; def->source.subsys.u.pci.addr.bus = bus; def->source.subsys.u.pci.addr.slot = slot; @@ -1255,7 +1255,7 @@ qemuParseCommandLineUSB(const char *val) } def->mode = VIR_DOMAIN_HOSTDEV_MODE_SUBSYS; - def->managed = false; + def->managed = VIR_PCI_MANAGED_NO; def->source.subsys.type = VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB; if (*end == '.') { usbsrc->bus = first; diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 098207e..4baf884 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -1,6 +1,6 @@ /* virhostdev.c: hostdev management * - * Copyright (C) 2006-2007, 2009-2015 Red Hat, Inc. + * Copyright (C) 2006-2007, 2009-2016 Red Hat, Inc. * Copyright (C) 2006 Daniel P. Berrange * Copyright (C) 2014 SUSE LINUX Products GmbH, Nuernberg, Germany. * @@ -592,7 +592,7 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) { virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); - if (virPCIDeviceGetManaged(dev)) { + if (virPCIDeviceGetManaged(dev) != VIR_PCI_MANAGED_NO) { VIR_DEBUG("Detaching managed PCI device %s", virPCIDeviceGetName(dev)); if (virPCIDeviceDetach(dev, @@ -726,7 +726,7 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) { virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); - if (virPCIDeviceGetManaged(dev)) { + if (virPCIDeviceGetManaged(dev) == VIR_PCI_MANAGED_YES) { VIR_DEBUG("Reattaching managed PCI device %s", virPCIDeviceGetName(dev)); ignore_value(virPCIDeviceReattach(dev, @@ -756,7 +756,7 @@ virHostdevReattachPCIDevice(virPCIDevicePtr dev, virHostdevManagerPtr mgr) /* If the device is not managed and was attached to guest * successfully, it must have been inactive. */ - if (!virPCIDeviceGetManaged(dev)) { + if (virPCIDeviceGetManaged(dev) != VIR_PCI_MANAGED_YES) { VIR_DEBUG("Adding unmanaged PCI device %s to inactive list", virPCIDeviceGetName(dev)); if (virPCIDeviceListAdd(mgr->inactivePCIHostdevs, dev) < 0) @@ -1312,7 +1312,7 @@ virHostdevPrepareSCSIHostDevices(virDomainHostdevDefPtr hostdev, virSCSIDevicePtr scsi; int ret = -1; - if (hostdev->managed) { + if (hostdev->managed != VIR_PCI_MANAGED_NO) { virReportError(VIR_ERR_XML_ERROR, "%s", _("SCSI host device doesn't support managed mode")); goto cleanup; diff --git a/src/util/virpci.c b/src/util/virpci.c index 1854318..7164824 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1,7 +1,7 @@ /* * virpci.c: helper APIs for managing host PCI devices * - * Copyright (C) 2009-2015 Red Hat, Inc. + * Copyright (C) 2009-2016 Red Hat, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -62,6 +62,11 @@ VIR_ENUM_IMPL(virPCIStubDriver, VIR_PCI_STUB_DRIVER_LAST, "vfio-pci", /* VFIO */ ); +VIR_ENUM_IMPL(virPCIManagedMode, VIR_PCI_MANAGED_LAST, + "no", + "yes", + "detach"); + struct _virPCIDevice { virPCIDeviceAddress address; @@ -77,7 +82,7 @@ struct _virPCIDevice { unsigned int pci_pm_cap_pos; bool has_flr; bool has_pm_reset; - bool managed; + virPCIManagedMode managed; virPCIStubDriver stubDriver; @@ -1702,12 +1707,13 @@ virPCIDeviceGetName(virPCIDevicePtr dev) return dev->name; } -void virPCIDeviceSetManaged(virPCIDevicePtr dev, bool managed) +void virPCIDeviceSetManaged(virPCIDevicePtr dev, + virPCIManagedMode managed) { dev->managed = managed; } -bool +virPCIManagedMode virPCIDeviceGetManaged(virPCIDevicePtr dev) { return dev->managed; diff --git a/src/util/virpci.h b/src/util/virpci.h index 55329c8..1969e29 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -53,6 +53,16 @@ typedef enum { VIR_ENUM_DECL(virPCIStubDriver); typedef enum { + VIR_PCI_MANAGED_NO, + VIR_PCI_MANAGED_YES, + VIR_PCI_MANAGED_DETACH, + + VIR_PCI_MANAGED_LAST +} virPCIManagedMode; + +VIR_ENUM_DECL(virPCIManagedMode); + +typedef enum { VIR_PCIE_LINK_SPEED_NA = 0, VIR_PCIE_LINK_SPEED_25, VIR_PCIE_LINK_SPEED_5, @@ -98,8 +108,8 @@ int virPCIDeviceReset(virPCIDevicePtr dev, virPCIDeviceListPtr inactiveDevs); void virPCIDeviceSetManaged(virPCIDevice *dev, - bool managed); -bool virPCIDeviceGetManaged(virPCIDevice *dev); + virPCIManagedMode managed); +virPCIManagedMode virPCIDeviceGetManaged(virPCIDevice *dev); void virPCIDeviceSetStubDriver(virPCIDevicePtr dev, virPCIStubDriver driver); virPCIStubDriver virPCIDeviceGetStubDriver(virPCIDevicePtr dev); diff --git a/src/xenconfig/xen_common.c b/src/xenconfig/xen_common.c index 828c8e9..db4745c 100644 --- a/src/xenconfig/xen_common.c +++ b/src/xenconfig/xen_common.c @@ -32,6 +32,7 @@ #include "virerror.h" #include "virconf.h" #include "viralloc.h" +#include "virpci.h" #include "viruuid.h" #include "count-one-bits.h" #include "xenxs_private.h" @@ -461,7 +462,7 @@ xenParsePCI(virConfPtr conf, virDomainDefPtr def) if (!(hostdev = virDomainHostdevDefAlloc())) return -1; - hostdev->managed = false; + hostdev->managed = VIR_PCI_MANAGED_NO; hostdev->source.subsys.type = VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI; hostdev->source.subsys.u.pci.addr.domain = domainID; hostdev->source.subsys.u.pci.addr.bus = busID; diff --git a/src/xenconfig/xen_sxpr.c b/src/xenconfig/xen_sxpr.c index fdfec2b..51b32a4 100644 --- a/src/xenconfig/xen_sxpr.c +++ b/src/xenconfig/xen_sxpr.c @@ -1,7 +1,7 @@ /* * xen_sxpr.c: Xen SEXPR parsing functions * - * Copyright (C) 2010-2014 Red Hat, Inc. + * Copyright (C) 2010-2014, 2016 Red Hat, Inc. * Copyright (C) 2011 Univention GmbH * Copyright (C) 2005 Anthony Liguori <aliguori@us.ibm.com> * @@ -32,6 +32,7 @@ #include "virerror.h" #include "virconf.h" #include "viralloc.h" +#include "virpci.h" #include "verify.h" #include "viruuid.h" #include "virlog.h" @@ -1114,7 +1115,7 @@ xenParseSxprPCI(virDomainDefPtr def, goto error; dev->mode = VIR_DOMAIN_HOSTDEV_MODE_SUBSYS; - dev->managed = false; + dev->managed = VIR_PCI_MANAGED_NO; dev->source.subsys.type = VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI; dev->source.subsys.u.pci.addr.domain = domainID; dev->source.subsys.u.pci.addr.bus = busID; @@ -2002,7 +2003,7 @@ xenFormatSxprOnePCI(virDomainHostdevDefPtr def, virBufferPtr buf, int detach) { - if (def->managed) { + if (def->managed != VIR_PCI_MANAGED_NO) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", _("managed PCI devices not supported with XenD")); return -1; @@ -2062,7 +2063,7 @@ xenFormatSxprAllPCI(virDomainDefPtr def, for (i = 0; i < def->nhostdevs; i++) { if (def->hostdevs[i]->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && def->hostdevs[i]->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { - if (def->hostdevs[i]->managed) { + if (def->hostdevs[i]->managed != VIR_PCI_MANAGED_NO) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", _("managed PCI devices not supported with XenD")); return -1; diff --git a/tests/networkxml2xmlin/hostdev-managed-detach.xml b/tests/networkxml2xmlin/hostdev-managed-detach.xml new file mode 100644 index 0000000..3a5fac4 --- /dev/null +++ b/tests/networkxml2xmlin/hostdev-managed-detach.xml @@ -0,0 +1,10 @@ +<network> + <name>hostdev</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a9b</uuid> + <forward mode='hostdev' managed='detach'> + <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x1'/> + <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x2'/> + <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x3'/> + <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x4'/> + </forward> +</network> diff --git a/tests/networkxml2xmlout/hostdev-managed-detach.xml b/tests/networkxml2xmlout/hostdev-managed-detach.xml new file mode 100644 index 0000000..3a5fac4 --- /dev/null +++ b/tests/networkxml2xmlout/hostdev-managed-detach.xml @@ -0,0 +1,10 @@ +<network> + <name>hostdev</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a9b</uuid> + <forward mode='hostdev' managed='detach'> + <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x1'/> + <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x2'/> + <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x3'/> + <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x4'/> + </forward> +</network> diff --git a/tests/networkxml2xmltest.c b/tests/networkxml2xmltest.c index 8d60aa8..ec9a47f 100644 --- a/tests/networkxml2xmltest.c +++ b/tests/networkxml2xmltest.c @@ -108,6 +108,7 @@ mymain(void) DO_TEST("openvswitch-net"); DO_TEST_FULL("passthrough-pf", VIR_NETWORK_XML_INACTIVE); DO_TEST("hostdev"); + DO_TEST("hostdev-managed-detach"); DO_TEST_FULL("hostdev-pf", VIR_NETWORK_XML_INACTIVE); DO_TEST("passthrough-address-crash"); DO_TEST("nat-network-explicit-flood"); diff --git a/tests/qemuxml2argvdata/qemuxml2argv-hostdev-managed-detach.args b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-managed-detach.args new file mode 100644 index 0000000..9e37ed5 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-managed-detach.args @@ -0,0 +1,22 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/home/test \ +USER=test \ +LOGNAME=test \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu \ +-name QEMUGuestManagedUnbind \ +-S \ +-M pc \ +-m 214 \ +-smp 1 \ +-uuid c7a5fdbd-edaf-9466-926a-d65c16db1809 \ +-nographic \ +-nodefconfig \ +-nodefaults \ +-monitor unix:/tmp/lib/domain--1-QEMUGuestManagedUnbi/monitor.sock,server,nowait \ +-no-acpi \ +-boot c \ +-usb \ +-device vfio-pci,host=06:12.5,id=hostdev0,bus=pci.0,addr=0x3 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 diff --git a/tests/qemuxml2argvdata/qemuxml2argv-hostdev-managed-detach.xml b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-managed-detach.xml new file mode 100644 index 0000000..2d12d98 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-managed-detach.xml @@ -0,0 +1,30 @@ +<domain type='qemu'> + <name>QEMUGuestManagedUnbind</name> + <uuid>c7a5fdbd-edaf-9466-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='detach'> + <driver name='vfio'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + </hostdev> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index e15da37..433e3a7 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1278,6 +1278,9 @@ mymain(void) DO_TEST_FAILURE("hostdev-vfio-multidomain", QEMU_CAPS_PCIDEVICE, QEMU_CAPS_NODEFCONFIG, QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("hostdev-managed-detach", + QEMU_CAPS_PCIDEVICE, QEMU_CAPS_NODEFCONFIG, + QEMU_CAPS_DEVICE_VFIO_PCI); DO_TEST("pci-rom", QEMU_CAPS_PCIDEVICE, QEMU_CAPS_NODEFCONFIG, QEMU_CAPS_PCI_ROMBAR); diff --git a/tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-managed-detach.xml b/tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-managed-detach.xml new file mode 100644 index 0000000..e639e91 --- /dev/null +++ b/tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-managed-detach.xml @@ -0,0 +1,37 @@ +<domain type='qemu'> + <name>QEMUGuestManagedUnbind</name> + <uuid>c7a5fdbd-edaf-9466-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='ide' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='detach'> + <driver name='vfio'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 0735677..980c468 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -484,6 +484,7 @@ mymain(void) DO_TEST("hostdev-usb-address"); DO_TEST("hostdev-pci-address"); DO_TEST("hostdev-vfio"); + DO_TEST("hostdev-managed-detach"); DO_TEST("pci-rom"); DO_TEST("pci-serial-dev-chardev"); diff --git a/tests/virhostdevtest.c b/tests/virhostdevtest.c index 5eb2e7e..af175ca 100644 --- a/tests/virhostdevtest.c +++ b/tests/virhostdevtest.c @@ -166,7 +166,7 @@ testVirHostdevPreparePCIHostdevs_unmanaged(void) size_t active_count, inactive_count, i; for (i = 0; i < nhostdevs; i++) - hostdevs[i]->managed = false; + hostdevs[i]->managed = VIR_PCI_MANAGED_NO; active_count = virPCIDeviceListCount(mgr->activePCIHostdevs); inactive_count = virPCIDeviceListCount(mgr->inactivePCIHostdevs); @@ -225,7 +225,7 @@ testVirHostdevReAttachPCIHostdevs_unmanaged(void) size_t active_count, inactive_count, i; for (i = 0; i < nhostdevs; i++) { - if (hostdevs[i]->managed != false) { + if (hostdevs[i]->managed != VIR_PCI_MANAGED_NO) { VIR_DEBUG("invalid test"); return -1; } @@ -259,7 +259,7 @@ testVirHostdevPreparePCIHostdevs_managed(void) size_t active_count, inactive_count, i; for (i = 0; i < nhostdevs; i++) - hostdevs[i]->managed = true; + hostdevs[i]->managed = VIR_PCI_MANAGED_YES; active_count = virPCIDeviceListCount(mgr->activePCIHostdevs); inactive_count = virPCIDeviceListCount(mgr->inactivePCIHostdevs); @@ -310,7 +310,7 @@ testVirHostdevReAttachPCIHostdevs_managed(void) size_t active_count, inactive_count, i; for (i = 0; i < nhostdevs; i++) { - if (hostdevs[i]->managed != true) { + if (hostdevs[i]->managed != VIR_PCI_MANAGED_YES) { VIR_DEBUG("invalid test"); return -1; } -- 2.5.0

On Mon, Mar 14, 2016 at 03:41:48PM -0400, Laine Stump wrote:
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work.
And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it.
This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver.
You say that managed=yes causes pain upon re-attachment and that apps should use managed=detach to avoid it, but how do management apps know which devices are going to cause pain ? Libvirt isn't providing any info on whether a particular device id needs to use managed=yes vs managed=detach, and we don't want to be asking the user to choose between modes in openstack/ovirt IMHO. I think thats a fundamental problem with inventing a new value for managed here. Can you provide more details about the problems with detaching ? Is this inherant to all VGA cards, or is it specific to the Intel i915, or specific to a kernel version or something else ? I feel like this is something where libvirt should "do the right thing", since that's really what managed=yes is all about. eg, if we have managed=yes and we see an i915, we should automatically skip re-attach for that device. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 03/15/2016 01:00 PM, Daniel P. Berrange wrote:
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work.
And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it.
This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver. You say that managed=yes causes pain upon re-attachment and that apps should use managed=detach to avoid it, but how do management apps know which devices are going to cause pain ? Libvirt isn't
On Mon, Mar 14, 2016 at 03:41:48PM -0400, Laine Stump wrote: providing any info on whether a particular device id needs to use managed=yes vs managed=detach, and we don't want to be asking the user to choose between modes in openstack/ovirt IMHO. I think thats a fundamental problem with inventing a new value for managed here.
My suspicion is that in many/most cases users don't actually need for the device to be re-bound to the host driver after the guest is finished with it, because they're only going to use the device to assign to a different guest anyway. But because managed='yes' is what's supplied and is the easiest way to get it setup for assignment to a guest, that's what they use. As a matter of fact, all this extra churn of changing the driver back and forth for devices that are only actually used when they're bound to vfio-pci just wastes time, and makes it more likely that libvirt and its users will reveal and get caught up in the effects of some strange kernel driver loading/unloading bug (there was recently a bug reported like this; unfortunately the BZ record had customer info in it, so it's not publicly accessible :-( ) So beyond making this behavior available only when absolutely necessary, I think it is useful in other cases, at the user's discretion (and as I implied above, I think that if they understood the function and the tradeoffs, most people would choose to use managed='detach' rather than managed='yes') (alternately, we could come back to the discussion of having persistent nodedevice config, with one of the configurables being which devices should be bound to vfio-pci when libvirtd is started. Did we maybe even talk about exactly that in the past? I can't remember... That would of course preclude the use case where someone 1) normally wanted to use the device for the host, but 2) occasionally wanted to use it for a guest, after which 3) they were well aware that they would need to reboot the host before they could use the device on the host again. I know, I know - "odd edge cases", and in particular "odd edge cases only encountered by people who know other ways of working around the problem" :-))
Can you provide more details about the problems with detaching ?
Is this inherant to all VGA cards, or is it specific to the Intel i915, or specific to a kernel version or something else ?
I feel like this is something where libvirt should "do the right thing", since that's really what managed=yes is all about.
eg, if we have managed=yes and we see an i915, we should automatically skip re-attach for that device.
Alex can give a much better description of that than I can (I had told git to Cc him on the original patch, but it seems it didn't do that; I'm trying again). But what if there is such a behavior now for a certain set of VGA cards, and it gets fixed in the future? Would we continue to force avoiding re-attach for the device? I understand the allure of always doing the right thing without requiring config (and the dislike of adding new seemingly esoteric options), but I don't know that libvirt has (or can get) the necessary info to make the correct decision in all cases.

On Tue, 15 Mar 2016 14:21:35 -0400 Laine Stump <laine@laine.org> wrote:
On 03/15/2016 01:00 PM, Daniel P. Berrange wrote:
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work.
And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it.
This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver. You say that managed=yes causes pain upon re-attachment and that apps should use managed=detach to avoid it, but how do management apps know which devices are going to cause pain ? Libvirt isn't
On Mon, Mar 14, 2016 at 03:41:48PM -0400, Laine Stump wrote: providing any info on whether a particular device id needs to use managed=yes vs managed=detach, and we don't want to be asking the user to choose between modes in openstack/ovirt IMHO. I think thats a fundamental problem with inventing a new value for managed here.
My suspicion is that in many/most cases users don't actually need for the device to be re-bound to the host driver after the guest is finished with it, because they're only going to use the device to assign to a different guest anyway. But because managed='yes' is what's supplied and is the easiest way to get it setup for assignment to a guest, that's what they use.
As a matter of fact, all this extra churn of changing the driver back and forth for devices that are only actually used when they're bound to vfio-pci just wastes time, and makes it more likely that libvirt and its users will reveal and get caught up in the effects of some strange kernel driver loading/unloading bug (there was recently a bug reported like this; unfortunately the BZ record had customer info in it, so it's not publicly accessible :-( )
So beyond making this behavior available only when absolutely necessary, I think it is useful in other cases, at the user's discretion (and as I implied above, I think that if they understood the function and the tradeoffs, most people would choose to use managed='detach' rather than managed='yes')
(alternately, we could come back to the discussion of having persistent nodedevice config, with one of the configurables being which devices should be bound to vfio-pci when libvirtd is started. Did we maybe even talk about exactly that in the past? I can't remember... That would of course preclude the use case where someone 1) normally wanted to use the device for the host, but 2) occasionally wanted to use it for a guest, after which 3) they were well aware that they would need to reboot the host before they could use the device on the host again. I know, I know - "odd edge cases", and in particular "odd edge cases only encountered by people who know other ways of working around the problem" :-))
Can you provide more details about the problems with detaching ?
Is this inherant to all VGA cards, or is it specific to the Intel i915, or specific to a kernel version or something else ?
I feel like this is something where libvirt should "do the right thing", since that's really what managed=yes is all about.
eg, if we have managed=yes and we see an i915, we should automatically skip re-attach for that device.
Alex can give a much better description of that than I can (I had told git to Cc him on the original patch, but it seems it didn't do that; I'm trying again). But what if there is such a behavior now for a certain set of VGA cards, and it gets fixed in the future? Would we continue to force avoiding re-attach for the device? I understand the allure of always doing the right thing without requiring config (and the dislike of adding new seemingly esoteric options), but I don't know that libvirt has (or can get) the necessary info to make the correct decision in all cases.
I agree, blacklisting VGA devices or any other specific device types or host drivers is bound to be the wrong thing to do for someone or at some point in time. I think if we look at the way devices are typically used for device assignment, we'd probably see that they're used exclusively for device assignment or exclusively for the host. My guess is that it's a much less common scenario that a user actually wants to steal a device from the host only while a VM is using it. It is done though, I know of folks that steal an audio device from the host when they run their game VM and give it back when shutdown. I don't know that it's possible for libvirt to always just do the right thing here, it involves inferring the intentions of the user. So here are the types of things we're dealing with that made me suggest this idea to Laine; in the i915 scenario, the Intel graphics device (IGD) is typically the primary host graphics. If we want to assign it to a VM, obviously at some point it needs to move to vfio-pci, but do we know that the user has an alternate console configured or do they go headless when that happens? If they go headless then they probably don't want to use kernel boot options and blacklisting to prevent i915 from claiming the device or getting it attached to pci-stub or vfio-pci. Often that's not even enough since efifb or vesafb might try to claim resources of the device even if the PCI driver is prevented from doing so. In such a case, it's pretty convenient that the user can just set managed='yes' and the device gets snatched away from the host when the VM starts... but then the i915 driver sometimes barfs when the VM is shutdown and and i915 takes back the device. The host is left in a mostly unusable state. Yes, the user could do a nodedev-detach before starting the VM and yes, the i915 driver issue may just be temporary, but this isn't the first time this has happened. As Laine mentioned, we've seen another customer issue where a certain NIC is left in an inconsistent state, sometimes, when returned to the host. They have absolutely no use for this NIC on the host, so this was mostly a pointless operation anyway. In this case we had to use a pci-stub.ids option to prevent the host NIC driver from touching the devices since there was really no easy way to set manage='no' and pre-bind the devices to vfio-pci in their ovirt/openstack environment. NICs usually fair better at repeated attach/detach scenarios thanks to physical hotplug support, but it's really a question of how robust is the driver. For instance, how many people are out there hotplugging $10 Realtek NICs vs multi-hundred dollar enterprise class NICs? Has anyone ever done physical hotplug of a graphics card, sound cards or USB controller? Even in the scenario I mention above where the user thinks they want to bounce their audio device back and forth between VM and host, there's actually a fixed array of alsa cards in the kernel and unbinding a device just leaks that slot. :-\ We also have nvidia.ko, which not only messes with the device to the point where it may or may not work in the VM, but the vendor doesn't support dynamically unbinding devices. It will still do it, but it kinda forgets to tell Xorg to stop using the device. We generally recommend folks doing GPU assignment to avoid the host driver altogether, nouveau and radeon sometimes don't even like to do the unbind. i915 is actually doing better than average in this case and the fact that it's typically the primary graphics sort of breaks that rule anyway. So we have all sorts of driver issues that are sure to come and go over time and all sorts of use cases that seem difficult to predict. If we know we're in a ovirt/openstack environment, managed='detach' might actually be a more typical use case than managed='yes'. It still leaves a gap that we hope the host driver doesn't do anything bad when it initializes the device and hope that it releases the device cleanly, but it's probably better than tempting fate by unnecessarily bouncing it back and forth between drivers. Thanks, Alex

On Tue, 2016-03-15 at 13:31 -0600, Alex Williamson wrote:
So we have all sorts of driver issues that are sure to come and go over time and all sorts of use cases that seem difficult to predict. If we know we're in a ovirt/openstack environment, managed='detach' might actually be a more typical use case than managed='yes'. It still leaves a gap that we hope the host driver doesn't do anything bad when it initializes the device and hope that it releases the device cleanly, but it's probably better than tempting fate by unnecessarily bouncing it back and forth between drivers.
Is sharing a hostdev between multiple guests more solid in general? Eg. if I have g1 and g2, both configured to use the host's GPU, can I start up g1, shut it down, start up g2 and expect things to just work? Hopefully that's the case because the device would go through a more complete set up / tear down cycle. Anyway, after reading your explanation I'm wondering if we shouldn't always recommend a setup where devices that are going to be assigned to guests are just never bound to any host driver, as that sounds like it would have the most chances of working reliably. IIUC, the pci-stubs.ids kernel parameter you mentioned above does exactly that. Maybe blacklisting the host driver as well might be a good idea? Anything else a user would need to do? Would the user or management layer not be able to configure something like that in a oVirt / OpenStack environment? What should we change to make it possible or easier? That would give us a setup we can rely on, and cover all use cases you mentioned except that of someone assigning his laptop's GPU to a guest and needing the console to be available before the guest has started up because no other access to the machine is available. But in that case, even with managed='detach', the user would need to blindly restart the machine after guest shutdown, wouldn't he? Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

Sorry, apparently missed this reply previously On Wed, 16 Mar 2016 11:19:38 +0100 Andrea Bolognani <abologna@redhat.com> wrote:
On Tue, 2016-03-15 at 13:31 -0600, Alex Williamson wrote:
So we have all sorts of driver issues that are sure to come and go over time and all sorts of use cases that seem difficult to predict. If we know we're in a ovirt/openstack environment, managed='detach' might actually be a more typical use case than managed='yes'. It still leaves a gap that we hope the host driver doesn't do anything bad when it initializes the device and hope that it releases the device cleanly, but it's probably better than tempting fate by unnecessarily bouncing it back and forth between drivers.
Is sharing a hostdev between multiple guests more solid in general? Eg. if I have g1 and g2, both configured to use the host's GPU, can I start up g1, shut it down, start up g2 and expect things to just work? Hopefully that's the case because the device would go through a more complete set up / tear down cycle.
Yes, this should work.
Anyway, after reading your explanation I'm wondering if we shouldn't always recommend a setup where devices that are going to be assigned to guests are just never bound to any host driver, as that sounds like it would have the most chances of working reliably.
IIUC, the pci-stubs.ids kernel parameter you mentioned above does exactly that. Maybe blacklisting the host driver as well might be a good idea? Anything else a user would need to do? Would the user or management layer not be able to configure something like that in a oVirt / OpenStack environment? What should we change to make it possible or easier?
Both pci-stub.ids and blacklisting have some serious problems. A common thread on the vfio-users list is that someone has two identical devices and wants to use one for the host and one for a VM and can't figure out how to make that work. The only solution I know is to escalate to initramfs scripts that are smarter about which device to pick. Maybe this implies that we need some initramfs integration for driver_override so that we can force a particular driver for a particular device, but that fails to acknowledge that a user may want the device to use the default driver, up until the point that they don't. Blacklisting has similar issues, but worse. Not all SR-IOV drivers are partitioned such that there's a separate driver for PF vs VF, quite a few use the same driver for both. So immediately blacklisting doesn't work for a large class of drivers. In general it's even more broad than pci-stub.ids, for instance I might not want snd-hda-intel to bind to the GPU audio device used for a VM, but I definitely want snd-hda-intel binding to my primary audio device. I've also found that marking i915 for blacklist does absolutely nothing... go figure. BTW, even if we do either of these, do we still need managed='yes' since the device will be bound to either pci-stub or nothing and needs to get to vfio-pci? I don't recall if libvirt accepts those cases or not. If not, then we're unnecessarily bouncing back and forth between drivers even if we take a pci-stub/blacklist approach unless we inject another layer that actually binds them to vfio-pci. Let's take it one step further, what if we made an initramfs script that would set "vfio-pci" for the driver_override for a user specified list of devices. Recall that driver_override means that only the driver with the matching name can bind to the device. So now we boot up with the user defined set of devices without drivers. What happens in either the managed='yes' or 'no' scenarios? Logically this seems like exactly when we want to use 'detach'.
That would give us a setup we can rely on, and cover all use cases you mentioned except that of someone assigning his laptop's GPU to a guest and needing the console to be available before the guest has started up because no other access to the machine is available. But in that case, even with managed='detach', the user would need to blindly restart the machine after guest shutdown, wouldn't he?
I don't want to make this issue about any one particular driver because I certainly hope that we'll eventually fix that one driver. The problem is more that this is not an uncommon scenario. In one of my previous replies I listed all of the driver issues that I know about. In some cases we can't fix them because the driver is proprietary, in others we just don't have the bandwidth. Users have far more devices at their disposal to test than we do, so they're likely going to continue to run into these issues. Yes, managed='no' is an alternative, but if we go down the path of saying 'well that feature sounds like foo+bar and we already do both of those, so we don't need that feature', then we have to start asking why do we even have managed='yes'? Why do we have an autostart feature, we can clearly start a VM and it's the app's problem when to call that, etc. It seems useful to me, but I can understand the concern about feature bloat. Thanks, Alex

On Fri, 2016-03-18 at 11:03 -0600, Alex Williamson wrote:
Anyway, after reading your explanation I'm wondering if we shouldn't always recommend a setup where devices that are going to be assigned to guests are just never bound to any host driver, as that sounds like it would have the most chances of working reliably. IIUC, the pci-stubs.ids kernel parameter you mentioned above does exactly that. Maybe blacklisting the host driver as well might be a good idea? Anything else a user would need to do? Would the user or management layer not be able to configure something like that in a oVirt / OpenStack environment? What should we change to make it possible or easier? Both pci-stub.ids and blacklisting have some serious problems. A common thread on the vfio-users list is that someone has two identical devices and wants to use one for the host and one for a VM and can't figure out how to make that work.
Right, because pci-stub.ids works on 'vendor product' and not on 'domain:bus:slot.function', so if you have two or more identical devices it's all or nothing.
The only solution I know is to escalate to initramfs scripts that are smarter about which device to pick. Maybe this implies that we need some initramfs integration for driver_override so that we can force a particular driver for a particular device, but that fails to acknowledge that a user may want the device to use the default driver, up until the point that they don't.
Is that really something we want to encourage? See below.
Blacklisting has similar issues, but worse. Not all SR-IOV drivers are partitioned such that there's a separate driver for PF vs VF, quite a few use the same driver for both. So immediately blacklisting doesn't work for a large class of drivers. In general it's even more broad than pci-stub.ids, for instance I might not want snd-hda-intel to bind to the GPU audio device used for a VM, but I definitely want snd-hda-intel binding to my primary audio device. I've also found that marking i915 for blacklist does absolutely nothing... go figure.
Of course driver blacklisting is even more coarse-grained than pci-stub.ids, so not suitable in a lot of situations. Got it.
BTW, even if we do either of these, do we still need managed='yes' since the device will be bound to either pci-stub or nothing and needs to get to vfio-pci? I don't recall if libvirt accepts those cases or not. If not, then we're unnecessarily bouncing back and forth between drivers even if we take a pci-stub/blacklist approach unless we inject another layer that actually binds them to vfio-pci.
Well, if a device is bound to pci-stub and we want to use VFIO passthrough we're always going to have to unbind it from pci-stub and bind it to vfio-pci, aren't we? Shouldn't that be safe anyway? As long as the device is never bound to a host driver...
Let's take it one step further, what if we made an initramfs script that would set "vfio-pci" for the driver_override for a user specified list of devices. Recall that driver_override means that only the driver with the matching name can bind to the device. So now we boot up with the user defined set of devices without drivers. What happens in either the managed='yes' or 'no' scenarios? Logically this seems like exactly when we want to use 'detach'.
If managed='no', libvirt will not attempt to rebind the device to the host driver after detaching it from the guest, so we're good. If managed='yes', libvirt *will* attempt to rebind it to the host driver, but due to driver_override it will end up being bound to vfio-pci anyway: still safe. This looks like the perfect solution for people who want to dedicate some host devices to VFIO passthrough exclusively... Of course it requires implementing the smart initramfs bits. Could this be controlled by a kernel parameter? So you can just add something like vfio-pci.devices=0000:02:00.0,0000:03:00.0 to your bootloader configuration and be sure that the devices you've listed will never be touched by the host. Note that I have no idea how much work implementing something like the above would be, or whether it would even be possible :)
That would give us a setup we can rely on, and cover all use cases you mentioned except that of someone assigning his laptop's GPU to a guest and needing the console to be available before the guest has started up because no other access to the machine is available. But in that case, even with managed='detach', the user would need to blindly restart the machine after guest shutdown, wouldn't he? I don't want to make this issue about any one particular driver because I certainly hope that we'll eventually fix that one driver.
Of course :)
The problem is more that this is not an uncommon scenario. In one of my previous replies I listed all of the driver issues that I know about. In some cases we can't fix them because the driver is proprietary, in others we just don't have the bandwidth. Users have far more devices at their disposal to test than we do, so they're likely going to continue to run into these issues.
Summing up the issues you listed in that message, along with my thoughts: * stealing audio device from host when running a game in a guest, giving it back afterwards - desktop use case - seems to be working fine with managed='yes' - if something goes wrong when reattaching to the host, the user can probably still go about his business; rebooting the host is going to be annoying but not a huge deal - limited number of in-kernel slots means you can only do this a number of time before it stops working; again, having to reboot the host once every so-many guest boots is probably not a deal breaker - managed='detach' is not helpful here, and neither is the proposal above * NIC reserved for guest use - both server and desktop use case - may or may not handle going back and forth between the host and the guest - managed='detach' would still prevent most issues, assuming the host driver detaches cleanly - best would be if the host never touched the device - a solution like the one outlined above seems perfect - for oVirt / OpenStack, the management layer could take care of updating the bootloader configuration - for pure libvirt, that responsability would fall on the user - should virt-manager etc. rather default to managed='detach' when assigning a NIC to a guest? * GPU with binary driver - both server and desktop use case - doesn't handle dynamic unbind gracefully - best would be if the host never touched the device - a solution like the one outlined above seems perfect - for oVirt / OpenStack, the management layer could take care of updating the bootloader configuration - for pure libvirt, that responsability would fall on the user - managed='detach' can prevent some failures, but not those that happen on host driver unbind * Primary Intel GPU - desktop use case (any reason to do this for servers?) - will probably fail on reattach - mananged='detach' would prevent host crashes and the like - must assume the user has some other way to connect to the host, because at some point they're going to be unable to use the GPU from the host anyway Have I missed anything?
Yes, managed='no' is an alternative, but if we go down the path of saying 'well that feature sounds like foo+bar and we already do both of those, so we don't need that feature', then we have to start asking why do we even have managed='yes'? Why do we have an autostart feature, we can clearly start a VM and it's the app's problem when to call that, etc. It seems useful to me, but I can understand the concern about feature bloat.
If we've added redundant features in the past, then that's only the more reason to avoid adding even more now ;) Thanks for taking the time to reply in such detail and providing concrete examples, it's been really helpful for me! Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Tue, 22 Mar 2016 12:54:12 +0100 Andrea Bolognani <abologna@redhat.com> wrote:
On Fri, 2016-03-18 at 11:03 -0600, Alex Williamson wrote:
Anyway, after reading your explanation I'm wondering if we shouldn't always recommend a setup where devices that are going to be assigned to guests are just never bound to any host driver, as that sounds like it would have the most chances of working reliably. IIUC, the pci-stubs.ids kernel parameter you mentioned above does exactly that. Maybe blacklisting the host driver as well might be a good idea? Anything else a user would need to do? Would the user or management layer not be able to configure something like that in a oVirt / OpenStack environment? What should we change to make it possible or easier? Both pci-stub.ids and blacklisting have some serious problems. A common thread on the vfio-users list is that someone has two identical devices and wants to use one for the host and one for a VM and can't figure out how to make that work.
Right, because pci-stub.ids works on 'vendor product' and not on 'domain:bus:slot.function', so if you have two or more identical devices it's all or nothing.
The only solution I know is to escalate to initramfs scripts that are smarter about which device to pick. Maybe this implies that we need some initramfs integration for driver_override so that we can force a particular driver for a particular device, but that fails to acknowledge that a user may want the device to use the default driver, up until the point that they don't.
Is that really something we want to encourage? See below.
Blacklisting has similar issues, but worse. Not all SR-IOV drivers are partitioned such that there's a separate driver for PF vs VF, quite a few use the same driver for both. So immediately blacklisting doesn't work for a large class of drivers. In general it's even more broad than pci-stub.ids, for instance I might not want snd-hda-intel to bind to the GPU audio device used for a VM, but I definitely want snd-hda-intel binding to my primary audio device. I've also found that marking i915 for blacklist does absolutely nothing... go figure.
Of course driver blacklisting is even more coarse-grained than pci-stub.ids, so not suitable in a lot of situations. Got it.
BTW, even if we do either of these, do we still need managed='yes' since the device will be bound to either pci-stub or nothing and needs to get to vfio-pci? I don't recall if libvirt accepts those cases or not. If not, then we're unnecessarily bouncing back and forth between drivers even if we take a pci-stub/blacklist approach unless we inject another layer that actually binds them to vfio-pci.
Well, if a device is bound to pci-stub and we want to use VFIO passthrough we're always going to have to unbind it from pci-stub and bind it to vfio-pci, aren't we?
Shouldn't that be safe anyway? As long as the device is never bound to a host driver...
Let's take it one step further, what if we made an initramfs script that would set "vfio-pci" for the driver_override for a user specified list of devices. Recall that driver_override means that only the driver with the matching name can bind to the device. So now we boot up with the user defined set of devices without drivers. What happens in either the managed='yes' or 'no' scenarios? Logically this seems like exactly when we want to use 'detach'.
If managed='no', libvirt will not attempt to rebind the device to the host driver after detaching it from the guest, so we're good.
If managed='yes', libvirt *will* attempt to rebind it to the host driver, but due to driver_override it will end up being bound to vfio-pci anyway: still safe.
This looks like the perfect solution for people who want to dedicate some host devices to VFIO passthrough exclusively... Of course it requires implementing the smart initramfs bits.
Could this be controlled by a kernel parameter? So you can just add something like
vfio-pci.devices=0000:02:00.0,0000:03:00.0
to your bootloader configuration and be sure that the devices you've listed will never be touched by the host.
Note that I have no idea how much work implementing something like the above would be, or whether it would even be possible :)
vfio-pci is typically a module, device assignment and userspace drivers are very niche as far as the kernel is concerned and the driver is complex enough that building it statically into the kernel has some disadvantages. So adding module options to vfio-pci doesn't really help. We sort of tried this with adding the vfio-pci.ids module option so we'd have parity with pci-stub, but it didn't really work because users need to go through enough hoops to get vfio-pci loaded early that they might as well just use the driver_override option. Live and learn, pci-stub.ids is still generally more useful than the equivalent vfio-pci option because pci-stub is (or should be) built statically into the kernel. Also, a fun property that we get to deal with in PCI space is that bus numbers are under the control of the OS, so if you boot with pci=assign-buses, your device address may change. That means that specifying a device by address is not as clean of a solution as you might think. It seems like this quickly devolves into a question of whether the kernel is even the right place to do this, if you have a properly modular kernel then userspace really has the control of how modules are loaded and can intervene to exclude certain devices. The kink is that we need to do that early in boot, which circles right back around to the initramfs scripts, but afaik those are generally managed by each distro with only loose similarities, so we'd need to hope that we contribute something that becomes a defacto standard.
That would give us a setup we can rely on, and cover all use cases you mentioned except that of someone assigning his laptop's GPU to a guest and needing the console to be available before the guest has started up because no other access to the machine is available. But in that case, even with managed='detach', the user would need to blindly restart the machine after guest shutdown, wouldn't he? I don't want to make this issue about any one particular driver because I certainly hope that we'll eventually fix that one driver.
Of course :)
The problem is more that this is not an uncommon scenario. In one of my previous replies I listed all of the driver issues that I know about. In some cases we can't fix them because the driver is proprietary, in others we just don't have the bandwidth. Users have far more devices at their disposal to test than we do, so they're likely going to continue to run into these issues.
Summing up the issues you listed in that message, along with my thoughts:
* stealing audio device from host when running a game in a guest, giving it back afterwards
- desktop use case - seems to be working fine with managed='yes' - if something goes wrong when reattaching to the host, the user can probably still go about his business; rebooting the host is going to be annoying but not a huge deal - limited number of in-kernel slots means you can only do this a number of time before it stops working; again, having to reboot the host once every so-many guest boots is probably not a deal breaker - managed='detach' is not helpful here, and neither is the proposal above
* NIC reserved for guest use
- both server and desktop use case - may or may not handle going back and forth between the host and the guest - managed='detach' would still prevent most issues, assuming the host driver detaches cleanly - best would be if the host never touched the device - a solution like the one outlined above seems perfect - for oVirt / OpenStack, the management layer could take care of updating the bootloader configuration - for pure libvirt, that responsability would fall on the user - should virt-manager etc. rather default to managed='detach' when assigning a NIC to a guest?
* GPU with binary driver
- both server and desktop use case - doesn't handle dynamic unbind gracefully - best would be if the host never touched the device - a solution like the one outlined above seems perfect - for oVirt / OpenStack, the management layer could take care of updating the bootloader configuration - for pure libvirt, that responsability would fall on the user - managed='detach' can prevent some failures, but not those that happen on host driver unbind
* Primary Intel GPU
- desktop use case (any reason to do this for servers?) - will probably fail on reattach - mananged='detach' would prevent host crashes and the like - must assume the user has some other way to connect to the host, because at some point they're going to be unable to use the GPU from the host anyway
Have I missed anything?
I think the only thing missing is that libvirt sort of becomes a natural management point for an assigned device because it's specified with some management characteristics. Users set managed='yes', assuming libvirt will do the right thing and because the managed='no' path requires them to come up with their own solution. So maybe we just need to create some framework for users to be able to take that managed='no' path with feeling like they're stepping off a cliff into their own adhoc management scripts. Thanks, Alex

On Tue, 2016-03-22 at 09:04 -0600, Alex Williamson wrote:
Could this be controlled by a kernel parameter? So you can just add something like vfio-pci.devices=0000:02:00.0,0000:03:00.0 to your bootloader configuration and be sure that the devices you've listed will never be touched by the host. Note that I have no idea how much work implementing something like the above would be, or whether it would even be possible :) vfio-pci is typically a module, device assignment and userspace drivers are very niche as far as the kernel is concerned and the driver is complex enough that building it statically into the kernel has some disadvantages. So adding module options to vfio-pci doesn't really help. We sort of tried this with adding the vfio-pci.ids module option so we'd have parity with pci-stub, but it didn't really work because users need to go through enough hoops to get vfio-pci loaded early that they might as well just use the driver_override option. Live and learn, pci-stub.ids is still generally more useful than the equivalent vfio-pci option because pci-stub is (or should be) built statically into the kernel.
I see. But it doesn't necessarily have to be a kernel module option, we just need the information to get to initramfs somehow. I believe a bunch of arguments, such as "splash", are already processed this way.
Also, a fun property that we get to deal with in PCI space is that bus numbers are under the control of the OS, so if you boot with pci=assign-buses, your device address may change. That means that specifying a device by address is not as clean of a solution as you might think.
That sounds... Problematic. Wouldn't that mean that the guest XML itself might suddenly become invalid? When we configure a device for PCI passthrough, we identify it using domain:bus:slot.function.
It seems like this quickly devolves into a question of whether the kernel is even the right place to do this, if you have a properly modular kernel then userspace really has the control of how modules are loaded and can intervene to exclude certain devices. The kink is that we need to do that early in boot, which circles right back around to the initramfs scripts, but afaik those are generally managed by each distro with only loose similarities, so we'd need to hope that we contribute something that becomes a defacto standard.
Yeah, once you get in initramfs territory you can hardly count on standardization. IIRC some distros even provide *several* alternative tools for the job :) Then again, it seems quite weird to me that some sort of mechanism for setting driver_override during early boot is not available... Wouldn't it make the most sense for it to be set before the kernel has had a chance to bind devices to the default drivers?
The problem is more that this is not an uncommon scenario. In one of my previous replies I listed all of the driver issues that I know about. In some cases we can't fix them because the driver is proprietary, in others we just don't have the bandwidth. Users have far more devices at their disposal to test than we do, so they're likely going to continue to run into these issues. Summing up the issues you listed in that message, along with my thoughts: * stealing audio device from host when running a game in a guest, giving it back afterwards - desktop use case - seems to be working fine with managed='yes' - if something goes wrong when reattaching to the host, the user can probably still go about his business; rebooting the host is going to be annoying but not a huge deal - limited number of in-kernel slots means you can only do this a number of time before it stops working; again, having to reboot the host once every so-many guest boots is probably not a deal breaker - managed='detach' is not helpful here, and neither is the proposal above * NIC reserved for guest use - both server and desktop use case - may or may not handle going back and forth between the host and the guest - managed='detach' would still prevent most issues, assuming the host driver detaches cleanly - best would be if the host never touched the device - a solution like the one outlined above seems perfect - for oVirt / OpenStack, the management layer could take care of updating the bootloader configuration - for pure libvirt, that responsability would fall on the user - should virt-manager etc. rather default to managed='detach' when assigning a NIC to a guest? * GPU with binary driver - both server and desktop use case - doesn't handle dynamic unbind gracefully - best would be if the host never touched the device - a solution like the one outlined above seems perfect - for oVirt / OpenStack, the management layer could take care of updating the bootloader configuration - for pure libvirt, that responsability would fall on the user - managed='detach' can prevent some failures, but not those that happen on host driver unbind * Primary Intel GPU - desktop use case (any reason to do this for servers?) - will probably fail on reattach - mananged='detach' would prevent host crashes and the like - must assume the user has some other way to connect to the host, because at some point they're going to be unable to use the GPU from the host anyway Have I missed anything? I think the only thing missing is that libvirt sort of becomes a natural management point for an assigned device because it's specified with some management characteristics. Users set managed='yes', assuming libvirt will do the right thing
People who assume that managed='yes' will always do the right thing are probably going to be disappointed no matter what :)
and because the managed='no' path requires them to come up with their own solution. So maybe we just need to create some framework for users to be able to take that managed='no' path with feeling like they're stepping off a cliff into their own adhoc management scripts.
That's what I'm thinking. If, as argued above, the "proper" solution is in most cases to prevent devices from being bound to the host driver in the first place, then libvirt enters the picture way too late to be able to take care of it. If we can figure out a way to give users the ability to mark a bunch of devices as "reserved for assignment" and have early boot set driver_override to vfio-pci for us, we'll make life easier for people currently relying on pci-stub.ids as well. The simple stuff, that already works fine today with managed='yes', will of course keep working. Does this sound sensible / doable? Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Tue, 22 Mar 2016 19:04:31 +0100 Andrea Bolognani <abologna@redhat.com> wrote:
On Tue, 2016-03-22 at 09:04 -0600, Alex Williamson wrote:
Could this be controlled by a kernel parameter? So you can just add something like vfio-pci.devices=0000:02:00.0,0000:03:00.0 to your bootloader configuration and be sure that the devices you've listed will never be touched by the host. Note that I have no idea how much work implementing something like the above would be, or whether it would even be possible :) vfio-pci is typically a module, device assignment and userspace drivers are very niche as far as the kernel is concerned and the driver is complex enough that building it statically into the kernel has some disadvantages. So adding module options to vfio-pci doesn't really help. We sort of tried this with adding the vfio-pci.ids module option so we'd have parity with pci-stub, but it didn't really work because users need to go through enough hoops to get vfio-pci loaded early that they might as well just use the driver_override option. Live and learn, pci-stub.ids is still generally more useful than the equivalent vfio-pci option because pci-stub is (or should be) built statically into the kernel.
I see. But it doesn't necessarily have to be a kernel module option, we just need the information to get to initramfs somehow. I believe a bunch of arguments, such as "splash", are already processed this way.
Also, a fun property that we get to deal with in PCI space is that bus numbers are under the control of the OS, so if you boot with pci=assign-buses, your device address may change. That means that specifying a device by address is not as clean of a solution as you might think.
That sounds... Problematic.
Wouldn't that mean that the guest XML itself might suddenly become invalid? When we configure a device for PCI passthrough, we identify it using domain:bus:slot.function.
Well, barring configuration or commandline changes, PCI device addresses are generally stable. To do better we'd need to provide device paths like ACPI does in the DMAR table. ACPI specifies the base bus number of the root bridge, so that address is predictable. It then gives the path to the device from one of those stable bus numbers. For instance in this topology: -[0000:00]-+-00.0 +-01.0-[01]--+-00.0 | \-00.1 Rather than specifying 0000:01:00.0 it would be [0000:00],01.0,00.0. Therefore if the OS renumbers the subordinate bus behind 01.0, the path is still valid. The support issues induced by needing to come up with something like that are probably greater than the rare occasion that device addresses change.
It seems like this quickly devolves into a question of whether the kernel is even the right place to do this, if you have a properly modular kernel then userspace really has the control of how modules are loaded and can intervene to exclude certain devices. The kink is that we need to do that early in boot, which circles right back around to the initramfs scripts, but afaik those are generally managed by each distro with only loose similarities, so we'd need to hope that we contribute something that becomes a defacto standard.
Yeah, once you get in initramfs territory you can hardly count on standardization. IIRC some distros even provide *several* alternative tools for the job :)
Then again, it seems quite weird to me that some sort of mechanism for setting driver_override during early boot is not available... Wouldn't it make the most sense for it to be set before the kernel has had a chance to bind devices to the default drivers?
Of course. There are mechanisms available, the one I typically use is to exploit modprobe.d's install script option and load a small script into the initramfs which sets driver_override before a known conflicting driver is loaded. This falls into site specific hacking though.
The problem is more that this is not an uncommon scenario. In one of my previous replies I listed all of the driver issues that I know about. In some cases we can't fix them because the driver is proprietary, in others we just don't have the bandwidth. Users have far more devices at their disposal to test than we do, so they're likely going to continue to run into these issues. Summing up the issues you listed in that message, along with my thoughts: * stealing audio device from host when running a game in a guest, giving it back afterwards - desktop use case - seems to be working fine with managed='yes' - if something goes wrong when reattaching to the host, the user can probably still go about his business; rebooting the host is going to be annoying but not a huge deal - limited number of in-kernel slots means you can only do this a number of time before it stops working; again, having to reboot the host once every so-many guest boots is probably not a deal breaker - managed='detach' is not helpful here, and neither is the proposal above * NIC reserved for guest use - both server and desktop use case - may or may not handle going back and forth between the host and the guest - managed='detach' would still prevent most issues, assuming the host driver detaches cleanly - best would be if the host never touched the device - a solution like the one outlined above seems perfect - for oVirt / OpenStack, the management layer could take care of updating the bootloader configuration - for pure libvirt, that responsability would fall on the user - should virt-manager etc. rather default to managed='detach' when assigning a NIC to a guest? * GPU with binary driver - both server and desktop use case - doesn't handle dynamic unbind gracefully - best would be if the host never touched the device - a solution like the one outlined above seems perfect - for oVirt / OpenStack, the management layer could take care of updating the bootloader configuration - for pure libvirt, that responsability would fall on the user - managed='detach' can prevent some failures, but not those that happen on host driver unbind * Primary Intel GPU - desktop use case (any reason to do this for servers?) - will probably fail on reattach - mananged='detach' would prevent host crashes and the like - must assume the user has some other way to connect to the host, because at some point they're going to be unable to use the GPU from the host anyway Have I missed anything? I think the only thing missing is that libvirt sort of becomes a natural management point for an assigned device because it's specified with some management characteristics. Users set managed='yes', assuming libvirt will do the right thing
People who assume that managed='yes' will always do the right thing are probably going to be disappointed no matter what :)
and because the managed='no' path requires them to come up with their own solution. So maybe we just need to create some framework for users to be able to take that managed='no' path with feeling like they're stepping off a cliff into their own adhoc management scripts.
That's what I'm thinking.
If, as argued above, the "proper" solution is in most cases to prevent devices from being bound to the host driver in the first place, then libvirt enters the picture way too late to be able to take care of it.
If we can figure out a way to give users the ability to mark a bunch of devices as "reserved for assignment" and have early boot set driver_override to vfio-pci for us, we'll make life easier for people currently relying on pci-stub.ids as well.
The simple stuff, that already works fine today with managed='yes', will of course keep working.
Does this sound sensible / doable?
Sure, it's "just" a matter of knowing what package to start hacking on and proposing patches to the downstreams. We could effectively also duplicate the functionality of managed='detach' with libvirt hook scripts, but they also lack any sort of standard framework. Last I looked they still only called out to a monolithic script, rather than making site local modifications easy with some sort of libvirt-hooks.d directory. Thanks, Alex

On Tue, Mar 15, 2016 at 02:21:35PM -0400, Laine Stump wrote:
On 03/15/2016 01:00 PM, Daniel P. Berrange wrote:
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work.
And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it.
This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver. You say that managed=yes causes pain upon re-attachment and that apps should use managed=detach to avoid it, but how do management apps know which devices are going to cause pain ? Libvirt isn't
On Mon, Mar 14, 2016 at 03:41:48PM -0400, Laine Stump wrote: providing any info on whether a particular device id needs to use managed=yes vs managed=detach, and we don't want to be asking the user to choose between modes in openstack/ovirt IMHO. I think thats a fundamental problem with inventing a new value for managed here.
My suspicion is that in many/most cases users don't actually need for the device to be re-bound to the host driver after the guest is finished with it, because they're only going to use the device to assign to a different guest anyway. But because managed='yes' is what's supplied and is the easiest way to get it setup for assignment to a guest, that's what they use.
As a matter of fact, all this extra churn of changing the driver back and forth for devices that are only actually used when they're bound to vfio-pci just wastes time, and makes it more likely that libvirt and its users will reveal and get caught up in the effects of some strange kernel driver loading/unloading bug (there was recently a bug reported like this; unfortunately the BZ record had customer info in it, so it's not publicly accessible :-( )
So beyond making this behavior available only when absolutely necessary, I think it is useful in other cases, at the user's discretion (and as I implied above, I think that if they understood the function and the tradeoffs, most people would choose to use managed='detach' rather than managed='yes')
IIUC, in managed=yes mode we explicitly track whether the device was originally attached to a host device driver. ie we only re-attach the device to the host when guest shuts down, if it was attached to the host at guest startup. We already have a virNodeDeviceDetach() API that can be used to detach a device from the host driver explicitly. So applications can in fact already achieve what you describe in terms of managed=detach, by simply calling virNodeDeviceDetach() prior to starting the guest with cold plugged PCI devices / hotplugging the PCI device. IOW, even if we think applications should be using managed=detach, they can already do so via existing libvirt APIs. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, 17 Mar 2016 17:32:08 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Mar 15, 2016 at 02:21:35PM -0400, Laine Stump wrote:
On 03/15/2016 01:00 PM, Daniel P. Berrange wrote:
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work.
And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it.
This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver. You say that managed=yes causes pain upon re-attachment and that apps should use managed=detach to avoid it, but how do management apps know which devices are going to cause pain ? Libvirt isn't
On Mon, Mar 14, 2016 at 03:41:48PM -0400, Laine Stump wrote: providing any info on whether a particular device id needs to use managed=yes vs managed=detach, and we don't want to be asking the user to choose between modes in openstack/ovirt IMHO. I think thats a fundamental problem with inventing a new value for managed here.
My suspicion is that in many/most cases users don't actually need for the device to be re-bound to the host driver after the guest is finished with it, because they're only going to use the device to assign to a different guest anyway. But because managed='yes' is what's supplied and is the easiest way to get it setup for assignment to a guest, that's what they use.
As a matter of fact, all this extra churn of changing the driver back and forth for devices that are only actually used when they're bound to vfio-pci just wastes time, and makes it more likely that libvirt and its users will reveal and get caught up in the effects of some strange kernel driver loading/unloading bug (there was recently a bug reported like this; unfortunately the BZ record had customer info in it, so it's not publicly accessible :-( )
So beyond making this behavior available only when absolutely necessary, I think it is useful in other cases, at the user's discretion (and as I implied above, I think that if they understood the function and the tradeoffs, most people would choose to use managed='detach' rather than managed='yes')
IIUC, in managed=yes mode we explicitly track whether the device was originally attached to a host device driver. ie we only re-attach the device to the host when guest shuts down, if it was attached to the host at guest startup.
We already have a virNodeDeviceDetach() API that can be used to detach a device from the host driver explicitly.
So applications can in fact already achieve what you describe in terms of managed=detach, by simply calling virNodeDeviceDetach() prior to starting the guest with cold plugged PCI devices / hotplugging the PCI device.
IOW, even if we think applications should be using managed=detach, they can already do so via existing libvirt APIs.
Agreed, that was never in doubt, but now you've required a management step before the VM can be started. We're basically requiring users to write their own scripting or modify kernel commandlines simply to avoid libvirt trying to rebind a device at shutdown. We're leaving it as the users' problem how to handle autostart VMs that prefer this behavior. It can already be done, the question is whether it's worthwhile to make this easier path to do it. Thanks, Alex

On Thu, Mar 17, 2016 at 11:52:14AM -0600, Alex Williamson wrote:
On Thu, 17 Mar 2016 17:32:08 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Mar 15, 2016 at 02:21:35PM -0400, Laine Stump wrote:
On 03/15/2016 01:00 PM, Daniel P. Berrange wrote:
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work.
And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it.
This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver. You say that managed=yes causes pain upon re-attachment and that apps should use managed=detach to avoid it, but how do management apps know which devices are going to cause pain ? Libvirt isn't
On Mon, Mar 14, 2016 at 03:41:48PM -0400, Laine Stump wrote: providing any info on whether a particular device id needs to use managed=yes vs managed=detach, and we don't want to be asking the user to choose between modes in openstack/ovirt IMHO. I think thats a fundamental problem with inventing a new value for managed here.
My suspicion is that in many/most cases users don't actually need for the device to be re-bound to the host driver after the guest is finished with it, because they're only going to use the device to assign to a different guest anyway. But because managed='yes' is what's supplied and is the easiest way to get it setup for assignment to a guest, that's what they use.
As a matter of fact, all this extra churn of changing the driver back and forth for devices that are only actually used when they're bound to vfio-pci just wastes time, and makes it more likely that libvirt and its users will reveal and get caught up in the effects of some strange kernel driver loading/unloading bug (there was recently a bug reported like this; unfortunately the BZ record had customer info in it, so it's not publicly accessible :-( )
So beyond making this behavior available only when absolutely necessary, I think it is useful in other cases, at the user's discretion (and as I implied above, I think that if they understood the function and the tradeoffs, most people would choose to use managed='detach' rather than managed='yes')
IIUC, in managed=yes mode we explicitly track whether the device was originally attached to a host device driver. ie we only re-attach the device to the host when guest shuts down, if it was attached to the host at guest startup.
We already have a virNodeDeviceDetach() API that can be used to detach a device from the host driver explicitly.
So applications can in fact already achieve what you describe in terms of managed=detach, by simply calling virNodeDeviceDetach() prior to starting the guest with cold plugged PCI devices / hotplugging the PCI device.
IOW, even if we think applications should be using managed=detach, they can already do so via existing libvirt APIs.
Agreed, that was never in doubt, but now you've required a management step before the VM can be started. We're basically requiring users to write their own scripting or modify kernel commandlines simply to avoid libvirt trying to rebind a device at shutdown. We're leaving it as the users' problem how to handle autostart VMs that prefer this behavior. It can already be done, the question is whether it's worthwhile to make this easier path to do it. Thanks,
I don't think it is a significant burden really. Apps which want this blacklisted forever likely want to setup the modprobe blacklist anyway to stop the initial bind at boot up and instead permanently reserve the device. This stops the device being used at startup - eg if we have a bunch of NICs to be given to guests, you don't want the host OS to automatically configure them and give them IP addresses on the host before we start guests. So pre-reserving devices at the host OS level is really want you want todo with data center / cloud management apps like oVirt / OpenStack at least. They could easily use the virNodeDeviceDetach API at the time they decide to assign a device to a guest though. For ad-hoc usage such as with desktop virt, then I think users would typically want to use have PCI devices re-assigned to host at shutdown for the most part. If apps change to use managed=detach then they will only work with new libvirt, where as if they change to use the virNodeDeviceDetach() API they'll work with all historic version of libvirt that support PCI assignment. So I really don't see a compelling reason to introduce a managed=detach option explicitly Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, 17 Mar 2016 17:59:53 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Mar 17, 2016 at 11:52:14AM -0600, Alex Williamson wrote:
On Thu, 17 Mar 2016 17:32:08 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Mar 15, 2016 at 02:21:35PM -0400, Laine Stump wrote:
On 03/15/2016 01:00 PM, Daniel P. Berrange wrote:
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work.
And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it.
This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver. You say that managed=yes causes pain upon re-attachment and that apps should use managed=detach to avoid it, but how do management apps know which devices are going to cause pain ? Libvirt isn't
On Mon, Mar 14, 2016 at 03:41:48PM -0400, Laine Stump wrote: providing any info on whether a particular device id needs to use managed=yes vs managed=detach, and we don't want to be asking the user to choose between modes in openstack/ovirt IMHO. I think thats a fundamental problem with inventing a new value for managed here.
My suspicion is that in many/most cases users don't actually need for the device to be re-bound to the host driver after the guest is finished with it, because they're only going to use the device to assign to a different guest anyway. But because managed='yes' is what's supplied and is the easiest way to get it setup for assignment to a guest, that's what they use.
As a matter of fact, all this extra churn of changing the driver back and forth for devices that are only actually used when they're bound to vfio-pci just wastes time, and makes it more likely that libvirt and its users will reveal and get caught up in the effects of some strange kernel driver loading/unloading bug (there was recently a bug reported like this; unfortunately the BZ record had customer info in it, so it's not publicly accessible :-( )
So beyond making this behavior available only when absolutely necessary, I think it is useful in other cases, at the user's discretion (and as I implied above, I think that if they understood the function and the tradeoffs, most people would choose to use managed='detach' rather than managed='yes')
IIUC, in managed=yes mode we explicitly track whether the device was originally attached to a host device driver. ie we only re-attach the device to the host when guest shuts down, if it was attached to the host at guest startup.
We already have a virNodeDeviceDetach() API that can be used to detach a device from the host driver explicitly.
So applications can in fact already achieve what you describe in terms of managed=detach, by simply calling virNodeDeviceDetach() prior to starting the guest with cold plugged PCI devices / hotplugging the PCI device.
IOW, even if we think applications should be using managed=detach, they can already do so via existing libvirt APIs.
Agreed, that was never in doubt, but now you've required a management step before the VM can be started. We're basically requiring users to write their own scripting or modify kernel commandlines simply to avoid libvirt trying to rebind a device at shutdown. We're leaving it as the users' problem how to handle autostart VMs that prefer this behavior. It can already be done, the question is whether it's worthwhile to make this easier path to do it. Thanks,
I don't think it is a significant burden really. Apps which want this blacklisted forever likely want to setup the modprobe blacklist anyway to stop the initial bind at boot up and instead permanently reserve the device. This stops the device being used at startup - eg if we have a bunch of NICs to be given to guests, you don't want the host OS to automatically configure them and give them IP addresses on the host before we start guests. So pre-reserving devices at the host OS level is really want you want todo with data center / cloud management apps like oVirt / OpenStack at least. They could easily use the virNodeDeviceDetach API at the time they decide to assign a device to a guest though.
modprobe blacklist assumes that all devices managed by a given driver are reserved for VM use. That's very often not the case. Even with SR-IOV VFs, several vendors use the same driver for PF and VF, so that's just a poor solution. For GPU assignment we often recommend using pci-stub.ids on the kernel commandline to pre-load the pci-stub driver with PCI vendor and device IDs to claim to prevent host drivers from attaching, but that also assumes that you want to use everything matching those IDs for a VM, which users will quickly find fault with. Additionally, using either solution assumes that the device will be left entirely alone otherwise, which is also not true. If I blacklist i915 or using pci-stub.ids to make pci-stub claim it, then efifb or vesafb is more than happy to make use of it, so it's actually cleaner to let i915 grab the device and unbind it when ready. And of course the issue of assuming that the device can go without drivers, which may make your user run a headless system. This is really not the simplistic issue that it may seem. Thanks, Alex

On Thu, Mar 17, 2016 at 12:18:49PM -0600, Alex Williamson wrote:
On Thu, 17 Mar 2016 17:59:53 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
I don't think it is a significant burden really. Apps which want this blacklisted forever likely want to setup the modprobe blacklist anyway to stop the initial bind at boot up and instead permanently reserve the device. This stops the device being used at startup - eg if we have a bunch of NICs to be given to guests, you don't want the host OS to automatically configure them and give them IP addresses on the host before we start guests. So pre-reserving devices at the host OS level is really want you want todo with data center / cloud management apps like oVirt / OpenStack at least. They could easily use the virNodeDeviceDetach API at the time they decide to assign a device to a guest though.
modprobe blacklist assumes that all devices managed by a given driver are reserved for VM use. That's very often not the case. Even with SR-IOV VFs, several vendors use the same driver for PF and VF, so that's just a poor solution. For GPU assignment we often recommend using pci-stub.ids on the kernel commandline to pre-load the pci-stub driver with PCI vendor and device IDs to claim to prevent host drivers from attaching, but that also assumes that you want to use everything matching those IDs for a VM, which users will quickly find fault with. Additionally, using either solution assumes that the device will be left entirely alone otherwise, which is also not true. If I blacklist i915 or using pci-stub.ids to make pci-stub claim it, then efifb or vesafb is more than happy to make use of it, so it's actually cleaner to let i915 grab the device and unbind it when ready. And of course the issue of assuming that the device can go without drivers, which may make your user run a headless system. This is really not the simplistic issue that it may seem. Thanks,
The issues you describe point towards the need for a better blacklisting facility at the OS level IMHO. eg a way to tell the kernel module to not automatically bind drivers for devices with a particular device ID. Combined that with the virNodeDeviceDetach() API usage still feels like a better solution that adding a managed=detach flag to libvirt. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 03/17/2016 02:32 PM, Daniel P. Berrange wrote:
On Thu, Mar 17, 2016 at 12:18:49PM -0600, Alex Williamson wrote:
On Thu, 17 Mar 2016 17:59:53 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
I don't think it is a significant burden really. Apps which want this blacklisted forever likely want to setup the modprobe blacklist anyway to stop the initial bind at boot up and instead permanently reserve the device. This stops the device being used at startup - eg if we have a bunch of NICs to be given to guests, you don't want the host OS to automatically configure them and give them IP addresses on the host before we start guests. So pre-reserving devices at the host OS level is really want you want todo with data center / cloud management apps like oVirt / OpenStack at least. They could easily use the virNodeDeviceDetach API at the time they decide to assign a device to a guest though. modprobe blacklist assumes that all devices managed by a given driver are reserved for VM use. That's very often not the case. Even with SR-IOV VFs, several vendors use the same driver for PF and VF, so that's just a poor solution. For GPU assignment we often recommend using pci-stub.ids on the kernel commandline to pre-load the pci-stub driver with PCI vendor and device IDs to claim to prevent host drivers from attaching, but that also assumes that you want to use everything matching those IDs for a VM, which users will quickly find fault with. Additionally, using either solution assumes that the device will be left entirely alone otherwise, which is also not true. If I blacklist i915 or using pci-stub.ids to make pci-stub claim it, then efifb or vesafb is more than happy to make use of it, so it's actually cleaner to let i915 grab the device and unbind it when ready. And of course the issue of assuming that the device can go without drivers, which may make your user run a headless system. This is really not the simplistic issue that it may seem. Thanks, The issues you describe point towards the need for a better blacklisting facility at the OS level IMHO. eg a way to tell the kernel module to not automatically bind drivers for devices with a particular device ID.
To paraphrase an old saying: this is a nail, and libvirt is my hammer :-) My first response to Alex when he asked about this feature was to ask if he couldn't configure it somehow outside libvirt, and his response (similar to above) made it fairly plain that you really can't, at least not without adding custom startup scripts. Since the barrier to entry for the kernel is much higher than for libvirt, this seems like a way to actually get something that works.
Combined that with the virNodeDeviceDetach() API usage still feels like a better solution that adding a managed=detach flag to libvirt.
Again, assuming that there is a higher level management application that is calling libvirt, maybe so. For anyone simply using virsh to start/stop guests, it's a bit more cumbersome (but you could claim that isn't a typical user, so that's okay). But the same thing could be said about managed='yes' - why have that at all when it could be handled purely with libvirt's nodedevice API? (and would eliminate the ambiguity of error situations where, e.g. a device had been successfully assigned to a guest, then detached from the guest, but libvirt failed to rebind the device to the host driver - the device *has* been successfully detached from the guest though, so should we report success or failure?). (Seriously, I think we should consider being more discouraging about use of managed='yes').

On Thu, Mar 17, 2016 at 05:37:28PM -0400, Laine Stump wrote:
On 03/17/2016 02:32 PM, Daniel P. Berrange wrote:
On Thu, Mar 17, 2016 at 12:18:49PM -0600, Alex Williamson wrote:
On Thu, 17 Mar 2016 17:59:53 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
I don't think it is a significant burden really. Apps which want this blacklisted forever likely want to setup the modprobe blacklist anyway to stop the initial bind at boot up and instead permanently reserve the device. This stops the device being used at startup - eg if we have a bunch of NICs to be given to guests, you don't want the host OS to automatically configure them and give them IP addresses on the host before we start guests. So pre-reserving devices at the host OS level is really want you want todo with data center / cloud management apps like oVirt / OpenStack at least. They could easily use the virNodeDeviceDetach API at the time they decide to assign a device to a guest though. modprobe blacklist assumes that all devices managed by a given driver are reserved for VM use. That's very often not the case. Even with SR-IOV VFs, several vendors use the same driver for PF and VF, so that's just a poor solution. For GPU assignment we often recommend using pci-stub.ids on the kernel commandline to pre-load the pci-stub driver with PCI vendor and device IDs to claim to prevent host drivers from attaching, but that also assumes that you want to use everything matching those IDs for a VM, which users will quickly find fault with. Additionally, using either solution assumes that the device will be left entirely alone otherwise, which is also not true. If I blacklist i915 or using pci-stub.ids to make pci-stub claim it, then efifb or vesafb is more than happy to make use of it, so it's actually cleaner to let i915 grab the device and unbind it when ready. And of course the issue of assuming that the device can go without drivers, which may make your user run a headless system. This is really not the simplistic issue that it may seem. Thanks, The issues you describe point towards the need for a better blacklisting facility at the OS level IMHO. eg a way to tell the kernel module to not automatically bind drivers for devices with a particular device ID.
To paraphrase an old saying: this is a nail, and libvirt is my hammer :-) My first response to Alex when he asked about this feature was to ask if he couldn't configure it somehow outside libvirt, and his response (similar to above) made it fairly plain that you really can't, at least not without adding custom startup scripts. Since the barrier to entry for the kernel is much higher than for libvirt, this seems like a way to actually get something that works.
Looking at this from the POV of openstack, we would want to mark devices as detached from host independantly of assigning them to guests, since we build up a whitelist of assignable devices at initial startup. So I'd expect OpenStack would simply call virNodeDeviceDetach() for each device in its whitelist at startup. If we consider how we deal with guest domain hotplug, we have the concept of VIR_DOMAIN_MODIFY_LIVE vs CONFIG. If there were a way to blacklist devices based on address, then it would be natural to extend libvirt so that you could do virNodeDeviceDetach(dev, VIR_DOMAIN_MODIFY_CONFIG); to ensure it was permanently disabled from attachment to host drivers, instead of just for the current point in time. This is using libvirt as a hammer, but we still need underlying OS support in some manner to actually implement it. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 18/03/16 10:15 +0000, Daniel P. Berrange wrote:
On Thu, Mar 17, 2016 at 05:37:28PM -0400, Laine Stump wrote:
On 03/17/2016 02:32 PM, Daniel P. Berrange wrote:
On Thu, Mar 17, 2016 at 12:18:49PM -0600, Alex Williamson wrote:
On Thu, 17 Mar 2016 17:59:53 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
I don't think it is a significant burden really. Apps which want this blacklisted forever likely want to setup the modprobe blacklist anyway to stop the initial bind at boot up and instead permanently reserve the device. This stops the device being used at startup - eg if we have a bunch of NICs to be given to guests, you don't want the host OS to automatically configure them and give them IP addresses on the host before we start guests. So pre-reserving devices at the host OS level is really want you want todo with data center / cloud management apps like oVirt / OpenStack at least. They could easily use the virNodeDeviceDetach API at the time they decide to assign a device to a guest though. modprobe blacklist assumes that all devices managed by a given driver are reserved for VM use. That's very often not the case. Even with SR-IOV VFs, several vendors use the same driver for PF and VF, so that's just a poor solution. For GPU assignment we often recommend using pci-stub.ids on the kernel commandline to pre-load the pci-stub driver with PCI vendor and device IDs to claim to prevent host drivers from attaching, but that also assumes that you want to use everything matching those IDs for a VM, which users will quickly find fault with. Additionally, using either solution assumes that the device will be left entirely alone otherwise, which is also not true. If I blacklist i915 or using pci-stub.ids to make pci-stub claim it, then efifb or vesafb is more than happy to make use of it, so it's actually cleaner to let i915 grab the device and unbind it when ready. And of course the issue of assuming that the device can go without drivers, which may make your user run a headless system. This is really not the simplistic issue that it may seem. Thanks, The issues you describe point towards the need for a better blacklisting facility at the OS level IMHO. eg a way to tell the kernel module to not automatically bind drivers for devices with a particular device ID.
To paraphrase an old saying: this is a nail, and libvirt is my hammer :-) My first response to Alex when he asked about this feature was to ask if he couldn't configure it somehow outside libvirt, and his response (similar to above) made it fairly plain that you really can't, at least not without adding custom startup scripts. Since the barrier to entry for the kernel is much higher than for libvirt, this seems like a way to actually get something that works.
Looking at this from the POV of openstack, we would want to mark devices as detached from host independantly of assigning them to guests, since we build up a whitelist of assignable devices at initial startup.
So I'd expect OpenStack would simply call virNodeDeviceDetach() for each device in its whitelist at startup. If we consider how we deal with guest domain hotplug, we have the concept of VIR_DOMAIN_MODIFY_LIVE vs CONFIG. If there were a way to blacklist devices based on address, then it would be natural to extend libvirt so that you could do
virNodeDeviceDetach(dev, VIR_DOMAIN_MODIFY_CONFIG);
to ensure it was permanently disabled from attachment to host drivers, instead of just for the current point in time. This is using libvirt as a hammer, but we still need underlying OS support in some manner to actually implement it.
Joining this discussion from oVirt's perspective. oVirt uses managed="no" mode. What we do now is using virNodeDeviceDetachFlags() when VM is started, and virNodeDeviceReAttach() when it's destroyed. This is because we handle permissions of /dev/vfio/* ourselves and allows for finer granularity (e.g. skipping PCI_HEADER_TYPE 0 devices or hot(un)plugging a device). To deal with GPUs or any kind of device with problematic driver, we currently advise users to use pci-stub.ids and plan to make this functionality available from our UI. We have to modify kernel command line anyway, as user's modifications of it are not supported and we need to enable IOMMU and possibly expose workarounds for machines with issues (allow_unsafe_interrupts, pci=realloc for SR-IOV). After reading this thread, we are considering not explicitly reattaching the devices. If user wants to reclaim a device, it will be possible to reattach it explicitly from the UI. Due to system configuration mentioned above, I don't believe there is a need for managed="detach" in management applications. It should be management application's developer that decides whether the devices will be detached on boot or at VM start, and users decision if they require reattaching back later. That being said, I am not sure if people using libvirt directly expect to have devices "reserved" for assignment. mpolednik
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Hi,
For ad-hoc usage such as with desktop virt, then I think users would typically want to use have PCI devices re-assigned to host at shutdown for the most part.
Unfortunately that tends to not work very well for some kinds of devices. I have an explicit "managed=no" in my configs, otherwise I risk the igd (intel gfx) hangs the system when assigning it back to the host.
If apps change to use managed=detach then they will only work with new libvirt, where as if they change to use the virNodeDeviceDetach() API they'll work with all historic version of libvirt that support PCI assignment.
My "app" is "virsh edit" ;) So, I have to do an explicit "virsh nodedev-detach" before I can start the guest. Yes, it's annonying. But guest failing to start is better than host crashing on guest shutdown in case I forget to do it. And, no, I don't want do that in the boot scripts somewhere because I might want to use the igd on the host. "managed=detach" would be a very nice solution to that problem. cheers, Gerd

On Fri, 2016-03-18 at 12:23 +0100, Gerd Hoffmann wrote:
For ad-hoc usage such as with desktop virt, then I think users would typically want to use have PCI devices re-assigned to host at shutdown for the most part. Unfortunately that tends to not work very well for some kinds of devices. I have an explicit "managed=no" in my configs, otherwise I risk the igd (intel gfx) hangs the system when assigning it back to the host. If apps change to use managed=detach then they will only work with new libvirt, where as if they change to use the virNodeDeviceDetach() API they'll work with all historic version of libvirt that support PCI assignment. My "app" is "virsh edit" ;) So, I have to do an explicit "virsh nodedev-detach" before I can start the guest. Yes, it's annonying. But guest failing to start is better than host crashing on guest shutdown in case I forget to do it. And, no, I don't want do that in the boot scripts somewhere because I might want to use the igd on the host. "managed=detach" would be a very nice solution to that problem.
So what do you do after shutting down the guest? Your host ends up having no usable GPU, so you have to access it using some other mean (eg. ssh) and reboot it, right? My point is that if your devices are dedicated to guest assignment, you're better off if they're never bound to the host driver to begin with - if the facilities one can use to ensure that is the case are not flexible or user-friendly enough, we should improve them. If you're instead working mostly on bare metal (eg. your laptop) and you want to, every once in a while, assign your card reader or some other secondary peripheral to a guest, then manged='yes' should be what you're looking for. But, in that case, doing something like assigning your primary and only GPU to a guest is a behaviour that falls entirely into the "hacks" category, and libvirt should not grow code dedicated to support it. Setting the device to managed='no' and detaching it from the host manually, in this case, is the equivalent of saying "I really know what I'm doing" :) Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

Hi,
So what do you do after shutting down the guest? Your host ends up having no usable GPU, so you have to access it using some other mean (eg. ssh) and reboot it, right?
If I need a console, yes, I have to reboot. Usually I just start another guest though.
My point is that if your devices are dedicated to guest assignment, you're better off if they're never bound to the host driver to begin with - if the facilities one can use to ensure that is the case are not flexible or user-friendly enough, we should improve them.
Well, this is a development machine, so it switches roles depending on what I'm working on. So, no, it isn't igd device assignment all the time, sometimes I need the igd for the host machine to test stuff.
If you're instead working mostly on bare metal (eg. your laptop) and you want to, every once in a while, assign your card reader or some other secondary peripheral to a guest, then manged='yes' should be what you're looking for.
Depends. I actually do that with a usb card reader now and then. With usb you don't have the option to permanently detach the device. And it is annoying at times, because assigning it back to the host implies hotplug events for the device and your desktop poping up with "Oh, and here I have a new device for you to use". Even if you don't want to use it on the host.
But, in that case, doing something like assigning your primary and only GPU to a guest is a behaviour that falls entirely into the "hacks" category, and libvirt should not grow code dedicated to support it.
/me goes write "declare as unsupported hack" on a yellow note sticker. I guess I'll find uses for that in the future ;) cheers, Gerd

On 03/14/2016 03:41 PM, Laine Stump wrote:
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work.
And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it.
This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver. ---
I'm sending this with the "RFC" tag because I'm concerned it might be considered "feature creep" by some (although I think it makes at least as much sense as "managed='yes'") and also because, even once (if) it is ACKed, I wouldn't want to push it until abologna is finished hacking around with the driver bind/unbind code - he has enough grief to deal with without me causing a bunch of merge conflicts :-)
[...] Rather the burying this in one of the 3 other conversations that have been taking place - I'll respond at the top level. I have been following the conversation, but not at any great depth... Going to leave the should we have "managed='detach'" conversation alone (at least for now). Not that I want to necessarily dip my toes into these shark infested waters; however, one thing keeps gnawing at my fingers from hanging onto the don't get involved in this one ledge ;-)... That one thing is the problem to me is less libvirt's ability to manage whether the devices are or aren't detached from the host, but rather that lower layers (e.g. kernel) aren't happy over the frequency of such requests. Thrashing for any system isn't fun, but it's a lot easier to tell someone else to stop doing it since it hurts when you do that. Tough to find a happy medium between force user to detach rather than let libvirt manage and letting libvirt be the culprit causing angst for the lower layers. So, would it work to have some intermediary handle this thrashing by creating some sort of "blob" that will accept responsibility for reattaching the device to the host "at some point in time" as long as no one else has requested to use it? Why not add an attribute (e.g. delay='n') that is only valid when managed='yes' to the device which means, rather than immediately reattaching this to the host when the guest is destroy, libvirt will delay the reattach by 'n' seconds. That way someone that knows they're going to have a device used by multiple guests that could be thrashing heavily in the detach -> reattach -> detach -> reattach -> etc loop would be able to make use of an optimization of sorts that just places the device back in the inactive list (as if it were detached manually), then starts a thread that will reawaken when a timer fires to handle the reattach. The thread would be destroyed in the event that something codes along and uses (e.g. places back into the active list). The caveats that come quickly to mind in using this is that devices that were being managed could be left on the inactive list if the daemon dies or is restarted, but I think it is detectable at restart so it may not be totally bad. Also, failure to reattach is left in some thread which has no one to message to other than libvirtd log files. Both would have to be noted with any description of this. Not all devices will want this delay logic and I think it's been pointed out that there is a "known list" of them. In the long run it allows some control by a user to decide how much rope they'd like to have to hang themselves. John Not sure if Gerd, Alex, and Martin are regular libvir-list readers, so I did CC them just in case so that it's easier for them to respond if they so desire since they were at part of the discussions in this thread.

On 03/22/2016 07:03 PM, John Ferlan wrote:
On 03/14/2016 03:41 PM, Laine Stump wrote:
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens to be the host system console, you likely want it to start out using the host driver (so that boot messages/etc will be displayed), then later have the host driver replaced with vfio-pci for assignment to the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has been detached from the host driver and attached to vfio-pci, attempts to reattach to the host driver only lead to "grief" (ask Alex for details). This means that simply using "managed='yes'" in libvirt won't work.
And if you set "managed='no'" in libvirt then either you have to manually run virsh nodedev-detach prior to the first start of the guest, or you have to have a management application intelligent enough to know that it should detach from the host driver, but never reattach to it.
This patch makes it simple/automatic to deal with such a case - it adds a third "managed" mode for assigned PCI devices, called "detach". It will detach ("unbind" in driver parlance) the device from the host driver prior to assigning it to the guest, but when the guest is finished with the device, will leave it bound to vfio-pci. This allows re-using the device for another guest, without requiring initial out-of-band intervention to unbind the host driver. ---
I'm sending this with the "RFC" tag because I'm concerned it might be considered "feature creep" by some (although I think it makes at least as much sense as "managed='yes'") and also because, even once (if) it is ACKed, I wouldn't want to push it until abologna is finished hacking around with the driver bind/unbind code - he has enough grief to deal with without me causing a bunch of merge conflicts :-)
[...]
Rather the burying this in one of the 3 other conversations that have been taking place - I'll respond at the top level. I have been following the conversation, but not at any great depth... Going to leave the should we have "managed='detach'" conversation alone (at least for now).
Not that I want to necessarily dip my toes into these shark infested waters; however, one thing keeps gnawing at my fingers from hanging onto the don't get involved in this one ledge ;-)... That one thing is the problem to me is less libvirt's ability to manage whether the devices are or aren't detached from the host, but rather that lower layers (e.g. kernel) aren't happy over the frequency of such requests.
That is definitely one of the reasons we need to do it less. My conclusion from all this discussion has been 1) lower layers are woefully lacking in ways available to specify what driver to use for a specific device at boot time, making it essentially impossible, for example, to have devices that will only be used for vfio assignment to a guest be bound to the vfio-pci driver straight from boot time. Instead they must be bound to pci-stub at boot, then later re-bound to vfio-pci by "something, who knows, shuddup and do it yerself!". And even that only works if you want *all* devices with the same vendor/device id to be treated in the exact same manner. (and blacklisting a particular driver is just hopelessly useless. 2) (1) isn't necessarily because the people in the lower layers are being unhelpful jerks about it, it's because the way that things are organized in the kernel and its boot process make it really difficult to do it.
Thrashing for any system isn't fun, but it's a lot easier to tell someone else to stop doing it since it hurts when you do that. Tough to find a happy medium between force user to detach rather than let libvirt manage and letting libvirt be the culprit causing angst for the lower layers.
So, would it work to have some intermediary handle this thrashing by creating some sort of "blob" that will accept responsibility for reattaching the device to the host "at some point in time" as long as no one else has requested to use it?
Why not add an attribute (e.g. delay='n') that is only valid when managed='yes' to the device which means, rather than immediately reattaching this to the host when the guest is destroy, libvirt will delay the reattach by 'n' seconds. That way someone that knows they're going to have a device used by multiple guests that could be thrashing heavily in the detach -> reattach -> detach -> reattach -> etc loop would be able to make use of an optimization of sorts that just places the device back in the inactive list (as if it were detached manually), then starts a thread that will reawaken when a timer fires to handle the reattach. The thread would be destroyed in the event that something codes along and uses (e.g. places back into the active list).
This sounds dangerous for two reasons: 1) someone may expect that they'll be able to use the device on the host if it's not in use by a guest, and this will be true only at undeterminable (from the outside) times. 2) Rather than causing a failure immediately when the guest terminates (or deletes the device), the failure will occur at some later time, which could serve to obscure the source of the failure. It's an interesting idea from the "this would be cool to code", but I don't really see a practical use for it.
The caveats that come quickly to mind in using this is that devices that were being managed could be left on the inactive list if the daemon dies or is restarted, but I think it is detectable at restart so it may not be totally bad. Also, failure to reattach is left in some thread which has no one to message to other than libvirtd log files. Both would have to be noted with any description of this.
Not all devices will want this delay logic and I think it's been pointed out that there is a "known list" of them. In the long run it allows some control by a user to decide how much rope they'd like to have to hang themselves.
John
Not sure if Gerd, Alex, and Martin are regular libvir-list readers, so I did CC them just in case so that it's easier for them to respond if they so desire since they were at part of the discussions in this thread.
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
participants (7)
-
Alex Williamson
-
Andrea Bolognani
-
Daniel P. Berrange
-
Gerd Hoffmann
-
John Ferlan
-
Laine Stump
-
Martin Polednik