[libvirt] [PATCH 0/5] Add device assignment related node device methods

Hi, The following patches implement three new node device methods that are needed for KVM PCI device assignment. The dettach method unbinds the device from its device driver and binds it to the pci-stub driver, if available. This ensures the device is not being used by the host. The reattach method undos this operation. It unbinds the device from pci-stub and attempts to re-bind it to the original device driver. This allows the device to be used by the host again. The reset method is intended to be used on a dettached device before it is used by the guest or host. This operation may in future affect other devices, so it is intended that all devices to be assigned to a guest are first dettached before any device is reset. A writeup of some more background details are available here: http://marc.info/?l=kvm&m=123454366317045 Cheers, Mark.

Add implementations of dettach, reattach and reset for PCI devices. Background to this code can be found here: http://marc.info/?l=kvm&m=123454366317045 Some notes: * pci-stub was first introduced in 2.6.29; if it's not available, then dettach can only unbind the device from the host driver * remove_id is queued up for 2.6.30; in it's absence reattach cannot re-bind the device to the original driver * If the device supports Function Level Reset, we just don't bother doing the reset - KVM will do this before and after a guest runs * Currently, if a reset would affect another device on the same bus, or another function on the same multi-function device, then we just refuse to do the reset. In future, we could allow the reset as long as it doesn't affect a device in use by the host or assigned to another guest * The device reset code is similar in concept to some of Xen's code Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- ChangeLog | 10 + configure.in | 6 + po/POTFILES.in | 1 + src/Makefile.am | 1 + src/libvirt_private.syms | 8 + src/pci.c | 774 ++++++++++++++++++++++++++++++++++++++++++++++ src/pci.h | 42 +++ 7 files changed, 842 insertions(+), 0 deletions(-) create mode 100644 src/pci.c create mode 100644 src/pci.h diff --git a/ChangeLog b/ChangeLog index 79f91e4..c08c66d 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,13 @@ +Tue Feb 24 21:55:03 GMT 2009 Mark McLoughlin <markmc@redhat.com> + + * src/pci.[ch]: add implementations of dettach, reattach + and reset implementations for PCI devices + + * configure.in: look for modprobe + + * src/Makefile.am, src/libvirt_private.syms, + po/POTFILES.in: administrivia + Tue Feb 24 14:55:28 GMT 2009 Mark McLoughlin <markmc@redhat.com> * tests/nodedevxml2xmltest.c: Add a test to check node diff --git a/configure.in b/configure.in index 72a64dd..a5b8208 100644 --- a/configure.in +++ b/configure.in @@ -128,6 +128,8 @@ AC_PATH_PROG([UDEVADM], [udevadm], [], [/sbin:/usr/sbin:/usr/local/sbin:$PATH]) AC_PATH_PROG([UDEVSETTLE], [udevsettle], [], [/sbin:/usr/sbin:/usr/local/sbin:$PATH]) +AC_PATH_PROG([MODPROBE], [modprobe], [], + [/sbin:/usr/sbin:/usr/local/sbin:$PATH]) AC_DEFINE_UNQUOTED([DNSMASQ],["$DNSMASQ"], [Location or name of the dnsmasq program]) @@ -141,6 +143,10 @@ if test -n "$UDEVSETTLE"; then AC_DEFINE_UNQUOTED([UDEVSETTLE],["$UDEVSETTLE"], [Location or name of the udevsettle program]) fi +if test -n "$MODPROBE"; then + AC_DEFINE_UNQUOTED([MODPROBE],["$MODPROBE"], + [Location or name of the modprobe program]) +fi dnl Specific dir for HTML output ? AC_ARG_WITH([html-dir], [AC_HELP_STRING([--with-html-dir=path], diff --git a/po/POTFILES.in b/po/POTFILES.in index 7461f93..237d462 100644 --- a/po/POTFILES.in +++ b/po/POTFILES.in @@ -18,6 +18,7 @@ src/node_device_conf.c src/nodeinfo.c src/openvz_conf.c src/openvz_driver.c +src/pci.c src/proxy_internal.c src/qemu_conf.c src/qemu_driver.c diff --git a/src/Makefile.am b/src/Makefile.am index 3a798d2..a636e8a 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -48,6 +48,7 @@ UTIL_SOURCES = \ iptables.c iptables.h \ logging.c logging.h \ memory.c memory.h \ + pci.c pci.h \ qparams.c qparams.h \ threads.c threads.h \ threads-pthread.h \ diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index fdda478..c8c08ae 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -229,6 +229,14 @@ virNodeDeviceObjUnlock; virNodeDeviceAssignDef; +# pci.h +pciGetDevice; +pciFreeDevice; +pciDettachDevice; +pciReAttachDevice; +pciResetDevice; + + # qparams.h qparam_get_query; qparam_query_parse; diff --git a/src/pci.c b/src/pci.c new file mode 100644 index 0000000..f32e296 --- /dev/null +++ b/src/pci.c @@ -0,0 +1,774 @@ +/* + * Copyright (C) 2009 Red Hat, Inc. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Authors: + * Mark McLoughlin <markmc@redhat.com> + */ + +#include <config.h> + +#include "pci.h" + +#include <dirent.h> +#include <fcntl.h> +#include <inttypes.h> +#include <limits.h> +#include <stdio.h> +#include <string.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <unistd.h> + +#include "logging.h" +#include "memory.h" +#include "util.h" +#include "virterror_internal.h" + +#define PCI_SYSFS "/sys/bus/pci/" +#define PCI_ID_LEN 10 /* "XXXX XXXX" */ +#define PCI_ADDR_LEN 13 /* "XXXX:XX:XX.X" */ + +struct _pciDevice { + virConnectPtr conn; + + unsigned vendor; + unsigned product; + unsigned domain; + unsigned bus; + unsigned slot; + unsigned function; + + char name[PCI_ADDR_LEN]; /* domain:bus:slot.function */ + char id[PCI_ID_LEN]; /* product vendor */ + char path[PATH_MAX]; + int fd; + + unsigned initted; + unsigned pcie_cap_pos; + unsigned pci_pm_cap_pos; + unsigned has_flr : 1; + unsigned has_pm_reset : 1; +}; + +/* For virReportOOMError() and virReportSystemError() */ +#define VIR_FROM_THIS VIR_FROM_NONE + +#define pciReportError(conn, code, fmt...) \ + virReportErrorHelper(conn, VIR_FROM_NONE, code, __FILE__, \ + __FUNCTION__, __LINE__, fmt) + +/* Specifications referenced in comments: + * PCI30 - PCI Local Bus Specification 3.0 + * PCIe20 - PCI Express Base Specification 2.0 + * BR12 - PCI-to-PCI Bridge Architecture Specification 1.2 + * PM12 - PCI Bus Power Management Interface Specification 1.2 + * ECN_AF - Advanced Capabilities for Conventional PCI ECN + */ + +/* Type 0 config space header length; PCI30 Section 6.1 Configuration Space Organization */ +#define PCI_CONF_LEN 0x100 +#define PCI_CONF_HEADER_LEN 0x40 + +/* PCI30 6.2.1 */ +#define PCI_HEADER_TYPE 0x0e /* Header type */ +#define PCI_HEADER_TYPE_BRIDGE 0x1 +#define PCI_HEADER_TYPE_MASK 0x7f +#define PCI_HEADER_TYPE_MULTI 0x80 + +/* PCI30 6.2.1 Device Identification */ +#define PCI_CLASS_DEVICE 0x0a /* Device class */ + +/* Class Code for bridge; PCI30 D.7 Base Class 06h */ +#define PCI_CLASS_BRIDGE_PCI 0x0604 + +/* PCI30 6.2.3 Device Status */ +#define PCI_STATUS 0x06 /* 16 bits */ +#define PCI_STATUS_CAP_LIST 0x10 /* Support Capability List */ + +/* PCI30 6.7 Capabilities List */ +#define PCI_CAPABILITY_LIST 0x34 /* Offset of first capability list entry */ + +/* PM12 3.2.1 Capability Identifier */ +#define PCI_CAP_ID_PM 0x01 /* Power Management */ +/* PCI30 H Capability IDs */ +#define PCI_CAP_ID_EXP 0x10 /* PCI Express */ +/* ECN_AF 6.x.1.1 Capability ID for AF */ +#define PCI_CAP_ID_AF 0x13 /* Advanced Features */ + +/* PCIe20 7.8.3 Device Capabilities Register (Offset 04h) */ +#define PCI_EXP_DEVCAP 0x4 /* Device capabilities */ +#define PCI_EXP_DEVCAP_FLR (1<<28) /* Function Level Reset */ + +/* Header type 1 BR12 3.2 PCI-to-PCI Bridge Configuration Space Header Format */ +#define PCI_PRIMARY_BUS 0x18 /* BR12 3.2.5.2 Primary bus number */ +#define PCI_SECONDARY_BUS 0x19 /* BR12 3.2.5.3 Secondary bus number */ +#define PCI_SUBORDINATE_BUS 0x1a /* BR12 3.2.5.4 Highest bus number behind the bridge */ +#define PCI_BRIDGE_CONTROL 0x3e +/* BR12 3.2.5.18 Bridge Control Register */ +#define PCI_BRIDGE_CTL_RESET 0x40 /* Secondary bus reset */ + +/* PM12 3.2.4 Power Management Control/Status (Offset = 4) */ +#define PCI_PM_CTRL 4 /* PM control and status register */ +#define PCI_PM_CTRL_STATE_MASK 0x3 /* Current power state (D0 to D3) */ +#define PCI_PM_CTRL_STATE_D0 0x0 /* D0 state */ +#define PCI_PM_CTRL_STATE_D3hot 0x3 /* D3 state */ +#define PCI_PM_CTRL_NO_SOFT_RESET 0x4 /* No reset for D3hot->D0 */ + +/* ECN_AF 6.x.1 Advanced Features Capability Structure */ +#define PCI_AF_CAP 0x3 /* Advanced features capabilities */ +#define PCI_AF_CAP_FLR 0x2 /* Function Level Reset */ + +static int +pciOpenConfig(pciDevice *dev) +{ + int fd; + + if (dev->fd > 0) + return 0; + + fd = open(dev->path, O_RDWR); + if (fd < 0) { + char ebuf[1024]; + VIR_WARN(_("Failed to open config space file '%s': %s"), + dev->path, virStrerror(errno, ebuf, sizeof(ebuf))); + return -1; + } + VIR_DEBUG("%s %s: opened %s", dev->id, dev->name, dev->path); + dev->fd = fd; + return 0; +} + +static int +pciRead(pciDevice *dev, unsigned pos, uint8_t *buf, unsigned buflen) +{ + memset(buf, 0, buflen); + + if (pciOpenConfig(dev) < 0) + return -1; + + if (pread(dev->fd, buf, buflen, pos) < 0) { + char ebuf[1024]; + VIR_WARN(_("Failed to read from '%s' : %s"), dev->path, + virStrerror(errno, ebuf, sizeof(ebuf))); + return -1; + } + return 0; +} + +static uint8_t +pciRead8(pciDevice *dev, unsigned pos) +{ + uint8_t buf; + pciRead(dev, pos, &buf, sizeof(buf)); + return buf; +} + +static uint16_t +pciRead16(pciDevice *dev, unsigned pos) +{ + uint8_t buf[2]; + pciRead(dev, pos, &buf[0], sizeof(buf)); + return (buf[0] << 0) | (buf[1] << 8); +} + +static uint32_t +pciRead32(pciDevice *dev, unsigned pos) +{ + uint8_t buf[4]; + pciRead(dev, pos, &buf[0], sizeof(buf)); + return (buf[0] << 0) | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24); +} + +static int +pciWrite(pciDevice *dev, unsigned pos, uint8_t *buf, unsigned buflen) +{ + if (pciOpenConfig(dev) < 0) + return -1; + + if (pwrite(dev->fd, buf, buflen, pos) < 0) { + char ebuf[1024]; + VIR_WARN(_("Failed to write to '%s' : %s"), dev->path, + virStrerror(errno, ebuf, sizeof(ebuf))); + return -1; + } + return 0; +} + +static void +pciWrite16(pciDevice *dev, unsigned pos, uint16_t val) +{ + uint8_t buf[2] = { (val >> 0), (val >> 8) }; + pciWrite(dev, pos, &buf[0], sizeof(buf)); +} + +static void +pciWrite32(pciDevice *dev, unsigned pos, uint32_t val) +{ + uint8_t buf[4] = { (val >> 0), (val >> 8), (val >> 16), (val >> 14) }; + pciWrite(dev, pos, &buf[0], sizeof(buf)); +} + +typedef int (*pciIterPredicate)(pciDevice *, pciDevice *); + +/* Iterate over available PCI devices calling @predicate + * to compare each one to @dev. + * Return -1 on error since we don't want to assume it is + * safe to reset if there is an error. + */ +static int +pciIterDevices(pciIterPredicate predicate, pciDevice *dev, pciDevice **matched) +{ + DIR *dir; + struct dirent *entry; + + *matched = NULL; + + VIR_DEBUG("%s %s: iterating over " PCI_SYSFS "devices", dev->id, dev->name); + + dir = opendir(PCI_SYSFS "devices"); + if (!dir) { + VIR_WARN0("Failed to open " PCI_SYSFS "devices"); + return -1; + } + + while ((entry = readdir(dir))) { + unsigned domain, bus, slot, function; + pciDevice *try; + + /* Ignore '.' and '..' */ + if (entry->d_name[0] == '.') + continue; + + if (sscanf(entry->d_name, "%x:%x:%x.%x", &domain, &bus, &slot, &function) < 4) { + VIR_WARN("Unusual entry in " PCI_SYSFS "devices: %s", entry->d_name); + continue; + } + + try = pciGetDevice(dev->conn, 0, 0, domain, bus, slot, function); + if (!try) + return -1; + + if (predicate(try, dev)) { + VIR_DEBUG("%s %s: iter matched on %s", dev->id, dev->name, try->name); + *matched = try; + break; + } + pciFreeDevice(try); + } + closedir(dir); + return 0; +} + +static uint8_t +pciFindCapabilityOffset(pciDevice *dev, unsigned capability) +{ + uint16_t status; + uint8_t pos; + + status = pciRead16(dev, PCI_STATUS); + if (!(status & PCI_STATUS_CAP_LIST)) + return 0; + + pos = pciRead8(dev, PCI_CAPABILITY_LIST); + + /* Zero indicates last capability, capabilities can't + * be in the config space header and 0xff is returned + * by the kernel if we don't have access to this region + * + * Note: we're not handling loops or extended + * capabilities here. + */ + while (pos >= PCI_CONF_HEADER_LEN && pos != 0xff) { + uint8_t capid = pciRead8(dev, pos); + if (capid == capability) { + VIR_DEBUG("%s %s: found cap 0x%.2x at 0x%.2x", + dev->id, dev->name, capability, pos); + return pos; + } + + pos = pciRead8(dev, pos + 1); + } + + VIR_DEBUG("%s %s: failed to find cap 0x%.2x", dev->id, dev->name, capability); + + return 0; +} + +static unsigned +pciDetectFunctionLevelReset(pciDevice *dev) +{ + uint16_t caps; + uint8_t pos; + + /* The PCIe Function Level Reset capability allows + * individual device functions to be reset without + * affecting any other functions on the device or + * any other devices on the bus. This is only common + * on SR-IOV NICs at the moment. + */ + if (dev->pcie_cap_pos) { + caps = pciRead16(dev, dev->pcie_cap_pos + PCI_EXP_DEVCAP); + if (caps & PCI_EXP_DEVCAP_FLR) { + VIR_DEBUG("%s %s: detected PCIe FLR capability", dev->id, dev->name); + return 1; + } + } + + /* The PCI AF Function Level Reset capability is + * the same thing, except for conventional PCI + * devices. This is not common yet. + */ + pos = pciFindCapabilityOffset(dev, PCI_CAP_ID_AF); + if (pos) { + caps = pciRead16(dev, pos + PCI_AF_CAP); + if (caps & PCI_AF_CAP_FLR) { + VIR_DEBUG("%s %s: detected PCI FLR capability", dev->id, dev->name); + return 1; + } + } + + VIR_DEBUG("%s %s: no FLR capability found", dev->id, dev->name); + + return 0; +} + +/* Require the device has the PCI Power Management capability + * and that a D3hot->D0 transition will results in a full + * internal reset, not just a soft reset. + */ +static unsigned +pciDetectPowerManagementReset(pciDevice *dev) +{ + if (dev->pci_pm_cap_pos) { + uint32_t ctl; + + /* require the NO_SOFT_RESET bit is clear */ + ctl = pciRead32(dev, dev->pci_pm_cap_pos + PCI_PM_CTRL); + if (!(ctl & PCI_PM_CTRL_NO_SOFT_RESET)) { + VIR_DEBUG("%s %s: detected PM reset capability", dev->id, dev->name); + return 1; + } + } + + VIR_DEBUG("%s %s: no PM reset capability found", dev->id, dev->name); + + return 0; +} + +/* Any devices other than the one supplied on the same domain/bus ? */ +static int +pciSharesBus(pciDevice *a, pciDevice *b) +{ + return + a->domain == b->domain && + a->bus == b->bus && + (a->slot != b->slot || + a->function != b->function); +} + +static int +pciBusContainsOtherDevices(pciDevice *dev) +{ + pciDevice *matched = NULL; + if (pciIterDevices(pciSharesBus, dev, &matched) < 0) + return 1; + if (!matched) + return 0; + pciFreeDevice(matched); + return 1; +} + +/* Any other functions on this device ? */ +static int +pciSharesDevice(pciDevice *a, pciDevice *b) +{ + return + a->domain == b->domain && + a->bus == b->bus && + a->slot == b->slot && + a->function != b->function; +} + +static int +pciDeviceContainsOtherFunctions(pciDevice *dev) +{ + pciDevice *matched = NULL; + if (pciIterDevices(pciSharesDevice, dev, &matched) < 0) + return 1; + if (!matched) + return 0; + pciFreeDevice(matched); + return 1; +} + +/* Is @a the parent of @b ? */ +static int +pciIsParent(pciDevice *a, pciDevice *b) +{ + uint16_t device_class; + uint8_t header_type, secondary, subordinate; + + if (a->domain != b->domain) + return 0; + + /* Is it a bridge? */ + device_class = pciRead16(a, PCI_CLASS_DEVICE); + if (device_class != PCI_CLASS_BRIDGE_PCI) + return 0; + + /* Is it a plane? */ + header_type = pciRead8(a, PCI_HEADER_TYPE); + if ((header_type & PCI_HEADER_TYPE_MASK) != PCI_HEADER_TYPE_BRIDGE) + return 0; + + secondary = pciRead8(a, PCI_SECONDARY_BUS); + subordinate = pciRead8(a, PCI_SUBORDINATE_BUS); + + VIR_DEBUG("%s %s: found parent device %s\n", b->id, b->name, a->name); + + /* No, it's superman! */ + return (b->bus >= secondary && b->bus <= subordinate); +} + +static pciDevice * +pciGetParentDevice(pciDevice *dev) +{ + pciDevice *parent = NULL; + pciIterDevices(pciIsParent, dev, &parent); + return parent; +} + +/* Secondary Bus Reset is our sledgehammer - it resets all + * devices behind a bus. + */ +static int +pciTrySecondaryBusReset(pciDevice *dev) +{ + pciDevice *parent; + uint8_t config_space[PCI_CONF_LEN]; + uint16_t ctl; + int ret = -1; + + /* For now, we just refuse to do a secondary bus reset + * if there are other devices/functions behind the bus. + * In future, we could allow it so long as those devices + * are not in use by the host or other guests. + */ + if (pciBusContainsOtherDevices(dev)) { + VIR_WARN("Other devices on bus with %s, not doing bus reset", + dev->name); + return -1; + } + + /* Find the parent bus */ + parent = pciGetParentDevice(dev); + if (!parent) { + VIR_WARN("Failed to find parent device for %s", dev->name); + return -1; + } + + VIR_DEBUG("%s %s: doing a secondary bus reset", dev->id, dev->name); + + /* Save and restore the device's config space; we only do this + * for the supplied device since we refuse to do a reset if there + * are multiple devices/functions + */ + if (pciRead(dev, 0, config_space, PCI_CONF_LEN) < 0) { + VIR_WARN("Failed to save PCI config space for %s", dev->name); + goto out; + } + + /* Read the control register, set the reset flag, wait 200ms, + * unset the reset flag and wait 200ms. + */ + ctl = pciRead16(dev, PCI_BRIDGE_CONTROL); + + pciWrite16(parent, PCI_BRIDGE_CONTROL, ctl | PCI_BRIDGE_CTL_RESET); + + usleep(200 * 1000); /* sleep 200ms */ + + pciWrite16(parent, PCI_BRIDGE_CONTROL, ctl); + + usleep(200 * 1000); /* sleep 200ms */ + + if (pciWrite(dev, 0, config_space, PCI_CONF_LEN) < 0) + VIR_WARN("Failed to restore PCI config space for %s", dev->name); + + ret = 0; +out: + pciFreeDevice(parent); + return ret; +} + +/* Power management reset attempts to reset a device using a + * D-state transition from D3hot to D0. Note, in detect_pm_reset() + * above we require the device supports a full internal reset. + */ +static int +pciTryPowerManagementReset(pciDevice *dev) +{ + uint8_t config_space[PCI_CONF_LEN]; + uint32_t ctl; + + if (!dev->pci_pm_cap_pos) + return -1; + + /* For now, we just refuse to do a power management reset + * if there are other functions on this device. + * In future, we could allow it so long as those functions + * are not in use by the host or other guests. + */ + if (pciDeviceContainsOtherFunctions(dev)) { + VIR_WARN("%s contains other functions, not resetting", dev->name); + return -1; + } + + /* Save and restore the device's config space. */ + if (pciRead(dev, 0, &config_space[0], PCI_CONF_LEN) < 0) { + VIR_WARN("Failed to save PCI config space for %s", dev->name); + return -1; + } + + VIR_DEBUG("%s %s: doing a power management reset", dev->id, dev->name); + + ctl = pciRead32(dev, dev->pci_pm_cap_pos + PCI_PM_CTRL); + ctl &= ~PCI_PM_CTRL_STATE_MASK; + + pciWrite32(dev, dev->pci_pm_cap_pos + PCI_PM_CTRL, ctl|PCI_PM_CTRL_STATE_D3hot); + + usleep(10 * 1000); /* sleep 10ms */ + + pciWrite32(dev, dev->pci_pm_cap_pos + PCI_PM_CTRL, ctl|PCI_PM_CTRL_STATE_D0); + + usleep(10 * 1000); /* sleep 10ms */ + + if (pciWrite(dev, 0, &config_space[0], PCI_CONF_LEN) < 0) + VIR_WARN("Failed to restore PCI config space for %s", dev->name); + + return 0; +} + +static int +pciInitDevice(pciDevice *dev) +{ + if (pciOpenConfig(dev) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to open config space file '%s'"), + dev->path); + return -1; + } + + dev->pcie_cap_pos = pciFindCapabilityOffset(dev, PCI_CAP_ID_EXP); + dev->pci_pm_cap_pos = pciFindCapabilityOffset(dev, PCI_CAP_ID_PM); + dev->has_flr = pciDetectFunctionLevelReset(dev); + dev->has_pm_reset = pciDetectPowerManagementReset(dev); + dev->initted = 1; + return 0; +} + +int +pciResetDevice(pciDevice *dev) +{ + int ret = -1; + + if (!dev->initted && pciInitDevice(dev) < 0) + return -1; + + /* KVM will perform FLR when starting and stopping + * a guest, so there is no need for us to do it here. + */ + if (dev->has_flr) + return 0; + + /* Bus reset is not an option with the root bus */ + if (dev->bus != 0) + ret = pciTrySecondaryBusReset(dev); + + /* Next best option is a PCI power management reset */ + if (ret < 0 && dev->has_pm_reset) + ret = pciTryPowerManagementReset(dev); + + if (ret < 0) + pciReportError(dev->conn, VIR_ERR_NO_SUPPORT, + _("No PCI reset capability available for %s"), + dev->name); + return ret; +} + +int +pciDettachDevice(pciDevice *dev) +{ + char path[PATH_MAX]; + + /* Try loading the pci-stub module if it isn't already loaded; + * return an error if there is no pci-stub available. + */ + if (!virFileExists(PCI_SYSFS "drivers/pci-stub")) { + const char *const modprobeargv[] = { MODPROBE, "pci-stub", NULL }; + + if (virRun(dev->conn, modprobeargv, NULL) < 0) { + char ebuf[1024]; + VIR_WARN(_("modprobe pci-stub failed: %s"), + virStrerror(errno, ebuf, sizeof ebuf)); + } + } + + if (!virFileExists(PCI_SYSFS "drivers/pci-stub")) { + VIR_WARN(_("pci-stub driver not available, cannot bind device %s to it"), + dev->name); + } else { + /* Add the PCI device ID to pci-stub's dynamic ID table; + * this is needed to allow us to bind the device to pci-stub. + * Note: if the device is not currently bound to any driver, + * pci-stub will immediately be bound to the device. Also, note + * that if a new device with this ID is hotplugged, or if a probe + * is triggered for such a device, it will also be immediately + * bound by pci-stub. + */ + if (virFileWriteStr(PCI_SYSFS "drivers/pci-stub/new_id", dev->id) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to add PCI device ID '%s' to pci-stub/new_id"), + dev->id); + return -1; + } + } + + /* If the device is already bound to a driver, unbind it. + * Note, this will have rather unpleasant side effects if this + * PCI device happens to be IDE controller for the disk hosting + * your root filesystem. + */ + snprintf(path, sizeof(path), + PCI_SYSFS "devices/%s/driver/unbind", dev->name); + if (virFileExists(path) && virFileWriteStr(path, dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to unbind PCI device '%s'"), dev->name); + return -1; + } + + if (virFileExists(PCI_SYSFS "drivers/pci-stub")) { + /* If the device isn't already bound to pci-stub, try binding it now. + */ + snprintf(path, sizeof(path), PCI_SYSFS "devices/%s/driver", dev->name); + if (!virFileLinkPointsTo(path, PCI_SYSFS "drivers/pci-stub") && + virFileWriteStr(PCI_SYSFS "drivers/pci-stub/bind", dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to bind PCI device '%s' to pci-stub"), + dev->name); + return -1; + } + + /* If 'remove_id' exists, remove the device id from pci-stub's dynamic + * ID table so that 'drivers_probe' works below. + */ + if (virFileExists(PCI_SYSFS "drivers/pci-stub/remove_id") && + virFileWriteStr(PCI_SYSFS "drivers/pci-stub/remove_id", dev->id) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to remove PCI ID '%s' with pci-stub/remove_id"), + dev->id); + return -1; + } + } + + return 0; +} + +int +pciReAttachDevice(pciDevice *dev) +{ + if (virFileExists(PCI_SYSFS "drivers/pci-stub")) { + char path[PATH_MAX]; + + /* If the device is bound to pci-stub, unbind it. + */ + snprintf(path, sizeof(path), PCI_SYSFS "devices/%s/driver", dev->name); + if (virFileLinkPointsTo(path, PCI_SYSFS "drivers/pci-stub") && + virFileWriteStr(PCI_SYSFS "drivers/pci-stub/unbind", dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to bind PCI device '%s' to pci-stub"), + dev->name); + return -1; + } + } + + /* Trigger a re-probe of the device is not in pci-stub's dynamic + * ID table. If pci-stub is available, but 'remove_id' isn't + * available, then re-probing would just cause the device to be + * re-bound to pci-stub. + */ + if (!virFileExists(PCI_SYSFS "drivers/pci-stub") || + virFileExists(PCI_SYSFS "drivers/pci-stub/remove_id")) { + if (virFileWriteStr(PCI_SYSFS "drivers_probe", dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to trigger a re-probe for PCI device '%s'"), + dev->name); + return -1; + } + } + + return 0; +} + +pciDevice * +pciGetDevice(virConnectPtr conn, + unsigned vendor, + unsigned product, + unsigned domain, + unsigned bus, + unsigned slot, + unsigned function) +{ + pciDevice *dev; + + VIR_DEBUG("%.4x:%.4x %.4x:%.2x:%.2x.%.1x: initializing", + vendor, product, domain, bus, slot, function); + + if (VIR_ALLOC(dev) < 0) { + virReportOOMError(dev->conn); + return NULL; + } + + dev->vendor = vendor; + dev->product = product; + dev->domain = domain; + dev->bus = bus; + dev->slot = slot; + dev->function = function; + + snprintf(dev->name, sizeof(dev->name), "%.4x:%.2x:%.2x.%.1x", + dev->domain, dev->bus, dev->slot, dev->function); + snprintf(dev->id, sizeof(dev->id), "%.4x %.4x", + dev->vendor, dev->product); + snprintf(dev->path, sizeof(dev->path), + PCI_SYSFS "devices/%s/config", dev->name); + + /* These objects are short-lived, so this is just + * a safe hack for error handling convenience */ + dev->conn = conn; + + return dev; +} + +void +pciFreeDevice(pciDevice *dev) +{ + VIR_DEBUG("%.4x:%.4x %.4x:%.2x:%.2x.%.1x: freeing", + dev->vendor, dev->product, dev->domain, dev->bus, dev->slot, dev->function); + if (dev->fd) + close(dev->fd); + VIR_FREE(dev); +} diff --git a/src/pci.h b/src/pci.h new file mode 100644 index 0000000..6e45ccf --- /dev/null +++ b/src/pci.h @@ -0,0 +1,42 @@ +/* + * Copyright (C) 2009 Red Hat, Inc. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Authors: + * Mark McLoughlin <markmc@redhat.com> + */ + +#ifndef __VIR_PCI_H__ +#define __VIR_PCI_H__ + +#include <config.h> +#include "internal.h" + +typedef struct _pciDevice pciDevice; + +pciDevice *pciGetDevice (virConnectPtr conn, + unsigned vendor, + unsigned product, + unsigned domain, + unsigned bus, + unsigned slot, + unsigned function); +void pciFreeDevice (pciDevice *dev); +int pciDettachDevice (pciDevice *dev); +int pciReAttachDevice (pciDevice *dev); +int pciResetDevice (pciDevice *dev); + +#endif /* __VIR_PCI_H__ */ -- 1.6.0.6

Before starting a guest virNodeDeviceAttach() is intended to be called on all node devices to be assigned to a guest, followed by virNodeDeviceReset() on those devices. Once the guest has been shutdown, virNodeDeviceReset() followed by virNodeDeviceReAttach() should be called in order to make the device available to the host again. This patch merely adds the APIs and stubs out the driver implementations. Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- ChangeLog | 14 +++++ include/libvirt/libvirt.h | 4 + include/libvirt/libvirt.h.in | 4 + src/driver.h | 13 ++++ src/libvirt.c | 129 ++++++++++++++++++++++++++++++++++++++++++ src/libvirt_public.syms | 3 + src/lxc_driver.c | 3 + src/openvz_driver.c | 3 + src/qemu_driver.c | 3 + src/test.c | 3 + src/uml_driver.c | 3 + 11 files changed, 182 insertions(+), 0 deletions(-) diff --git a/ChangeLog b/ChangeLog index c08c66d..1bd29ed 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,17 @@ +Tue Feb 24 21:58:22 GMT 2009 Mark McLoughlin <markmc@redhat.com> + + * include/libvirt/libvirt.h: add virNodeDeviceDettach(), + virNodeDeviceReAttach() and virNodeDeviceReset(). + + * src/libvirt.c, src/libvirt_public.syms: implement the + public API + + * src/driver.h: add to the driver API + + * src/lxc_driver.c, src/openvz_driver.c, + src/qemu_driver.c, src/test.c, src/uml_driver.c: stub + out the driver implementations. + Tue Feb 24 21:55:03 GMT 2009 Mark McLoughlin <markmc@redhat.com> * src/pci.[ch]: add implementations of dettach, reattach diff --git a/include/libvirt/libvirt.h b/include/libvirt/libvirt.h index e32d40b..cf39488 100644 --- a/include/libvirt/libvirt.h +++ b/include/libvirt/libvirt.h @@ -1053,6 +1053,10 @@ char * virNodeDeviceGetXMLDesc (virNodeDevicePtr dev, int virNodeDeviceRef (virNodeDevicePtr dev); int virNodeDeviceFree (virNodeDevicePtr dev); +int virNodeDeviceDettach (virNodeDevicePtr dev); +int virNodeDeviceReAttach (virNodeDevicePtr dev); +int virNodeDeviceReset (virNodeDevicePtr dev); + /* * Domain Event Notification */ diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index af97bfd..06361de 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -1053,6 +1053,10 @@ char * virNodeDeviceGetXMLDesc (virNodeDevicePtr dev, int virNodeDeviceRef (virNodeDevicePtr dev); int virNodeDeviceFree (virNodeDevicePtr dev); +int virNodeDeviceDettach (virNodeDevicePtr dev); +int virNodeDeviceReAttach (virNodeDevicePtr dev); +int virNodeDeviceReset (virNodeDevicePtr dev); + /* * Domain Event Notification */ diff --git a/src/driver.h b/src/driver.h index 32d4257..ffb95dc 100644 --- a/src/driver.h +++ b/src/driver.h @@ -313,6 +313,16 @@ typedef virDomainPtr unsigned long flags, int retcode); +typedef int + (*virDrvNodeDeviceDettach) + (virNodeDevicePtr dev); +typedef int + (*virDrvNodeDeviceReAttach) + (virNodeDevicePtr dev); +typedef int + (*virDrvNodeDeviceReset) + (virNodeDevicePtr dev); + /** * _virDriver: * @@ -387,6 +397,9 @@ struct _virDriver { virDrvDomainEventDeregister domainEventDeregister; virDrvDomainMigratePrepare2 domainMigratePrepare2; virDrvDomainMigrateFinish2 domainMigrateFinish2; + virDrvNodeDeviceDettach nodeDeviceDettach; + virDrvNodeDeviceReAttach nodeDeviceReAttach; + virDrvNodeDeviceReset nodeDeviceReset; }; typedef int diff --git a/src/libvirt.c b/src/libvirt.c index 038a1ac..bc444dd 100644 --- a/src/libvirt.c +++ b/src/libvirt.c @@ -7286,6 +7286,135 @@ virNodeDeviceRef(virNodeDevicePtr dev) return 0; } +/** + * virNodeDeviceAttach: + * @dev: pointer to the node device + * + * Dettach the node device from the node itself so that it may be + * assigned to a guest domain. + * + * Depending on the hypervisor, this may involve operations such + * as unbinding any device drivers from the device, binding the + * device to a dummy device driver and resetting the device. + * + * If the device is currently in use by the node, this method may + * fail. + * + * Once the device is not assigned to any guest, it may be re-attached + * to the node using the virNodeDeviceReattach() method. + */ +int +virNodeDeviceDettach(virNodeDevicePtr dev) +{ + DEBUG("dev=%p, conn=%p", dev, dev ? dev->conn : NULL); + + virResetLastError(); + + if (!VIR_IS_CONNECTED_NODE_DEVICE(dev)) { + virLibNodeDeviceError(NULL, VIR_ERR_INVALID_NODE_DEVICE, __FUNCTION__); + return (-1); + } + + if (dev->conn->driver->nodeDeviceDettach) { + int ret; + ret = dev->conn->driver->nodeDeviceDettach (dev); + if (ret < 0) + goto error; + return ret; + } + + virLibConnError(dev->conn, VIR_ERR_NO_SUPPORT, __FUNCTION__); + +error: + /* Copy to connection error object for back compatability */ + virSetConnError(dev->conn); + return (-1); +} + +/** + * virNodeDeviceReAttach: + * @dev: pointer to the node device + * + * Re-attach a previously dettached node device to the node so that it + * may be used by the node again. + * + * Depending on the hypervisor, this may involve operations such + * as resetting the device, unbinding it from a dummy device driver + * and binding it to its appropriate driver. + * + * If the device is currently in use by a guest, this method may fail. + */ +int +virNodeDeviceReAttach(virNodeDevicePtr dev) +{ + DEBUG("dev=%p, conn=%p", dev, dev ? dev->conn : NULL); + + virResetLastError(); + + if (!VIR_IS_CONNECTED_NODE_DEVICE(dev)) { + virLibNodeDeviceError(NULL, VIR_ERR_INVALID_NODE_DEVICE, __FUNCTION__); + return (-1); + } + + if (dev->conn->driver->nodeDeviceReAttach) { + int ret; + ret = dev->conn->driver->nodeDeviceReAttach (dev); + if (ret < 0) + goto error; + return ret; + } + + virLibConnError(dev->conn, VIR_ERR_NO_SUPPORT, __FUNCTION__); + +error: + /* Copy to connection error object for back compatability */ + virSetConnError(dev->conn); + return (-1); +} + +/** + * virNodeDeviceReset: + * @dev: pointer to the node device + * + * Reset a previously dettached node device to the node before or + * after assigning it to a guest. + * + * The exact reset semantics depends on the hypervisor and device + * type but, for example, KVM will attempt to reset PCI devices with + * a Function Level Reset, Secondary Bus Reset or a Power Management + * D-State reset. + * + * If the reset will affect other devices which are currently in use, + * this function may fail. + */ +int +virNodeDeviceReset(virNodeDevicePtr dev) +{ + DEBUG("dev=%p, conn=%p", dev, dev ? dev->conn : NULL); + + virResetLastError(); + + if (!VIR_IS_CONNECTED_NODE_DEVICE(dev)) { + virLibNodeDeviceError(NULL, VIR_ERR_INVALID_NODE_DEVICE, __FUNCTION__); + return (-1); + } + + if (dev->conn->driver->nodeDeviceReset) { + int ret; + ret = dev->conn->driver->nodeDeviceReset (dev); + if (ret < 0) + goto error; + return ret; + } + + virLibConnError(dev->conn, VIR_ERR_NO_SUPPORT, __FUNCTION__); + +error: + /* Copy to connection error object for back compatability */ + virSetConnError(dev->conn); + return (-1); +} + /* * Domain Event Notification diff --git a/src/libvirt_public.syms b/src/libvirt_public.syms index 1422e4c..8114073 100644 --- a/src/libvirt_public.syms +++ b/src/libvirt_public.syms @@ -251,6 +251,9 @@ LIBVIRT_0.6.1 { global: virFreeError; virSaveLastError; + virNodeDeviceDettach; + virNodeDeviceReAttach; + virNodeDeviceReset; } LIBVIRT_0.6.0; # .... define new API here using predicted next version number .... diff --git a/src/lxc_driver.c b/src/lxc_driver.c index aa417a9..28f2c6f 100644 --- a/src/lxc_driver.c +++ b/src/lxc_driver.c @@ -1455,6 +1455,9 @@ static virDriver lxcDriver = { NULL, /* domainEventDeregister */ NULL, /* domainMigratePrepare2 */ NULL, /* domainMigrateFinish2 */ + NULL, /* nodeDeviceAttach */ + NULL, /* nodeDeviceReAttach */ + NULL, /* nodeDeviceReset */ }; static virStateDriver lxcStateDriver = { diff --git a/src/openvz_driver.c b/src/openvz_driver.c index e004819..b4d9128 100644 --- a/src/openvz_driver.c +++ b/src/openvz_driver.c @@ -1325,6 +1325,9 @@ static virDriver openvzDriver = { NULL, /* domainEventDeregister */ NULL, /* domainMigratePrepare2 */ NULL, /* domainMigrateFinish2 */ + NULL, /* nodeDeviceAttach */ + NULL, /* nodeDeviceReAttach */ + NULL, /* nodeDeviceReset */ }; int openvzRegister(void) { diff --git a/src/qemu_driver.c b/src/qemu_driver.c index a8a2ae7..c2dfc71 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -4542,6 +4542,9 @@ static virDriver qemuDriver = { qemudDomainEventDeregister, /* domainEventDeregister */ qemudDomainMigratePrepare2, /* domainMigratePrepare2 */ qemudDomainMigrateFinish2, /* domainMigrateFinish2 */ + NULL, /* nodeDeviceDettach */ + NULL, /* nodeDeviceReAttach */ + NULL, /* nodeDeviceReset */ }; diff --git a/src/test.c b/src/test.c index 226fe2e..833faf0 100644 --- a/src/test.c +++ b/src/test.c @@ -3527,6 +3527,9 @@ static virDriver testDriver = { testDomainEventDeregister, /* domainEventDeregister */ NULL, /* domainMigratePrepare2 */ NULL, /* domainMigrateFinish2 */ + NULL, /* nodeDeviceAttach */ + NULL, /* nodeDeviceReAttach */ + NULL, /* nodeDeviceReset */ }; static virNetworkDriver testNetworkDriver = { diff --git a/src/uml_driver.c b/src/uml_driver.c index c5a06a2..aec22ec 100644 --- a/src/uml_driver.c +++ b/src/uml_driver.c @@ -1885,6 +1885,9 @@ static virDriver umlDriver = { NULL, /* domainEventUnregister */ NULL, /* domainMigratePrepare2 */ NULL, /* domainMigrateFinish2 */ + NULL, /* nodeDeviceAttach */ + NULL, /* nodeDeviceReAttach */ + NULL, /* nodeDeviceReset */ }; -- 1.6.0.6

Nothing to see here, just a bunch of plumbing. Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- ChangeLog | 5 ++ qemud/remote.c | 78 ++++++++++++++++++++++++++++++++++++ qemud/remote_dispatch_args.h | 3 + qemud/remote_dispatch_prototypes.h | 21 ++++++++++ qemud/remote_dispatch_table.h | 15 +++++++ qemud/remote_protocol.c | 27 ++++++++++++ qemud/remote_protocol.h | 24 +++++++++++ qemud/remote_protocol.x | 17 +++++++- src/remote_internal.c | 72 +++++++++++++++++++++++++++++++++ 9 files changed, 261 insertions(+), 1 deletions(-) diff --git a/ChangeLog b/ChangeLog index 1bd29ed..dd8241d 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +Tue Feb 24 22:02:32 GMT 2009 Mark McLoughlin <markmc@redhat.com> + + * qemud/remote*, src/remote_internal.c: plumb the new + APIs in the remote driver + Tue Feb 24 21:58:22 GMT 2009 Mark McLoughlin <markmc@redhat.com> * include/libvirt/libvirt.h: add virNodeDeviceDettach(), diff --git a/qemud/remote.c b/qemud/remote.c index 78dda42..bb4cdb0 100644 --- a/qemud/remote.c +++ b/qemud/remote.c @@ -4172,6 +4172,84 @@ remoteDispatchNodeDeviceListCaps (struct qemud_server *server ATTRIBUTE_UNUSED, } +static int +remoteDispatchNodeDeviceDettach (struct qemud_server *server ATTRIBUTE_UNUSED, + struct qemud_client *client ATTRIBUTE_UNUSED, + virConnectPtr conn, + remote_error *rerr, + remote_node_device_dettach_args *args, + void *ret ATTRIBUTE_UNUSED) +{ + virNodeDevicePtr dev; + CHECK_CONN(client); + + dev = virNodeDeviceLookupByName(conn, args->name); + if (dev == NULL) { + remoteDispatchFormatError(rerr, "%s", _("node_device not found")); + return -1; + } + + if (virNodeDeviceDettach(dev) == -1) { + remoteDispatchConnError(rerr, conn); + return -1; + } + + return 0; +} + + +static int +remoteDispatchNodeDeviceReAttach (struct qemud_server *server ATTRIBUTE_UNUSED, + struct qemud_client *client ATTRIBUTE_UNUSED, + virConnectPtr conn, + remote_error *rerr, + remote_node_device_re_attach_args *args, + void *ret ATTRIBUTE_UNUSED) +{ + virNodeDevicePtr dev; + CHECK_CONN(client); + + dev = virNodeDeviceLookupByName(conn, args->name); + if (dev == NULL) { + remoteDispatchFormatError(rerr, "%s", _("node_device not found")); + return -1; + } + + if (virNodeDeviceReAttach(dev) == -1) { + remoteDispatchConnError(rerr, conn); + return -1; + } + + return 0; +} + + +static int +remoteDispatchNodeDeviceReset (struct qemud_server *server ATTRIBUTE_UNUSED, + struct qemud_client *client ATTRIBUTE_UNUSED, + virConnectPtr conn, + remote_error *rerr, + remote_node_device_reset_args *args, + void *ret ATTRIBUTE_UNUSED) +{ + virNodeDevicePtr dev; + CHECK_CONN(client); + + dev = virNodeDeviceLookupByName(conn, args->name); + if (dev == NULL) { + remoteDispatchFormatError(rerr, "%s", _("node_device not found")); + return -1; + } + + if (virNodeDeviceReset(dev) == -1) { + remoteDispatchConnError(rerr, conn); + return -1; + } + + return 0; +} + + /************************** * Async Events **************************/ diff --git a/qemud/remote_dispatch_args.h b/qemud/remote_dispatch_args.h index a19ab79..03a7937 100644 --- a/qemud/remote_dispatch_args.h +++ b/qemud/remote_dispatch_args.h @@ -99,3 +99,6 @@ remote_node_device_get_parent_args val_remote_node_device_get_parent_args; remote_node_device_num_of_caps_args val_remote_node_device_num_of_caps_args; remote_node_device_list_caps_args val_remote_node_device_list_caps_args; + remote_node_device_dettach_args val_remote_node_device_dettach_args; + remote_node_device_re_attach_args val_remote_node_device_re_attach_args; + remote_node_device_reset_args val_remote_node_device_reset_args; diff --git a/qemud/remote_dispatch_prototypes.h b/qemud/remote_dispatch_prototypes.h index 3ffb164..4188c6a 100644 --- a/qemud/remote_dispatch_prototypes.h +++ b/qemud/remote_dispatch_prototypes.h @@ -520,6 +520,13 @@ static int remoteDispatchNetworkUndefine( remote_error *err, remote_network_undefine_args *args, void *ret); +static int remoteDispatchNodeDeviceDettach( + struct qemud_server *server, + struct qemud_client *client, + virConnectPtr conn, + remote_error *err, + remote_node_device_dettach_args *args, + void *ret); static int remoteDispatchNodeDeviceDumpXml( struct qemud_server *server, struct qemud_client *client, @@ -555,6 +562,20 @@ static int remoteDispatchNodeDeviceNumOfCaps( remote_error *err, remote_node_device_num_of_caps_args *args, remote_node_device_num_of_caps_ret *ret); +static int remoteDispatchNodeDeviceReAttach( + struct qemud_server *server, + struct qemud_client *client, + virConnectPtr conn, + remote_error *err, + remote_node_device_re_attach_args *args, + void *ret); +static int remoteDispatchNodeDeviceReset( + struct qemud_server *server, + struct qemud_client *client, + virConnectPtr conn, + remote_error *err, + remote_node_device_reset_args *args, + void *ret); static int remoteDispatchNodeGetCellsFreeMemory( struct qemud_server *server, struct qemud_client *client, diff --git a/qemud/remote_dispatch_table.h b/qemud/remote_dispatch_table.h index 60f0e1c..98be9f3 100644 --- a/qemud/remote_dispatch_table.h +++ b/qemud/remote_dispatch_table.h @@ -592,3 +592,18 @@ .args_filter = (xdrproc_t) xdr_remote_node_device_list_caps_args, .ret_filter = (xdrproc_t) xdr_remote_node_device_list_caps_ret, }, +{ /* NodeDeviceDettach => 118 */ + .fn = (dispatch_fn) remoteDispatchNodeDeviceDettach, + .args_filter = (xdrproc_t) xdr_remote_node_device_dettach_args, + .ret_filter = (xdrproc_t) xdr_void, +}, +{ /* NodeDeviceReAttach => 119 */ + .fn = (dispatch_fn) remoteDispatchNodeDeviceReAttach, + .args_filter = (xdrproc_t) xdr_remote_node_device_re_attach_args, + .ret_filter = (xdrproc_t) xdr_void, +}, +{ /* NodeDeviceReset => 120 */ + .fn = (dispatch_fn) remoteDispatchNodeDeviceReset, + .args_filter = (xdrproc_t) xdr_remote_node_device_reset_args, + .ret_filter = (xdrproc_t) xdr_void, +}, diff --git a/qemud/remote_protocol.c b/qemud/remote_protocol.c index 249614a..b872469 100644 --- a/qemud/remote_protocol.c +++ b/qemud/remote_protocol.c @@ -2166,6 +2166,33 @@ xdr_remote_node_device_list_caps_ret (XDR *xdrs, remote_node_device_list_caps_re } bool_t +xdr_remote_node_device_dettach_args (XDR *xdrs, remote_node_device_dettach_args *objp) +{ + + if (!xdr_remote_nonnull_string (xdrs, &objp->name)) + return FALSE; + return TRUE; +} + +bool_t +xdr_remote_node_device_re_attach_args (XDR *xdrs, remote_node_device_re_attach_args *objp) +{ + + if (!xdr_remote_nonnull_string (xdrs, &objp->name)) + return FALSE; + return TRUE; +} + +bool_t +xdr_remote_node_device_reset_args (XDR *xdrs, remote_node_device_reset_args *objp) +{ + + if (!xdr_remote_nonnull_string (xdrs, &objp->name)) + return FALSE; + return TRUE; +} + +bool_t xdr_remote_domain_events_register_ret (XDR *xdrs, remote_domain_events_register_ret *objp) { diff --git a/qemud/remote_protocol.h b/qemud/remote_protocol.h index 912d8e3..e73e5da 100644 --- a/qemud/remote_protocol.h +++ b/qemud/remote_protocol.h @@ -1211,6 +1211,21 @@ struct remote_node_device_list_caps_ret { }; typedef struct remote_node_device_list_caps_ret remote_node_device_list_caps_ret; +struct remote_node_device_dettach_args { + remote_nonnull_string name; +}; +typedef struct remote_node_device_dettach_args remote_node_device_dettach_args; + +struct remote_node_device_re_attach_args { + remote_nonnull_string name; +}; +typedef struct remote_node_device_re_attach_args remote_node_device_re_attach_args; + +struct remote_node_device_reset_args { + remote_nonnull_string name; +}; +typedef struct remote_node_device_reset_args remote_node_device_reset_args; + struct remote_domain_events_register_ret { int cb_registered; }; @@ -1348,6 +1363,9 @@ enum remote_procedure { REMOTE_PROC_NODE_DEVICE_GET_PARENT = 115, REMOTE_PROC_NODE_DEVICE_NUM_OF_CAPS = 116, REMOTE_PROC_NODE_DEVICE_LIST_CAPS = 117, + REMOTE_PROC_NODE_DEVICE_DETTACH = 118, + REMOTE_PROC_NODE_DEVICE_RE_ATTACH = 119, + REMOTE_PROC_NODE_DEVICE_RESET = 120, }; typedef enum remote_procedure remote_procedure; @@ -1574,6 +1592,9 @@ extern bool_t xdr_remote_node_device_num_of_caps_args (XDR *, remote_node_devic extern bool_t xdr_remote_node_device_num_of_caps_ret (XDR *, remote_node_device_num_of_caps_ret*); extern bool_t xdr_remote_node_device_list_caps_args (XDR *, remote_node_device_list_caps_args*); extern bool_t xdr_remote_node_device_list_caps_ret (XDR *, remote_node_device_list_caps_ret*); +extern bool_t xdr_remote_node_device_dettach_args (XDR *, remote_node_device_dettach_args*); +extern bool_t xdr_remote_node_device_re_attach_args (XDR *, remote_node_device_re_attach_args*); +extern bool_t xdr_remote_node_device_reset_args (XDR *, remote_node_device_reset_args*); extern bool_t xdr_remote_domain_events_register_ret (XDR *, remote_domain_events_register_ret*); extern bool_t xdr_remote_domain_events_deregister_ret (XDR *, remote_domain_events_deregister_ret*); extern bool_t xdr_remote_domain_event_ret (XDR *, remote_domain_event_ret*); @@ -1779,6 +1800,9 @@ extern bool_t xdr_remote_node_device_num_of_caps_args (); extern bool_t xdr_remote_node_device_num_of_caps_ret (); extern bool_t xdr_remote_node_device_list_caps_args (); extern bool_t xdr_remote_node_device_list_caps_ret (); +extern bool_t xdr_remote_node_device_dettach_args (); +extern bool_t xdr_remote_node_device_re_attach_args (); +extern bool_t xdr_remote_node_device_reset_args (); extern bool_t xdr_remote_domain_events_register_ret (); extern bool_t xdr_remote_domain_events_deregister_ret (); extern bool_t xdr_remote_domain_event_ret (); diff --git a/qemud/remote_protocol.x b/qemud/remote_protocol.x index 2a6035b..8f76064 100644 --- a/qemud/remote_protocol.x +++ b/qemud/remote_protocol.x @@ -1068,6 +1068,18 @@ struct remote_node_device_list_caps_ret { remote_nonnull_string names<REMOTE_NODE_DEVICE_CAPS_LIST_MAX>; }; +struct remote_node_device_dettach_args { + remote_nonnull_string name; +}; + +struct remote_node_device_re_attach_args { + remote_nonnull_string name; +}; + +struct remote_node_device_reset_args { + remote_nonnull_string name; +}; + /** * Events Register/Deregister: @@ -1223,7 +1235,10 @@ enum remote_procedure { REMOTE_PROC_NODE_DEVICE_DUMP_XML = 114, REMOTE_PROC_NODE_DEVICE_GET_PARENT = 115, REMOTE_PROC_NODE_DEVICE_NUM_OF_CAPS = 116, - REMOTE_PROC_NODE_DEVICE_LIST_CAPS = 117 + REMOTE_PROC_NODE_DEVICE_LIST_CAPS = 117, + REMOTE_PROC_NODE_DEVICE_DETTACH = 118, + REMOTE_PROC_NODE_DEVICE_RE_ATTACH = 119, + REMOTE_PROC_NODE_DEVICE_RESET = 120 }; /* Custom RPC structure. */ diff --git a/src/remote_internal.c b/src/remote_internal.c index eda6177..0f651f4 100644 --- a/src/remote_internal.c +++ b/src/remote_internal.c @@ -4812,6 +4812,75 @@ done: return rv; } +static int +remoteNodeDeviceDettach (virNodeDevicePtr dev) +{ + int rv = -1; + remote_node_device_dettach_args args; + struct private_data *priv = dev->conn->privateData; + + remoteDriverLock(priv); + + args.name = dev->name; + + if (call (dev->conn, priv, 0, REMOTE_PROC_NODE_DEVICE_DETTACH, + (xdrproc_t) xdr_remote_node_device_dettach_args, (char *) &args, + (xdrproc_t) xdr_void, (char *) NULL) == -1) + goto done; + + rv = 0; + +done: + remoteDriverUnlock(priv); + return rv; +} + +static int +remoteNodeDeviceReAttach (virNodeDevicePtr dev) +{ + int rv = -1; + remote_node_device_re_attach_args args; + struct private_data *priv = dev->conn->privateData; + + remoteDriverLock(priv); + + args.name = dev->name; + + if (call (dev->conn, priv, 0, REMOTE_PROC_NODE_DEVICE_RE_ATTACH, + (xdrproc_t) xdr_remote_node_device_re_attach_args, (char *) &args, + (xdrproc_t) xdr_void, (char *) NULL) == -1) + goto done; + + rv = 0; + +done: + remoteDriverUnlock(priv); + return rv; +} + +static int +remoteNodeDeviceReset (virNodeDevicePtr dev) +{ + int rv = -1; + remote_node_device_reset_args args; + struct private_data *priv = dev->conn->privateData; + + remoteDriverLock(priv); + + args.name = dev->name; + + if (call (dev->conn, priv, 0, REMOTE_PROC_NODE_DEVICE_RESET, + (xdrproc_t) xdr_remote_node_device_reset_args, (char *) &args, + (xdrproc_t) xdr_void, (char *) NULL) == -1) + goto done; + + rv = 0; + +done: + remoteDriverUnlock(priv); + return rv; +} + /*----------------------------------------------------------------------*/ @@ -6741,6 +6810,9 @@ static virDriver driver = { .domainEventDeregister = remoteDomainEventDeregister, .domainMigratePrepare2 = remoteDomainMigratePrepare2, .domainMigrateFinish2 = remoteDomainMigrateFinish2, + .nodeDeviceDettach = remoteNodeDeviceDettach, + .nodeDeviceReAttach = remoteNodeDeviceReAttach, + .nodeDeviceReset = remoteNodeDeviceReset, }; static virNetworkDriver network_driver = { -- 1.6.0.6

Implement the new methods in the QEMU driver by parsing the node device XML to obtain the PCI device details and then calling the appropriate PCI utility function. Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- ChangeLog | 5 ++ src/qemu_driver.c | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 132 insertions(+), 3 deletions(-) diff --git a/ChangeLog b/ChangeLog index dd8241d..62a286f 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +Tue Feb 24 22:08:15 GMT 2009 Mark McLoughlin <markmc@redhat.com> + + * src/qemu_driver.c: implement the new methods in the + QEMU driver. + Tue Feb 24 22:02:32 GMT 2009 Mark McLoughlin <markmc@redhat.com> * qemud/remote*, src/remote_internal.c: plumb the new diff --git a/src/qemu_driver.c b/src/qemu_driver.c index c2dfc71..7b13c5e 100644 --- a/src/qemu_driver.c +++ b/src/qemu_driver.c @@ -68,6 +68,8 @@ #include "memory.h" #include "uuid.h" #include "domain_conf.h" +#include "node_device_conf.h" +#include "pci.h" #define VIR_FROM_THIS VIR_FROM_QEMU @@ -4470,6 +4472,128 @@ cleanup: return dom; } +static int +qemudNodeDeviceGetPciInfo (virNodeDevicePtr dev, + unsigned *vendor, + unsigned *product, + unsigned *domain, + unsigned *bus, + unsigned *slot, + unsigned *function) +{ + virNodeDeviceDefPtr def = NULL; + virNodeDevCapsDefPtr cap; + char *xml = NULL; + int ret = -1; + + xml = virNodeDeviceGetXMLDesc(dev, 0); + if (!xml) + goto out; + + def = virNodeDeviceDefParseString(dev->conn, xml); + if (!def) + goto out; + + cap = def->caps; + while (cap) { + if (cap->type == VIR_NODE_DEV_CAP_PCI_DEV) { + *vendor = cap->data.pci_dev.vendor; + *product = cap->data.pci_dev.product; + *domain = cap->data.pci_dev.domain; + *bus = cap->data.pci_dev.bus; + *slot = cap->data.pci_dev.slot; + *function = cap->data.pci_dev.function; + break; + } + + cap = cap->next; + } + + if (!cap) { + qemudReportError(dev->conn, NULL, NULL, VIR_ERR_INVALID_ARG, + _("device %s is not a PCI device"), dev->name); + goto out; + } + + ret = 0; +out: + virNodeDeviceDefFree(def); + VIR_FREE(xml); + return ret; +} + +static int +qemudNodeDeviceDettach (virNodeDevicePtr dev) +{ + pciDevice *pci; + unsigned vendor, product, domain, bus, slot, function; + int ret = -1; + + if (qemudNodeDeviceGetPciInfo(dev, &vendor, &product, + &domain, &bus, &slot, &function) < 0) + return -1; + + pci = pciGetDevice(dev->conn, vendor, product, domain, bus, slot, function); + if (!pci) + return -1; + + if (pciDettachDevice(pci) < 0) + goto out; + + ret = 0; +out: + pciFreeDevice(pci); + return ret; +} + +static int +qemudNodeDeviceReAttach (virNodeDevicePtr dev) +{ + pciDevice *pci; + unsigned vendor, product, domain, bus, slot, function; + int ret = -1; + + if (qemudNodeDeviceGetPciInfo(dev, &vendor, &product, + &domain, &bus, &slot, &function) < 0) + return -1; + + pci = pciGetDevice(dev->conn, vendor, product, domain, bus, slot, function); + if (!pci) + return -1; + + if (pciReAttachDevice(pci) < 0) + goto out; + + ret = 0; +out: + pciFreeDevice(pci); + return ret; +} + +static int +qemudNodeDeviceReset (virNodeDevicePtr dev) +{ + pciDevice *pci; + unsigned vendor, product, domain, bus, slot, function; + int ret = -1; + + if (qemudNodeDeviceGetPciInfo(dev, &vendor, &product, + &domain, &bus, &slot, &function) < 0) + return -1; + + pci = pciGetDevice(dev->conn, vendor, product, domain, bus, slot, function); + if (!pci) + return -1; + + if (pciResetDevice(pci) < 0) + goto out; + + ret = 0; +out: + pciFreeDevice(pci); + return ret; +} + static virDriver qemuDriver = { VIR_DRV_QEMU, "QEMU", @@ -4542,9 +4666,9 @@ static virDriver qemuDriver = { qemudDomainEventDeregister, /* domainEventDeregister */ qemudDomainMigratePrepare2, /* domainMigratePrepare2 */ qemudDomainMigrateFinish2, /* domainMigrateFinish2 */ - NULL, /* nodeDeviceDettach */ - NULL, /* nodeDeviceReAttach */ - NULL, /* nodeDeviceReset */ + qemudNodeDeviceDettach, /* nodeDeviceDettach */ + qemudNodeDeviceReAttach, /* nodeDeviceReAttach */ + qemudNodeDeviceReset, /* nodeDeviceReset */ }; -- 1.6.0.6

Add virsh commands corresponding to each of the new methods. Signed-off-by: Mark McLoughlin <markmc@redhat.com> --- ChangeLog | 5 ++ src/virsh.c | 126 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 131 insertions(+), 0 deletions(-) diff --git a/ChangeLog b/ChangeLog index 62a286f..4f2c125 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +Tue Feb 24 22:09:40 GMT 2009 Mark McLoughlin <markmc@redhat.com> + + * src/virsh.c: add new commands for each of the new + node device methods + Tue Feb 24 22:08:15 GMT 2009 Mark McLoughlin <markmc@redhat.com> * src/qemu_driver.c: implement the new methods in the diff --git a/src/virsh.c b/src/virsh.c index 298dde0..8ae79c5 100644 --- a/src/virsh.c +++ b/src/virsh.c @@ -4433,6 +4433,129 @@ cmdNodeDeviceDumpXML (vshControl *ctl, const vshCmd *cmd) } /* + * "nodedev-dettach" command + */ +static const vshCmdInfo info_node_device_dettach[] = { + {"help", gettext_noop("dettach node device its device driver")}, + {"desc", gettext_noop("Dettach node device its device driver before assigning to a domain.")}, + {NULL, NULL} +}; + + +static const vshCmdOptDef opts_node_device_dettach[] = { + {"device", VSH_OT_DATA, VSH_OFLAG_REQ, gettext_noop("device key")}, + {NULL, 0, 0, NULL} +}; + +static int +cmdNodeDeviceDettach (vshControl *ctl, const vshCmd *cmd) +{ + const char *name; + virNodeDevicePtr device; + int ret = TRUE; + + if (!vshConnectionUsability(ctl, ctl->conn, TRUE)) + return FALSE; + if (!(name = vshCommandOptString(cmd, "device", NULL))) + return FALSE; + if (!(device = virNodeDeviceLookupByName(ctl->conn, name))) { + vshError(ctl, FALSE, "%s '%s'", _("Could not find matching device"), name); + return FALSE; + } + + if (virNodeDeviceDettach(device) == 0) { + vshPrint(ctl, _("Device %s dettached\n"), name); + } else { + vshError(ctl, FALSE, _("Failed to dettach device %s"), name); + ret = FALSE; + } + virNodeDeviceFree(device); + return ret; +} + +/* + * "nodedev-reattach" command + */ +static const vshCmdInfo info_node_device_reattach[] = { + {"help", gettext_noop("reattach node device its device driver")}, + {"desc", gettext_noop("Dettach node device its device driver before assigning to a domain.")}, + {NULL, NULL} +}; + + +static const vshCmdOptDef opts_node_device_reattach[] = { + {"device", VSH_OT_DATA, VSH_OFLAG_REQ, gettext_noop("device key")}, + {NULL, 0, 0, NULL} +}; + +static int +cmdNodeDeviceReAttach (vshControl *ctl, const vshCmd *cmd) +{ + const char *name; + virNodeDevicePtr device; + int ret = TRUE; + + if (!vshConnectionUsability(ctl, ctl->conn, TRUE)) + return FALSE; + if (!(name = vshCommandOptString(cmd, "device", NULL))) + return FALSE; + if (!(device = virNodeDeviceLookupByName(ctl->conn, name))) { + vshError(ctl, FALSE, "%s '%s'", _("Could not find matching device"), name); + return FALSE; + } + + if (virNodeDeviceReAttach(device) == 0) { + vshPrint(ctl, _("Device %s re-attached\n"), name); + } else { + vshError(ctl, FALSE, _("Failed to re-attach device %s"), name); + ret = FALSE; + } + virNodeDeviceFree(device); + return ret; +} + +/* + * "nodedev-reset" command + */ +static const vshCmdInfo info_node_device_reset[] = { + {"help", gettext_noop("reset node device")}, + {"desc", gettext_noop("Reset node device before or after assigning to a domain.")}, + {NULL, NULL} +}; + + +static const vshCmdOptDef opts_node_device_reset[] = { + {"device", VSH_OT_DATA, VSH_OFLAG_REQ, gettext_noop("device key")}, + {NULL, 0, 0, NULL} +}; + +static int +cmdNodeDeviceReset (vshControl *ctl, const vshCmd *cmd) +{ + const char *name; + virNodeDevicePtr device; + int ret = TRUE; + + if (!vshConnectionUsability(ctl, ctl->conn, TRUE)) + return FALSE; + if (!(name = vshCommandOptString(cmd, "device", NULL))) + return FALSE; + if (!(device = virNodeDeviceLookupByName(ctl->conn, name))) { + vshError(ctl, FALSE, "%s '%s'", _("Could not find matching device"), name); + return FALSE; + } + + if (virNodeDeviceReset(device) == 0) { + vshPrint(ctl, _("Device %s reset\n"), name); + } else { + vshError(ctl, FALSE, _("Failed to reset device %s"), name); + ret = FALSE; + } + virNodeDeviceFree(device); + return ret; +} + +/* * "hostkey" command */ static const vshCmdInfo info_hostname[] = { @@ -5576,6 +5699,9 @@ static const vshCmdDef commands[] = { {"nodedev-list", cmdNodeListDevices, opts_node_list_devices, info_node_list_devices}, {"nodedev-dumpxml", cmdNodeDeviceDumpXML, opts_node_device_dumpxml, info_node_device_dumpxml}, + {"nodedev-dettach", cmdNodeDeviceDettach, opts_node_device_dettach, info_node_device_dettach}, + {"nodedev-reattach", cmdNodeDeviceReAttach, opts_node_device_reattach, info_node_device_reattach}, + {"nodedev-reset", cmdNodeDeviceReset, opts_node_device_reset, info_node_device_reset}, {"pool-autostart", cmdPoolAutostart, opts_pool_autostart, info_pool_autostart}, {"pool-build", cmdPoolBuild, opts_pool_build, info_pool_build}, -- 1.6.0.6

On Tue, Feb 24, 2009 at 10:21:48PM +0000, Mark McLoughlin wrote:
Before starting a guest virNodeDeviceAttach() is intended to be called on all node devices to be assigned to a guest, followed by virNodeDeviceReset() on those devices.
Once the guest has been shutdown, virNodeDeviceReset() followed by virNodeDeviceReAttach() should be called in order to make the device available to the host again.
This patch merely adds the APIs and stubs out the driver implementations.
While I can see a point in providing public APIs to attach/detach drivers to devices - because we need this for Xen driver PCI passthrough, I'm not sure theres a compelling need for exposing a reset function, because both Xen & your KVM impl are quite happy doing the resets themselves. I think the attach/detach functions should be in the nodedev driver too, because they're not really part of the HV functionality. On modern Linux kernels, both Xen & KVM (and any other users) have the same pci-stub.ko code for managed driver binding. On older Xen kernels, there is the functionally equivalent pci-back.ko. A similar capability is really needed for USB devices, to disconnect them from any host USB driver, and that's not HV specific either. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, 2009-02-25 at 17:51 +0000, Daniel P. Berrange wrote:
On Tue, Feb 24, 2009 at 10:21:48PM +0000, Mark McLoughlin wrote:
Before starting a guest virNodeDeviceAttach() is intended to be called on all node devices to be assigned to a guest, followed by virNodeDeviceReset() on those devices.
Once the guest has been shutdown, virNodeDeviceReset() followed by virNodeDeviceReAttach() should be called in order to make the device available to the host again.
This patch merely adds the APIs and stubs out the driver implementations.
While I can see a point in providing public APIs to attach/detach drivers to devices - because we need this for Xen driver PCI passthrough, I'm not sure theres a compelling need for exposing a reset function, because both Xen & your KVM impl are quite happy doing the resets themselves.
The idea with the reset function is that calling reset is a way for the app to query whether this is an assignable device - e.g. if the user chooses a given NIC to pass through in one of the early screens in virt-manager, we can give a "you can't assign that device" error at that point rather than just having the guest fail to start up much later on. It needs to be a separate API from dettach() because reset() may succeed (in future) if e.g. you dettach() all the functions on a multi-function device before calling reset() on one of those functions.
I think the attach/detach functions should be in the nodedev driver too, because they're not really part of the HV functionality. On modern Linux kernels, both Xen & KVM (and any other users) have the same pci-stub.ko code for managed driver binding. On older Xen kernels, there is the functionally equivalent pci-back.ko.
Sure, but the logic to choose between pci-stub vs. pci-back is HV specific, right? I'm sure more of this could be re-factored to be in common for the Xen implementation, but it makes sense to me to do that when we actually do the Xen implementation. The code should be very easy to re-factor.
A similar capability is really needed for USB devices, to disconnect them from any host USB driver, and that's not HV specific either.
Yep, I haven't investigated the USB side of things - looks like we need remove_id support in USB sysfs too. Cheers, Mark.

On Wed, Feb 25, 2009 at 07:16:28PM +0000, Mark McLoughlin wrote:
On Wed, 2009-02-25 at 17:51 +0000, Daniel P. Berrange wrote:
On Tue, Feb 24, 2009 at 10:21:48PM +0000, Mark McLoughlin wrote:
Before starting a guest virNodeDeviceAttach() is intended to be called on all node devices to be assigned to a guest, followed by virNodeDeviceReset() on those devices.
Once the guest has been shutdown, virNodeDeviceReset() followed by virNodeDeviceReAttach() should be called in order to make the device available to the host again.
This patch merely adds the APIs and stubs out the driver implementations.
While I can see a point in providing public APIs to attach/detach drivers to devices - because we need this for Xen driver PCI passthrough, I'm not sure theres a compelling need for exposing a reset function, because both Xen & your KVM impl are quite happy doing the resets themselves.
The idea with the reset function is that calling reset is a way for the app to query whether this is an assignable device - e.g. if the user chooses a given NIC to pass through in one of the early screens in virt-manager, we can give a "you can't assign that device" error at that point rather than just having the guest fail to start up much later on.
It needs to be a separate API from dettach() because reset() may succeed (in future) if e.g. you dettach() all the functions on a multi-function device before calling reset() on one of those functions.
I think the attach/detach functions should be in the nodedev driver too, because they're not really part of the HV functionality. On modern Linux kernels, both Xen & KVM (and any other users) have the same pci-stub.ko code for managed driver binding. On older Xen kernels, there is the functionally equivalent pci-back.ko.
Sure, but the logic to choose between pci-stub vs. pci-back is HV specific, right?
No, its really kernel version depedant. On old Xen kernels it is always pci-back. On mainline KVM / pv_ops Xen kernels, it'll always be pci-stub There's no need for Xen's pci-back, now the generic pci-stub exists. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, 2009-02-25 at 19:42 +0000, Daniel P. Berrange wrote:
I think the attach/detach functions should be in the nodedev driver too, because they're not really part of the HV functionality. On modern Linux kernels, both Xen & KVM (and any other users) have the same pci-stub.ko code for managed driver binding. On older Xen kernels, there is the functionally equivalent pci-back.ko.
Sure, but the logic to choose between pci-stub vs. pci-back is HV specific, right?
No, its really kernel version depedant. On old Xen kernels it is always pci-back. On mainline KVM / pv_ops Xen kernels, it'll always be pci-stub There's no need for Xen's pci-back, now the generic pci-stub exists.
Okay, that does makes some sense. Sorry I didn't do the re-factoring in the set I just posted. Cheers, Mark.

On Wed, Feb 25, 2009 at 07:16:28PM +0000, Mark McLoughlin wrote:
On Wed, 2009-02-25 at 17:51 +0000, Daniel P. Berrange wrote:
On Tue, Feb 24, 2009 at 10:21:48PM +0000, Mark McLoughlin wrote:
Before starting a guest virNodeDeviceAttach() is intended to be called on all node devices to be assigned to a guest, followed by virNodeDeviceReset() on those devices.
Once the guest has been shutdown, virNodeDeviceReset() followed by virNodeDeviceReAttach() should be called in order to make the device available to the host again.
This patch merely adds the APIs and stubs out the driver implementations.
While I can see a point in providing public APIs to attach/detach drivers to devices - because we need this for Xen driver PCI passthrough, I'm not sure theres a compelling need for exposing a reset function, because both Xen & your KVM impl are quite happy doing the resets themselves.
The idea with the reset function is that calling reset is a way for the app to query whether this is an assignable device - e.g. if the user chooses a given NIC to pass through in one of the early screens in virt-manager, we can give a "you can't assign that device" error at that point rather than just having the guest fail to start up much later on.
I wonder if we should generalize that beyond just host device, to cover VM device hotplug in general, passing in the same XML doc you'd use for the subsequent virDomainAttachDevice() call. virDomainCanAttachDevice(virDomainPtr dom, const char *xml); Though, obviously that won't help for scenarios before a virDomainPtr exists. So might also want a way to ask whether a device is likely to be supported before creating a VM virConnectSupportsDevice(virConnectPtr conn, const char *xml) But then maybe this is getting into overkill, and just rename the reset function to int virNodeDeviceAssignable(virNodeDevicePtr dev); so we're not explicitly saying 'reset' is the test we're doing, even though that may be the internal impl Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, 2009-02-25 at 19:58 +0000, Daniel P. Berrange wrote:
While I can see a point in providing public APIs to attach/detach drivers to devices - because we need this for Xen driver PCI passthrough, I'm not sure theres a compelling need for exposing a reset function, because both Xen & your KVM impl are quite happy doing the resets themselves.
The idea with the reset function is that calling reset is a way for the app to query whether this is an assignable device - e.g. if the user chooses a given NIC to pass through in one of the early screens in virt-manager, we can give a "you can't assign that device" error at that point rather than just having the guest fail to start up much later on.
I wonder if we should generalize that beyond just host device, to cover VM device hotplug in general, passing in the same XML doc you'd use for the subsequent virDomainAttachDevice() call.
virDomainCanAttachDevice(virDomainPtr dom, const char *xml);
Though, obviously that won't help for scenarios before a virDomainPtr exists. So might also want a way to ask whether a device is likely to be supported before creating a VM
virConnectSupportsDevice(virConnectPtr conn, const char *xml)
But then maybe this is getting into overkill, and just rename the reset function to
int virNodeDeviceAssignable(virNodeDevicePtr dev);
so we're not explicitly saying 'reset' is the test we're doing, even though that may be the internal impl
I see what you're getting at - we may in future want have an "is assignable" API that goes beyond just "does reset" work and it would be a pity to have to add yet another API. However, "is assignable" in the context of a hotplugging into a running guest is an operation on a given domain - e.g. if you're hotplugging a device on the same bus as a device previously assigned to the same guest, then that's allowed. "reset" seems to me to be a pretty sane operation to expose - especially for what we're now calling "non-managed" assigned devices. It doesn't cover all cases, but it's an step in the right direction. Cheers, Mark.

On Tue, Feb 24, 2009 at 10:21:47PM +0000, Mark McLoughlin wrote:
Add implementations of dettach, reattach and reset for PCI devices.
Background to this code can be found here:
http://marc.info/?l=kvm&m=123454366317045
Some notes:
* pci-stub was first introduced in 2.6.29; if it's not available, then dettach can only unbind the device from the host driver
* remove_id is queued up for 2.6.30; in it's absence reattach cannot re-bind the device to the original driver
* If the device supports Function Level Reset, we just don't bother doing the reset - KVM will do this before and after a guest runs
* Currently, if a reset would affect another device on the same bus, or another function on the same multi-function device, then we just refuse to do the reset. In future, we could allow the reset as long as it doesn't affect a device in use by the host or assigned to another guest
* The device reset code is similar in concept to some of Xen's code
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
I can't claim to understand all the gory PCI code in here, but a few API level suggestions. Also, would it be worth (or possible) using libpciaccess.so for this instead of poking Linux sysfs files directly ?
+static int +pciRead(pciDevice *dev, unsigned pos, uint8_t *buf, unsigned buflen) +{ + memset(buf, 0, buflen); + + if (pciOpenConfig(dev) < 0) + return -1; + + if (pread(dev->fd, buf, buflen, pos) < 0) { + char ebuf[1024]; + VIR_WARN(_("Failed to read from '%s' : %s"), dev->path, + virStrerror(errno, ebuf, sizeof(ebuf))); + return -1; + } + return 0; +}
I think this needs to check for a possible short read too. Perhaps better off doing a seek and then saferead(), or defining a safepread() function.
+static int +pciWrite(pciDevice *dev, unsigned pos, uint8_t *buf, unsigned buflen) +{ + if (pciOpenConfig(dev) < 0) + return -1; + + if (pwrite(dev->fd, buf, buflen, pos) < 0) { + char ebuf[1024]; + VIR_WARN(_("Failed to write to '%s' : %s"), dev->path, + virStrerror(errno, ebuf, sizeof(ebuf))); + return -1; + } + return 0; +}
Likewise here needs to validate actual num of bytes written
+int +pciDettachDevice(pciDevice *dev) +{ + char path[PATH_MAX]; + + /* Try loading the pci-stub module if it isn't already loaded; + * return an error if there is no pci-stub available. + */ + if (!virFileExists(PCI_SYSFS "drivers/pci-stub")) { + const char *const modprobeargv[] = { MODPROBE, "pci-stub", NULL }; + + if (virRun(dev->conn, modprobeargv, NULL) < 0) { + char ebuf[1024]; + VIR_WARN(_("modprobe pci-stub failed: %s"), + virStrerror(errno, ebuf, sizeof ebuf)); + } + } + + if (!virFileExists(PCI_SYSFS "drivers/pci-stub")) { + VIR_WARN(_("pci-stub driver not available, cannot bind device %s to it"), + dev->name); + } else { + /* Add the PCI device ID to pci-stub's dynamic ID table; + * this is needed to allow us to bind the device to pci-stub. + * Note: if the device is not currently bound to any driver, + * pci-stub will immediately be bound to the device. Also, note + * that if a new device with this ID is hotplugged, or if a probe + * is triggered for such a device, it will also be immediately + * bound by pci-stub. + */ + if (virFileWriteStr(PCI_SYSFS "drivers/pci-stub/new_id", dev->id) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to add PCI device ID '%s' to pci-stub/new_id"), + dev->id); + return -1; + } + } + + /* If the device is already bound to a driver, unbind it. + * Note, this will have rather unpleasant side effects if this + * PCI device happens to be IDE controller for the disk hosting + * your root filesystem. + */ + snprintf(path, sizeof(path), + PCI_SYSFS "devices/%s/driver/unbind", dev->name); + if (virFileExists(path) && virFileWriteStr(path, dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to unbind PCI device '%s'"), dev->name); + return -1; + } + + if (virFileExists(PCI_SYSFS "drivers/pci-stub")) { + /* If the device isn't already bound to pci-stub, try binding it now. + */ + snprintf(path, sizeof(path), PCI_SYSFS "devices/%s/driver", dev->name); + if (!virFileLinkPointsTo(path, PCI_SYSFS "drivers/pci-stub") && + virFileWriteStr(PCI_SYSFS "drivers/pci-stub/bind", dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to bind PCI device '%s' to pci-stub"), + dev->name); + return -1; + } + + /* If 'remove_id' exists, remove the device id from pci-stub's dynamic + * ID table so that 'drivers_probe' works below. + */ + if (virFileExists(PCI_SYSFS "drivers/pci-stub/remove_id") && + virFileWriteStr(PCI_SYSFS "drivers/pci-stub/remove_id", dev->id) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to remove PCI ID '%s' with pci-stub/remove_id"), + dev->id); + return -1; + } + } + + return 0; +} + +int +pciReAttachDevice(pciDevice *dev) +{ + if (virFileExists(PCI_SYSFS "drivers/pci-stub")) { + char path[PATH_MAX]; + + /* If the device is bound to pci-stub, unbind it. + */ + snprintf(path, sizeof(path), PCI_SYSFS "devices/%s/driver", dev->name); + if (virFileLinkPointsTo(path, PCI_SYSFS "drivers/pci-stub") && + virFileWriteStr(PCI_SYSFS "drivers/pci-stub/unbind", dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to bind PCI device '%s' to pci-stub"), + dev->name); + return -1; + } + } + + /* Trigger a re-probe of the device is not in pci-stub's dynamic + * ID table. If pci-stub is available, but 'remove_id' isn't + * available, then re-probing would just cause the device to be + * re-bound to pci-stub. + */ + if (!virFileExists(PCI_SYSFS "drivers/pci-stub") || + virFileExists(PCI_SYSFS "drivers/pci-stub/remove_id")) { + if (virFileWriteStr(PCI_SYSFS "drivers_probe", dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to trigger a re-probe for PCI device '%s'"), + dev->name); + return -1; + } + } + + return 0; +}
These functions really need to take the driver name as an argument, so we can pass 'pci-stub' for KVM (or modern pv_ops Xen dom0), or 'pciback' for old pre-pv_ops Xen dom0.
+pciDevice *pciGetDevice (virConnectPtr conn, + unsigned vendor, + unsigned product, + unsigned domain, + unsigned bus, + unsigned slot, + unsigned function); +void pciFreeDevice (pciDevice *dev); +int pciDettachDevice (pciDevice *dev); +int pciReAttachDevice (pciDevice *dev); +int pciResetDevice (pciDevice *dev);
I prefer it to pass the virConnectPtr conn into each of these calls rather than storing it in the pciDevice struct. It'd also be good to remove the vendor/product ID here, and have this code lookup them up from te address info, since the domain XML format only provides the caller with the domain/bus/slot/function address info. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, 2009-02-25 at 15:54 +0000, Daniel P. Berrange wrote:
On Tue, Feb 24, 2009 at 10:21:47PM +0000, Mark McLoughlin wrote:
Add implementations of dettach, reattach and reset for PCI devices.
Background to this code can be found here:
http://marc.info/?l=kvm&m=123454366317045
Some notes:
* pci-stub was first introduced in 2.6.29; if it's not available, then dettach can only unbind the device from the host driver
* remove_id is queued up for 2.6.30; in it's absence reattach cannot re-bind the device to the original driver
* If the device supports Function Level Reset, we just don't bother doing the reset - KVM will do this before and after a guest runs
* Currently, if a reset would affect another device on the same bus, or another function on the same multi-function device, then we just refuse to do the reset. In future, we could allow the reset as long as it doesn't affect a device in use by the host or assigned to another guest
* The device reset code is similar in concept to some of Xen's code
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
I can't claim to understand all the gory PCI code in here, but a few API level suggestions.
Also, would it be worth (or possible) using libpciaccess.so for this instead of poking Linux sysfs files directly ?
Ah, yes - I knew I forgot another note! Originally the code used libpci, but since that doesn't handle malloc failure (well, you can handle it, but only by exiting) I implemented it directly. libpciaccess looks much saner, so we should use that. It doesn't take much code to implement it, though, so I don't consider switching to the library to be very high priority.
+static int +pciRead(pciDevice *dev, unsigned pos, uint8_t *buf, unsigned buflen) +{ + memset(buf, 0, buflen); + + if (pciOpenConfig(dev) < 0) + return -1; + + if (pread(dev->fd, buf, buflen, pos) < 0) { + char ebuf[1024]; + VIR_WARN(_("Failed to read from '%s' : %s"), dev->path, + virStrerror(errno, ebuf, sizeof(ebuf))); + return -1; + } + return 0; +}
I think this needs to check for a possible short read too. Perhaps better off doing a seek and then saferead(), or defining a safepread() function.
+static int +pciWrite(pciDevice *dev, unsigned pos, uint8_t *buf, unsigned buflen) +{ + if (pciOpenConfig(dev) < 0) + return -1; + + if (pwrite(dev->fd, buf, buflen, pos) < 0) { + char ebuf[1024]; + VIR_WARN(_("Failed to write to '%s' : %s"), dev->path, + virStrerror(errno, ebuf, sizeof(ebuf))); + return -1; + } + return 0; +}
Likewise here needs to validate actual num of bytes written
Yep, except we shouldn't see such failures with sysfs - e.g. the likes of libpci doesn't handle such failures. Suggest it's not a big issue and fix it by switching to libpciaccess
+int +pciDettachDevice(pciDevice *dev) +{ + char path[PATH_MAX]; + + /* Try loading the pci-stub module if it isn't already loaded; + * return an error if there is no pci-stub available. + */ + if (!virFileExists(PCI_SYSFS "drivers/pci-stub")) { + const char *const modprobeargv[] = { MODPROBE, "pci-stub", NULL }; + + if (virRun(dev->conn, modprobeargv, NULL) < 0) { + char ebuf[1024]; + VIR_WARN(_("modprobe pci-stub failed: %s"), + virStrerror(errno, ebuf, sizeof ebuf)); + } + } + + if (!virFileExists(PCI_SYSFS "drivers/pci-stub")) { + VIR_WARN(_("pci-stub driver not available, cannot bind device %s to it"), + dev->name); + } else { + /* Add the PCI device ID to pci-stub's dynamic ID table; + * this is needed to allow us to bind the device to pci-stub. + * Note: if the device is not currently bound to any driver, + * pci-stub will immediately be bound to the device. Also, note + * that if a new device with this ID is hotplugged, or if a probe + * is triggered for such a device, it will also be immediately + * bound by pci-stub. + */ + if (virFileWriteStr(PCI_SYSFS "drivers/pci-stub/new_id", dev->id) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to add PCI device ID '%s' to pci-stub/new_id"), + dev->id); + return -1; + } + } + + /* If the device is already bound to a driver, unbind it. + * Note, this will have rather unpleasant side effects if this + * PCI device happens to be IDE controller for the disk hosting + * your root filesystem. + */ + snprintf(path, sizeof(path), + PCI_SYSFS "devices/%s/driver/unbind", dev->name); + if (virFileExists(path) && virFileWriteStr(path, dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to unbind PCI device '%s'"), dev->name); + return -1; + } + + if (virFileExists(PCI_SYSFS "drivers/pci-stub")) { + /* If the device isn't already bound to pci-stub, try binding it now. + */ + snprintf(path, sizeof(path), PCI_SYSFS "devices/%s/driver", dev->name); + if (!virFileLinkPointsTo(path, PCI_SYSFS "drivers/pci-stub") && + virFileWriteStr(PCI_SYSFS "drivers/pci-stub/bind", dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to bind PCI device '%s' to pci-stub"), + dev->name); + return -1; + } + + /* If 'remove_id' exists, remove the device id from pci-stub's dynamic + * ID table so that 'drivers_probe' works below. + */ + if (virFileExists(PCI_SYSFS "drivers/pci-stub/remove_id") && + virFileWriteStr(PCI_SYSFS "drivers/pci-stub/remove_id", dev->id) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to remove PCI ID '%s' with pci-stub/remove_id"), + dev->id); + return -1; + } + } + + return 0; +} + +int +pciReAttachDevice(pciDevice *dev) +{ + if (virFileExists(PCI_SYSFS "drivers/pci-stub")) { + char path[PATH_MAX]; + + /* If the device is bound to pci-stub, unbind it. + */ + snprintf(path, sizeof(path), PCI_SYSFS "devices/%s/driver", dev->name); + if (virFileLinkPointsTo(path, PCI_SYSFS "drivers/pci-stub") && + virFileWriteStr(PCI_SYSFS "drivers/pci-stub/unbind", dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to bind PCI device '%s' to pci-stub"), + dev->name); + return -1; + } + } + + /* Trigger a re-probe of the device is not in pci-stub's dynamic + * ID table. If pci-stub is available, but 'remove_id' isn't + * available, then re-probing would just cause the device to be + * re-bound to pci-stub. + */ + if (!virFileExists(PCI_SYSFS "drivers/pci-stub") || + virFileExists(PCI_SYSFS "drivers/pci-stub/remove_id")) { + if (virFileWriteStr(PCI_SYSFS "drivers_probe", dev->name) < 0) { + virReportSystemError(dev->conn, errno, + _("Failed to trigger a re-probe for PCI device '%s'"), + dev->name); + return -1; + } + } + + return 0; +}
These functions really need to take the driver name as an argument, so we can pass 'pci-stub' for KVM (or modern pv_ops Xen dom0), or 'pciback' for old pre-pv_ops Xen dom0.
Okay.
+pciDevice *pciGetDevice (virConnectPtr conn, + unsigned vendor, + unsigned product, + unsigned domain, + unsigned bus, + unsigned slot, + unsigned function); +void pciFreeDevice (pciDevice *dev); +int pciDettachDevice (pciDevice *dev); +int pciReAttachDevice (pciDevice *dev); +int pciResetDevice (pciDevice *dev);
I prefer it to pass the virConnectPtr conn into each of these calls rather than storing it in the pciDevice struct.
Okay.
It'd also be good to remove the vendor/product ID here, and have this code lookup them up from te address info, since the domain XML format only provides the caller with the domain/bus/slot/function address info.
Okay. I'll follow up with a new series with those changes. Cheers, Mark.
participants (2)
-
Daniel P. Berrange
-
Mark McLoughlin