[libvirt] [RFC PATCH 00/16] Introduce vGPU mdev framework to libvirt

Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following: <hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev> So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line. A few remarks if you actually happen to have a machine to test this on: - right now the mediated devices are one-time use only, i.e. they have to be recreated before every machine boot - I wouldn't recommend assigning multiple vGPUs to a single domain Once this series is sorted out, we can then continue with 'managed=yes' where as Laine pointed out [1], we need to figure out how exactly should the management layer hint libvirt which vGPU type should be used for device instantiation. [1] https://www.redhat.com/archives/libvir-list/2017-January/msg00287.html #pleaseshareyourfeedback Thanks, Erik Erik Skultety (16): util: Introduce new module virmdev conf: Introduce new hostdev device type mdev docs: Update RNG schema to reflect the new hostdev type mdev conf: Adjust the domain parser to work with mdevs Adjust the formatter to reflect the new hostdev type mdev security: dac: Enable labeling of vfio mediated devices security: selinux: Enable labeling of vfio mediated devices conf: Enable cold-plug of a mediated device qemu: Assign PCI addresses for mediated devices as well hostdev: Maintain a driver list of active mediated devices hostdev: Introduce a reattach method for mediated devices qemu: cgroup: Adjust cgroups' logic to allow mediated devices qemu: namespace: Hook up the discovery of mdevs into the namespace code qemu: Format mdevs on the qemu command line test: Add some test cases for our test suite regarding the mdevs docs: Document the new hostdev device type 'mdev' docs/formatdomain.html.in | 40 ++- docs/schemas/domaincommon.rng | 17 + po/POTFILES.in | 1 + src/Makefile.am | 1 + src/conf/domain_conf.c | 81 ++++- src/conf/domain_conf.h | 10 + src/libvirt_private.syms | 19 ++ src/qemu/qemu_cgroup.c | 35 ++ src/qemu/qemu_command.c | 49 +++ src/qemu/qemu_command.h | 5 + src/qemu/qemu_domain.c | 13 + src/qemu/qemu_domain_address.c | 12 +- src/qemu/qemu_hostdev.c | 37 ++ src/qemu/qemu_hostdev.h | 8 + src/qemu/qemu_hotplug.c | 2 + src/security/security_apparmor.c | 3 + src/security/security_dac.c | 56 +++ src/security/security_selinux.c | 55 +++ src/util/virhostdev.c | 179 +++++++++- src/util/virhostdev.h | 16 + src/util/virmdev.c | 375 +++++++++++++++++++++ src/util/virmdev.h | 85 +++++ tests/domaincapsschemadata/full.xml | 1 + ...qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml | 37 ++ .../qemuxml2argv-hostdev-mdev-unmanaged.args | 25 ++ .../qemuxml2argv-hostdev-mdev-unmanaged.xml | 38 +++ tests/qemuxml2argvtest.c | 6 + .../qemuxml2xmlout-hostdev-mdev-unmanaged.xml | 41 +++ tests/qemuxml2xmltest.c | 1 + 29 files changed, 1239 insertions(+), 9 deletions(-) create mode 100644 src/util/virmdev.c create mode 100644 src/util/virmdev.h create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml -- 2.10.2

Beside creation, disposal, getter, and setter methods the module exports methods to work with lists of mediated devices. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- po/POTFILES.in | 1 + src/Makefile.am | 1 + src/libvirt_private.syms | 17 +++ src/util/virmdev.c | 375 +++++++++++++++++++++++++++++++++++++++++++++++ src/util/virmdev.h | 85 +++++++++++ 5 files changed, 479 insertions(+) create mode 100644 src/util/virmdev.c create mode 100644 src/util/virmdev.h diff --git a/po/POTFILES.in b/po/POTFILES.in index 365ea66..53c674e 100644 --- a/po/POTFILES.in +++ b/po/POTFILES.in @@ -218,6 +218,7 @@ src/util/virlease.c src/util/virlockspace.c src/util/virlog.c src/util/virmacmap.c +src/util/virmdev.c src/util/virnetdev.c src/util/virnetdevbandwidth.c src/util/virnetdevbridge.c diff --git a/src/Makefile.am b/src/Makefile.am index 2f32d41..6510e1d 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -186,6 +186,7 @@ UTIL_SOURCES = \ util/viruuid.c util/viruuid.h \ util/virxdrdefs.h \ util/virxml.c util/virxml.h \ + util/virmdev.c util/virmdev.h \ $(NULL) EXTRA_DIST += $(srcdir)/util/keymaps.csv $(srcdir)/util/virkeycode-mapgen.py diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 8e994c7..4047945 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1960,6 +1960,23 @@ virMacMapNew; virMacMapRemove; virMacMapWriteFile; +# util/virmdev.h +virMediatedDeviceFree; +virMediatedDeviceGetIOMMUGroupDev; +virMediatedDeviceGetPath; +virMediatedDeviceGetUsedBy; +virMediatedDeviceListAdd; +virMediatedDeviceListCount; +virMediatedDeviceListDel; +virMediatedDeviceListFind; +virMediatedDeviceListGet; +virMediatedDeviceListNew; +virMediatedDeviceListSteal; +virMediatedDeviceListStealIndex; +virMediatedDeviceNew; +virMediatedDeviceSetUsedBy; + + # util/virnetdev.h virNetDevAddMulti; diff --git a/src/util/virmdev.c b/src/util/virmdev.c new file mode 100644 index 0000000..0b70798 --- /dev/null +++ b/src/util/virmdev.c @@ -0,0 +1,375 @@ +/* + * virmdev.c: helper APIs for managing host MDEV devices + * + * Copyright (C) 2017-2018 Red Hat, Inc. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + * + * Authors: + * Erik Skultety <eskultet@redhat.com> + */ + +#include <config.h> + +#include <fcntl.h> +#include <inttypes.h> +#include <limits.h> +#include <stdio.h> +#include <string.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <unistd.h> +#include <stdlib.h> + +#include "virmdev.h" +#include "dirname.h" +#include "virlog.h" +#include "viralloc.h" +#include "vircommand.h" +#include "virerror.h" +#include "virfile.h" +#include "virkmod.h" +#include "virstring.h" +#include "virutil.h" +#include "viruuid.h" +#include "virhostdev.h" + +VIR_LOG_INIT("util.mdev"); + +struct _virMediatedDevice { + char *path; /* sysfs path */ + bool managed; + + char *used_by_drvname; + char *used_by_domname; +}; + +struct _virMediatedDeviceList { + virObjectLockable parent; + + size_t count; + virMediatedDevicePtr *devs; +}; + + +/* For virReportOOMError() and virReportSystemError() */ +#define VIR_FROM_THIS VIR_FROM_NONE + +static virClassPtr virMediatedDeviceListClass; + +static void virMediatedDeviceListDispose(void *obj); + +static int virMediatedOnceInit(void) +{ + if (!(virMediatedDeviceListClass = virClassNew(virClassForObjectLockable(), + "virMediatedDeviceList", + sizeof(virMediatedDeviceList), + virMediatedDeviceListDispose))) + return -1; + + return 0; +} + +VIR_ONCE_GLOBAL_INIT(virMediated) + +#ifdef __linux__ +# define MDEV_SYSFS "/sys/class/mdev_bus/" + +virMediatedDevicePtr +virMediatedDeviceNew(virPCIDeviceAddressPtr pciaddr, const char *uuidstr) +{ + virMediatedDevicePtr dev = NULL, ret = NULL; + char *pcistr = NULL; + + if (virAsprintf(&pcistr, "%.4x:%.2x:%.2x.%.1x", pciaddr->domain, + pciaddr->bus, pciaddr->slot, pciaddr->function) < 0) + return NULL; + + if (VIR_ALLOC(dev) < 0) + goto cleanup; + + if (virAsprintf(&dev->path, MDEV_SYSFS "%s/%s", pcistr, uuidstr) < 0) + goto cleanup; + + ret = dev; + dev = NULL; + + cleanup: + VIR_FREE(pcistr); + virMediatedDeviceFree(dev); + return ret; +} + + +void +virMediatedDeviceFree(virMediatedDevicePtr dev) +{ + if (!dev) + return; + VIR_FREE(dev->path); + VIR_FREE(dev->used_by_drvname); + VIR_FREE(dev->used_by_domname); + VIR_FREE(dev); +} + + +const char * +virMediatedDeviceGetPath(virMediatedDevicePtr dev) +{ + return dev->path; +} + + +/* Returns an absolute canonicalized path to the device used to control the + * mediated device's IOMMU group (e.g. "/dev/vfio/15") + */ +char * +virMediatedDeviceGetIOMMUGroupDev(virMediatedDevicePtr dev) +{ + char *resultpath = NULL; + char *iommu_linkpath = NULL; + char *vfio_path = NULL; + + if (virAsprintf(&iommu_linkpath, "%s/iommu_group", dev->path) < 0) + return NULL; + + if (virFileIsLink(iommu_linkpath) != 1) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("IOMMU group file %s is not a symlink"), + iommu_linkpath); + goto cleanup; + } + + if (virFileResolveLink(iommu_linkpath, &resultpath) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to resolve IOMMU group symlink %s"), + iommu_linkpath); + goto cleanup; + } + + if (virAsprintf(&vfio_path, "/dev/vfio/%s", last_component(resultpath)) < 0) + goto cleanup; + + cleanup: + VIR_FREE(resultpath); + VIR_FREE(iommu_linkpath); + return vfio_path; +} + + +void +virMediatedDeviceGetUsedBy(virMediatedDevicePtr dev, + char **drvname, char **domname) +{ + *drvname = dev->used_by_drvname; + *domname = dev->used_by_domname; +} + + +int +virMediatedDeviceSetUsedBy(virMediatedDevicePtr dev, + const char *drvname, + const char *domname) +{ + VIR_FREE(dev->used_by_drvname); + VIR_FREE(dev->used_by_domname); + if (VIR_STRDUP(dev->used_by_drvname, drvname) < 0) + return -1; + if (VIR_STRDUP(dev->used_by_domname, domname) < 0) + return -1; + + return 0; +} + + +#else +static const char *unsupported = N_("not supported on non-linux platforms"); + +virMediatedDevicePtr +virMediatedDeviceNew(virPCIDeviceAddressPtr pciaddr ATTRIBUTE_UNUSED, + const char *uuidstr ATTRIBUTE_UNUSED) +{ + virRerportError(VIR_ERR_INTERNAL_ERROR, "%s", _(unsupported)); + return NULL; +} + + +char * +virMediatedDeviceGetIOMMUGroupDev(virMediatedDevicePtr dev ATTRIBUTE_UNUSED) +{ + virRerportError(VIR_ERR_INTERNAL_ERROR, "%s", _(unsupported)); + return NULL; +} + + +void +virMediatedDeviceFree(virMediatedDevicePtr dev ATTRIBUTE_UNUSED) +{ + virRerportError(VIR_ERR_INTERNAL_ERROR, "%s", _(unsupported)); +} + + +const char * +virMediatedDeviceGetPath(virMediatedDevicePtr dev ATTRIBUTE_UNUSED) +{ + virRerportError(VIR_ERR_INTERNAL_ERROR, "%s", _(unsupported)); + return NULL; +} + + +void +virMediatedDeviceGetUsedBy(virMediatedDevicePtr dev, + char **drvname, char **domname) +{ + virRerportError(VIR_ERR_INTERNAL_ERROR, "%s", _(unsupported)); + *drvname = NULL; + *domname = NULL; +} + + +int +virMediatedDeviceSetUsedBy(virMediatedDevicePtr dev ATTRIBUTE_UNUSED, + const char *drvname ATTRIBUTE_UNUSED, + const char *domname ATTRIBUTE_UNUSED) +{ + virRerportError(VIR_ERR_INTERNAL_ERROR, "%s", _(unsupported)); + return -1; +} +#endif /* __linux__ */ + +virMediatedDeviceListPtr +virMediatedDeviceListNew(void) +{ + virMediatedDeviceListPtr list; + + if (virMediatedInitialize() < 0) + return NULL; + + if (!(list = virObjectLockableNew(virMediatedDeviceListClass))) + return NULL; + + return list; +} + + +static void +virMediatedDeviceListDispose(void *obj) +{ + virMediatedDeviceListPtr list = obj; + size_t i; + + for (i = 0; i < list->count; i++) { + virMediatedDeviceFree(list->devs[i]); + list->devs[i] = NULL; + } + + list->count = 0; + VIR_FREE(list->devs); +} + + +int +virMediatedDeviceListAdd(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev) +{ + if (virMediatedDeviceListFind(list, dev)) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Device %s is already in use"), dev->path); + return -1; + } + return VIR_APPEND_ELEMENT(list->devs, list->count, dev); +} + + +virMediatedDevicePtr +virMediatedDeviceListGet(virMediatedDeviceListPtr list, + int idx) +{ + if (idx >= list->count) + return NULL; + if (idx < 0) + return NULL; + + return list->devs[idx]; +} + + +size_t +virMediatedDeviceListCount(virMediatedDeviceListPtr list) +{ + return list->count; +} + + +virMediatedDevicePtr +virMediatedDeviceListStealIndex(virMediatedDeviceListPtr list, + int idx) +{ + virMediatedDevicePtr ret; + + if (idx < 0 || idx >= list->count) + return NULL; + + ret = list->devs[idx]; + VIR_DELETE_ELEMENT(list->devs, idx, list->count); + return ret; +} + + +virMediatedDevicePtr +virMediatedDeviceListSteal(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev) +{ + int idx = virMediatedDeviceListFindIndex(list, dev); + + return virMediatedDeviceListStealIndex(list, idx); +} + + +void +virMediatedDeviceListDel(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev) +{ + virMediatedDevicePtr ret = virMediatedDeviceListSteal(list, dev); + virMediatedDeviceFree(ret); +} + + +int +virMediatedDeviceListFindIndex(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev) +{ + size_t i; + + for (i = 0; i < list->count; i++) { + virMediatedDevicePtr other = list->devs[i]; + if (STREQ(other->path, dev->path)) + return i; + } + return -1; +} + + +virMediatedDevicePtr +virMediatedDeviceListFind(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev) +{ + int idx; + + if ((idx = virMediatedDeviceListFindIndex(list, dev)) >= 0) + return list->devs[idx]; + else + return NULL; +} diff --git a/src/util/virmdev.h b/src/util/virmdev.h new file mode 100644 index 0000000..c4d97f3 --- /dev/null +++ b/src/util/virmdev.h @@ -0,0 +1,85 @@ +/* + * virmdev.h: helper APIs for managing host mediated devices + * + * Copyright (C) 2017-2018 Red Hat, Inc. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + * + * Authors: + * Erik Skultety <eskultet@redhat.com> + */ + +#ifndef __VIR_MDEV_H__ +# define __VIR_MDEV_H__ + +# include "internal.h" +# include "virobject.h" +# include "virutil.h" +# include "virpci.h" + +typedef struct _virMediatedDevice virMediatedDevice; +typedef virMediatedDevice *virMediatedDevicePtr; +typedef struct _virMediatedDeviceAddress virMediatedDeviceAddress; +typedef virMediatedDeviceAddress *virMediatedDeviceAddressPtr; +typedef struct _virMediatedDeviceList virMediatedDeviceList; +typedef virMediatedDeviceList *virMediatedDeviceListPtr; + +typedef int (*virMediatedDeviceCallback)(virMediatedDevicePtr dev, + const char *path, void *opaque); + +virMediatedDevicePtr virMediatedDeviceNew(virPCIDeviceAddressPtr addr, + const char *uuidstr); + +virMediatedDevicePtr virMediatedDeviceCopy(virMediatedDevicePtr dev); + +void virMediatedDeviceFree(virMediatedDevicePtr dev); + +const char *virMediatedDeviceGetPath(virMediatedDevicePtr dev); + +void virMediatedDeviceGetUsedBy(virMediatedDevicePtr dev, + char **drvname, char **domname); + +int virMediatedDeviceSetUsedBy(virMediatedDevicePtr dev, + const char *drvname, + const char *domname); + +char *virMediatedDeviceGetIOMMUGroupDev(virMediatedDevicePtr dev); + +virMediatedDeviceListPtr virMediatedDeviceListNew(void); + +int virMediatedDeviceListAdd(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev); + +virMediatedDevicePtr virMediatedDeviceListGet(virMediatedDeviceListPtr list, + int idx); + +size_t virMediatedDeviceListCount(virMediatedDeviceListPtr list); + +virMediatedDevicePtr virMediatedDeviceListSteal(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev); + +virMediatedDevicePtr virMediatedDeviceListStealIndex(virMediatedDeviceListPtr list, + int idx); + +void virMediatedDeviceListDel(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev); + +virMediatedDevicePtr virMediatedDeviceListFind(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev); + +int virMediatedDeviceListFindIndex(virMediatedDeviceListPtr list, + virMediatedDevicePtr dev); + +#endif /* __VIR_MDEV_H__ */ -- 2.10.2

On 06.02.2017 13:19, Erik Skultety wrote:
Beside creation, disposal, getter, and setter methods the module exports methods to work with lists of mediated devices.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- po/POTFILES.in | 1 + src/Makefile.am | 1 + src/libvirt_private.syms | 17 +++ src/util/virmdev.c | 375 +++++++++++++++++++++++++++++++++++++++++++++++ src/util/virmdev.h | 85 +++++++++++ 5 files changed, 479 insertions(+) create mode 100644 src/util/virmdev.c create mode 100644 src/util/virmdev.h
diff --git a/src/util/virmdev.c b/src/util/virmdev.c new file mode 100644 index 0000000..0b70798 --- /dev/null +++ b/src/util/virmdev.c @@ -0,0 +1,375 @@ +/* + * virmdev.c: helper APIs for managing host MDEV devices + * + * Copyright (C) 2017-2018 Red Hat, Inc. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + * + * Authors: + * Erik Skultety <eskultet@redhat.com> + */ + +#include <config.h> + +#include <fcntl.h> +#include <inttypes.h> +#include <limits.h> +#include <stdio.h> +#include <string.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <unistd.h> +#include <stdlib.h> + +#include "virmdev.h" +#include "dirname.h" +#include "virlog.h" +#include "viralloc.h" +#include "vircommand.h" +#include "virerror.h" +#include "virfile.h" +#include "virkmod.h" +#include "virstring.h" +#include "virutil.h" +#include "viruuid.h" +#include "virhostdev.h" + +VIR_LOG_INIT("util.mdev"); + +struct _virMediatedDevice { + char *path; /* sysfs path */ + bool managed; + + char *used_by_drvname; + char *used_by_domname; +}; + +struct _virMediatedDeviceList { + virObjectLockable parent; + + size_t count; + virMediatedDevicePtr *devs; +}; + + +/* For virReportOOMError() and virReportSystemError() */ +#define VIR_FROM_THIS VIR_FROM_NONE + +static virClassPtr virMediatedDeviceListClass; + +static void virMediatedDeviceListDispose(void *obj); + +static int virMediatedOnceInit(void) +{ + if (!(virMediatedDeviceListClass = virClassNew(virClassForObjectLockable(), + "virMediatedDeviceList", + sizeof(virMediatedDeviceList), + virMediatedDeviceListDispose))) + return -1; + + return 0; +} + +VIR_ONCE_GLOBAL_INIT(virMediated) + +#ifdef __linux__ +# define MDEV_SYSFS "/sys/class/mdev_bus/"
Some functions in this block look not-linux specific at all. For instance there's nothing linux specific about virMediatedDeviceFree(), GetPath() and so on. Usually, if running on unsupported environment, erroring out in New() or Create() methods is good. The rest of the code is not platform/environment/stellar constellation specific.
+ +virMediatedDevicePtr +virMediatedDeviceNew(virPCIDeviceAddressPtr pciaddr, const char *uuidstr) +{ + virMediatedDevicePtr dev = NULL, ret = NULL; + char *pcistr = NULL; + + if (virAsprintf(&pcistr, "%.4x:%.2x:%.2x.%.1x", pciaddr->domain, + pciaddr->bus, pciaddr->slot, pciaddr->function) < 0) + return NULL; + + if (VIR_ALLOC(dev) < 0) + goto cleanup; + + if (virAsprintf(&dev->path, MDEV_SYSFS "%s/%s", pcistr, uuidstr) < 0) + goto cleanup; + + ret = dev; + dev = NULL; + + cleanup: + VIR_FREE(pcistr); + virMediatedDeviceFree(dev); + return ret; +} + + +void +virMediatedDeviceFree(virMediatedDevicePtr dev) +{ + if (!dev) + return; + VIR_FREE(dev->path); + VIR_FREE(dev->used_by_drvname); + VIR_FREE(dev->used_by_domname); + VIR_FREE(dev); +} + + +const char * +virMediatedDeviceGetPath(virMediatedDevicePtr dev) +{ + return dev->path; +} + + +/* Returns an absolute canonicalized path to the device used to control the + * mediated device's IOMMU group (e.g. "/dev/vfio/15") + */ +char * +virMediatedDeviceGetIOMMUGroupDev(virMediatedDevicePtr dev) +{ + char *resultpath = NULL; + char *iommu_linkpath = NULL; + char *vfio_path = NULL; + + if (virAsprintf(&iommu_linkpath, "%s/iommu_group", dev->path) < 0) + return NULL; + + if (virFileIsLink(iommu_linkpath) != 1) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("IOMMU group file %s is not a symlink"), + iommu_linkpath); + goto cleanup; + } + + if (virFileResolveLink(iommu_linkpath, &resultpath) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to resolve IOMMU group symlink %s"), + iommu_linkpath); + goto cleanup; + } + + if (virAsprintf(&vfio_path, "/dev/vfio/%s", last_component(resultpath)) < 0) + goto cleanup; + + cleanup: + VIR_FREE(resultpath); + VIR_FREE(iommu_linkpath); + return vfio_path; +} + + +void +virMediatedDeviceGetUsedBy(virMediatedDevicePtr dev, + char **drvname, char **domname)
s/char **/const char **/ as we want to let caller know to not free the values.
+{ + *drvname = dev->used_by_drvname; + *domname = dev->used_by_domname; +} + +
+ + +virMediatedDevicePtr +virMediatedDeviceListGet(virMediatedDeviceListPtr list, + int idx)
s/int/ssize_t/ Not that it would matter too much, but I had to point something else too :-)
+{ + if (idx >= list->count) + return NULL; + if (idx < 0) + return NULL; + + return list->devs[idx]; +} + +
Michal

A mediated device will be identified by the host PCI address of the parent physical device which is backing the given mediated device, and a UUID of the user pre-created mediated device. The data necessary to identify a mediated device can be easily extended in the future, once we need to enable managed='yes' in which case a hint from the upper management layer about which mediated device type (e.g. vGPU type) should an instance be created on. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/conf/domain_conf.c | 7 ++++++- src/conf/domain_conf.h | 10 ++++++++++ src/qemu/qemu_cgroup.c | 5 +++++ src/qemu/qemu_domain.c | 1 + src/qemu/qemu_hotplug.c | 2 ++ src/security/security_apparmor.c | 3 +++ src/security/security_dac.c | 2 ++ src/security/security_selinux.c | 2 ++ tests/domaincapsschemadata/full.xml | 1 + 9 files changed, 32 insertions(+), 1 deletion(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index c06b128..38ffc95 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -649,7 +649,8 @@ VIR_ENUM_IMPL(virDomainHostdevSubsys, VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST, "usb", "pci", "scsi", - "scsi_host") + "scsi_host", + "mdev") VIR_ENUM_IMPL(virDomainHostdevSubsysPCIBackend, VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_LAST, @@ -6453,6 +6454,8 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, if (virDomainHostdevSubsysSCSIVHostDefParseXML(sourcenode, def) < 0) goto error; break; + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + break; default: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, @@ -13281,6 +13284,7 @@ virDomainHostdevDefParseXML(virDomainXMLOptionPtr xmlopt, } break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB: + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: break; } @@ -14172,6 +14176,7 @@ virDomainHostdevMatchSubsys(virDomainHostdevDefPtr a, return 1; else return 0; + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: return 0; } diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 507ace8..3a1009a 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -54,6 +54,7 @@ # include "virgic.h" # include "virperf.h" # include "virtypedparam.h" +# include "virpci.h" /* forward declarations of all device types, required by * virDomainDeviceDef @@ -295,6 +296,7 @@ typedef enum { VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI, VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI, VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST, + VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV, VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST } virDomainHostdevSubsysType; @@ -369,6 +371,13 @@ struct _virDomainHostdevSubsysSCSI { } u; }; +typedef struct _virDomainHostdevSubsysMediatedDev virDomainHostdevSubsysMediatedDev; +typedef virDomainHostdevSubsysMediatedDev *virDomainHostdevSubsysMediatedDevPtr; +struct _virDomainHostdevSubsysMediatedDev { + virPCIDeviceAddress addr; /* parent device's host address */ + char uuidstr[VIR_UUID_STRING_BUFLEN]; /* mediated device's uuid string */ +}; + typedef enum { VIR_DOMAIN_HOSTDEV_SUBSYS_SCSI_HOST_PROTOCOL_TYPE_NONE, VIR_DOMAIN_HOSTDEV_SUBSYS_SCSI_HOST_PROTOCOL_TYPE_VHOST, @@ -394,6 +403,7 @@ struct _virDomainHostdevSubsys { virDomainHostdevSubsysPCI pci; virDomainHostdevSubsysSCSI scsi; virDomainHostdevSubsysSCSIVHost scsi_host; + virDomainHostdevSubsysMediatedDev mdev; } u; }; diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 6c90d46..0902624 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -433,6 +433,9 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, break; } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + break; + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: break; } @@ -501,6 +504,8 @@ qemuTeardownHostdevCgroup(virDomainObjPtr vm, case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST: /* nothing to tear down for scsi_host */ break; + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: break; } diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index c6ce090..3006d78 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -6901,6 +6901,7 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, break; } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: break; } diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 57ecc02..27ece86 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -3847,6 +3847,8 @@ qemuDomainRemoveHostDevice(virQEMUDriverPtr driver, case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST: qemuDomainRemoveSCSIVHostDevice(driver, vm, hostdev); break; + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: break; } diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 2c33abb..8395efc 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -931,6 +931,9 @@ AppArmorSetSecurityHostdevLabel(virSecurityManagerPtr mgr, break; } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + break; + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: ret = 0; break; diff --git a/src/security/security_dac.c b/src/security/security_dac.c index 6721917..ecce1d3 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -964,6 +964,7 @@ virSecurityDACSetHostdevLabel(virSecurityManagerPtr mgr, break; } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: ret = 0; break; @@ -1119,6 +1120,7 @@ virSecurityDACRestoreHostdevLabel(virSecurityManagerPtr mgr, break; } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: ret = 0; break; diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index e22de06..e152c72 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -1782,6 +1782,7 @@ virSecuritySELinuxSetHostdevSubsysLabel(virSecurityManagerPtr mgr, break; } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: ret = 0; break; @@ -2009,6 +2010,7 @@ virSecuritySELinuxRestoreHostdevSubsysLabel(virSecurityManagerPtr mgr, break; } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: ret = 0; break; diff --git a/tests/domaincapsschemadata/full.xml b/tests/domaincapsschemadata/full.xml index 6abd499..6b43069 100644 --- a/tests/domaincapsschemadata/full.xml +++ b/tests/domaincapsschemadata/full.xml @@ -88,6 +88,7 @@ <value>pci</value> <value>scsi</value> <value>scsi_host</value> + <value>mdev</value> </enum> <enum name='capsType'> <value>storage</value> -- 2.10.2

On 06.02.2017 13:19, Erik Skultety wrote:
A mediated device will be identified by the host PCI address of the parent physical device which is backing the given mediated device, and a UUID of the user pre-created mediated device. The data necessary to identify a mediated device can be easily extended in the future, once we need to enable managed='yes' in which case a hint from the upper management layer about which mediated device type (e.g. vGPU type) should an instance be created on.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/conf/domain_conf.c | 7 ++++++- src/conf/domain_conf.h | 10 ++++++++++ src/qemu/qemu_cgroup.c | 5 +++++ src/qemu/qemu_domain.c | 1 + src/qemu/qemu_hotplug.c | 2 ++ src/security/security_apparmor.c | 3 +++ src/security/security_dac.c | 2 ++ src/security/security_selinux.c | 2 ++ tests/domaincapsschemadata/full.xml | 1 + 9 files changed, 32 insertions(+), 1 deletion(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index c06b128..38ffc95 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -649,7 +649,8 @@ VIR_ENUM_IMPL(virDomainHostdevSubsys, VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST, "usb", "pci", "scsi", - "scsi_host") + "scsi_host", + "mdev")
VIR_ENUM_IMPL(virDomainHostdevSubsysPCIBackend, VIR_DOMAIN_HOSTDEV_PCI_BACKEND_TYPE_LAST, @@ -6453,6 +6454,8 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, if (virDomainHostdevSubsysSCSIVHostDefParseXML(sourcenode, def) < 0) goto error; break; + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + break;
default: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, @@ -13281,6 +13284,7 @@ virDomainHostdevDefParseXML(virDomainXMLOptionPtr xmlopt, } break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB: + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: break; } @@ -14172,6 +14176,7 @@ virDomainHostdevMatchSubsys(virDomainHostdevDefPtr a, return 1; else return 0; + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: return 0; } diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 507ace8..3a1009a 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -54,6 +54,7 @@ # include "virgic.h" # include "virperf.h" # include "virtypedparam.h" +# include "virpci.h"
/* forward declarations of all device types, required by * virDomainDeviceDef @@ -295,6 +296,7 @@ typedef enum { VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI, VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI, VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST, + VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV,
VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST } virDomainHostdevSubsysType; @@ -369,6 +371,13 @@ struct _virDomainHostdevSubsysSCSI { } u; };
+typedef struct _virDomainHostdevSubsysMediatedDev virDomainHostdevSubsysMediatedDev; +typedef virDomainHostdevSubsysMediatedDev *virDomainHostdevSubsysMediatedDevPtr; +struct _virDomainHostdevSubsysMediatedDev { + virPCIDeviceAddress addr; /* parent device's host address */ + char uuidstr[VIR_UUID_STRING_BUFLEN]; /* mediated device's uuid string */
Either this or VIR_UUID_BUFLEN for storing UUID in raw format, which is what we typically do. But I don't care that much.
+}; +
Michal

To keep the domain XML as much platform agnostic as possible, do not expose an element/attribute which would contain path directly to the syfs filesystem which the mediated devices are build upon. Instead, identify each mediated device by the parent physical device and a UUID which is an optional element, but only if managed='yes' which is not implemented yet. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- docs/schemas/domaincommon.rng | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index cc6e0d0..087ca82 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -3984,6 +3984,7 @@ <ref name="hostdevsubsysusb"/> <ref name="hostdevsubsysscsi"/> <ref name="hostdevsubsyshost"/> + <ref name="hostdevsubsysmdev"/> </choice> </define> @@ -4134,6 +4135,22 @@ </element> </define> + <define name="hostdevsubsysmdev"> + <attribute name="type"> + <value>mdev</value> + </attribute> + <element name="source"> + <element name="address"> + <ref name="pciaddress"/> + </element> + <optional> + <element name="uuid"> + <ref name="UUID"/> + </element> + </optional> + </element> + </define> + <define name="hostdevcapsstorage"> <attribute name="type"> <value>storage</value> -- 2.10.2

On 06.02.2017 13:19, Erik Skultety wrote:
To keep the domain XML as much platform agnostic as possible, do not expose an element/attribute which would contain path directly to the syfs filesystem which the mediated devices are build upon. Instead, identify each mediated device by the parent physical device and a UUID which is an optional element, but only if managed='yes' which is not implemented yet.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- docs/schemas/domaincommon.rng | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
This one, 04/16, 05/16 and 16/16 should be merged together. Michal

As for the parser, the uuid element is optional and a UUID will be generated automatically if missing unless the device is unmanaged (default) in which case the element is mandatory, otherwise libvirt wouldn't have means to identify the device uniquely. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/conf/domain_conf.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 38ffc95..f7bdd7a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -6326,6 +6326,49 @@ virDomainHostdevSubsysSCSIVHostDefParseXML(xmlNodePtr sourcenode, return ret; } +static int +virDomainHostdevSubsysMediatedDevDefParseXML(virDomainHostdevDefPtr def, + xmlXPathContextPtr ctxt) +{ + int ret = -1; + unsigned char uuid[VIR_UUID_BUFLEN] = {0}; + char *uuidxml = NULL; + xmlNodePtr node = NULL; + + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &def->source.subsys.u.mdev; + virPCIDeviceAddressPtr addr = &mdevsrc->addr; + + node = virXPathNode("./source/address", ctxt); + if (virPCIDeviceAddressParseXML(node, addr) < 0) + return -1; + + uuidxml = virXPathString("string(./source/uuid)", ctxt); + if (!uuidxml && !def->managed) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("uuid element is mandatory for unmanaged devices")); + goto cleanup; + } + + if (uuidxml) { + if (virUUIDParse(uuidxml, uuid) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("malformed uuid element")); + goto cleanup; + } + } else { + if (virUUIDGenerate(uuid)) { + virReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("Failed to generate UUID")); + goto cleanup; + } + } + + virUUIDFormat(uuid, mdevsrc->uuidstr); + ret = 0; + cleanup: + VIR_FREE(uuidxml); + return ret; +} static int virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, @@ -6455,6 +6498,8 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, goto error; break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + if (virDomainHostdevSubsysMediatedDevDefParseXML(def, ctxt) < 0) + goto error; break; default: -- 2.10.2

Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/conf/domain_conf.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index f7bdd7a..9663350 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -21140,6 +21140,7 @@ virDomainHostdevDefFormatSubsys(virBufferPtr buf, virDomainHostdevSubsysPCIPtr pcisrc = &def->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &def->source.subsys.u.scsi; virDomainHostdevSubsysSCSIVHostPtr hostsrc = &def->source.subsys.u.scsi_host; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &def->source.subsys.u.mdev; virDomainHostdevSubsysSCSIHostPtr scsihostsrc = &scsisrc->u.host; virDomainHostdevSubsysSCSIiSCSIPtr iscsisrc = &scsisrc->u.iscsi; @@ -21244,6 +21245,15 @@ virDomainHostdevDefFormatSubsys(virBufferPtr buf, break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST: break; + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + if (virPCIDeviceAddressFormat(buf, mdevsrc->addr, + includeTypeInAddr) != 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("PCI address Formatting failed")); + return -1; + } + virBufferAsprintf(buf, "<uuid>%s</uuid>\n", mdevsrc->uuidstr); + break; default: virReportError(VIR_ERR_INTERNAL_ERROR, _("unexpected hostdev type %d"), -- 2.10.2

Label the VFIO IOMMU devices under /dev/vfio/ referenced by the symlinks in the sysfs (e.g. /sys/class/mdev_bus/<uuid>/iommu_group) which what qemu actually gets formatted on the command line. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/security/security_dac.c | 58 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 56 insertions(+), 2 deletions(-) diff --git a/src/security/security_dac.c b/src/security/security_dac.c index ecce1d3..93cb66f 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -33,6 +33,7 @@ #include "virfile.h" #include "viralloc.h" #include "virlog.h" +#include "virmdev.h" #include "virpci.h" #include "virusb.h" #include "virscsi.h" @@ -856,6 +857,15 @@ virSecurityDACSetHostLabel(virSCSIVHostDevicePtr dev ATTRIBUTE_UNUSED, static int +virSecurityDACSetMediatedDevLabel(virMediatedDevicePtr dev ATTRIBUTE_UNUSED, + const char *file, + void *opaque) +{ + return virSecurityDACSetHostdevLabelHelper(file, opaque); +} + + +static int virSecurityDACSetHostdevLabel(virSecurityManagerPtr mgr, virDomainDefPtr def, virDomainHostdevDefPtr dev, @@ -867,7 +877,9 @@ virSecurityDACSetHostdevLabel(virSecurityManagerPtr mgr, virDomainHostdevSubsysPCIPtr pcisrc = &dev->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &dev->source.subsys.u.scsi; virDomainHostdevSubsysSCSIVHostPtr hostsrc = &dev->source.subsys.u.scsi_host; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &dev->source.subsys.u.mdev; int ret = -1; + virMediatedDevicePtr mdev = NULL; if (!priv->dynamicOwnership) return 0; @@ -964,13 +976,26 @@ virSecurityDACSetHostdevLabel(virSecurityManagerPtr mgr, break; } - case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: { + char *vfio_dev = NULL; + if (!(mdev = virMediatedDeviceNew(&mdevsrc->addr, mdevsrc->uuidstr))) + goto done; + + if (!(vfio_dev = virMediatedDeviceGetIOMMUGroupDev(mdev))) + goto done; + + ret = virSecurityDACSetMediatedDevLabel(mdev, vfio_dev, &cbdata); + VIR_FREE(vfio_dev); + break; + } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: ret = 0; break; } done: + virMediatedDeviceFree(mdev); return ret; } @@ -1018,6 +1043,15 @@ virSecurityDACRestoreHostLabel(virSCSIVHostDevicePtr dev ATTRIBUTE_UNUSED, return virSecurityDACRestoreFileLabel(priv, file); } +static int +virSecurityDACRestoreMediatedDevLabel(virMediatedDevicePtr dev ATTRIBUTE_UNUSED, + const char *file, + void *opaque) +{ + virSecurityManagerPtr mgr = opaque; + virSecurityDACDataPtr priv = virSecurityManagerGetPrivateData(mgr); + return virSecurityDACRestoreFileLabel(priv, file); +} static int virSecurityDACRestoreHostdevLabel(virSecurityManagerPtr mgr, @@ -1032,6 +1066,7 @@ virSecurityDACRestoreHostdevLabel(virSecurityManagerPtr mgr, virDomainHostdevSubsysPCIPtr pcisrc = &dev->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &dev->source.subsys.u.scsi; virDomainHostdevSubsysSCSIVHostPtr hostsrc = &dev->source.subsys.u.scsi_host; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &dev->source.subsys.u.mdev; int ret = -1; secdef = virDomainDefGetSecurityLabelDef(def, SECURITY_DAC_NAME); @@ -1120,7 +1155,26 @@ virSecurityDACRestoreHostdevLabel(virSecurityManagerPtr mgr, break; } - case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: { + char *vfiodev = NULL; + virMediatedDevicePtr mdev = virMediatedDeviceNew(&mdevsrc->addr, + mdevsrc->uuidstr); + + if (!mdev) + goto done; + + if (!(vfiodev = virMediatedDeviceGetIOMMUGroupDev(mdev))) { + virMediatedDeviceFree(mdev); + goto done; + } + + ret = virSecurityDACRestoreMediatedDevLabel(mdev, vfiodev, mgr); + + VIR_FREE(vfiodev); + virMediatedDeviceFree(mdev); + break; + } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: ret = 0; break; -- 2.10.2

Label the VFIO IOMMU devices under /dev/vfio/ referenced by the symlinks in the sysfs (e.g. /sys/class/mdev_bus/<uuid>/iommu_group) which what qemu actually gets formatted on the command line. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/security/security_selinux.c | 57 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 55 insertions(+), 2 deletions(-) diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index e152c72..4f6b098 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -36,6 +36,7 @@ #include "virerror.h" #include "viralloc.h" #include "virlog.h" +#include "virmdev.h" #include "virpci.h" #include "virusb.h" #include "virscsi.h" @@ -1686,6 +1687,13 @@ virSecuritySELinuxSetHostLabel(virSCSIVHostDevicePtr dev ATTRIBUTE_UNUSED, } static int +virSecuritySELinuxSetMediatedDevLabel(virMediatedDevicePtr dev ATTRIBUTE_UNUSED, + const char *file, void *opaque) +{ + return virSecuritySELinuxSetHostdevLabelHelper(file, opaque); +} + +static int virSecuritySELinuxSetHostdevSubsysLabel(virSecurityManagerPtr mgr, virDomainDefPtr def, virDomainHostdevDefPtr dev, @@ -1696,7 +1704,9 @@ virSecuritySELinuxSetHostdevSubsysLabel(virSecurityManagerPtr mgr, virDomainHostdevSubsysPCIPtr pcisrc = &dev->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &dev->source.subsys.u.scsi; virDomainHostdevSubsysSCSIVHostPtr hostsrc = &dev->source.subsys.u.scsi_host; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &dev->source.subsys.u.mdev; virSecuritySELinuxCallbackData data = {.mgr = mgr, .def = def}; + virMediatedDevicePtr mdev = NULL; int ret = -1; @@ -1782,13 +1792,26 @@ virSecuritySELinuxSetHostdevSubsysLabel(virSecurityManagerPtr mgr, break; } - case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: { + char *vfio_dev = NULL; + if (!(mdev = virMediatedDeviceNew(&mdevsrc->addr, mdevsrc->uuidstr))) + goto done; + + if (!(vfio_dev = virMediatedDeviceGetIOMMUGroupDev(mdev))) + goto done; + + ret = virSecuritySELinuxSetMediatedDevLabel(mdev, vfio_dev, &data); + VIR_FREE(vfio_dev); + break; + } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: ret = 0; break; } done: + virMediatedDeviceFree(mdev); return ret; } @@ -1918,6 +1941,16 @@ virSecuritySELinuxRestoreHostLabel(virSCSIVHostDevicePtr dev ATTRIBUTE_UNUSED, } static int +virSecuritySELinuxRestoreMediatedDevLabel(virMediatedDevicePtr dev ATTRIBUTE_UNUSED, + const char *file, + void *opaque) +{ + virSecurityManagerPtr mgr = opaque; + + return virSecuritySELinuxRestoreFileLabel(mgr, file); +} + +static int virSecuritySELinuxRestoreHostdevSubsysLabel(virSecurityManagerPtr mgr, virDomainHostdevDefPtr dev, const char *vroot) @@ -1927,6 +1960,7 @@ virSecuritySELinuxRestoreHostdevSubsysLabel(virSecurityManagerPtr mgr, virDomainHostdevSubsysPCIPtr pcisrc = &dev->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &dev->source.subsys.u.scsi; virDomainHostdevSubsysSCSIVHostPtr hostsrc = &dev->source.subsys.u.scsi_host; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &dev->source.subsys.u.mdev; int ret = -1; /* Like virSecuritySELinuxRestoreImageLabelInt() for a networked @@ -2010,7 +2044,26 @@ virSecuritySELinuxRestoreHostdevSubsysLabel(virSecurityManagerPtr mgr, break; } - case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: { + char *vfiodev = NULL; + virMediatedDevicePtr mdev = virMediatedDeviceNew(&mdevsrc->addr, + mdevsrc->uuidstr); + + if (!mdev) + goto done; + + if (!(vfiodev = virMediatedDeviceGetIOMMUGroupDev(mdev))) { + virMediatedDeviceFree(mdev); + goto done; + } + + ret = virSecuritySELinuxRestoreMediatedDevLabel(mdev, vfiodev, mgr); + + VIR_FREE(vfiodev); + virMediatedDeviceFree(mdev); + break; + } + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: ret = 0; break; -- 2.10.2

This merely introduces virDomainHostdevMatchSubsysMediatedDev method that is supposed to check whether device being cold-plugged does not already exist in the domain configuration. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/conf/domain_conf.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 9663350..aa1bd68 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -14192,6 +14192,24 @@ virDomainHostdevMatchSubsysSCSIiSCSI(virDomainHostdevDefPtr first, } static int +virDomainHostdevMatchSubsysMediatedDev(virDomainHostdevDefPtr first, + virDomainHostdevDefPtr second) +{ + virDomainHostdevSubsysMediatedDevPtr first_mdevsrc = + &first->source.subsys.u.mdev; + virDomainHostdevSubsysMediatedDevPtr second_mdevsrc = + &second->source.subsys.u.mdev; + + if (STREQ(first_mdevsrc->uuidstr, second_mdevsrc->uuidstr) && + first_mdevsrc->addr.domain == second_mdevsrc->addr.domain && + first_mdevsrc->addr.bus == second_mdevsrc->addr.bus && + first_mdevsrc->addr.slot == second_mdevsrc->addr.slot && + first_mdevsrc->addr.function == second_mdevsrc->addr.function) + return 1; + return 0; +} + +static int virDomainHostdevMatchSubsys(virDomainHostdevDefPtr a, virDomainHostdevDefPtr b) { @@ -14222,6 +14240,7 @@ virDomainHostdevMatchSubsys(virDomainHostdevDefPtr a, else return 0; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + return virDomainHostdevMatchSubsysMediatedDev(a, b); case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: return 0; } -- 2.10.2

On 06.02.2017 13:19, Erik Skultety wrote:
This merely introduces virDomainHostdevMatchSubsysMediatedDev method that is supposed to check whether device being cold-plugged does not already exist in the domain configuration.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/conf/domain_conf.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 9663350..aa1bd68 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -14192,6 +14192,24 @@ virDomainHostdevMatchSubsysSCSIiSCSI(virDomainHostdevDefPtr first, }
static int +virDomainHostdevMatchSubsysMediatedDev(virDomainHostdevDefPtr first, + virDomainHostdevDefPtr second) +{ + virDomainHostdevSubsysMediatedDevPtr first_mdevsrc = + &first->source.subsys.u.mdev; + virDomainHostdevSubsysMediatedDevPtr second_mdevsrc = + &second->source.subsys.u.mdev;
Had you gone with a/b, a_src/b_src this would nicely fit onto two lines.
+ + if (STREQ(first_mdevsrc->uuidstr, second_mdevsrc->uuidstr) && + first_mdevsrc->addr.domain == second_mdevsrc->addr.domain && + first_mdevsrc->addr.bus == second_mdevsrc->addr.bus && + first_mdevsrc->addr.slot == second_mdevsrc->addr.slot && + first_mdevsrc->addr.function == second_mdevsrc->addr.function)
Huh, we don't have virPCIAddrMatch or virPCIAddrCompare. :(
+ return 1; + return 0; +} + +static int virDomainHostdevMatchSubsys(virDomainHostdevDefPtr a, virDomainHostdevDefPtr b) { @@ -14222,6 +14240,7 @@ virDomainHostdevMatchSubsys(virDomainHostdevDefPtr a, else return 0; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + return virDomainHostdevMatchSubsysMediatedDev(a, b); case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: return 0; }
Michal

So far, the official support is for x86_64 arch guests so unless a different device API than vfio-pci is available let's assume PCI address assignment. Once a different device API is introduced, we'll actually have to start checking for the device API in the sysfs in order to assign a correct type of address. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/qemu/qemu_domain_address.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 70482f2..71a2e5c 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -618,9 +618,11 @@ qemuDomainDeviceCalculatePCIConnectFlags(virDomainDeviceDefPtr dev, virPCIDeviceAddressPtr hostAddr = &hostdev->source.subsys.u.pci.addr; if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || - hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + (hostdev->source.subsys.type != + VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + hostdev->source.subsys.type != + VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV)) return 0; - } if (pciFlags == pcieFlags) { /* This arch/qemu only supports legacy PCI, so there @@ -642,6 +644,9 @@ qemuDomainDeviceCalculatePCIConnectFlags(virDomainDeviceDefPtr dev, return pcieFlags; } + if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV) + return pcieFlags; + if (!(pciDev = virPCIDeviceNew(hostAddr->domain, hostAddr->bus, hostAddr->slot, @@ -1730,7 +1735,8 @@ qemuDomainAssignDevicePCISlots(virDomainDefPtr def, if (def->hostdevs[i]->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS) continue; if (def->hostdevs[i]->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - def->hostdevs[i]->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST) + def->hostdevs[i]->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST && + def->hostdevs[i]->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV) continue; if (qemuDomainPCIAddressReserveNextAddr(addrs, -- 2.10.2

Keep track of the assigned mediated devices the same way we do it for the rest of hostdevs. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/libvirt_private.syms | 1 + src/qemu/qemu_hostdev.c | 22 +++++++++ src/qemu/qemu_hostdev.h | 4 ++ src/util/virhostdev.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++- src/util/virhostdev.h | 9 ++++ 5 files changed, 161 insertions(+), 1 deletion(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 4047945..8921a53 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1707,6 +1707,7 @@ virHostdevPCINodeDeviceDetach; virHostdevPCINodeDeviceReAttach; virHostdevPCINodeDeviceReset; virHostdevPrepareDomainDevices; +virHostdevPrepareMediatedDevices; virHostdevPreparePCIDevices; virHostdevPrepareSCSIDevices; virHostdevPrepareSCSIVHostDevices; diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index 7cd49e4..45b731c 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -305,6 +305,24 @@ qemuHostdevPrepareSCSIVHostDevices(virQEMUDriverPtr driver, } int +qemuHostdevPrepareMediatedDevices(virQEMUDriverPtr driver, + const char *name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs) +{ + virHostdevManagerPtr hostdev_mgr = driver->hostdevMgr; + + if (!qemuHostdevHostSupportsPassthroughVFIO()) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("host doesn't support VFIO PCI interface")); + return -1; + } + + return virHostdevPrepareMediatedDevices(hostdev_mgr, QEMU_DRIVER_NAME, + name, hostdevs, nhostdevs); +} + +int qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def, virQEMUCapsPtr qemuCaps, @@ -330,6 +348,10 @@ qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, def->hostdevs, def->nhostdevs) < 0) return -1; + if (qemuHostdevPrepareMediatedDevices(driver, def->name, + def->hostdevs, def->nhostdevs) < 0) + return -1; + return 0; } diff --git a/src/qemu/qemu_hostdev.h b/src/qemu/qemu_hostdev.h index 74a7d4f..9399241 100644 --- a/src/qemu/qemu_hostdev.h +++ b/src/qemu/qemu_hostdev.h @@ -59,6 +59,10 @@ int qemuHostdevPrepareSCSIVHostDevices(virQEMUDriverPtr driver, const char *name, virDomainHostdevDefPtr *hostdevs, int nhostdevs); +int qemuHostdevPrepareMediatedDevices(virQEMUDriverPtr driver, + const char *name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs); int qemuHostdevPrepareDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def, virQEMUCapsPtr qemuCaps, diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 86ca8e0..7691eaa 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -1,6 +1,6 @@ /* virhostdev.c: hostdev management * - * Copyright (C) 2006-2007, 2009-2016 Red Hat, Inc. + * Copyright (C) 2006-2007, 2009-2017 Red Hat, Inc. * Copyright (C) 2006 Daniel P. Berrange * Copyright (C) 2014 SUSE LINUX Products GmbH, Nuernberg, Germany. * @@ -147,6 +147,7 @@ virHostdevManagerDispose(void *obj) virObjectUnref(hostdevMgr->activeUSBHostdevs); virObjectUnref(hostdevMgr->activeSCSIHostdevs); virObjectUnref(hostdevMgr->activeSCSIVHostHostdevs); + virObjectUnref(hostdevMgr->activeMediatedHostdevs); VIR_FREE(hostdevMgr->stateDir); } @@ -174,6 +175,9 @@ virHostdevManagerNew(void) if (!(hostdevMgr->activeSCSIVHostHostdevs = virSCSIVHostDeviceListNew())) goto error; + if (!(hostdevMgr->activeMediatedHostdevs = virMediatedDeviceListNew())) + goto error; + if (privileged) { if (VIR_STRDUP(hostdevMgr->stateDir, HOSTDEV_STATE_DIR) < 0) goto error; @@ -1595,6 +1599,126 @@ virHostdevPrepareSCSIVHostDevices(virHostdevManagerPtr mgr, return -1; } +static int +virHostdevMarkMediatedDevices(virHostdevManagerPtr mgr, + const char *drvname, + const char *domname, + virMediatedDeviceListPtr list) +{ + size_t i, j; + unsigned int count; + virMediatedDevicePtr tmp; + + virObjectLock(mgr->activeMediatedHostdevs); + count = virMediatedDeviceListCount(list); + + for (i = 0; i < count; i++) { + virMediatedDevicePtr mdev = virMediatedDeviceListGet(list, i); + if ((tmp = virMediatedDeviceListFind(mgr->activeMediatedHostdevs, + mdev))) { + char *tmp_drvname; + char *tmp_domname; + + virMediatedDeviceGetUsedBy(tmp, &tmp_drvname, &tmp_domname); + virReportError(VIR_ERR_OPERATION_INVALID, + _("Mediated device %s is in use by " + "driver %s, domain %s"), + virMediatedDeviceGetPath(tmp), + tmp_drvname, tmp_domname); + goto error; + } + + virMediatedDeviceSetUsedBy(mdev, drvname, domname); + VIR_DEBUG("Adding %s to activeMediatedHostdevs", + virMediatedDeviceGetPath(mdev)); + /* + * The caller is responsible to steal these mdev devices + * from the virMediatedDeviceList that passed in on success, + * perform rollback on failure. + */ + if (virMediatedDeviceListAdd(mgr->activeMediatedHostdevs, mdev) < 0) + goto error; + } + + virObjectUnlock(mgr->activeMediatedHostdevs); + return 0; + + error: + for (j = 0; j < i; j++) { + tmp = virMediatedDeviceListGet(list, i); + virMediatedDeviceListSteal(mgr->activeMediatedHostdevs, tmp); + } + virObjectUnlock(mgr->activeMediatedHostdevs); + return -1; +} + +int +virHostdevPrepareMediatedDevices(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs) +{ + size_t i; + int ret = -1; + virMediatedDeviceListPtr list; + virMediatedDevicePtr tmp; + + if (!nhostdevs) + return 0; + + /* To prevent situation where Mediated device is assigned to two domains + * we need to keep a list of currently assigned Mediated devices. + * This is done in several loops which cannot be joined into one big + * loop. See virHostdevPreparePCIDevices() + */ + if (!(list = virMediatedDeviceListNew())) + goto cleanup; + + /* Loop 1: build temporary list + */ + for (i = 0; i < nhostdevs; i++) { + virDomainHostdevDefPtr hostdev = hostdevs[i]; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &hostdev->source.subsys.u.mdev; + virMediatedDevicePtr mdev; + + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS) + continue; + if (hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV) + continue; + + if (!(mdev = virMediatedDeviceNew(&mdevsrc->addr, mdevsrc->uuidstr))) + goto cleanup; + + if (virMediatedDeviceListAdd(list, mdev) < 0) { + virMediatedDeviceFree(mdev); + goto cleanup; + } + } + + /* Mark devices in temporary list as used by @dom_name + * and add them do driver list. However, if something goes + * wrong, perform rollback. + */ + if (virHostdevMarkMediatedDevices(mgr, drv_name, dom_name, list) < 0) + goto cleanup; + + /* Loop 2: Temporary list was successfully merged with + * driver list, so steal all items to avoid freeing them + * in cleanup label. + */ + while (virMediatedDeviceListCount(list) > 0) { + tmp = virMediatedDeviceListGet(list, 0); + virMediatedDeviceListSteal(list, tmp); + } + + ret = 0; + + cleanup: + virObjectUnref(list); + return ret; +} + void virHostdevReAttachUSBDevices(virHostdevManagerPtr mgr, const char *drv_name, diff --git a/src/util/virhostdev.h b/src/util/virhostdev.h index 4c1fea3..b077089 100644 --- a/src/util/virhostdev.h +++ b/src/util/virhostdev.h @@ -31,6 +31,7 @@ # include "virusb.h" # include "virscsi.h" # include "virscsivhost.h" +# include "virmdev.h" # include "domain_conf.h" typedef enum { @@ -55,6 +56,7 @@ struct _virHostdevManager { virUSBDeviceListPtr activeUSBHostdevs; virSCSIDeviceListPtr activeSCSIHostdevs; virSCSIVHostDeviceListPtr activeSCSIVHostHostdevs; + virMediatedDeviceListPtr activeMediatedHostdevs; }; virHostdevManagerPtr virHostdevManagerGetDefault(void); @@ -96,6 +98,13 @@ virHostdevPrepareSCSIVHostDevices(virHostdevManagerPtr hostdev_mgr, virDomainHostdevDefPtr *hostdevs, int nhostdevs) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); +int +virHostdevPrepareMediatedDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); void virHostdevReAttachPCIDevices(virHostdevManagerPtr hostdev_mgr, const char *drv_name, -- 2.10.2

The name "reattach" does not really reflect the truth behind mediated devices, since these are purely software devices that do not need to be plugged back into the host in any way, however this patch pushes for naming consistency for methods where the logic behind the underlying operation stays the same except that in case of mdevs the operation itself is effectively a NO-OP. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/libvirt_private.syms | 1 + src/qemu/qemu_hostdev.c | 15 ++++++++++++++ src/qemu/qemu_hostdev.h | 4 ++++ src/util/virhostdev.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++ src/util/virhostdev.h | 7 +++++++ 5 files changed, 80 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 8921a53..00fb2c2 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1713,6 +1713,7 @@ virHostdevPrepareSCSIDevices; virHostdevPrepareSCSIVHostDevices; virHostdevPrepareUSBDevices; virHostdevReAttachDomainDevices; +virHostdevReAttachMediatedDevices; virHostdevReAttachPCIDevices; virHostdevReAttachSCSIDevices; virHostdevReAttachSCSIVHostDevices; diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index 45b731c..6a7232f 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -419,6 +419,18 @@ qemuHostdevReAttachSCSIVHostDevices(virQEMUDriverPtr driver, } void +qemuHostdevReAttachMediatedDevices(virQEMUDriverPtr driver, + const char *name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs) +{ + virHostdevManagerPtr hostdev_mgr = driver->hostdevMgr; + + virHostdevReAttachMediatedDevices(hostdev_mgr, QEMU_DRIVER_NAME, + name, hostdevs, nhostdevs); +} + +void qemuHostdevReAttachDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def) { @@ -436,4 +448,7 @@ qemuHostdevReAttachDomainDevices(virQEMUDriverPtr driver, qemuHostdevReAttachSCSIVHostDevices(driver, def->name, def->hostdevs, def->nhostdevs); + + qemuHostdevReAttachMediatedDevices(driver, def->name, def->hostdevs, + def->nhostdevs); } diff --git a/src/qemu/qemu_hostdev.h b/src/qemu/qemu_hostdev.h index 9399241..0096497 100644 --- a/src/qemu/qemu_hostdev.h +++ b/src/qemu/qemu_hostdev.h @@ -84,6 +84,10 @@ void qemuHostdevReAttachSCSIVHostDevices(virQEMUDriverPtr driver, const char *name, virDomainHostdevDefPtr *hostdevs, int nhostdevs); +void qemuHostdevReAttachMediatedDevices(virQEMUDriverPtr driver, + const char *name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs); void qemuHostdevReAttachDomainDevices(virQEMUDriverPtr driver, virDomainDefPtr def); diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 7691eaa..520b711 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -1913,6 +1913,59 @@ virHostdevReAttachSCSIVHostDevices(virHostdevManagerPtr mgr, virObjectUnlock(mgr->activeSCSIVHostHostdevs); } +void +virHostdevReAttachMediatedDevices(virHostdevManagerPtr mgr, + const char *drv_name, + const char *dom_name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs) +{ + char *used_by_drvname = NULL; + char *used_by_domname = NULL; + virDomainHostdevDefPtr hostdev = NULL; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = NULL; + size_t i; + + if (nhostdevs == 0) + return; + + virObjectLock(mgr->activeMediatedHostdevs); + for (i = 0; i < nhostdevs; i++) { + virMediatedDevicePtr mdev, tmp; + + hostdev = hostdevs[i]; + mdevsrc = &hostdev->source.subsys.u.mdev; + + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || + hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV) + continue; + + if (!(mdev = virMediatedDeviceNew(&mdevsrc->addr, mdevsrc->uuidstr))) { + VIR_WARN("Failed to reattach mediated device %s attached to " + "domain %s", mdevsrc->uuidstr, dom_name); + continue; + } + + /* Remove from the list only mdevs assigned to @drv_name/@dom_name */ + + tmp = virMediatedDeviceListFind(mgr->activeMediatedHostdevs, mdev); + virMediatedDeviceFree(mdev); + + /* skip inactive devices */ + if (!tmp) + continue; + + virMediatedDeviceGetUsedBy(tmp, &used_by_drvname, &used_by_domname); + if (STREQ_NULLABLE(drv_name, used_by_drvname) && + STREQ_NULLABLE(dom_name, used_by_domname)) { + VIR_DEBUG("Removing %s dom=%s from activeMediatedHostdevs", + mdevsrc->uuidstr, dom_name); + virMediatedDeviceListDel(mgr->activeMediatedHostdevs, tmp); + } + } + virObjectUnlock(mgr->activeMediatedHostdevs); +} + int virHostdevPCINodeDeviceDetach(virHostdevManagerPtr mgr, virPCIDevicePtr pci) diff --git a/src/util/virhostdev.h b/src/util/virhostdev.h index b077089..d0875d8 100644 --- a/src/util/virhostdev.h +++ b/src/util/virhostdev.h @@ -134,6 +134,13 @@ virHostdevReAttachSCSIVHostDevices(virHostdevManagerPtr hostdev_mgr, virDomainHostdevDefPtr *hostdevs, int nhostdevs) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); +void +virHostdevReAttachMediatedDevices(virHostdevManagerPtr hostdev_mgr, + const char *drv_name, + const char *dom_name, + virDomainHostdevDefPtr *hostdevs, + int nhostdevs) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); int virHostdevUpdateActivePCIDevices(virHostdevManagerPtr mgr, virDomainHostdevDefPtr *hostdevs, -- 2.10.2

As goes for all the other hostdev device types, grant the qemu process access to /dev/vfio/<mediated_device_iommu_group>. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/qemu/qemu_cgroup.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 0902624..405d6c1 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -318,6 +318,23 @@ qemuSetupHostSCSIVHostDeviceCgroup(virSCSIVHostDevicePtr dev ATTRIBUTE_UNUSED, return ret; } +static int +qemuSetupHostMediatedDeviceCgroup(virMediatedDevicePtr dev ATTRIBUTE_UNUSED, + const char *path, + void *opaque) +{ + virDomainObjPtr vm = opaque; + qemuDomainObjPrivatePtr priv = vm->privateData; + int ret = -1; + + VIR_DEBUG("Process path '%s' for mediated device", path); + ret = virCgroupAllowDevicePath(priv->cgroup, path, + VIR_CGROUP_DEVICE_RW, false); + virDomainAuditCgroupPath(vm, priv->cgroup, "allow", path, "rw", ret == 0); + + return ret; +} + int qemuSetupHostdevCgroup(virDomainObjPtr vm, virDomainHostdevDefPtr dev) @@ -328,10 +345,12 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, virDomainHostdevSubsysPCIPtr pcisrc = &dev->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &dev->source.subsys.u.scsi; virDomainHostdevSubsysSCSIVHostPtr hostsrc = &dev->source.subsys.u.scsi_host; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &dev->source.subsys.u.mdev; virPCIDevicePtr pci = NULL; virUSBDevicePtr usb = NULL; virSCSIDevicePtr scsi = NULL; virSCSIVHostDevicePtr host = NULL; + virMediatedDevicePtr mdev = NULL; char *path = NULL; /* currently this only does something for PCI devices using vfio @@ -434,6 +453,16 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, } case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + if (!(mdev = virMediatedDeviceNew(&mdevsrc->addr, + mdevsrc->uuidstr))) + goto cleanup; + + if (!(path = virMediatedDeviceGetIOMMUGroupDev(mdev))) + goto cleanup; + + if (qemuSetupHostMediatedDeviceCgroup(mdev, path, vm) < 0) + goto cleanup; + break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: @@ -447,6 +476,7 @@ qemuSetupHostdevCgroup(virDomainObjPtr vm, virUSBDeviceFree(usb); virSCSIDeviceFree(scsi); virSCSIVHostDeviceFree(host); + virMediatedDeviceFree(mdev); VIR_FREE(path); return ret; } -- 2.10.2

Again, as for all the other hostdev device types, make sure that the /dev/vfio/<mdev_iommu_group> device will be added to the qemu namespace. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/qemu/qemu_domain.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 3006d78..afc7d07 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -6824,10 +6824,12 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, virDomainHostdevSubsysPCIPtr pcisrc = &dev->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &dev->source.subsys.u.scsi; virDomainHostdevSubsysSCSIVHostPtr hostsrc = &dev->source.subsys.u.scsi_host; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &dev->source.subsys.u.mdev; virPCIDevicePtr pci = NULL; virUSBDevicePtr usb = NULL; virSCSIDevicePtr scsi = NULL; virSCSIVHostDevicePtr host = NULL; + virMediatedDevicePtr mdev = NULL; char *tmpPath = NULL; bool freeTmpPath = false; @@ -6902,6 +6904,15 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, } case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + if (!(mdev = virMediatedDeviceNew(&mdevsrc->addr, + mdevsrc->uuidstr))) + goto cleanup; + + if (!(tmpPath = virMediatedDeviceGetIOMMUGroupDev(mdev))) + goto cleanup; + + freeTmpPath = true; + break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: break; } @@ -6922,6 +6933,7 @@ qemuDomainGetHostdevPath(virDomainHostdevDefPtr dev, virUSBDeviceFree(usb); virSCSIDeviceFree(scsi); virSCSIVHostDeviceFree(host); + virMediatedDeviceFree(mdev); if (freeTmpPath) VIR_FREE(tmpPath); return ret; -- 2.10.2

Format the mediated devices on the qemu command line as -device vfio-pci,sysfsdev='/path/to/device/in/syfs'. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/qemu/qemu_command.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_command.h | 5 +++++ 2 files changed, 54 insertions(+) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 1396661..51a9298 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -57,6 +57,7 @@ #include "virscsi.h" #include "virnuma.h" #include "virgic.h" +#include "virmdev.h" #if defined(__linux__) # include <linux/capability.h> #endif @@ -5135,6 +5136,35 @@ qemuBuildChrChardevStr(virLogManagerPtr logManager, return ret; } +char * +qemuBuildHostdevMediatedDevStr(const virDomainDef *def, + virDomainHostdevDefPtr dev, + virQEMUCapsPtr qemuCaps) +{ + virBuffer buf = VIR_BUFFER_INITIALIZER; + virDomainHostdevSubsysMediatedDevPtr mdevsrc = &dev->source.subsys.u.mdev; + virMediatedDevicePtr mdev = NULL; + char *ret = NULL; + + if (!(mdev = virMediatedDeviceNew(&mdevsrc->addr, mdevsrc->uuidstr))) + goto cleanup; + + virBufferAddLit(&buf, "vfio-pci"); + virBufferAsprintf(&buf, ",sysfsdev=%s", virMediatedDeviceGetPath(mdev)); + + if (qemuBuildDeviceAddressStr(&buf, def, dev->info, qemuCaps) < 0) + goto cleanup; + + if (virBufferCheckError(&buf) < 0) + goto cleanup; + + ret = virBufferContentAndReset(&buf); + + cleanup: + virBufferFreeAndReset(&buf); + virMediatedDeviceFree(mdev); + return ret; +} static int qemuBuildHostdevCommandLine(virCommandPtr cmd, @@ -5323,6 +5353,25 @@ qemuBuildHostdevCommandLine(virCommandPtr cmd, VIR_FREE(devstr); } } + + /* MDEV */ + if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + subsys->type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV) { + + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE_VFIO_PCI)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("VFIO PCI device assignment is not " + "supported by this version of qemu")); + return -1; + } + + virCommandAddArg(cmd, "-device"); + if (!(devstr = + qemuBuildHostdevMediatedDevStr(def, hostdev, qemuCaps))) + return -1; + virCommandAddArg(cmd, devstr); + VIR_FREE(devstr); + } } return 0; diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index 3bcfdc6..e50b6f4 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -170,6 +170,11 @@ qemuBuildSCSIVHostHostdevDevStr(const virDomainDef *def, virQEMUCapsPtr qemuCaps, char *vhostfdName); +char * +qemuBuildHostdevMediatedDevStr(const virDomainDef *def, + virDomainHostdevDefPtr dev, + virQEMUCapsPtr qemuCaps); + char *qemuBuildRedirdevDevStr(const virDomainDef *def, virDomainRedirdevDefPtr dev, virQEMUCapsPtr qemuCaps); -- 2.10.2

For now, focus only on unmanaged devices, thus also testing whether the uuid element is present or not, since in case of unmanaged mediated devices it must be present. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- ...qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml | 37 +++++++++++++++++++ .../qemuxml2argv-hostdev-mdev-unmanaged.args | 25 +++++++++++++ .../qemuxml2argv-hostdev-mdev-unmanaged.xml | 38 ++++++++++++++++++++ tests/qemuxml2argvtest.c | 6 ++++ .../qemuxml2xmlout-hostdev-mdev-unmanaged.xml | 41 ++++++++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 6 files changed, 148 insertions(+) create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml diff --git a/tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml new file mode 100644 index 0000000..0fda187 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml @@ -0,0 +1,37 @@ +<domain type='qemu'> + <name>QEMUGuest2</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <disk type='block' device='disk'> + <driver name='qemu' type='raw'/> + <source dev='/dev/HostVG/QEMUGuest2'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='usb' index='0'> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <controller type='ide' index='0'> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='mdev' managed='no'> + <source> + <address domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </source> + </hostdev> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args new file mode 100644 index 0000000..be0017e --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args @@ -0,0 +1,25 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/home/test \ +USER=test \ +LOGNAME=test \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu \ +-name QEMUGuest2 \ +-S \ +-M pc \ +-m 214 \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \ +-nographic \ +-nodefconfig \ +-nodefaults \ +-monitor unix:/tmp/lib/domain--1-QEMUGuest2/monitor.sock,server,nowait \ +-no-acpi \ +-boot c \ +-usb \ +-drive file=/dev/HostVG/QEMUGuest2,format=raw,if=none,id=drive-ide0-0-0 \ +-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 \ +-device vfio-pci,\ +sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc,\ +bus=pci.0,addr=0x3 diff --git a/tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml new file mode 100644 index 0000000..025429d --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml @@ -0,0 +1,38 @@ +<domain type='qemu'> + <name>QEMUGuest2</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <disk type='block' device='disk'> + <driver name='qemu' type='raw'/> + <source dev='/dev/HostVG/QEMUGuest2'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='usb' index='0'> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <controller type='ide' index='0'> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='mdev' managed='no'> + <source> + <address domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + <uuid>53764d0e-85a0-42b4-af5c-2046b460b1dc</uuid> + </source> + </hostdev> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 3532cb5..12ed800 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1469,6 +1469,12 @@ mymain(void) DO_TEST("hostdev-vfio-multidomain", QEMU_CAPS_NODEFCONFIG, QEMU_CAPS_DEVICE_VFIO_PCI, QEMU_CAPS_HOST_PCI_MULTIDOMAIN); + DO_TEST("hostdev-mdev-unmanaged", + QEMU_CAPS_NODEFCONFIG, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST_PARSE_ERROR("hostdev-mdev-unmanaged-no-uuid", + QEMU_CAPS_NODEFCONFIG, + QEMU_CAPS_DEVICE_VFIO_PCI); DO_TEST_FAILURE("hostdev-vfio-multidomain", QEMU_CAPS_NODEFCONFIG, QEMU_CAPS_DEVICE_VFIO_PCI); DO_TEST("pci-rom", diff --git a/tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml b/tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml new file mode 100644 index 0000000..baa8b78 --- /dev/null +++ b/tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml @@ -0,0 +1,41 @@ +<domain type='qemu'> + <name>QEMUGuest2</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <disk type='block' device='disk'> + <driver name='qemu' type='raw'/> + <source dev='/dev/HostVG/QEMUGuest2'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <controller type='ide' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='mdev' managed='no'> + <source> + <address domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + <uuid>53764d0e-85a0-42b4-af5c-2046b460b1dc</uuid> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 4f3b09a..740025d 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -550,6 +550,7 @@ mymain(void) DO_TEST("hostdev-usb-address", NONE); DO_TEST("hostdev-pci-address", NONE); DO_TEST("hostdev-vfio", NONE); + DO_TEST("hostdev-mdev-unmanaged", NONE); DO_TEST("pci-rom", NONE); DO_TEST("pci-serial-dev-chardev", NONE); -- 2.10.2

Signed-off-by: Erik Skultety <eskultet@redhat.com> --- docs/formatdomain.html.in | 40 ++++++++++++++++++++++++++++++++++++---- 1 file changed, 36 insertions(+), 4 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 77d8e71..10cef31 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -3765,6 +3765,20 @@ </devices> ...</pre> + <p>or:</p> + +<pre> + ... + <devices> + <hostdev mode='subsystem' type='mdev'> + <source> + <address domain='0x0000' bus='0x06' slot='0x02' function='0x0'/> + <uuid>c2177883-f1bb-47f0-914d-32a22e3a8804</uuid> + </source> + </hostdev> + </devices> + ...</pre> + <dl> <dt><code>hostdev</code></dt> <dd>The <code>hostdev</code> element is the main container for describing @@ -3809,12 +3823,20 @@ <code>type</code> passes all LUNs presented by a single HBA to the guest. </dd> + <dt><code>mdev</code></dt> + <dd>For mediated devices (<span class="since">Since 3.1.0</span>) + the <code>managed</code> attribute controls whether the mediated + device to be assigned is pre-created by the user or shall be created + automatically by libvirt using the information provided by the + <code>source</code> element. Currently, only the value "no" is + supported which is the same as omitting the attribute completely. + </dd> </dl> <p> - Note: The <code>managed</code> attribute is only used with PCI devices - and is ignored by all the other device types, thus setting - <code>managed</code> explicitly with other than PCI device has the same - effect as omitting it. + Note: The <code>managed</code> attribute is only used with PCI and + mediated devices and is ignored by all the other device types, thus + setting <code>managed</code> explicitly with other than a PCI or a + mediated device has the same effect as omitting it. </p> </dd> <dt><code>source</code></dt> @@ -3879,6 +3901,16 @@ is the vhost_scsi wwpn (16 hexadecimal digits with a prefix of "naa.") established in the host configfs. </dd> + <dt><code>mdev</code></dt> + <dd>Mediated devices (<span class="since">Since 3.1.0</span>) are + described by the PCI <code>address</code> of the backing physical + parent device. Additionally, the mediated device is further identified + by the <code>uuid</code> element. The element is optional if + <code>managed="yes"</code>, in which case the UUID shall be + generated automatically along with creation with the mediated + device. However, when <code>managed="no"</code> then <code>uuid</code> + element is mandatory. + </dd> </dl> </dd> <dt><code>vendor</code>, <code>product</code></dt> -- 2.10.2

On 06.02.2017 13:19, Erik Skultety wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
A few remarks if you actually happen to have a machine to test this on: - right now the mediated devices are one-time use only, i.e. they have to be recreated before every machine boot - I wouldn't recommend assigning multiple vGPUs to a single domain
Once this series is sorted out, we can then continue with 'managed=yes' where as Laine pointed out [1], we need to figure out how exactly should the management layer hint libvirt which vGPU type should be used for device instantiation.
[1] https://www.redhat.com/archives/libvir-list/2017-January/msg00287.html
#pleaseshareyourfeedback
Thanks, Erik
Erik Skultety (16): util: Introduce new module virmdev conf: Introduce new hostdev device type mdev docs: Update RNG schema to reflect the new hostdev type mdev conf: Adjust the domain parser to work with mdevs Adjust the formatter to reflect the new hostdev type mdev security: dac: Enable labeling of vfio mediated devices security: selinux: Enable labeling of vfio mediated devices conf: Enable cold-plug of a mediated device qemu: Assign PCI addresses for mediated devices as well hostdev: Maintain a driver list of active mediated devices hostdev: Introduce a reattach method for mediated devices qemu: cgroup: Adjust cgroups' logic to allow mediated devices qemu: namespace: Hook up the discovery of mdevs into the namespace code qemu: Format mdevs on the qemu command line test: Add some test cases for our test suite regarding the mdevs docs: Document the new hostdev device type 'mdev'
docs/formatdomain.html.in | 40 ++- docs/schemas/domaincommon.rng | 17 + po/POTFILES.in | 1 + src/Makefile.am | 1 + src/conf/domain_conf.c | 81 ++++- src/conf/domain_conf.h | 10 + src/libvirt_private.syms | 19 ++ src/qemu/qemu_cgroup.c | 35 ++ src/qemu/qemu_command.c | 49 +++ src/qemu/qemu_command.h | 5 + src/qemu/qemu_domain.c | 13 + src/qemu/qemu_domain_address.c | 12 +- src/qemu/qemu_hostdev.c | 37 ++ src/qemu/qemu_hostdev.h | 8 + src/qemu/qemu_hotplug.c | 2 + src/security/security_apparmor.c | 3 + src/security/security_dac.c | 56 +++ src/security/security_selinux.c | 55 +++ src/util/virhostdev.c | 179 +++++++++- src/util/virhostdev.h | 16 + src/util/virmdev.c | 375 +++++++++++++++++++++ src/util/virmdev.h | 85 +++++ tests/domaincapsschemadata/full.xml | 1 + ...qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml | 37 ++ .../qemuxml2argv-hostdev-mdev-unmanaged.args | 25 ++ .../qemuxml2argv-hostdev-mdev-unmanaged.xml | 38 +++ tests/qemuxml2argvtest.c | 6 + .../qemuxml2xmlout-hostdev-mdev-unmanaged.xml | 41 +++ tests/qemuxml2xmltest.c | 1 + 29 files changed, 1239 insertions(+), 9 deletions(-) create mode 100644 src/util/virmdev.c create mode 100644 src/util/virmdev.h create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml
I'm no expert in mdevs, but from code POV these look solid. Michal

On Mon, 6 Feb 2017 13:19:42 +0100 Erik Skultety <eskultet@redhat.com> wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
Based on your test code, I think you're creating something like this: -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc That would explain the need for the parent device address, but that's an entirely self inflicted requirement. For a managed="no" scenarios, we shouldn't need the parent, we can get to the mdev device via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc. So it seems that the UUID should be the only required source element for managed="no". For managed="yes", it seems like the parent device is still an optional field. The most important thing that libvirt needs to know when creating a mdev device for a VM is the mdev type name. The parent device should be an optional field to help higher level management tools deal with placement of the device for locality or load balancing. Also, we can't assume that the parent device is a PCI device, the sample mtty driver already breaks this assumption. Also, grep'ing through the patches, I don't see that the "device_api" file is being used to test that the mdev device actually exports the vfio-pci API before making use of it with the QEMU vfio-pci driver. We don't yet have any examples to the contrary, but non vfio-pci mdev devices are in development. Just like we can't assume the parent device type, we can't assume the API of an mdev device to the user. Thanks, Alex
A few remarks if you actually happen to have a machine to test this on: - right now the mediated devices are one-time use only, i.e. they have to be recreated before every machine boot - I wouldn't recommend assigning multiple vGPUs to a single domain
Once this series is sorted out, we can then continue with 'managed=yes' where as Laine pointed out [1], we need to figure out how exactly should the management layer hint libvirt which vGPU type should be used for device instantiation.
[1] https://www.redhat.com/archives/libvir-list/2017-January/msg00287.html
#pleaseshareyourfeedback
Thanks, Erik
Erik Skultety (16): util: Introduce new module virmdev conf: Introduce new hostdev device type mdev docs: Update RNG schema to reflect the new hostdev type mdev conf: Adjust the domain parser to work with mdevs Adjust the formatter to reflect the new hostdev type mdev security: dac: Enable labeling of vfio mediated devices security: selinux: Enable labeling of vfio mediated devices conf: Enable cold-plug of a mediated device qemu: Assign PCI addresses for mediated devices as well hostdev: Maintain a driver list of active mediated devices hostdev: Introduce a reattach method for mediated devices qemu: cgroup: Adjust cgroups' logic to allow mediated devices qemu: namespace: Hook up the discovery of mdevs into the namespace code qemu: Format mdevs on the qemu command line test: Add some test cases for our test suite regarding the mdevs docs: Document the new hostdev device type 'mdev'
docs/formatdomain.html.in | 40 ++- docs/schemas/domaincommon.rng | 17 + po/POTFILES.in | 1 + src/Makefile.am | 1 + src/conf/domain_conf.c | 81 ++++- src/conf/domain_conf.h | 10 + src/libvirt_private.syms | 19 ++ src/qemu/qemu_cgroup.c | 35 ++ src/qemu/qemu_command.c | 49 +++ src/qemu/qemu_command.h | 5 + src/qemu/qemu_domain.c | 13 + src/qemu/qemu_domain_address.c | 12 +- src/qemu/qemu_hostdev.c | 37 ++ src/qemu/qemu_hostdev.h | 8 + src/qemu/qemu_hotplug.c | 2 + src/security/security_apparmor.c | 3 + src/security/security_dac.c | 56 +++ src/security/security_selinux.c | 55 +++ src/util/virhostdev.c | 179 +++++++++- src/util/virhostdev.h | 16 + src/util/virmdev.c | 375 +++++++++++++++++++++ src/util/virmdev.h | 85 +++++ tests/domaincapsschemadata/full.xml | 1 + ...qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml | 37 ++ .../qemuxml2argv-hostdev-mdev-unmanaged.args | 25 ++ .../qemuxml2argv-hostdev-mdev-unmanaged.xml | 38 +++ tests/qemuxml2argvtest.c | 6 + .../qemuxml2xmlout-hostdev-mdev-unmanaged.xml | 41 +++ tests/qemuxml2xmltest.c | 1 + 29 files changed, 1239 insertions(+), 9 deletions(-) create mode 100644 src/util/virmdev.c create mode 100644 src/util/virmdev.h create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged-no-uuid.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-hostdev-mdev-unmanaged.xml create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-hostdev-mdev-unmanaged.xml

On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
On Mon, 6 Feb 2017 13:19:42 +0100 Erik Skultety <eskultet@redhat.com> wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
Based on your test code, I think you're creating something like this:
-device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
That would explain the need for the parent device address, but that's an entirely self inflicted requirement. For a managed="no" scenarios, we shouldn't need the parent, we can get to the mdev device via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc. So it
True, for managed="no" would this path be a nice optimization.
seems that the UUID should be the only required source element for managed="no".
For managed="yes", it seems like the parent device is still an optional
The reason I went with the parent address element (and purposely neglecting the sample mtty driver) was that I assumed any modern mdev capable HW would be accessible through the PCI bus on the host. Also I wanted to explicitly hint libvirt as much as possible which parent device a vGPU device instance should be created on in case there are more than one of them, rather then scanning sysfs for a suitable parent which actually supports the given vGPU type.
field. The most important thing that libvirt needs to know when creating a mdev device for a VM is the mdev type name. The parent device should be an optional field to help higher level management tools deal with placement of the device for locality or load balancing. Also, we can't assume that the parent device is a PCI device, the sample mtty driver already breaks this assumption.
Since we need to assume non-PCI devices and we still need to enable management to hint libvirt about the parent to utilize load balancing and stuff, I've come up with the following adjustments/ideas on how to reflect that in the XML: - still use the address element but use it with the 'type' attribute [1] (still breaks the sample mtty driver though) while making the element truly optional if I'm going to be outvoted in favor of scanning the directory for a suitable parent device on our own, rather than requiring the user to provide that - providing either an attribute or a standalone element for the parent device name, like a string version of the PCI address or whatever form the parent device comes in (doesn't break the mtty driver but I don't quite like this) - providing a path element/attribute to sysfs pointing to the parent device which I'm afraid is what Daniel is not in favor of libvirt doing So, this is what I've so far come up with in terms of hinting libvirt about the parent device, do you have any input on this, maybe some more ideas on how we should identify the parent device?
Also, grep'ing through the patches, I don't see that the "device_api"
Yep, this was also on purpose since as you write below, right now the only functioning mdev devices we have to work with are vfio-pci capable only, so with this RFC I wanted to gather some feedback on whether I'm moving the right direction in the first place. So yeah, I thought this could be added at any point later. [1] http://libvirt.org/formatdomain.html#elementsAddress Erik
file is being used to test that the mdev device actually exports the vfio-pci API before making use of it with the QEMU vfio-pci driver. We don't yet have any examples to the contrary, but non vfio-pci mdev devices are in development. Just like we can't assume the parent device type, we can't assume the API of an mdev device to the user. Thanks,
Alex
A few remarks if you actually happen to have a machine to test this on: - right now the mediated devices are one-time use only, i.e. they have to be recreated before every machine boot - I wouldn't recommend assigning multiple vGPUs to a single domain
Once this series is sorted out, we can then continue with 'managed=yes' where as Laine pointed out [1], we need to figure out how exactly should the management layer hint libvirt which vGPU type should be used for device instantiation.
[1] https://www.redhat.com/archives/libvir-list/2017-January/msg00287.html

On Tue, 7 Feb 2017 17:26:51 +0100 Erik Skultety <eskultet@redhat.com> wrote:
On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
On Mon, 6 Feb 2017 13:19:42 +0100 Erik Skultety <eskultet@redhat.com> wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
Based on your test code, I think you're creating something like this:
-device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
That would explain the need for the parent device address, but that's an entirely self inflicted requirement. For a managed="no" scenarios, we shouldn't need the parent, we can get to the mdev device via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc. So it
True, for managed="no" would this path be a nice optimization.
seems that the UUID should be the only required source element for managed="no".
For managed="yes", it seems like the parent device is still an optional
The reason I went with the parent address element (and purposely neglecting the sample mtty driver) was that I assumed any modern mdev capable HW would be accessible through the PCI bus on the host. Also I wanted to explicitly hint libvirt as much as possible which parent device a vGPU device instance should be created on in case there are more than one of them, rather then scanning sysfs for a suitable parent which actually supports the given vGPU type.
field. The most important thing that libvirt needs to know when creating a mdev device for a VM is the mdev type name. The parent device should be an optional field to help higher level management tools deal with placement of the device for locality or load balancing. Also, we can't assume that the parent device is a PCI device, the sample mtty driver already breaks this assumption.
Since we need to assume non-PCI devices and we still need to enable management to hint libvirt about the parent to utilize load balancing and stuff, I've come up with the following adjustments/ideas on how to reflect that in the XML: - still use the address element but use it with the 'type' attribute [1] (still breaks the sample mtty driver though) while making the element truly optional if I'm going to be outvoted in favor of scanning the directory for a suitable parent device on our own, rather than requiring the user to provide that
- providing either an attribute or a standalone element for the parent device name, like a string version of the PCI address or whatever form the parent device comes in (doesn't break the mtty driver but I don't quite like this)
- providing a path element/attribute to sysfs pointing to the parent device which I'm afraid is what Daniel is not in favor of libvirt doing
So, this is what I've so far come up with in terms of hinting libvirt about the parent device, do you have any input on this, maybe some more ideas on how we should identify the parent device?
IMO, if we cannot account for the mtty sample driver, we're doing it wrong. I suppose we can leave it unspecified how one selects a parent device for the mtty driver, but it should be possible to expand the syntax to include it. So I think that means that when the parent address is provided, the parent address type needs to be specified as PCI. So... <hostdev mode='subsystem' type='mdev'> This needs to encompass the device API or else the optional VM address cannot be resolved. Perhaps model='vfio-pci' here? Seems similar to how we specify the device type for PCI controllers where we have multiple options: <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> For managed='no', I don't see that anything other than the mdev UUID is useful. <uuid>MDEV_UUID</uuid> If libvirt gets into the business of creating mdev devices and we call that managed='yes', then the mdev type to create is required. I don't know whether there's anything similar we can steal syntax from: <type>"nvidia-11"</type> That's pretty horrible, needs some xml guru love. We need to provide for specifying a parent, but we can't assume the type and address format of the parent. If we say the parent is a string, then we don't care, libvirt simply matches the /sys/clas/mdev_bus/$STRING/ device. If we want libvirt to interpret the parent address (I can't figure what value this adds, but aiui raw strings are frowned upon), then the address type would need to be specified. So instead of: <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> Perhaps: <address type='pci' domain='0x0000' bus='0x00'...> If we want to support mtty here, we'd need a different type, but at least we have the possibility to define that (though if parent is simply a string then even mtty works). <source> The type of the optional <address> here would be based on the "model" of the hostdev. </hostdev> Clearly I'm less than a novice at defining xml tags, this is just what I think needs to be there.
Also, grep'ing through the patches, I don't see that the "device_api"
Yep, this was also on purpose since as you write below, right now the only functioning mdev devices we have to work with are vfio-pci capable only, so with this RFC I wanted to gather some feedback on whether I'm moving the right direction in the first place. So yeah, I thought this could be added at any point later.
Hmm, I think that might be painting us into a corner for the future.
I see here that we already do the <address type='pci'...> thing, so using that as the parent source element fits. The VM <address> though is optional, so we can't rely on queues from that to tell us what mdev type to expect, which implies we do need something like the model='vfio-pci' to determine the VM address format. Thanks, Alex

On 07/02/17 12:29 -0700, Alex Williamson wrote:
On Tue, 7 Feb 2017 17:26:51 +0100 Erik Skultety <eskultet@redhat.com> wrote:
On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
On Mon, 6 Feb 2017 13:19:42 +0100 Erik Skultety <eskultet@redhat.com> wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
Based on your test code, I think you're creating something like this:
-device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
That would explain the need for the parent device address, but that's an entirely self inflicted requirement. For a managed="no" scenarios, we shouldn't need the parent, we can get to the mdev device via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc. So it
True, for managed="no" would this path be a nice optimization.
seems that the UUID should be the only required source element for managed="no".
For managed="yes", it seems like the parent device is still an optional
The reason I went with the parent address element (and purposely neglecting the sample mtty driver) was that I assumed any modern mdev capable HW would be accessible through the PCI bus on the host. Also I wanted to explicitly hint libvirt as much as possible which parent device a vGPU device instance should be created on in case there are more than one of them, rather then scanning sysfs for a suitable parent which actually supports the given vGPU type.
field. The most important thing that libvirt needs to know when creating a mdev device for a VM is the mdev type name. The parent device should be an optional field to help higher level management tools deal with placement of the device for locality or load balancing. Also, we can't assume that the parent device is a PCI device, the sample mtty driver already breaks this assumption.
Since we need to assume non-PCI devices and we still need to enable management to hint libvirt about the parent to utilize load balancing and stuff, I've come up with the following adjustments/ideas on how to reflect that in the XML: - still use the address element but use it with the 'type' attribute [1] (still breaks the sample mtty driver though) while making the element truly optional if I'm going to be outvoted in favor of scanning the directory for a suitable parent device on our own, rather than requiring the user to provide that
- providing either an attribute or a standalone element for the parent device name, like a string version of the PCI address or whatever form the parent device comes in (doesn't break the mtty driver but I don't quite like this)
- providing a path element/attribute to sysfs pointing to the parent device which I'm afraid is what Daniel is not in favor of libvirt doing
So, this is what I've so far come up with in terms of hinting libvirt about the parent device, do you have any input on this, maybe some more ideas on how we should identify the parent device?
IMO, if we cannot account for the mtty sample driver, we're doing it wrong. I suppose we can leave it unspecified how one selects a parent device for the mtty driver, but it should be possible to expand the syntax to include it. So I think that means that when the parent address is provided, the parent address type needs to be specified as PCI. So...
<hostdev mode='subsystem' type='mdev'>
This needs to encompass the device API or else the optional VM address cannot be resolved. Perhaps model='vfio-pci' here? Seems similar to how we specify the device type for PCI controllers where we have multiple options:
<hostdev mode='subsystem' type='mdev' model='vfio-pci'>
<source>
For managed='no', I don't see that anything other than the mdev UUID is useful.
<uuid>MDEV_UUID</uuid>
If libvirt gets into the business of creating mdev devices and we call that managed='yes', then the mdev type to create is required. I don't know whether there's anything similar we can steal syntax from:
<type>"nvidia-11"</type>
That's pretty horrible, needs some xml guru love.
We need to provide for specifying a parent, but we can't assume the
From higher level perspective, I believe it would be "good enough" for most of the cases to only specify the type. Libvirt will anyway have to be able to enumerate the devices for listAllDevices afaik.
My wish would be specifying <hostdev mode='subsystem' type='mdev'> <type>nvidia-11</type> </hostdev> unless the user has specific requests or some other decision (mmio numa placement) takes place. We would additionally need (allocated instances/max instances of that type) in listAllDevices to account for the specific assignment possibility. I'm not sure what the decision was wrt type naming, can 2 different cards have similarly named type with different meaning? Sorry if I've missed some obvious reason why this can't work. mpolednik
type and address format of the parent. If we say the parent is a string, then we don't care, libvirt simply matches the /sys/clas/mdev_bus/$STRING/ device. If we want libvirt to interpret the parent address (I can't figure what value this adds, but aiui raw strings are frowned upon), then the address type would need to be specified. So instead of:
<address domain='0x0000' bus='0x00' slot='0x00' function='0x00'>
Perhaps:
<address type='pci' domain='0x0000' bus='0x00'...>
If we want to support mtty here, we'd need a different type, but at least we have the possibility to define that (though if parent is simply a string then even mtty works).
<source>
The type of the optional <address> here would be based on the "model" of the hostdev.
</hostdev>
Clearly I'm less than a novice at defining xml tags, this is just what I think needs to be there.
Also, grep'ing through the patches, I don't see that the "device_api"
Yep, this was also on purpose since as you write below, right now the only functioning mdev devices we have to work with are vfio-pci capable only, so with this RFC I wanted to gather some feedback on whether I'm moving the right direction in the first place. So yeah, I thought this could be added at any point later.
Hmm, I think that might be painting us into a corner for the future.
I see here that we already do the <address type='pci'...> thing, so using that as the parent source element fits.
The VM <address> though is optional, so we can't rely on queues from that to tell us what mdev type to expect, which implies we do need something like the model='vfio-pci' to determine the VM address format. Thanks,
Alex
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Tue, 14 Feb 2017 16:50:14 +0100 Martin Polednik <mpolednik@redhat.com> wrote:
On 07/02/17 12:29 -0700, Alex Williamson wrote:
On Tue, 7 Feb 2017 17:26:51 +0100 Erik Skultety <eskultet@redhat.com> wrote:
On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
On Mon, 6 Feb 2017 13:19:42 +0100 Erik Skultety <eskultet@redhat.com> wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
Based on your test code, I think you're creating something like this:
-device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
That would explain the need for the parent device address, but that's an entirely self inflicted requirement. For a managed="no" scenarios, we shouldn't need the parent, we can get to the mdev device via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc. So it
True, for managed="no" would this path be a nice optimization.
seems that the UUID should be the only required source element for managed="no".
For managed="yes", it seems like the parent device is still an optional
The reason I went with the parent address element (and purposely neglecting the sample mtty driver) was that I assumed any modern mdev capable HW would be accessible through the PCI bus on the host. Also I wanted to explicitly hint libvirt as much as possible which parent device a vGPU device instance should be created on in case there are more than one of them, rather then scanning sysfs for a suitable parent which actually supports the given vGPU type.
field. The most important thing that libvirt needs to know when creating a mdev device for a VM is the mdev type name. The parent device should be an optional field to help higher level management tools deal with placement of the device for locality or load balancing. Also, we can't assume that the parent device is a PCI device, the sample mtty driver already breaks this assumption.
Since we need to assume non-PCI devices and we still need to enable management to hint libvirt about the parent to utilize load balancing and stuff, I've come up with the following adjustments/ideas on how to reflect that in the XML: - still use the address element but use it with the 'type' attribute [1] (still breaks the sample mtty driver though) while making the element truly optional if I'm going to be outvoted in favor of scanning the directory for a suitable parent device on our own, rather than requiring the user to provide that
- providing either an attribute or a standalone element for the parent device name, like a string version of the PCI address or whatever form the parent device comes in (doesn't break the mtty driver but I don't quite like this)
- providing a path element/attribute to sysfs pointing to the parent device which I'm afraid is what Daniel is not in favor of libvirt doing
So, this is what I've so far come up with in terms of hinting libvirt about the parent device, do you have any input on this, maybe some more ideas on how we should identify the parent device?
IMO, if we cannot account for the mtty sample driver, we're doing it wrong. I suppose we can leave it unspecified how one selects a parent device for the mtty driver, but it should be possible to expand the syntax to include it. So I think that means that when the parent address is provided, the parent address type needs to be specified as PCI. So...
<hostdev mode='subsystem' type='mdev'>
This needs to encompass the device API or else the optional VM address cannot be resolved. Perhaps model='vfio-pci' here? Seems similar to how we specify the device type for PCI controllers where we have multiple options:
<hostdev mode='subsystem' type='mdev' model='vfio-pci'>
<source>
For managed='no', I don't see that anything other than the mdev UUID is useful.
<uuid>MDEV_UUID</uuid>
If libvirt gets into the business of creating mdev devices and we call that managed='yes', then the mdev type to create is required. I don't know whether there's anything similar we can steal syntax from:
<type>"nvidia-11"</type>
That's pretty horrible, needs some xml guru love.
We need to provide for specifying a parent, but we can't assume the
From higher level perspective, I believe it would be "good enough" for most of the cases to only specify the type. Libvirt will anyway have to be able to enumerate the devices for listAllDevices afaik.
My wish would be specifying <hostdev mode='subsystem' type='mdev'> <type>nvidia-11</type> </hostdev> unless the user has specific requests or some other decision (mmio numa placement) takes place.
Yes, the <type> is the minimum information necessary for libvirt to create the mdev device itself. A <source> section could add optional placement information. Note though that without an nvidia-11 type device on the system to query, the xml doesn't tell us what sort of device this creates in the VM. We could assume that it's vfio-pci, but designing in an assumption isn't a great idea. So, as above, some mechanism to make the xml self contained, such as specifying the model as vfio-pci, helps avoid that assumption and allows us to know the format for expressing the VM <address>
We would additionally need (allocated instances/max instances of that type) in listAllDevices to account for the specific assignment possibility.
mdev devices support an available_instances per mdev type that is dynamically updated as devices are created. The interaction of available_instances between different types is going to require some heuristics to understand. Some vendors may not support heterogeneous types, others may pull from a common pool of resources, where each type may consume resources from that pool at different rates.
I'm not sure what the decision was wrt type naming, can 2 different cards have similarly named type with different meaning?
We don't deal in similarities, each type ID is unique and it's up to the mdev vendor driver to make sure that an "nvidia-11" on and M60 card is software equivalent to an "nvidia-11" on an M10 card. If they're not equivalent, the type ID will be different. Something we may want to consider eventually is whether we want/need to deal with compatibility strings. For instance, NVIDIA seems to be tying the type ID strongly to specific implementations, an nvidia-11 may only be available on an M60 card. An M10 card may offer an nvidia-21 type with similar capabilities. There may be a need to express an mdev device as compatible with various type IDs for hardware availability, at the risk of exposing slight variations to the VM. This could also make placement easier for vendor drivers that only support homogeneous mdev devices, "I prefer an mdev ID of type 'nvidia-11', but will accept one of type 'nvidia-12,nvidia-21'". Thanks, Alex

On 14/02/17 09:58 -0700, Alex Williamson wrote:
On Tue, 14 Feb 2017 16:50:14 +0100 Martin Polednik <mpolednik@redhat.com> wrote:
On 07/02/17 12:29 -0700, Alex Williamson wrote:
On Tue, 7 Feb 2017 17:26:51 +0100 Erik Skultety <eskultet@redhat.com> wrote:
On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
On Mon, 6 Feb 2017 13:19:42 +0100 Erik Skultety <eskultet@redhat.com> wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
Based on your test code, I think you're creating something like this:
-device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
That would explain the need for the parent device address, but that's an entirely self inflicted requirement. For a managed="no" scenarios, we shouldn't need the parent, we can get to the mdev device via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc. So it
True, for managed="no" would this path be a nice optimization.
seems that the UUID should be the only required source element for managed="no".
For managed="yes", it seems like the parent device is still an optional
The reason I went with the parent address element (and purposely neglecting the sample mtty driver) was that I assumed any modern mdev capable HW would be accessible through the PCI bus on the host. Also I wanted to explicitly hint libvirt as much as possible which parent device a vGPU device instance should be created on in case there are more than one of them, rather then scanning sysfs for a suitable parent which actually supports the given vGPU type.
field. The most important thing that libvirt needs to know when creating a mdev device for a VM is the mdev type name. The parent device should be an optional field to help higher level management tools deal with placement of the device for locality or load balancing. Also, we can't assume that the parent device is a PCI device, the sample mtty driver already breaks this assumption.
Since we need to assume non-PCI devices and we still need to enable management to hint libvirt about the parent to utilize load balancing and stuff, I've come up with the following adjustments/ideas on how to reflect that in the XML: - still use the address element but use it with the 'type' attribute [1] (still breaks the sample mtty driver though) while making the element truly optional if I'm going to be outvoted in favor of scanning the directory for a suitable parent device on our own, rather than requiring the user to provide that
- providing either an attribute or a standalone element for the parent device name, like a string version of the PCI address or whatever form the parent device comes in (doesn't break the mtty driver but I don't quite like this)
- providing a path element/attribute to sysfs pointing to the parent device which I'm afraid is what Daniel is not in favor of libvirt doing
So, this is what I've so far come up with in terms of hinting libvirt about the parent device, do you have any input on this, maybe some more ideas on how we should identify the parent device?
IMO, if we cannot account for the mtty sample driver, we're doing it wrong. I suppose we can leave it unspecified how one selects a parent device for the mtty driver, but it should be possible to expand the syntax to include it. So I think that means that when the parent address is provided, the parent address type needs to be specified as PCI. So...
<hostdev mode='subsystem' type='mdev'>
This needs to encompass the device API or else the optional VM address cannot be resolved. Perhaps model='vfio-pci' here? Seems similar to how we specify the device type for PCI controllers where we have multiple options:
<hostdev mode='subsystem' type='mdev' model='vfio-pci'>
<source>
For managed='no', I don't see that anything other than the mdev UUID is useful.
<uuid>MDEV_UUID</uuid>
If libvirt gets into the business of creating mdev devices and we call that managed='yes', then the mdev type to create is required. I don't know whether there's anything similar we can steal syntax from:
<type>"nvidia-11"</type>
That's pretty horrible, needs some xml guru love.
We need to provide for specifying a parent, but we can't assume the
From higher level perspective, I believe it would be "good enough" for most of the cases to only specify the type. Libvirt will anyway have to be able to enumerate the devices for listAllDevices afaik.
My wish would be specifying <hostdev mode='subsystem' type='mdev'> <type>nvidia-11</type> </hostdev> unless the user has specific requests or some other decision (mmio numa placement) takes place.
Yes, the <type> is the minimum information necessary for libvirt to create the mdev device itself. A <source> section could add optional placement information. Note though that without an nvidia-11 type device on the system to query, the xml doesn't tell us what sort of device this creates in the VM. We could assume that it's vfio-pci, but designing in an assumption isn't a great idea. So, as above, some mechanism to make the xml self contained, such as specifying the model as vfio-pci, helps avoid that assumption and allows us to know the format for expressing the VM <address>
As long as libvirt provides means to determine the model via device listing (listAllDevices), OK.
We would additionally need (allocated instances/max instances of that type) in listAllDevices to account for the specific assignment possibility.
mdev devices support an available_instances per mdev type that is dynamically updated as devices are created. The interaction of available_instances between different types is going to require some heuristics to understand. Some vendors may not support heterogeneous types, others may pull from a common pool of resources, where each type may consume resources from that pool at different rates.
Given common pool semantics, will we be able to calculate how many of each type will be available in the pool if we were to instantiate certain type? Example: available types: type_a: 4 devices (each consumes 1 "slot") type_b: 1 device (each consumes 4 "slots") total "slots": 4 we know that creating type_a device prevents any more type_b devices to be created. Does NVIDIA or AMD use the common pool?
I'm not sure what the decision was wrt type naming, can 2 different cards have similarly named type with different meaning?
We don't deal in similarities, each type ID is unique and it's up to the mdev vendor driver to make sure that an "nvidia-11" on and M60 card is software equivalent to an "nvidia-11" on an M10 card. If they're not equivalent, the type ID will be different. Something we may want to consider eventually is whether we want/need to deal with compatibility strings. For instance, NVIDIA seems to be tying the type ID strongly to specific implementations, an nvidia-11 may only be available on an M60 card. An M10 card may offer an nvidia-21 type with similar capabilities. There may be a need to express an mdev device as compatible with various type IDs for hardware availability, at the risk of exposing slight variations to the VM. This could also make placement easier for vendor drivers that only support homogeneous mdev devices, "I prefer an mdev ID of type 'nvidia-11', but will accept one of type 'nvidia-12,nvidia-21'". Thanks,
I like the idea of libvirt being able to select one of specified types, we have to bear in mind that it'll slightly complicate the XML: <mdev_types> <type>nvidia-11</type> <type>nvidia-21</type> </mdev_types> That luckily shouldn't be problem for libvirt or management software. On the other hand, the type equivalence will require some kind of labeling on the management side -- user defines "mygpu" as "vgpu with type nvidia-11 or nvidia-21" unless libvirt commits to a maintaining a database with capability-equivalent types for devices (which, given the generic-ness of the mdev, doesn't seem like a good idea).
Alex

On Wed, Feb 15, 2017 at 09:50:03AM +0100, Martin Polednik wrote:
On 14/02/17 09:58 -0700, Alex Williamson wrote:
On Tue, 14 Feb 2017 16:50:14 +0100 Martin Polednik <mpolednik@redhat.com> wrote:
On 07/02/17 12:29 -0700, Alex Williamson wrote:
On Tue, 7 Feb 2017 17:26:51 +0100 Erik Skultety <eskultet@redhat.com> wrote:
On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
On Mon, 6 Feb 2017 13:19:42 +0100 Erik Skultety <eskultet@redhat.com> wrote:
> Finally. It's here. This is the initial suggestion on how libvirt might > interract with the mdev framework, currently only focussing on the non-managed > devices, i.e. those pre-created by the user, since that will be revisited once > we all settled on how the XML should look like, given we might not want to use > the sysfs path directly as an attribute in the domain XML. My proposal on the > XML is the following: > > <hostdev mode='subsystem' type='mdev'> > <source> > <!-- this is the host's physical device address --> > <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> > <uuid>vGPU_UUID<uuid> > <source> > <!-- target PCI address can be omitted to assign it automatically --> > </hostdev> > > So the mediated device is identified by the physical parent device visible on > the host and a UUID which allows us to construct the sysfs path by ourselves, > which we then put on the QEMU's command line.
Based on your test code, I think you're creating something like this:
-device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
That would explain the need for the parent device address, but that's an entirely self inflicted requirement. For a managed="no" scenarios, we shouldn't need the parent, we can get to the mdev device via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc. So it
True, for managed="no" would this path be a nice optimization.
seems that the UUID should be the only required source element for managed="no".
For managed="yes", it seems like the parent device is still an optional
The reason I went with the parent address element (and purposely neglecting the sample mtty driver) was that I assumed any modern mdev capable HW would be accessible through the PCI bus on the host. Also I wanted to explicitly hint libvirt as much as possible which parent device a vGPU device instance should be created on in case there are more than one of them, rather then scanning sysfs for a suitable parent which actually supports the given vGPU type.
field. The most important thing that libvirt needs to know when creating a mdev device for a VM is the mdev type name. The parent device should be an optional field to help higher level management tools deal with placement of the device for locality or load balancing. Also, we can't assume that the parent device is a PCI device, the sample mtty driver already breaks this assumption.
Since we need to assume non-PCI devices and we still need to enable management to hint libvirt about the parent to utilize load balancing and stuff, I've come up with the following adjustments/ideas on how to reflect that in the XML: - still use the address element but use it with the 'type' attribute [1] (still breaks the sample mtty driver though) while making the element truly optional if I'm going to be outvoted in favor of scanning the directory for a suitable parent device on our own, rather than requiring the user to provide that
- providing either an attribute or a standalone element for the parent device name, like a string version of the PCI address or whatever form the parent device comes in (doesn't break the mtty driver but I don't quite like this)
- providing a path element/attribute to sysfs pointing to the parent device which I'm afraid is what Daniel is not in favor of libvirt doing
So, this is what I've so far come up with in terms of hinting libvirt about the parent device, do you have any input on this, maybe some more ideas on how we should identify the parent device?
IMO, if we cannot account for the mtty sample driver, we're doing it wrong. I suppose we can leave it unspecified how one selects a parent device for the mtty driver, but it should be possible to expand the syntax to include it. So I think that means that when the parent address is provided, the parent address type needs to be specified as PCI. So...
<hostdev mode='subsystem' type='mdev'>
This needs to encompass the device API or else the optional VM address cannot be resolved. Perhaps model='vfio-pci' here? Seems similar to how we specify the device type for PCI controllers where we have multiple options:
<hostdev mode='subsystem' type='mdev' model='vfio-pci'>
<source>
For managed='no', I don't see that anything other than the mdev UUID is useful.
<uuid>MDEV_UUID</uuid>
If libvirt gets into the business of creating mdev devices and we call that managed='yes', then the mdev type to create is required. I don't know whether there's anything similar we can steal syntax from:
<type>"nvidia-11"</type>
That's pretty horrible, needs some xml guru love.
We need to provide for specifying a parent, but we can't assume the
From higher level perspective, I believe it would be "good enough" for most of the cases to only specify the type. Libvirt will anyway have to be able to enumerate the devices for listAllDevices afaik.
My wish would be specifying <hostdev mode='subsystem' type='mdev'> <type>nvidia-11</type> </hostdev> unless the user has specific requests or some other decision (mmio numa placement) takes place.
Yes, the <type> is the minimum information necessary for libvirt to create the mdev device itself. A <source> section could add optional placement information. Note though that without an nvidia-11 type device on the system to query, the xml doesn't tell us what sort of device this creates in the VM. We could assume that it's vfio-pci, but designing in an assumption isn't a great idea. So, as above, some mechanism to make the xml self contained, such as specifying the model as vfio-pci, helps avoid that assumption and allows us to know the format for expressing the VM <address>
As long as libvirt provides means to determine the model via device listing (listAllDevices), OK.
Yes, libvirt will provide means expose this information.
We would additionally need (allocated instances/max instances of that type) in listAllDevices to account for the specific assignment possibility.
mdev devices support an available_instances per mdev type that is dynamically updated as devices are created. The interaction of available_instances between different types is going to require some heuristics to understand. Some vendors may not support heterogeneous types, others may pull from a common pool of resources, where each type may consume resources from that pool at different rates.
Given common pool semantics, will we be able to calculate how many of each type will be available in the pool if we were to instantiate certain type? Example:
available types: type_a: 4 devices (each consumes 1 "slot") type_b: 1 device (each consumes 4 "slots") total "slots": 4
Well, if we could assume that the number of instances for a specific type would always be a power of 2 and the resources are distributed in that manner, then it's simple, you're allocating a resources that a more resource-demanding type would need to instantiate a single device, so you'll end up with one less device for each more resource-demanding type recursively. However, that is a strong assumption to make, so I'm not sure, it's possible that available instances, which only updates once you instantiated a specific type, is the only thing we should rely on.
we know that creating type_a device prevents any more type_b devices to be created.
Does NVIDIA or AMD use the common pool?
I'm not sure what the decision was wrt type naming, can 2 different cards have similarly named type with different meaning?
We don't deal in similarities, each type ID is unique and it's up to the mdev vendor driver to make sure that an "nvidia-11" on and M60 card is software equivalent to an "nvidia-11" on an M10 card. If they're not equivalent, the type ID will be different. Something we may want to consider eventually is whether we want/need to deal with compatibility strings. For instance, NVIDIA seems to be tying the type ID strongly to specific implementations, an nvidia-11 may only be available on an M60 card. An M10 card may offer an nvidia-21 type with similar capabilities. There may be a need to express an mdev device as compatible with various type IDs for hardware availability, at the risk of exposing slight variations to the VM. This could also make placement easier for vendor drivers that only support homogeneous mdev devices, "I prefer an mdev ID of type 'nvidia-11', but will accept one of type 'nvidia-12,nvidia-21'". Thanks,
I like the idea of libvirt being able to select one of specified types, we have to bear in mind that it'll slightly complicate the XML:
<mdev_types> <type>nvidia-11</type> <type>nvidia-21</type> </mdev_types>
^^ are you referring to nodedev XML or domain XML, because in case of a domain, there should be only one type per <hostdev type='mdev'>. There is also the ongoing question what's the best way to approach creation of mdev with libvirt and we have to be very careful with that so it won't bite us back in the future. However, for 7.4 the priority is to accept a pre-created device and to provide means in the nodedev driver to list all existing mdev devices and their corresponding parent devices.
That luckily shouldn't be problem for libvirt or management software. On the other hand, the type equivalence will require some kind of labeling on the management side -- user defines "mygpu" as "vgpu with type nvidia-11 or nvidia-21" unless libvirt commits to a maintaining a database with capability-equivalent types for devices (which, given the generic-ness of the mdev, doesn't seem like a good idea).
Libvirt definitely shouldn't be handling type compatibility-related issues. As Alex pointed out, this should be vendor driver's responsibility. There's also Intel's KVMGT which has a different approach to it's type IDs. IIUC they based their type IDs on the fraction of actual resources used, i.e. type _1 consumes the whole HW _2 consumes half, etc. but this is a question for Alex as he's been playing with it for some time. Anyhow, from my understanding Intel's types look more generic, thus more compatible with different HW revisions, if so, then in that case by dealing with the type compatibility, libvirt would be tailoring its logic to a specific vendor's use whereas I think libvirt should only focus on interacting with the mdev framework using the data it's got from the user. IOW new mdev-capable HW will be coming out which would in turn just bring more types to deal with. If the vendor driver won't be willing to accept any other type than just the set it's exporting, then I think the management may want to try to compensate for this with the information it can query from libvirt. Erik
Alex

On Wed, 15 Feb 2017 14:43:22 +0100 Erik Skultety <eskultet@redhat.com> wrote:
On Wed, Feb 15, 2017 at 09:50:03AM +0100, Martin Polednik wrote:
On 14/02/17 09:58 -0700, Alex Williamson wrote:
On Tue, 14 Feb 2017 16:50:14 +0100 Martin Polednik <mpolednik@redhat.com> wrote:
On 07/02/17 12:29 -0700, Alex Williamson wrote:
On Tue, 7 Feb 2017 17:26:51 +0100 Erik Skultety <eskultet@redhat.com> wrote:
On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote: > On Mon, 6 Feb 2017 13:19:42 +0100 > Erik Skultety <eskultet@redhat.com> wrote: > > > Finally. It's here. This is the initial suggestion on how libvirt might > > interract with the mdev framework, currently only focussing on the non-managed > > devices, i.e. those pre-created by the user, since that will be revisited once > > we all settled on how the XML should look like, given we might not want to use > > the sysfs path directly as an attribute in the domain XML. My proposal on the > > XML is the following: > > > > <hostdev mode='subsystem' type='mdev'> > > <source> > > <!-- this is the host's physical device address --> > > <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> > > <uuid>vGPU_UUID<uuid> > > <source> > > <!-- target PCI address can be omitted to assign it automatically --> > > </hostdev> > > > > So the mediated device is identified by the physical parent device visible on > > the host and a UUID which allows us to construct the sysfs path by ourselves, > > which we then put on the QEMU's command line. > > Based on your test code, I think you're creating something like this: > > -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc > > That would explain the need for the parent device address, but that's > an entirely self inflicted requirement. For a managed="no" scenarios, > we shouldn't need the parent, we can get to the mdev device > via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc. So it
True, for managed="no" would this path be a nice optimization.
> seems that the UUID should be the only required source element for > managed="no". > > For managed="yes", it seems like the parent device is still an optional
The reason I went with the parent address element (and purposely neglecting the sample mtty driver) was that I assumed any modern mdev capable HW would be accessible through the PCI bus on the host. Also I wanted to explicitly hint libvirt as much as possible which parent device a vGPU device instance should be created on in case there are more than one of them, rather then scanning sysfs for a suitable parent which actually supports the given vGPU type.
> field. The most important thing that libvirt needs to know when > creating a mdev device for a VM is the mdev type name. The parent > device should be an optional field to help higher level management > tools deal with placement of the device for locality or load balancing. > Also, we can't assume that the parent device is a PCI device, the > sample mtty driver already breaks this assumption.
Since we need to assume non-PCI devices and we still need to enable management to hint libvirt about the parent to utilize load balancing and stuff, I've come up with the following adjustments/ideas on how to reflect that in the XML: - still use the address element but use it with the 'type' attribute [1] (still breaks the sample mtty driver though) while making the element truly optional if I'm going to be outvoted in favor of scanning the directory for a suitable parent device on our own, rather than requiring the user to provide that
- providing either an attribute or a standalone element for the parent device name, like a string version of the PCI address or whatever form the parent device comes in (doesn't break the mtty driver but I don't quite like this)
- providing a path element/attribute to sysfs pointing to the parent device which I'm afraid is what Daniel is not in favor of libvirt doing
So, this is what I've so far come up with in terms of hinting libvirt about the parent device, do you have any input on this, maybe some more ideas on how we should identify the parent device?
IMO, if we cannot account for the mtty sample driver, we're doing it wrong. I suppose we can leave it unspecified how one selects a parent device for the mtty driver, but it should be possible to expand the syntax to include it. So I think that means that when the parent address is provided, the parent address type needs to be specified as PCI. So...
<hostdev mode='subsystem' type='mdev'>
This needs to encompass the device API or else the optional VM address cannot be resolved. Perhaps model='vfio-pci' here? Seems similar to how we specify the device type for PCI controllers where we have multiple options:
<hostdev mode='subsystem' type='mdev' model='vfio-pci'>
<source>
For managed='no', I don't see that anything other than the mdev UUID is useful.
<uuid>MDEV_UUID</uuid>
If libvirt gets into the business of creating mdev devices and we call that managed='yes', then the mdev type to create is required. I don't know whether there's anything similar we can steal syntax from:
<type>"nvidia-11"</type>
That's pretty horrible, needs some xml guru love.
We need to provide for specifying a parent, but we can't assume the
From higher level perspective, I believe it would be "good enough" for most of the cases to only specify the type. Libvirt will anyway have to be able to enumerate the devices for listAllDevices afaik.
My wish would be specifying <hostdev mode='subsystem' type='mdev'> <type>nvidia-11</type> </hostdev> unless the user has specific requests or some other decision (mmio numa placement) takes place.
Yes, the <type> is the minimum information necessary for libvirt to create the mdev device itself. A <source> section could add optional placement information. Note though that without an nvidia-11 type device on the system to query, the xml doesn't tell us what sort of device this creates in the VM. We could assume that it's vfio-pci, but designing in an assumption isn't a great idea. So, as above, some mechanism to make the xml self contained, such as specifying the model as vfio-pci, helps avoid that assumption and allows us to know the format for expressing the VM <address>
As long as libvirt provides means to determine the model via device listing (listAllDevices), OK.
Yes, libvirt will provide means expose this information.
We would additionally need (allocated instances/max instances of that type) in listAllDevices to account for the specific assignment possibility.
mdev devices support an available_instances per mdev type that is dynamically updated as devices are created. The interaction of available_instances between different types is going to require some heuristics to understand. Some vendors may not support heterogeneous types, others may pull from a common pool of resources, where each type may consume resources from that pool at different rates.
Given common pool semantics, will we be able to calculate how many of each type will be available in the pool if we were to instantiate certain type? Example:
available types: type_a: 4 devices (each consumes 1 "slot") type_b: 1 device (each consumes 4 "slots") total "slots": 4
Well, if we could assume that the number of instances for a specific type would always be a power of 2 and the resources are distributed in that manner, then it's simple, you're allocating a resources that a more resource-demanding type would need to instantiate a single device, so you'll end up with one less device for each more resource-demanding type recursively. However, that is a strong assumption to make, so I'm not sure, it's possible that available instances, which only updates once you instantiated a specific type, is the only thing we should rely on.
Agree, and vendors can change how they manage this at any time. For instance if I boot one version of the kernel, i915 gives me: i915-GVTg_V4_1 i915-GVTg_V4_2 i915-GVTg_V4_4 If I boot another, I get: i915-GVTg_V4_1 i915-GVTg_V4_2 i915-GVTg_V4_5 i915-GVTg_V4_7 Now we don't have evenly divisible numbers. If I create a type _1 device, available_instances still says I can create one type _5 or _7. It's perhaps best for libvirt to just look at the current state and not try to predict the future.
we know that creating type_a device prevents any more type_b devices to be created.
Does NVIDIA or AMD use the common pool?
AMD isn't a player here yet, Intel and NVIDIA have vGPUs, IBM has a model under development for S390 channel I/O. The only thing you can rely on is available_instances per mdev type at a given point of time. How available_instances chances when we start creating devices is vendor specific and may change at any time.
I'm not sure what the decision was wrt type naming, can 2 different cards have similarly named type with different meaning?
We don't deal in similarities, each type ID is unique and it's up to the mdev vendor driver to make sure that an "nvidia-11" on and M60 card is software equivalent to an "nvidia-11" on an M10 card. If they're not equivalent, the type ID will be different. Something we may want to consider eventually is whether we want/need to deal with compatibility strings. For instance, NVIDIA seems to be tying the type ID strongly to specific implementations, an nvidia-11 may only be available on an M60 card. An M10 card may offer an nvidia-21 type with similar capabilities. There may be a need to express an mdev device as compatible with various type IDs for hardware availability, at the risk of exposing slight variations to the VM. This could also make placement easier for vendor drivers that only support homogeneous mdev devices, "I prefer an mdev ID of type 'nvidia-11', but will accept one of type 'nvidia-12,nvidia-21'". Thanks,
I like the idea of libvirt being able to select one of specified types, we have to bear in mind that it'll slightly complicate the XML:
<mdev_types> <type>nvidia-11</type> <type>nvidia-21</type> </mdev_types>
^^ are you referring to nodedev XML or domain XML, because in case of a domain, there should be only one type per <hostdev type='mdev'>. There is also the ongoing question what's the best way to approach creation of mdev with libvirt and we have to be very careful with that so it won't bite us back in the future. However, for 7.4 the priority is to accept a pre-created device and to provide means in the nodedev driver to list all existing mdev devices and their corresponding parent devices.
Sorry if I confused the topic with some sort of compatibility listing. I agree that for a <hostdev> there needs to be a single type if libvirt is to create an instance of that type. Any notion of compatible or secondary acceptable types is lower priority than the necessary basic behaviors, I'm just trying to plan ahead for later extensions that might be useful. Perhaps any notion of compatibility lives in user define lists above libvirt.
That luckily shouldn't be problem for libvirt or management software. On the other hand, the type equivalence will require some kind of labeling on the management side -- user defines "mygpu" as "vgpu with type nvidia-11 or nvidia-21" unless libvirt commits to a maintaining a database with capability-equivalent types for devices (which, given the generic-ness of the mdev, doesn't seem like a good idea).
Libvirt definitely shouldn't be handling type compatibility-related issues. As Alex pointed out, this should be vendor driver's responsibility. There's also Intel's KVMGT which has a different approach to it's type IDs. IIUC they based their type IDs on the fraction of actual resources used, i.e. type _1 consumes the whole HW _2 consumes half, etc. but this is a question for Alex as he's been playing with it for some time. Anyhow, from my understanding Intel's types look more generic, thus more compatible with different HW revisions, if so, then in that case by dealing with the type compatibility, libvirt would be tailoring its logic to a specific vendor's use whereas I think libvirt should only focus on interacting with the mdev framework using the data it's got from the user. IOW new mdev-capable HW will be coming out which would in turn just bring more types to deal with. If the vendor driver won't be willing to accept any other type than just the set it's exporting, then I think the management may want to try to compensate for this with the information it can query from libvirt.
AIUI, looking at the example Intel mdev types I list above, that "V4" indicates a Broadwell class GPU. So don't be fooled into thinking Intel is making some sort of generic device that it can represent on any platform. I'd expect a Skylake system to export similar types, but with a "V5" component to the name. NVIDIA naming is just more opaque, possibly changing not only across generations but across implementations. I fully agree that it's the vendor's responsibility to maintain that a given type is compatible wherever it is exposed and libvirt's first priority is to focus on specifying a single type in the xml and working towards instantiating an mdev of that type. Libvirt should never assume that anything other than that single, exact type is compatible or sufficient for the VM. I did plant the seed above about whether user defined compatibility lists might be useful, it seems like something we should keep in mind, but at a lower priority than any sort of initial support. As Erik suggested in a separate discussion, perhaps any notion of user defined compatibility happens at a management layer above libvirt. Migration also needs to be considered when we think about compatible devices. Compatibility likely only refers to the point at which we instantiate the VM, if we were ever to support migration of an mdev device, the target and source would need to be identical. Yet more reasons for libvirt to leave compatibility to higher layers of management tools. Thanks, Alex

On Mon, Feb 06, 2017 at 01:19:42PM +0100, Erik Skultety wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
A few remarks if you actually happen to have a machine to test this on: - right now the mediated devices are one-time use only, i.e. they have to be recreated before every machine boot - I wouldn't recommend assigning multiple vGPUs to a single domain
Once this series is sorted out, we can then continue with 'managed=yes' where as Laine pointed out [1], we need to figure out how exactly should the management layer hint libvirt which vGPU type should be used for device instantiation.
You seem to be suggesting that managed=yes with mdev devices would cause create / delete of a mdev device from a specified parent. This is rather different semantics from what managed=yes does with PCI device assignment today. There the managed=yes flag is just about controlling host device driver attachment. ie whether libvirt will manually bind to vfio.ko, or expect the admin to have bound it to vfio.ko before hand. I think it is important to keep that concept as is for mdev too. While we're thinking of mdev purely in terms of KVM + vfio usage, it wouldn't suprise me if there ended up being non-KVM based use cases for mdev. It isn't clear to me that auto-creation of mdev devices as a concept even belongs in the domain XML neccessarily. Looking at two similar areas. For SRIOV NICs, in the domain XML you either specify an explicit VF to use, or you reference a libvirt virtual network. The latter takes care of dynamically providing VFs to VMs. For NPIV, IIRC, the domain XML works similarly either taking an explicit vHBA, or referencing a storage pool to get one more dynamically. Before we even consider auto-creation though, I think we need to have manual creation designed & integrated in the node device APIs. So in terms of the domain XML, I think the only think we need to provide is the address of the pre-existing mdev device to be used. In this case "address" means the UUID. We should not need anything about the parent device AFAICT. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Mon, 6 Feb 2017 16:44:37 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Mon, Feb 06, 2017 at 01:19:42PM +0100, Erik Skultety wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
A few remarks if you actually happen to have a machine to test this on: - right now the mediated devices are one-time use only, i.e. they have to be recreated before every machine boot - I wouldn't recommend assigning multiple vGPUs to a single domain
Once this series is sorted out, we can then continue with 'managed=yes' where as Laine pointed out [1], we need to figure out how exactly should the management layer hint libvirt which vGPU type should be used for device instantiation.
You seem to be suggesting that managed=yes with mdev devices would cause create / delete of a mdev device from a specified parent.
This is rather different semantics from what managed=yes does with PCI device assignment today. There the managed=yes flag is just about controlling host device driver attachment. ie whether libvirt will manually bind to vfio.ko, or expect the admin to have bound it to vfio.ko before hand. I think it is important to keep that concept as is for mdev too.
While we're thinking of mdev purely in terms of KVM + vfio usage, it wouldn't suprise me if there ended up being non-KVM based use cases for mdev.
It isn't clear to me that auto-creation of mdev devices as a concept even belongs in the domain XML neccessarily.
Looking at two similar areas. For SRIOV NICs, in the domain XML you either specify an explicit VF to use, or you reference a libvirt virtual network. The latter takes care of dynamically providing VFs to VMs. For NPIV, IIRC, the domain XML works similarly either taking an explicit vHBA, or referencing a storage pool to get one more dynamically.
Nit, there are other constraints of SR-IOV which I think are over simplifying this analogy. With SR-IOV, we can't dynamically instantiate new VFs individually. The process there requires that we set the number of VFs we need and enable them. Changing that number of VFs requires that all existing VFs on that PF are removed and recreated. So, does libvirt work the way it does with SR-IOV devices because that's the optimal way for users to make use of those VFs, or does it behave that way because it must to follow the constraints of the device? I think libvirt handles VFs much like it does PFs because it has no other choice. Here we do have a choice. Individual mdev devices can be created and destroyed. The only dependency between mdev devices is how creating one affects the availability of mdev types remaining on the parent device. It would really be a shame to not take advantage of the fact that the underlying device creation has advanced so far from SR-IOV and lump it into the same sort of management. My impression is that user management of creating SR-IOV VFs via module options or self defined scripts is a stumbling point that libvirt could help to address here.
Before we even consider auto-creation though, I think we need to have manual creation designed & integrated in the node device APIs.
So in terms of the domain XML, I think the only think we need to provide is the address of the pre-existing mdev device to be used. In this case "address" means the UUID. We should not need anything about the parent device AFAICT.
Yep, agree. Thanks, Alex

On Mon, Feb 06, 2017 at 04:44:37PM +0000, Daniel P. Berrange wrote:
On Mon, Feb 06, 2017 at 01:19:42PM +0100, Erik Skultety wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
A few remarks if you actually happen to have a machine to test this on: - right now the mediated devices are one-time use only, i.e. they have to be recreated before every machine boot - I wouldn't recommend assigning multiple vGPUs to a single domain
Once this series is sorted out, we can then continue with 'managed=yes' where as Laine pointed out [1], we need to figure out how exactly should the management layer hint libvirt which vGPU type should be used for device instantiation.
You seem to be suggesting that managed=yes with mdev devices would cause create / delete of a mdev device from a specified parent.
This is rather different semantics from what managed=yes does with PCI device assignment today. There the managed=yes flag is just about controlling host device driver attachment. ie whether libvirt will manually bind to vfio.ko, or expect the admin to have bound it to vfio.ko before hand. I think it is important to keep that concept as is for mdev too.
If the managed attribute was used with other devices beside PCI, then sure, we should keep the concept, however, since only PCI devices support it and now we have another device type that potentially might have a use for such an attribute I think it's perfectly reasonable to alter the logic behind that attribute in favor of the new possibilities to device management which mdev framework is providing us with which in this case is dynamic creation and removal of a mediated device.
While we're thinking of mdev purely in terms of KVM + vfio usage, it wouldn't suprise me if there ended up being non-KVM based use cases for mdev.
It isn't clear to me that auto-creation of mdev devices as a concept even belongs in the domain XML neccessarily.
Looking at two similar areas. For SRIOV NICs, in the domain XML you either specify an explicit VF to use, or you reference a libvirt virtual network. The latter takes care of dynamically providing VFs to VMs. For NPIV, IIRC, the domain XML works similarly either taking an explicit vHBA, or referencing a storage pool to get one more dynamically.
Before we even consider auto-creation though, I think we need to have manual creation designed & integrated in the node device APIs.
Yes, integrating mdev into the nodedev driver this way ^^ is definitely planned.
So in terms of the domain XML, I think the only think we need to provide is the address of the pre-existing mdev device to be used. In this case "address" means the UUID. We should not need anything about the parent device AFAICT.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Tue, Feb 07, 2017 at 05:48:23PM +0100, Erik Skultety wrote:
On Mon, Feb 06, 2017 at 04:44:37PM +0000, Daniel P. Berrange wrote:
On Mon, Feb 06, 2017 at 01:19:42PM +0100, Erik Skultety wrote:
Finally. It's here. This is the initial suggestion on how libvirt might interract with the mdev framework, currently only focussing on the non-managed devices, i.e. those pre-created by the user, since that will be revisited once we all settled on how the XML should look like, given we might not want to use the sysfs path directly as an attribute in the domain XML. My proposal on the XML is the following:
<hostdev mode='subsystem' type='mdev'> <source> <!-- this is the host's physical device address --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <uuid>vGPU_UUID<uuid> <source> <!-- target PCI address can be omitted to assign it automatically --> </hostdev>
So the mediated device is identified by the physical parent device visible on the host and a UUID which allows us to construct the sysfs path by ourselves, which we then put on the QEMU's command line.
A few remarks if you actually happen to have a machine to test this on: - right now the mediated devices are one-time use only, i.e. they have to be recreated before every machine boot - I wouldn't recommend assigning multiple vGPUs to a single domain
Once this series is sorted out, we can then continue with 'managed=yes' where as Laine pointed out [1], we need to figure out how exactly should the management layer hint libvirt which vGPU type should be used for device instantiation.
You seem to be suggesting that managed=yes with mdev devices would cause create / delete of a mdev device from a specified parent.
This is rather different semantics from what managed=yes does with PCI device assignment today. There the managed=yes flag is just about controlling host device driver attachment. ie whether libvirt will manually bind to vfio.ko, or expect the admin to have bound it to vfio.ko before hand. I think it is important to keep that concept as is for mdev too.
If the managed attribute was used with other devices beside PCI, then sure, we should keep the concept, however, since only PCI devices support it and now we have another device type that potentially might have a use for such an attribute I think it's perfectly reasonable to alter the logic behind that attribute in favor of the new possibilities to device management which mdev framework is providing us with which in this case is dynamic creation and removal of a mediated device.
No we really shouldn't use one attribute to overload completely different semantics. As I say, we may well find we want to implement the existing PCI semantics for mdev devices too. If we want to auto-create, that should be a different attribute eg 'autocreate=yes|no' Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
participants (5)
-
Alex Williamson
-
Daniel P. Berrange
-
Erik Skultety
-
Martin Polednik
-
Michal Privoznik