[libvirt] [PATCH REPOST v2 0/3] Add callback mech for qemu and nodedev

Repost: http://www.redhat.com/archives/libvir-list/2017-February/msg00501.html to update to top of branch as of commit id '5ad03b9db2' John Ferlan (3): nodedev: Add driver callback mechanism to add/remove devices qemu: Use nodedev callback mechanism to to get a nodedev data qemu: Add configuration variable to control nodedev enumeration src/check-aclrules.pl | 3 +- src/check-driverimpls.pl | 1 + src/conf/node_device_conf.c | 112 ++++++++++++++++++++++++++++++++++++- src/conf/node_device_conf.h | 18 ++++++ src/libvirt_private.syms | 3 + src/node_device/node_device_udev.c | 29 ++++++++++ src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf | 19 +++++++ src/qemu/qemu_conf.c | 53 ++++++++++++++++++ src/qemu/qemu_conf.h | 17 ++++++ src/qemu/qemu_driver.c | 46 +++++++++++++++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 12 files changed, 301 insertions(+), 2 deletions(-) -- 2.9.3

Add a callback mechanism as a side door of sorts to the event mgmt functions that are triggered when a node_device object is added or removed. This includes a mechanism to enumerate already added devices for those stateInitialize functions called after the initial nodedevRegister is called. If there is no enumeration callback, a failure would be returned since the caller would seem to need that. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/check-aclrules.pl | 3 +- src/check-driverimpls.pl | 1 + src/conf/node_device_conf.c | 112 ++++++++++++++++++++++++++++++++++++- src/conf/node_device_conf.h | 18 ++++++ src/libvirt_private.syms | 3 + src/node_device/node_device_udev.c | 29 ++++++++++ 6 files changed, 164 insertions(+), 2 deletions(-) diff --git a/src/check-aclrules.pl b/src/check-aclrules.pl index 8739cda..161d954 100755 --- a/src/check-aclrules.pl +++ b/src/check-aclrules.pl @@ -243,7 +243,8 @@ while (<>) { } elsif (/^(?:static\s+)?(vir(?:\w+)?Driver)\s+/) { if ($1 ne "virNWFilterCallbackDriver" && $1 ne "virNWFilterTechDriver" && - $1 ne "virDomainConfNWFilterDriver") { + $1 ne "virDomainConfNWFilterDriver" && + $1 ne "virNodeDeviceCallbackDriver") { $intable = 1; $table = $1; } diff --git a/src/check-driverimpls.pl b/src/check-driverimpls.pl index e320558..b707fc8 100755 --- a/src/check-driverimpls.pl +++ b/src/check-driverimpls.pl @@ -70,6 +70,7 @@ while (<>) { } elsif (/^(?:static\s+)?(vir(?:\w+)?Driver)\s+/) { next if $1 eq "virNWFilterCallbackDriver" || $1 eq "virNWFilterTechDriver" || + $1 eq "virNodeDeviceCallbackDriver" || $1 eq "virConnectDriver"; $intable = 1; $table = $1; diff --git a/src/conf/node_device_conf.c b/src/conf/node_device_conf.c index c802840..061f013 100644 --- a/src/conf/node_device_conf.c +++ b/src/conf/node_device_conf.c @@ -278,6 +278,99 @@ void virNodeDeviceDefFree(virNodeDeviceDefPtr def) VIR_FREE(def); } + +/* Provide callback infrastructure that is expected to be called when + * devices are added/removed from a node_device subsystem that supports + * the callback mechanism. This provides a way for drivers to register + * to be notified when a node_device is added/removed. */ +virNodedevEnumerateAddDevices virNodedevEnumerateAddDevicesCb = NULL; +int nodeDeviceCallbackDriver; +#define MAX_CALLBACK_DRIVER 10 +static virNodeDeviceCallbackDriverPtr nodeDeviceDrvArray[MAX_CALLBACK_DRIVER]; + + +/** + * @cb: Driver function to be called. + * + * Set a callback at driver initialization to provide a callback to a + * driver specific function that can handle the enumeration of the existing + * devices and the addition of those devices for the registering driver. + * + * The initial node_device enumeration is done prior to other drivers, thus + * this provides a mechanism to load the existing data since the functions + * virNodeDeviceAssignDef and virNodeDeviceObjRemove would typically + * only be called when a new device is added/removed after the initial + * enumeration. The registering driver will need to handle duplication + * of data. + */ +void +virNodeDeviceConfEnumerateInit(virNodedevEnumerateAddDevices cb) +{ + virNodedevEnumerateAddDevicesCb = cb; +} + + +/** + * @cbd: Driver callback pointers to add/remove devices + * + * Register a callback function in the registering driver to be called + * when devices are added or removed. Additionally, ensure the initial + * enumeration of the devices is completed taking care to do it after + * setting the callbacks + * + * Returns address of enumerate callback on success, NULL on failure + */ +virNodedevEnumerateAddDevices +virNodeDeviceRegisterCallbackDriver(virNodeDeviceCallbackDriverPtr cbd) +{ + if (nodeDeviceCallbackDriver >= MAX_CALLBACK_DRIVER) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("no space available for another callback driver")); + return NULL; + } + + if (!cbd->nodeDeviceAdd || !cbd->nodeDeviceRemove) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("callback requires both add and remove functions")); + return NULL; + } + + if (!virNodedevEnumerateAddDevicesCb) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("no enumeration callback API is registered")); + return NULL; + } + + nodeDeviceDrvArray[nodeDeviceCallbackDriver++] = cbd; + + return virNodedevEnumerateAddDevicesCb; +} + + +/** + * @cb: Driver function to be called. + * + * Clear the callback function. It'll be up to the calling driver to + * clear it's own data properly + */ +void +virNodeDeviceUnregisterCallbackDriver(virNodeDeviceCallbackDriverPtr cbd) +{ + size_t i = 0; + + while (i < nodeDeviceCallbackDriver && nodeDeviceDrvArray[i] != cbd) + i++; + + if (i < nodeDeviceCallbackDriver) { + memmove(&nodeDeviceDrvArray[i], &nodeDeviceDrvArray[i+1], + (nodeDeviceCallbackDriver - i - 1) * + sizeof(nodeDeviceDrvArray[i])); + nodeDeviceDrvArray[i] = NULL; + nodeDeviceCallbackDriver--; + } +} + + void virNodeDeviceObjFree(virNodeDeviceObjPtr dev) { if (!dev) @@ -304,6 +397,7 @@ void virNodeDeviceObjListFree(virNodeDeviceObjListPtr devs) virNodeDeviceObjPtr virNodeDeviceAssignDef(virNodeDeviceObjListPtr devs, virNodeDeviceDefPtr def) { + size_t i; virNodeDeviceObjPtr device; if ((device = virNodeDeviceFindByName(devs, def->name))) { @@ -330,6 +424,13 @@ virNodeDeviceObjPtr virNodeDeviceAssignDef(virNodeDeviceObjListPtr devs, } device->def = def; + /* Call any registered drivers that want to be notified of a new device */ + for (i = 0; i < nodeDeviceCallbackDriver; i++) { + if (nodeDeviceDrvArray[i]->nodeDeviceAdd(def, false) < 0) + VIR_WARN("Failed to add device name '%s' parent '%s' for " + "register callback %zu", def->name, def->parent, i); + } + return device; } @@ -337,13 +438,22 @@ virNodeDeviceObjPtr virNodeDeviceAssignDef(virNodeDeviceObjListPtr devs, void virNodeDeviceObjRemove(virNodeDeviceObjListPtr devs, virNodeDeviceObjPtr *dev) { - size_t i; + size_t i, j; virNodeDeviceObjUnlock(*dev); for (i = 0; i < devs->count; i++) { virNodeDeviceObjLock(*dev); if (devs->objs[i] == *dev) { + /* Call any registered drivers that want to be notified of a + * removed device */ + for (j = 0; j < nodeDeviceCallbackDriver; j++) { + if (nodeDeviceDrvArray[j]->nodeDeviceRemove((*dev)->def) < 0) { + VIR_WARN("failed to remove name='%s' parent='%s' from " + "callback driver, continuing", + (*dev)->def->name, (*dev)->def->parent); + } + } virNodeDeviceObjUnlock(*dev); virNodeDeviceObjFree(devs->objs[i]); *dev = NULL; diff --git a/src/conf/node_device_conf.h b/src/conf/node_device_conf.h index 8213c27..b1d878a 100644 --- a/src/conf/node_device_conf.h +++ b/src/conf/node_device_conf.h @@ -234,6 +234,24 @@ struct _virNodeDeviceDef { virNodeDevCapsDefPtr caps; /* optional device capabilities */ }; +/* Callback mechanism to add/remove node device's by name/parent + * to a target driver that cares to know */ +typedef int (*virNodeDeviceAdd)(virNodeDeviceDefPtr def, bool enumerate); +typedef int (*virNodeDeviceRemove)(virNodeDeviceDefPtr def); + +typedef struct _virNodeDeviceCallbackDriver virNodeDeviceCallbackDriver; +typedef virNodeDeviceCallbackDriver *virNodeDeviceCallbackDriverPtr; +struct _virNodeDeviceCallbackDriver { + const char *name; + virNodeDeviceAdd nodeDeviceAdd; + virNodeDeviceRemove nodeDeviceRemove; +}; + +typedef int (*virNodedevEnumerateAddDevices)(virNodeDeviceAdd deviceAddCb); +void virNodeDeviceConfEnumerateInit(virNodedevEnumerateAddDevices cb); + +virNodedevEnumerateAddDevices virNodeDeviceRegisterCallbackDriver(virNodeDeviceCallbackDriverPtr); +void virNodeDeviceUnregisterCallbackDriver(virNodeDeviceCallbackDriverPtr); typedef struct _virNodeDeviceObj virNodeDeviceObj; typedef virNodeDeviceObj *virNodeDeviceObjPtr; diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index e6ccd69..f426c76 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -695,6 +695,7 @@ virNodeDevCapsDefFree; virNodeDevCapTypeFromString; virNodeDevCapTypeToString; virNodeDeviceAssignDef; +virNodeDeviceConfEnumerateInit; virNodeDeviceDefFormat; virNodeDeviceDefFree; virNodeDeviceDefParseFile; @@ -711,6 +712,8 @@ virNodeDeviceObjListFree; virNodeDeviceObjLock; virNodeDeviceObjRemove; virNodeDeviceObjUnlock; +virNodeDeviceRegisterCallbackDriver; +virNodeDeviceUnregisterCallbackDriver; # conf/node_device_event.h diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 1016075..52e5e97 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1529,6 +1529,33 @@ static int udevPCITranslateInit(bool privileged ATTRIBUTE_UNUSED) return 0; } + +/* + * @deviceAddCb: Callback routine for adding a device + * + * Enumerate all known devices calling the add device callback function + * + * Returns 0 on success, -1 on failure + */ +static int +udevEnumerateAddDevices(virNodeDeviceAdd deviceAddCb) +{ + size_t i; + int ret = 0; + + nodeDeviceLock(); + for (i = 0; i < driver->devs.count && ret >= 0; i++) { + virNodeDeviceObjPtr obj = driver->devs.objs[i]; + virNodeDeviceObjLock(obj); + ret = deviceAddCb(obj->def, true); + virNodeDeviceObjUnlock(obj); + } + nodeDeviceUnlock(); + + return ret; +} + + static int nodeStateInitialize(bool privileged, virStateInhibitCallback callback ATTRIBUTE_UNUSED, void *opaque ATTRIBUTE_UNUSED) @@ -1606,6 +1633,8 @@ static int nodeStateInitialize(bool privileged, if (udevEnumerateDevices(udev) != 0) goto cleanup; + virNodeDeviceConfEnumerateInit(udevEnumerateAddDevices); + ret = 0; cleanup: -- 2.9.3

Using the nodedev driver callback mechanism - allow qemu to be notified whenever a nodedev is added/removed. Keep track of the node devices in a hash table rather than requiring a connection in order to get specific information about a node device that qemu would eventually like to keep track of. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_conf.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_conf.h | 14 ++++++++++++++ src/qemu/qemu_driver.c | 41 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 103 insertions(+) diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index b5b0645..476179f 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -1619,3 +1619,51 @@ qemuGetDomainHupageMemPath(const virDomainDef *def, return 0; } + + +/** + * @driver: qemu driver pointer + * @def: Node device definition + * @enumeration: boolean for emumeration indication + * + * Taking the @def from the node device driver, add the device to + * the qemu node device hash table. The @enumeration is true when + * this callback is called from the node device enumeration and false + * when called when the @def was added to the node device. + * + * Returns -1 on failure, 0 on adding device name, 1 on already added name + */ +int +qemuNodeDeviceEntryAdd(virQEMUDriverPtr driver, + virNodeDeviceDefPtr def, + bool enumerate) +{ + VIR_DEBUG("Attempt to add name='%s', parent='%s' enumerate=%d", + def->name, def->parent, enumerate); + + if (!virHashLookup(driver->nodeDevices, def->name)) { + if (virHashAddEntry(driver->nodeDevices, def->name, NULL) < 0) + return -1; + return 0; + } + return 1; +} + + +/** + * @driver: qemu driver pointer + * @def: node device definition + * + * Remove the definition from qemu's node device hash table. + * + * Returns 0 on success, -1 on failure + */ +int +qemuNodeDeviceEntryRemove(virQEMUDriverPtr driver, + virNodeDeviceDefPtr def) +{ + VIR_DEBUG("Attempt to remove name='%s' parent='%s'", + def->name, def->parent); + + return virHashRemoveEntry(driver->nodeDevices, def->name); +} diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index e585f81..d47c9cc 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -32,6 +32,7 @@ # include "network_conf.h" # include "domain_conf.h" # include "snapshot_conf.h" +# include "node_device_conf.h" # include "domain_event.h" # include "virthread.h" # include "security/security_manager.h" @@ -255,6 +256,9 @@ struct _virQEMUDriver { virHashTablePtr sharedDevices; /* Immutable pointer, self-locking APIs */ + virHashTablePtr nodeDevices; + + /* Immutable pointer, self-locking APIs */ virPortAllocatorPtr remotePorts; /* Immutable pointer, self-locking APIs */ @@ -351,4 +355,14 @@ int qemuGetDomainHupageMemPath(const virDomainDef *def, virQEMUDriverConfigPtr cfg, unsigned long long pagesize, char **memPath); + +void qemuNodeDeviceEntryFree(void *payload, const void *name); + +int qemuNodeDeviceEntryAdd(virQEMUDriverPtr driver, + virNodeDeviceDefPtr def, + bool enumerate); + +int qemuNodeDeviceEntryRemove(virQEMUDriverPtr driver, + virNodeDeviceDefPtr def); + #endif /* __QEMUD_CONF_H */ diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6e1e3d4..546fea7 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -197,6 +197,28 @@ static virNWFilterCallbackDriver qemuCallbackDriver = { }; +static int +qemuNodeDeviceAdd(virNodeDeviceDefPtr def, + bool enumerate) +{ + return qemuNodeDeviceEntryAdd(qemu_driver, def, enumerate); +} + + +static int +qemuNodeDeviceRemove(virNodeDeviceDefPtr def) +{ + return qemuNodeDeviceEntryRemove(qemu_driver, def); +} + + +static virNodeDeviceCallbackDriver qemuNodedevCallbackDriver = { + .name = QEMU_DRIVER_NAME, + .nodeDeviceAdd = qemuNodeDeviceAdd, + .nodeDeviceRemove = qemuNodeDeviceRemove, +}; + + struct qemuAutostartData { virQEMUDriverPtr driver; virConnectPtr conn; @@ -638,6 +660,7 @@ qemuStateInitialize(bool privileged, char *driverConf = NULL; virConnectPtr conn = NULL; virQEMUDriverConfigPtr cfg; + virNodedevEnumerateAddDevices nodedevEnumCb; uid_t run_uid = -1; gid_t run_gid = -1; char *hugepagePath = NULL; @@ -776,6 +799,23 @@ qemuStateInitialize(bool privileged, if (!(qemu_driver->sharedDevices = virHashCreate(30, qemuSharedDeviceEntryFree))) goto error; + /* Create a hash table to keep track of node device's by name */ + if (!(qemu_driver->nodeDevices = virHashCreate(100, NULL))) + goto error; + + /* Set up a callback mechanism with the node device conf code to get + * called whenever a node device is added or removed. */ + if (!(nodedevEnumCb = + virNodeDeviceRegisterCallbackDriver(&qemuNodedevCallbackDriver))) + goto error; + + /* Setting the add/remove callback first ensures that there is no + * window of opportunity for a device to be added after enumeration + * is complete, but before the callback is in place. So, set the + * callback first, then do the enumeration. */ + if (nodedevEnumCb(qemuNodeDeviceAdd) < 0) + goto error; + if (qemuMigrationErrorInit(qemu_driver) < 0) goto error; @@ -1075,6 +1115,7 @@ qemuStateCleanup(void) return -1; virNWFilterUnRegisterCallbackDriver(&qemuCallbackDriver); + virNodeDeviceUnregisterCallbackDriver(&qemuNodedevCallbackDriver); virThreadPoolFree(qemu_driver->workerPool); virObjectUnref(qemu_driver->config); virObjectUnref(qemu_driver->hostdevMgr); -- 2.9.3

Add an 'enumerate_nodedev' qemu configuration variable to control whether the node device enumeration and add/remove event notification mechanism is enabled. This ensures only those environments that desire to utilize a domain vHBA need to have the (slight) overhead of adding the device hash table to qemu and managing add/remove events for specific scsi_host and scsi_target events. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf | 19 +++++++++++++++++++ src/qemu/qemu_conf.c | 5 +++++ src/qemu/qemu_conf.h | 3 +++ src/qemu/qemu_driver.c | 33 +++++++++++++++++++-------------- src/qemu/test_libvirtd_qemu.aug.in | 1 + 6 files changed, 48 insertions(+), 14 deletions(-) diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 82bae9e..ef759ce 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -92,6 +92,7 @@ module Libvirtd_qemu = let device_entry = bool_entry "mac_filter" | bool_entry "relaxed_acs_check" | bool_entry "allow_disk_format_probing" + | bool_entry "nodedev_enumeration" | str_entry "lock_manager" let rpc_entry = int_entry "max_queued" diff --git a/src/qemu/qemu.conf b/src/qemu/qemu.conf index 9f990c2..712473e 100644 --- a/src/qemu/qemu.conf +++ b/src/qemu/qemu.conf @@ -531,6 +531,25 @@ #allow_disk_format_probing = 1 +# In order for a properly configured qemu domain to be able to passthrough +# NPIV vHBA LUN's to the guest qemu will need to enumerate all the node +# device's and then handle the add and remove events. In order to enable +# qemu to handle this, set this parameter to 1 and restart libvirtd. This +# enables communication between qemu and udev node device event management +# to provide the functionality. When a domain is started or reconnected +# with a vHBA controller configured in the domain XML or when a vHBA +# controller is hot-plugged, the qemu driver will handle the "scsi_hostN" +# device creation and removal events by hot-(un)plugging "scsi_targetB_T_L" +# (Bus, Target, Lun) Direct-Access LUNs to/from the guest. +# +# This is not required for qemu domain environments that utilize the Storage +# Pool NPIV vHBA's to define LUNs in the domain XML. +# +# Defaults to 0. +# +#nodedev_enumeration = 1 + + # In order to prevent accidentally starting two domains that # share one writable disk, libvirt offers two approaches for # locking files. The first one is sanlock, the other one, diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 476179f..5f480d5 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -349,6 +349,8 @@ virQEMUDriverConfigPtr virQEMUDriverConfigNew(bool privileged) goto error; #endif + cfg->nodeDeviceEnumeration = false; + return cfg; error: @@ -712,6 +714,9 @@ int virQEMUDriverConfigLoadFile(virQEMUDriverConfigPtr cfg, goto cleanup; if (virConfGetValueBool(conf, "allow_disk_format_probing", &cfg->allowDiskFormatProbing) < 0) goto cleanup; + if (virConfGetValueBool(conf, "nodedev_enumeration", + &cfg->nodeDeviceEnumeration) < 0) + goto cleanup; if (virConfGetValueBool(conf, "set_process_name", &cfg->setProcessName) < 0) goto cleanup; if (virConfGetValueUInt(conf, "max_processes", &cfg->maxProcesses) < 0) diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index d47c9cc..10ccb54 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -198,6 +198,9 @@ struct _virQEMUDriverConfig { unsigned int glusterDebugLevel; char *memoryBackingDir; + + /* Node device enumeration enabled */ + bool nodeDeviceEnumeration; }; /* Main driver state */ diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 546fea7..db0cf87 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -799,22 +799,27 @@ qemuStateInitialize(bool privileged, if (!(qemu_driver->sharedDevices = virHashCreate(30, qemuSharedDeviceEntryFree))) goto error; - /* Create a hash table to keep track of node device's by name */ - if (!(qemu_driver->nodeDevices = virHashCreate(100, NULL))) - goto error; + /* If node device enumeration is enabled, create a hash table to + * keep track of node device's by name and set up a callback mechanism + * with the node device conf code to get called whenever a node device + * is added or removed. */ + if (cfg->nodeDeviceEnumeration) { + if (!(qemu_driver->nodeDevices = virHashCreate(100, NULL))) + goto error; - /* Set up a callback mechanism with the node device conf code to get - * called whenever a node device is added or removed. */ - if (!(nodedevEnumCb = - virNodeDeviceRegisterCallbackDriver(&qemuNodedevCallbackDriver))) - goto error; + /* Set up a callback mechanism with the node device conf code to get + * called whenever a node device is added or removed. */ + if (!(nodedevEnumCb = + virNodeDeviceRegisterCallbackDriver(&qemuNodedevCallbackDriver))) + goto error; - /* Setting the add/remove callback first ensures that there is no - * window of opportunity for a device to be added after enumeration - * is complete, but before the callback is in place. So, set the - * callback first, then do the enumeration. */ - if (nodedevEnumCb(qemuNodeDeviceAdd) < 0) - goto error; + /* Setting the add/remove callback first ensures that there is no + * window of opportunity for a device to be added after enumeration + * is complete, but before the callback is in place. So, set the + * callback first, then do the enumeration. */ + if (nodedevEnumCb(qemuNodeDeviceAdd) < 0) + goto error; + } if (qemuMigrationErrorInit(qemu_driver) < 0) goto error; diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index 6f03898..9b29fff 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirtd_qemu.aug.in @@ -73,6 +73,7 @@ module Test_libvirtd_qemu = { "mac_filter" = "1" } { "relaxed_acs_check" = "1" } { "allow_disk_format_probing" = "1" } +{ "nodedev_enumeration" = "1" } { "lock_manager" = "lockd" } { "max_queued" = "0" } { "keepalive_interval" = "5" } -- 2.9.3

On Tue, Feb 21, 2017 at 11:33:25AM -0500, John Ferlan wrote:
Repost: http://www.redhat.com/archives/libvir-list/2017-February/msg00501.html
to update to top of branch as of commit id '5ad03b9db2'
BTW, could you include the full cover letter in each new version rather than making people follow links all the way back to v1 to find info about the patch series goals. IIUC, the intention here is that we automatically create NPIV devices when starting guests and delete them when stopping guests. I can see some appeal in this, but at the same time I'm not convinced we should add such a feature. AFAICT, the node device APIs already allow a management application to achieve the same end goal without needing this integration. Yes, it would simplify usage of NPIV on the surface, but the cost of doing this is that it ends a specific usage policy for NPIV in the libvirt code and makes error handling harder. In particular it is possible to get into a situation where a VM fails to start and we're also unable to clear up the NPIV device we just auto-created. Now this could be said to apply to pretty much everything we do during guest startup, but in most cases the failure is harmless or gets auto-cleaned up by the kernel (ie the tap devices get auto-deleted when the FD is closed, or SELinux labels get reset next time a VM wants that file, locks are released when we close the virtlockd file handle, etc). NPIV is significantly more complicated and more likely to hit failure scenarios due to fact that it involves interactions with off-node hardware resources. Is there some aspect of NPIV mgmt that can only be achieved if libvirt is explicitly managing the device lifecycle during VM start/stop, as opposed to having the mgmt app manage it ? If OpenStack were to provide NPIV support I think it'd probably end up dealing with device setup explicitly via the node device APIs rather than relying on libvirt to create/delete them. That way it can track the lifecycle of NPIV devices explicitly, and if it is not possible to delete them at time of QEMU shutdown for some reason, it can easily arrange to delete them later. Overall I think one of the more successful aspects of libvirt's design has been the way we minimise the addition of usage policy decisions, in favour of providing mechanisms that applications can use to implement policies. This has had a cost in that applications need todo more work themselves, but on balance I still think it is a win to avoid adding policy driven features to libvirt. A key question is just where "autocreation/delete of NPIV devices" falls in the line between mechanism & policy, since the line is not entirely black & white. I tend towards it being policy though, since it is just providing a less general purpose way todo something that can be achieved already via the node device APIs. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On 02/21/2017 12:03 PM, Daniel P. Berrange wrote:
On Tue, Feb 21, 2017 at 11:33:25AM -0500, John Ferlan wrote:
Repost: http://www.redhat.com/archives/libvir-list/2017-February/msg00501.html
to update to top of branch as of commit id '5ad03b9db2'
BTW, could you include the full cover letter in each new version rather than making people follow links all the way back to v1 to find info about the patch series goals.
OK - I'll try to remember.
IIUC, the intention here is that we automatically create NPIV devices when starting guests and delete them when stopping guests. I can see some appeal in this, but at the same time I'm not convinced we should add such a feature.
A bit more than that - create the vHBA and assign the LUN's to the guest as they are discovered and remove them as they are removed (events from udev). This is a mechanism/idea from Paolo. The RHV team would be the primary consumer and IIRC they don't use storage pools.
AFAICT, the node device APIs already allow a management application to achieve the same end goal without needing this integration. Yes, it would simplify usage of NPIV on the surface, but the cost of doing this is that it ends a specific usage policy for NPIV in the libvirt code and makes error handling harder. In particular it is possible to get into a situation where a VM fails to start and we're also unable to clear up the NPIV device we just auto-created. Now this could be said to apply to pretty much everything we do during guest startup, but in most cases the failure is harmless or gets auto-cleaned up by the kernel (ie the tap devices get auto-deleted when the FD is closed, or SELinux labels get reset next time a VM wants that file, locks are released when we close the virtlockd file handle, etc). NPIV is significantly more complicated and more likely to hit failure scenarios due to fact that it involves interactions with off-node hardware resources.
I agree with your points. The "purpose" of libvirt taking care of it would be to let libvirt handle all those nasty and odd failure or integration issues - including migration. Of course from a libvirt perspective, I'd rather take the 'scsi_hostX' vHBA and just pass that through to QEMU directly to allow it (or the guest) to find the LUN's, but that's push the problem the other way. I said early on that this is something that could be done by the upper layers that would be able to receive the add/remove lun events whether they created a storage pool just for that purpose or they created the vHBA themselves. It's probably even in the bz's on this.
Is there some aspect of NPIV mgmt that can only be achieved if libvirt is explicitly managing the device lifecycle during VM start/stop, as opposed to having the mgmt app manage it ?
Beyond the upper layers not needing to handle anything other than creating the vHBA for the domain and letting libvirt handle the rest.
If OpenStack were to provide NPIV support I think it'd probably end up dealing with device setup explicitly via the node device APIs rather than relying on libvirt to create/delete them. That way it can track the lifecycle of NPIV devices explicitly, and if it is not possible to delete them at time of QEMU shutdown for some reason, it can easily arrange to delete them later.
Overall I think one of the more successful aspects of libvirt's design has been the way we minimise the addition of usage policy decisions, in favour of providing mechanisms that applications can use to implement policies. This has had a cost in that applications need todo more work themselves, but on balance I still think it is a win to avoid adding policy driven features to libvirt.
A key question is just where "autocreation/delete of NPIV devices" falls in the line between mechanism & policy, since the line is not entirely black & white. I tend towards it being policy though, since it is just providing a less general purpose way todo something that can be achieved already via the node device APIs.
Regards, Daniel
I understand - to a degree I guess I had assumed some of these type discussions had taken place by those that wanted the feature added. One other good thing that's come out of these changes is a bit more testing for vHBA creation via nodedev/storage pool and quite a bit of code cleanup once/if most of the patches I posted earlier in the week are accepted. John (FWIW: I'll have limited access to email over the next couple of days...)

On Tue, Feb 21, 2017 at 10:45:05PM -0500, John Ferlan wrote:
On 02/21/2017 12:03 PM, Daniel P. Berrange wrote:
On Tue, Feb 21, 2017 at 11:33:25AM -0500, John Ferlan wrote:
Repost: http://www.redhat.com/archives/libvir-list/2017-February/msg00501.html
to update to top of branch as of commit id '5ad03b9db2'
BTW, could you include the full cover letter in each new version rather than making people follow links all the way back to v1 to find info about the patch series goals.
OK - I'll try to remember.
IIUC, the intention here is that we automatically create NPIV devices when starting guests and delete them when stopping guests. I can see some appeal in this, but at the same time I'm not convinced we should add such a feature.
A bit more than that - create the vHBA and assign the LUN's to the guest as they are discovered and remove them as they are removed (events from udev). This is a mechanism/idea from Paolo. The RHV team would be the primary consumer and IIRC they don't use storage pools.
AFAICT, the node device APIs already allow a management application to achieve the same end goal without needing this integration. Yes, it would simplify usage of NPIV on the surface, but the cost of doing this is that it ends a specific usage policy for NPIV in the libvirt code and makes error handling harder. In particular it is possible to get into a situation where a VM fails to start and we're also unable to clear up the NPIV device we just auto-created. Now this could be said to apply to pretty much everything we do during guest startup, but in most cases the failure is harmless or gets auto-cleaned up by the kernel (ie the tap devices get auto-deleted when the FD is closed, or SELinux labels get reset next time a VM wants that file, locks are released when we close the virtlockd file handle, etc). NPIV is significantly more complicated and more likely to hit failure scenarios due to fact that it involves interactions with off-node hardware resources.
I agree with your points. The "purpose" of libvirt taking care of it would be to let libvirt handle all those nasty and odd failure or integration issues - including migration. Of course from a libvirt perspective, I'd rather take the 'scsi_hostX' vHBA and just pass that through to QEMU directly to allow it (or the guest) to find the LUN's, but that's push the problem the other way.
I said early on that this is something that could be done by the upper layers that would be able to receive the add/remove lun events whether they created a storage pool just for that purpose or they created the vHBA themselves. It's probably even in the bz's on this.
Yeah, I'm curious where the desire to assign individual LUNs comes from - that makes me even less comfortable with this idea. It means we are taking a single device from libvirt's POV - a HBA, and turning it into many devices from the guest's POV, which means for a single device we'd need to track multiple SCSI addresses surely, and we'd not know how many addresses we need to track up front either. I'd really strongly prefer we track individual devices given to the guest explicitly. I'd also suggest just assigning the whole vHBA, but We can still make it possible for a mgmt app to dynamically add/remove LUNS from a vHBA. Our node device APIs can report events when devices appear/disappear so an app can create a vHBA and monitor for LUN changes & update the guest. That said I'm curious how you'd handle remove, because surely by the time you see a LUN disappear at vHBA, the guest is already broken as its got an orphaned device. This seems like another reason to just assign the vHBA itself to let the guest deal with LUN removal more directly. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On 22/02/2017 11:05, Daniel P. Berrange wrote:
I'd also suggest just assigning the whole vHBA
What do you mean by "assigning the whole vHBA"? There is no concept of a vHBA as e.g. a PCI virtual function.
, but We can still make it possible for a mgmt app to dynamically add/remove LUNS from a vHBA.
Our node device APIs can report events when devices appear/disappear so an app can create a vHBA and monitor for LUN changes & update the guest. That said I'm curious how you'd handle remove, because surely by the time you see a LUN disappear at vHBA, the guest is already broken as its got an orphaned device. This seems like another reason to just assign the vHBA itself to let the guest deal with LUN removal more directly.
This is the same on bare metal: the device is orphaned at the time you unplug it. If the guest is using it and doesn't have a fallback, bad things happen. If it is not using it, or if it's using multipath, things ought to work just fine. Paolo

On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
On 22/02/2017 11:05, Daniel P. Berrange wrote:
I'd also suggest just assigning the whole vHBA
What do you mean by "assigning the whole vHBA"? There is no concept of a vHBA as e.g. a PCI virtual function.
I thought we had a generic SCSI passthrough capability that could attach to the SCSI vHBA in the host, or am I mis-remembering ? Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On 22/02/2017 14:46, Daniel P. Berrange wrote:
On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
On 22/02/2017 11:05, Daniel P. Berrange wrote:
I'd also suggest just assigning the whole vHBA
What do you mean by "assigning the whole vHBA"? There is no concept of a vHBA as e.g. a PCI virtual function.
I thought we had a generic SCSI passthrough capability that could attach to the SCSI vHBA in the host, or am I mis-remembering ?
No, the SCSI passthrough capability attaches to a single device. With Fibre Channel, zoning allows you to configure the set of visible disks for each machine (or more accurately for each WWPN and/or WWNN) at the storage fabric level. So the idea is to just configure the WWPN/WWNN and let libvirt handle the discovery of disks underneath. The alternative is to do it in both oVirt and Nova though, as mentioned elsewhere in the thread, libvirt could still provide a udev wrapper. Paolo

On Wed, Feb 22, 2017 at 03:10:00PM +0100, Paolo Bonzini wrote:
On 22/02/2017 14:46, Daniel P. Berrange wrote:
On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
On 22/02/2017 11:05, Daniel P. Berrange wrote:
I'd also suggest just assigning the whole vHBA
What do you mean by "assigning the whole vHBA"? There is no concept of a vHBA as e.g. a PCI virtual function.
I thought we had a generic SCSI passthrough capability that could attach to the SCSI vHBA in the host, or am I mis-remembering ?
No, the SCSI passthrough capability attaches to a single device.
With Fibre Channel, zoning allows you to configure the set of visible disks for each machine (or more accurately for each WWPN and/or WWNN) at the storage fabric level. So the idea is to just configure the WWPN/WWNN and let libvirt handle the discovery of disks underneath. The alternative is to do it in both oVirt and Nova though, as mentioned elsewhere in the thread, libvirt could still provide a udev wrapper.
What do you mean by "provide a udev wrapper" ? Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On 22/02/2017 15:11, Daniel P. Berrange wrote:
On Wed, Feb 22, 2017 at 03:10:00PM +0100, Paolo Bonzini wrote:
On 22/02/2017 14:46, Daniel P. Berrange wrote:
On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
On 22/02/2017 11:05, Daniel P. Berrange wrote:
I'd also suggest just assigning the whole vHBA
What do you mean by "assigning the whole vHBA"? There is no concept of a vHBA as e.g. a PCI virtual function.
I thought we had a generic SCSI passthrough capability that could attach to the SCSI vHBA in the host, or am I mis-remembering ?
No, the SCSI passthrough capability attaches to a single device.
With Fibre Channel, zoning allows you to configure the set of visible disks for each machine (or more accurately for each WWPN and/or WWNN) at the storage fabric level. So the idea is to just configure the WWPN/WWNN and let libvirt handle the discovery of disks underneath. The alternative is to do it in both oVirt and Nova though, as mentioned elsewhere in the thread, libvirt could still provide a udev wrapper.
What do you mean by "provide a udev wrapper" ?
Expose udev events in a way that is friendly to someone who is consuming the nodedev API. Thanks, Paolo

On Wed, Feb 22, 2017 at 03:13:05PM +0100, Paolo Bonzini wrote:
On 22/02/2017 15:11, Daniel P. Berrange wrote:
On Wed, Feb 22, 2017 at 03:10:00PM +0100, Paolo Bonzini wrote:
On 22/02/2017 14:46, Daniel P. Berrange wrote:
On Wed, Feb 22, 2017 at 02:38:32PM +0100, Paolo Bonzini wrote:
On 22/02/2017 11:05, Daniel P. Berrange wrote:
I'd also suggest just assigning the whole vHBA
What do you mean by "assigning the whole vHBA"? There is no concept of a vHBA as e.g. a PCI virtual function.
I thought we had a generic SCSI passthrough capability that could attach to the SCSI vHBA in the host, or am I mis-remembering ?
No, the SCSI passthrough capability attaches to a single device.
With Fibre Channel, zoning allows you to configure the set of visible disks for each machine (or more accurately for each WWPN and/or WWNN) at the storage fabric level. So the idea is to just configure the WWPN/WWNN and let libvirt handle the discovery of disks underneath. The alternative is to do it in both oVirt and Nova though, as mentioned elsewhere in the thread, libvirt could still provide a udev wrapper.
What do you mean by "provide a udev wrapper" ?
Expose udev events in a way that is friendly to someone who is consuming the nodedev API.
We've got that support already. Apps can register to receive events whenever a virNodeDevice is added or removed from a host, as well as if its config changes. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
participants (3)
-
Daniel P. Berrange
-
John Ferlan
-
Paolo Bonzini