[libvirt] [RFC PATCHv2 0/2] Implement tiered driver loading

This patchset implements a tiered driver loading system. I split the hypervisor drivers out into their own tier, which is loaded after the other drivers. This has the net effect of ensuring that things like secrets, networks, etc., are initialized and auto-started before any hypervisors, such as QEMU, LXC, etc. This resolves the race condition currently present when starting libvirtd while domains are running, which happens when restarting libvirtd after having started at least one domain. This patch will work without my config driver patchset, which is about to be submitted, as well. Without the config driver patchset, however, RBD storage pools using CephX authentication can not be auto-started due to a circular dependency between the QEMU and storage drivers. This may also affect other storage backends, but I currently only have the capacity to test with RBD and file backed storage pools. The reason this interferes with RBD storage pools is that currently, the storage driver has a hard-coded connection to QEMU in order to look up secrets. After this patchset, the QEMU driver will not be loaded until after the storage driver has completed its initialization and auto-start routines, which causes issues looking up secrets. Any pool type that does not use or need data from outside of the base storage pool definition should continue to auto-start along with no longer being affected by the current race condition. I have verified that file-based storage pools definitely auto-start fine after this patchset, and no longer have any issue with the current race condition. For anyone who is not familiar with the race condition I mention above, the basic description is that upon restarting libvirtd, any running QEMU domains using storage volumes are killed randomly due to their associated storage pool not yet being online. This is due to storage pool auto-start not having completed prior to QEMU initialization. In my prior testing, I found that this race condition affected at least one domain approximately 40% of the time. I sent this information to the mailing list back on 06DEC2013, if anyone is interested in going back and re-reading my description. I would appreciate any comments and suggestions about this patchset. It works for me on 4 machines running three different distros of Linux (Archlinux, Gentoo, and CentOS), so I would imagine that it should work most anywhere. Adam Walters (2): driver: Implement new state driver field libvirt: Implement tiered driver loading src/check-driverimpls.pl | 1 + src/driver.h | 7 +++++ src/interface/interface_backend_netcf.c | 1 + src/libvirt.c | 45 ++++++++++++++++++++------------- src/libxl/libxl_driver.c | 1 + src/lxc/lxc_driver.c | 1 + src/network/bridge_driver.c | 1 + src/node_device/node_device_hal.c | 1 + src/node_device/node_device_udev.c | 1 + src/nwfilter/nwfilter_driver.c | 1 + src/qemu/qemu_driver.c | 1 + src/remote/remote_driver.c | 1 + src/secret/secret_driver.c | 1 + src/storage/storage_driver.c | 1 + src/uml/uml_driver.c | 1 + src/xen/xen_driver.c | 1 + 16 files changed, 48 insertions(+), 18 deletions(-) -- 1.8.5.2

This implements a new field in the virStateDrvier struct. In order to prevent possible compilation issues, this patch also implements the new field in all of the existing drivers. Additionally, a change was needed to check-driverimpls.pl to prevent a make check problem. Other than those two items, all other changes are a single line addition to the driver source files to define this field. Signed-off-by: Adam Walters <adam@pandorasboxen.com> --- src/check-driverimpls.pl | 1 + src/driver.h | 7 +++++++ src/interface/interface_backend_netcf.c | 1 + src/libxl/libxl_driver.c | 1 + src/lxc/lxc_driver.c | 1 + src/network/bridge_driver.c | 1 + src/node_device/node_device_hal.c | 1 + src/node_device/node_device_udev.c | 1 + src/nwfilter/nwfilter_driver.c | 1 + src/qemu/qemu_driver.c | 1 + src/remote/remote_driver.c | 1 + src/secret/secret_driver.c | 1 + src/storage/storage_driver.c | 1 + src/uml/uml_driver.c | 1 + src/xen/xen_driver.c | 1 + 15 files changed, 21 insertions(+) diff --git a/src/check-driverimpls.pl b/src/check-driverimpls.pl index 17e2b48..f726403 100755 --- a/src/check-driverimpls.pl +++ b/src/check-driverimpls.pl @@ -37,6 +37,7 @@ while (<>) { next if $api eq "no"; next if $api eq "name"; + next if $api eq "stateDrvType"; my $suffix = $impl; my $prefix = $impl; diff --git a/src/driver.h b/src/driver.h index 5f4cd8d..943e1b1 100644 --- a/src/driver.h +++ b/src/driver.h @@ -1834,6 +1834,12 @@ typedef int typedef int (*virDrvStateStop)(void); +typedef enum { + VIR_DRV_STATE_DRV_LIBVIRT = 1, + VIR_DRV_STATE_DRV_HYPERVISOR = 2, + VIR_DRV_STATE_DRV_LAST, +} virDrvStateDrvType; + typedef struct _virStateDriver virStateDriver; typedef virStateDriver *virStateDriverPtr; @@ -1844,6 +1850,7 @@ struct _virStateDriver { virDrvStateCleanup stateCleanup; virDrvStateReload stateReload; virDrvStateStop stateStop; + virDrvStateDrvType stateDrvType; }; # endif diff --git a/src/interface/interface_backend_netcf.c b/src/interface/interface_backend_netcf.c index c525ca9..7c80fbe 100644 --- a/src/interface/interface_backend_netcf.c +++ b/src/interface/interface_backend_netcf.c @@ -1186,6 +1186,7 @@ static virStateDriver interfaceStateDriver = { .stateInitialize = netcfStateInitialize, .stateCleanup = netcfStateCleanup, .stateReload = netcfStateReload, + .stateDrvType = VIR_DRV_STATE_DRV_LIBVIRT, }; int netcfIfaceRegister(void) { diff --git a/src/libxl/libxl_driver.c b/src/libxl/libxl_driver.c index fc0efa2..826fd81 100644 --- a/src/libxl/libxl_driver.c +++ b/src/libxl/libxl_driver.c @@ -4420,6 +4420,7 @@ static virStateDriver libxlStateDriver = { .stateAutoStart = libxlStateAutoStart, .stateCleanup = libxlStateCleanup, .stateReload = libxlStateReload, + .stateDrvType = VIR_DRV_STATE_DRV_HYPERVISOR, }; diff --git a/src/lxc/lxc_driver.c b/src/lxc/lxc_driver.c index 982f3fc..6fb3cf8 100644 --- a/src/lxc/lxc_driver.c +++ b/src/lxc/lxc_driver.c @@ -5440,6 +5440,7 @@ static virStateDriver lxcStateDriver = { .stateAutoStart = lxcStateAutoStart, .stateCleanup = lxcStateCleanup, .stateReload = lxcStateReload, + .stateDrvType = VIR_DRV_STATE_DRV_HYPERVISOR, }; int lxcRegister(void) diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index 0b43a67..e437fde 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -3134,6 +3134,7 @@ static virStateDriver networkStateDriver = { .stateAutoStart = networkStateAutoStart, .stateCleanup = networkStateCleanup, .stateReload = networkStateReload, + .stateDrvType = VIR_DRV_STATE_DRV_LIBVIRT, }; int networkRegister(void) { diff --git a/src/node_device/node_device_hal.c b/src/node_device/node_device_hal.c index fafd520..15cdecb 100644 --- a/src/node_device/node_device_hal.c +++ b/src/node_device/node_device_hal.c @@ -827,6 +827,7 @@ static virStateDriver halStateDriver = { .stateInitialize = nodeStateInitialize, /* 0.5.0 */ .stateCleanup = nodeStateCleanup, /* 0.5.0 */ .stateReload = nodeStateReload, /* 0.5.0 */ + .stateDrvType = VIR_DRV_STATE_DRV_LIBVIRT, }; int diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 5d49968..91f8e9d 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1816,6 +1816,7 @@ static virStateDriver udevStateDriver = { .stateInitialize = nodeStateInitialize, /* 0.7.3 */ .stateCleanup = nodeStateCleanup, /* 0.7.3 */ .stateReload = nodeStateReload, /* 0.7.3 */ + .stateDrvType = VIR_DRV_STATE_DRV_LIBVIRT, }; int udevNodeRegister(void) diff --git a/src/nwfilter/nwfilter_driver.c b/src/nwfilter/nwfilter_driver.c index d21dd82..f29e723 100644 --- a/src/nwfilter/nwfilter_driver.c +++ b/src/nwfilter/nwfilter_driver.c @@ -702,6 +702,7 @@ static virStateDriver stateDriver = { .stateInitialize = nwfilterStateInitialize, .stateCleanup = nwfilterStateCleanup, .stateReload = nwfilterStateReload, + .stateDrvType = VIR_DRV_STATE_DRV_LIBVIRT, }; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6e21267..899ddd8 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -16621,6 +16621,7 @@ static virStateDriver qemuStateDriver = { .stateCleanup = qemuStateCleanup, .stateReload = qemuStateReload, .stateStop = qemuStateStop, + .stateDrvType = VIR_DRV_STATE_DRV_HYPERVISOR, }; int qemuRegister(void) { diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index ca86e3c..c032529 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -7300,6 +7300,7 @@ static virNWFilterDriver nwfilter_driver = { static virStateDriver state_driver = { .name = "Remote", .stateInitialize = remoteStateInitialize, + .stateDrvType = VIR_DRV_STATE_DRV_LIBVIRT, }; #endif diff --git a/src/secret/secret_driver.c b/src/secret/secret_driver.c index 9f7f946..f3aea8f 100644 --- a/src/secret/secret_driver.c +++ b/src/secret/secret_driver.c @@ -1189,6 +1189,7 @@ static virStateDriver stateDriver = { .stateInitialize = secretStateInitialize, .stateCleanup = secretStateCleanup, .stateReload = secretStateReload, + .stateDrvType = VIR_DRV_STATE_DRV_LIBVIRT, }; int diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c index c83aa8a..d753e34 100644 --- a/src/storage/storage_driver.c +++ b/src/storage/storage_driver.c @@ -2640,6 +2640,7 @@ static virStateDriver stateDriver = { .stateAutoStart = storageStateAutoStart, .stateCleanup = storageStateCleanup, .stateReload = storageStateReload, + .stateDrvType = VIR_DRV_STATE_DRV_LIBVIRT, }; int storageRegister(void) diff --git a/src/uml/uml_driver.c b/src/uml/uml_driver.c index 89afefe..82d207a 100644 --- a/src/uml/uml_driver.c +++ b/src/uml/uml_driver.c @@ -2904,6 +2904,7 @@ static virStateDriver umlStateDriver = { .stateAutoStart = umlStateAutoStart, .stateCleanup = umlStateCleanup, .stateReload = umlStateReload, + .stateDrvType = VIR_DRV_STATE_DRV_HYPERVISOR, }; int umlRegister(void) { diff --git a/src/xen/xen_driver.c b/src/xen/xen_driver.c index c45d980..3e5d860 100644 --- a/src/xen/xen_driver.c +++ b/src/xen/xen_driver.c @@ -276,6 +276,7 @@ static virStateDriver state_driver = { .name = "Xen", .stateInitialize = xenUnifiedStateInitialize, .stateCleanup = xenUnifiedStateCleanup, + .stateDrvType = VIR_DRV_STATE_DRV_HYPERVISOR, }; /*----- Dispatch functions. -----*/ -- 1.8.5.2

This utilizes the new state driver field to implement a tiered driver loading system. The two currently defined classes of driver are "libvirt" and "hypervisor" drivers. Hypervisor drivers are fairly self-explanatory, they provide domain services. Libvirt drivers are more like backend drivers, like the secret and storage drivers. The way the tiered loading is implemented makes it simple to add additional tiers as needed. The tiers are loaded in numeric order, starting with the lowered numbered tier. These tiers are defined in driver.h. Each tier's drivers are all initialized, then all auto-started before moving on to the next tier. This allows some interdependency between drivers within a tier, similar to the current free-for-all loading, while ensuring that auto-start is complete for a tier before moving on. By loading drivers in this manner, there is little-to-no need to hard-code driver load ordering or create a fully dynamic dependency-based loading algorithm. We still gain the benefits of a more orderly driver load order, which helps minimize the possibility of a race condition during startup. If a race condition is discovered in the future, the number of tiers can be changed to mitigate it without requiring too much effort. Signed-off-by: Adam Walters <adam@pandorasboxen.com> --- src/libvirt.c | 45 +++++++++++++++++++++++++++------------------ 1 file changed, 27 insertions(+), 18 deletions(-) diff --git a/src/libvirt.c b/src/libvirt.c index c15e29a..d8d1723 100644 --- a/src/libvirt.c +++ b/src/libvirt.c @@ -729,31 +729,40 @@ virStateInitialize(bool privileged, void *opaque) { size_t i; + size_t stateDriverTier; if (virInitialize() < 0) return -1; - for (i = 0; i < virStateDriverTabCount; i++) { - if (virStateDriverTab[i]->stateInitialize) { - VIR_DEBUG("Running global init for %s state driver", - virStateDriverTab[i]->name); - if (virStateDriverTab[i]->stateInitialize(privileged, - callback, - opaque) < 0) { - virErrorPtr err = virGetLastError(); - VIR_ERROR(_("Initialization of %s state driver failed: %s"), - virStateDriverTab[i]->name, - err && err->message ? err->message : _("Unknown problem")); - return -1; + for (stateDriverTier = 0; stateDriverTier < VIR_DRV_STATE_DRV_LAST; stateDriverTier++) { + for (i = 0; i < virStateDriverTabCount; i++) { + if (virStateDriverTab[i]->stateDrvType != stateDriverTier) + continue; + + if (virStateDriverTab[i]->stateInitialize) { + VIR_DEBUG("Running global init for %s state driver", + virStateDriverTab[i]->name); + if (virStateDriverTab[i]->stateInitialize(privileged, + callback, + opaque) < 0) { + virErrorPtr err = virGetLastError(); + VIR_ERROR(_("Initialization of %s state driver failed: %s"), + virStateDriverTab[i]->name, + err && err->message ? err->message : _("Unknown problem")); + return -1; + } } } - } - for (i = 0; i < virStateDriverTabCount; i++) { - if (virStateDriverTab[i]->stateAutoStart) { - VIR_DEBUG("Running global auto start for %s state driver", - virStateDriverTab[i]->name); - virStateDriverTab[i]->stateAutoStart(); + for (i = 0; i < virStateDriverTabCount; i++) { + if (virStateDriverTab[i]->stateDrvType != stateDriverTier) + continue; + + if (virStateDriverTab[i]->stateAutoStart) { + VIR_DEBUG("Running global auto start for %s state driver", + virStateDriverTab[i]->name); + virStateDriverTab[i]->stateAutoStart(); + } } } return 0; -- 1.8.5.2

On Thu, Jan 23, 2014 at 03:06:16PM -0500, Adam Walters wrote:
This patchset implements a tiered driver loading system. I split the hypervisor drivers out into their own tier, which is loaded after the other drivers. This has the net effect of ensuring that things like secrets, networks, etc., are initialized and auto-started before any hypervisors, such as QEMU, LXC, etc. This resolves the race condition currently present when starting libvirtd while domains are running, which happens when restarting libvirtd after having started at least one domain.
[snip]
For anyone who is not familiar with the race condition I mention above, the basic description is that upon restarting libvirtd, any running QEMU domains using storage volumes are killed randomly due to their associated storage pool not yet being online. This is due to storage pool auto-start not having completed prior to QEMU initialization. In my prior testing, I found that this race condition affected at least one domain approximately 40% of the time. I sent this information to the mailing list back on 06DEC2013, if anyone is interested in going back and re-reading my description.
Ok, so the current sequence is /supposed/ to be 1 StateInitialize 1.1 StorageInitialize 1.1.1 Load all configs 1.1.2 Detect currently active pools 1.2 QEMUInitialize 1.2.1 Load all configs 1.2.2 Detect currently active guests 2 StateAutostart 2.1 StorageAutostart 2.1.1 Start any inactive pools marked autostart 2.2 QEMUAutostart 2.2.1 Start any inactive guests marked autostart Looking at the storage driver code though, I see a flaw in that 1.1.2 does not actually exist. The checking for existing active storage pools is only done in step 2.1.1 as part of autostart, which as you say is clearly too late. Also the checking for existing active storage pools looks like it only works for storage pools which actually have some persistent kernel / userspace state outside of libvirt. ie iSCSI pools will remain active even when libvirt is not running, since the iSCSI initiator is outside scope of libvirt. If we have a RBD pool, however, which is using librbd instead of the FUSE driver there's no persistent state to detect. The issue here is that the storage driver is not correctly keeping track of what storage pools were active prior to restart. ie any active storage pools before restart, should still be active after restart, *regardless* of the autostart flag. So I understand why your proposed split of driver loading works to solve this, but even with this it only solves the startup ordering problem. There's still the issue that we're reliant on the autostart flag being set. If there's a guest with an RBD volume against a pool that isn't set to autostart, we're still doomed. The storage drivers are just plain broken wrt libvirt restarts because they're not correctly recording their state and restoring back to the exact same state after restart. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Jan 28, 2014 at 8:05 AM, Daniel P. Berrange <berrange@redhat.com>wrote:
On Thu, Jan 23, 2014 at 03:06:16PM -0500, Adam Walters wrote:
This patchset implements a tiered driver loading system. I split the hypervisor drivers out into their own tier, which is loaded after the other drivers. This has the net effect of ensuring that things like secrets, networks, etc., are initialized and auto-started before any hypervisors, such as QEMU, LXC, etc. This resolves the race condition currently present when starting libvirtd while domains are running, which happens when restarting libvirtd after having started at least one domain.
[snip]
For anyone who is not familiar with the race condition I mention above, the basic description is that upon restarting libvirtd, any running QEMU domains using storage volumes are killed randomly due to their associated storage pool not yet being online. This is due to storage pool auto-start not having completed prior to QEMU initialization. In my prior testing, I found that this race condition affected at least one domain approximately 40% of the time. I sent this information to the mailing list back on 06DEC2013, if anyone is interested in going back and re-reading my description.
Ok, so the current sequence is /supposed/ to be
1 StateInitialize 1.1 StorageInitialize 1.1.1 Load all configs 1.1.2 Detect currently active pools 1.2 QEMUInitialize 1.2.1 Load all configs 1.2.2 Detect currently active guests 2 StateAutostart 2.1 StorageAutostart 2.1.1 Start any inactive pools marked autostart 2.2 QEMUAutostart 2.2.1 Start any inactive guests marked autostart
Looking at the storage driver code though, I see a flaw in that 1.1.2 does not actually exist. The checking for existing active storage pools is only done in step 2.1.1 as part of autostart, which as you say is clearly too late.
Also the checking for existing active storage pools looks like it only works for storage pools which actually have some persistent kernel / userspace state outside of libvirt. ie iSCSI pools will remain active even when libvirt is not running, since the iSCSI initiator is outside scope of libvirt. If we have a RBD pool, however, which is using librbd instead of the FUSE driver there's no persistent state to detect. The issue here is that the storage driver is not correctly keeping track of what storage pools were active prior to restart. ie any active storage pools before restart, should still be active after restart, *regardless* of the autostart flag.
So I understand why your proposed split of driver loading works to solve this, but even with this it only solves the startup ordering problem. There's still the issue that we're reliant on the autostart flag being set. If there's a guest with an RBD volume against a pool that isn't set to autostart, we're still doomed.
I hadn't really thought about that piece, as I always set my storage pools to auto-start. I'm still not sure that a race condition would not still exist if the storage driver continues to utilize a hard-coded connection to 'qemu:///', though. I'm not particularly familiar enough with the libvirt code to know exactly at which point the QEMU driver can accept a connection to its URI and facilitate access to information about networks, secrets, etc. Also note that I was also able to reproduce the race condition with a file-backed storage pool, too. So even taking out the semi-exotic RBD piece, I think there would still be problems if 1.1.2 were added into the code. I could certainly look into implementing something to store state information for storage pools. I'll likely look toward existing domain code for an idea of how to implement it. That issue does sort of bleed into my other patchset that implements the 'config:///' URI, which we had some discussion over prior to the fixup and resubmit.
The storage drivers are just plain broken wrt libvirt restarts because they're not correctly recording their state and restoring back to the exact same state after restart.
Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/:| |: http://libvirt.org -o- http://virt-manager.org:| |: http://autobuild.org -o- http://search.cpan.org/~danberr/:| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc:|
participants (2)
-
Adam Walters
-
Daniel P. Berrange