[libvirt] [PATCH 0/3] nodedev driver initialization adjustments

While running the Avocado test suite on my laptop/host (f26), I ran into a few instances where if the timing is just right the libvirtd restart will "hang" and ends up not being stoppable - leaving the libvirtd <defunct> eventually once a kill -9 is run. I tracked this into a pci_system_init() call that never returns. Since I was mucking in the code - I also figured we could move device enumeration into its own thread to "speed" startup. John Ferlan (3): nodedev: Move device enumumeration out of nodeStateInitialize nodedev: Move the udevPCITranslate{Init|Deinit} nodedev: Move the udevPCITranslateInit call src/conf/virnodedeviceobj.h | 1 + src/node_device/node_device_udev.c | 107 ++++++++++++++++++++++++------------- 2 files changed, 70 insertions(+), 38 deletions(-) -- 2.13.6

Let's move the udevEnumerateDevices into a thread to "speed up" the initialization process. If the enumeration fails we can set the Quit flag to ensure that udevEventHandleCallback will not run. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/node_device/node_device_udev.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 1e1b71742..e0fca6159 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1891,6 +1891,25 @@ udevSetupSystemDev(void) } +static void +nodeStateInitializeEnumerate(void *opaque) +{ + struct udev *udev = opaque; + udevEventDataPtr priv = driver->privateData; + + /* Populate with known devices */ + if (udevEnumerateDevices(udev) != 0) + goto error; + + return; + + error: + virObjectLock(priv); + priv->threadQuit = true; + virObjectUnlock(priv); +} + + static int udevPCITranslateInit(bool privileged ATTRIBUTE_UNUSED) { @@ -1922,6 +1941,7 @@ nodeStateInitialize(bool privileged, { udevEventDataPtr priv = NULL; struct udev *udev = NULL; + virThread enumThread; if (VIR_ALLOC(driver) < 0) return -1; @@ -2002,9 +2022,12 @@ nodeStateInitialize(bool privileged, if (udevSetupSystemDev() != 0) goto cleanup; - /* Populate with known devices */ - if (udevEnumerateDevices(udev) != 0) + if (virThreadCreate(&enumThread, false, nodeStateInitializeEnumerate, + udev) < 0) { + virReportSystemError(errno, "%s", + _("failed to create udev enumerate thread")); goto cleanup; + } return 0; -- 2.13.6

On Sat, Dec 09, 2017 at 12:29:12PM -0500, John Ferlan wrote:
Let's move the udevEnumerateDevices into a thread to "speed up" the initialization process. If the enumeration fails we can set the Quit flag to ensure that udevEventHandleCallback will not run.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/node_device/node_device_udev.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
Speedup's okay (even though I don't expect anything significant), but if something goes wrong with udev, we start into a sort of crippled state where you're still able to start machines, but only if these don't make use of NPIV in a sense that a vHBA would need to be created, or managed direct-assigned PCI devices. Prior to this change if enumeration failed, the daemon terminated - I think this is an improvement. Reviewed-by: Erik Skultety <eskultet@redhat.com>

On 12/14/2017 05:42 AM, Erik Skultety wrote:
On Sat, Dec 09, 2017 at 12:29:12PM -0500, John Ferlan wrote:
Let's move the udevEnumerateDevices into a thread to "speed up" the initialization process. If the enumeration fails we can set the Quit flag to ensure that udevEventHandleCallback will not run.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/node_device/node_device_udev.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
Speedup's okay (even though I don't expect anything significant), but if something goes wrong with udev, we start into a sort of crippled state where you're still able to start machines, but only if these don't make use of NPIV in a sense that a vHBA would need to be created, or managed direct-assigned PCI devices. Prior to this change if enumeration failed, the daemon terminated - I think this is an improvement.
Reviewed-by: Erik Skultety <eskultet@redhat.com>
So as a "next step" - do you think it'd be worthwhile to actually implement the nodeStateReload functionality? I think we have enough "capability" now between separate threads for Enumeration and HandleCallback. Might be a bit tricky, but also a way to 'restart' a stopped HandleCallback thread. I will hold off on pushing for a bit to ensure someone else doesn't have extreme agita over the enumerate thread though. Tks - John

On Thu, Dec 14, 2017 at 08:25:34AM -0500, John Ferlan wrote:
On 12/14/2017 05:42 AM, Erik Skultety wrote:
On Sat, Dec 09, 2017 at 12:29:12PM -0500, John Ferlan wrote:
Let's move the udevEnumerateDevices into a thread to "speed up" the initialization process. If the enumeration fails we can set the Quit flag to ensure that udevEventHandleCallback will not run.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/node_device/node_device_udev.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
Speedup's okay (even though I don't expect anything significant), but if something goes wrong with udev, we start into a sort of crippled state where you're still able to start machines, but only if these don't make use of NPIV in a sense that a vHBA would need to be created, or managed direct-assigned PCI devices. Prior to this change if enumeration failed, the daemon terminated - I think this is an improvement.
Reviewed-by: Erik Skultety <eskultet@redhat.com>
So as a "next step" - do you think it'd be worthwhile to actually implement the nodeStateReload functionality? I think we have enough
I wasn't going to suggest that, since there are many other thing on the plate that need attention, but since you mentioned it, yes, I think it would be worthwhile to have that kind of functionality. On the other hand, since there's the resonating incentive on reworking our architecture, we also need to think with respect to that whether such a change is still going to make sense even after we make any architectural changes or many things are going to """solve""" by the nature of it. With that said and not giving not much thought to it, I don't think StateReload feature would be one of such things. Erik

Move the functions to the top - about to change where/when the Init helper gets called away from the main StateInitialize. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/node_device/node_device_udev.c | 74 +++++++++++++++++++------------------- 1 file changed, 37 insertions(+), 37 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index e0fca6159..ca5b47767 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -126,6 +126,43 @@ udevEventDataNew(void) } +static int +udevPCITranslateInit(bool privileged ATTRIBUTE_UNUSED) +{ +#if defined __s390__ || defined __s390x_ + /* On s390(x) system there is no PCI bus. + * Therefore there is nothing to initialize here. */ +#else + int rc; + + if ((rc = pci_system_init()) != 0) { + /* Ignore failure as non-root; udev is not as helpful in that + * situation, but a non-privileged user won't benefit much + * from udev in the first place. */ + if (errno != ENOENT && (privileged || errno != EACCES)) { + virReportSystemError(rc, "%s", + _("Failed to initialize libpciaccess")); + return -1; + } + } +#endif + return 0; +} + + +static void +udevPCITranslateDeinit(void) +{ +#if defined __s390__ || defined __s390x_ + /* Nothing was initialized, nothing needs to be cleaned up */ +#else + /* pci_system_cleanup returns void */ + pci_system_cleanup(); +#endif + return; +} + + static bool udevHasDeviceProperty(struct udev_device *dev, const char *key) @@ -1627,19 +1664,6 @@ udevEnumerateDevices(struct udev *udev) } -static void -udevPCITranslateDeinit(void) -{ -#if defined __s390__ || defined __s390x_ - /* Nothing was initialized, nothing needs to be cleaned up */ -#else - /* pci_system_cleanup returns void */ - pci_system_cleanup(); -#endif - return; -} - - static int nodeStateCleanup(void) { @@ -1911,30 +1935,6 @@ nodeStateInitializeEnumerate(void *opaque) static int -udevPCITranslateInit(bool privileged ATTRIBUTE_UNUSED) -{ -#if defined __s390__ || defined __s390x_ - /* On s390(x) system there is no PCI bus. - * Therefore there is nothing to initialize here. */ -#else - int rc; - - if ((rc = pci_system_init()) != 0) { - /* Ignore failure as non-root; udev is not as helpful in that - * situation, but a non-privileged user won't benefit much - * from udev in the first place. */ - if (errno != ENOENT && (privileged || errno != EACCES)) { - virReportSystemError(rc, "%s", - _("Failed to initialize libpciaccess")); - return -1; - } - } -#endif - return 0; -} - - -static int nodeStateInitialize(bool privileged, virStateInhibitCallback callback ATTRIBUTE_UNUSED, void *opaque ATTRIBUTE_UNUSED) -- 2.13.6

If the timing is "just right", there is a possibility that the udev nodeStateInitialize conflicts with another systemd thread running an lspci command leaving both waiting for "something", but resulting in a hung libvirtd (and hung lspci thread) from which the only recovery is a reboot because killing either thread is impossible and results in a defunct libvirtd process if a SIGKILL is performed. In order to avoid this let's move where the PCI initialization is done to be where it's actually needed. Ensure we only perform the initialization once via a driver bool. Likewise, during cleanup ensure we only call udevPCITranslateDeinit once the initialization is successful. At least a failure for this driver won't hang out the rest of the the libvirt event loop. May not make certain things usable though. Still a libvirtd restart is far easier than a host reboot. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/conf/virnodedeviceobj.h | 1 + src/node_device/node_device_udev.c | 16 ++++++++++++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/src/conf/virnodedeviceobj.h b/src/conf/virnodedeviceobj.h index 87f908369..b5f96f206 100644 --- a/src/conf/virnodedeviceobj.h +++ b/src/conf/virnodedeviceobj.h @@ -41,6 +41,7 @@ struct _virNodeDeviceDriverState { virNodeDeviceObjListPtr devs; /* currently-known devices */ void *privateData; /* driver-specific private data */ bool privileged; /* whether we run in privileged mode */ + bool initPCI; /* Set when PCI thread completed */ /* Immutable pointer, self-locking APIs */ virObjectEventStatePtr nodeDeviceEventState; diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index ca5b47767..9cbba8562 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -551,12 +551,21 @@ udevProcessPCI(struct udev_device *device, virPCIEDeviceInfoPtr pci_express = NULL; virPCIDevicePtr pciDev = NULL; int ret = -1; + int rc = 0; char *p; bool privileged; nodeDeviceLock(); privileged = driver->privileged; + + if (!driver->initPCI) { + rc = udevPCITranslateInit(driver->privileged); + driver->initPCI = true; + } + nodeDeviceUnlock(); + if (rc < 0) + goto cleanup; if (udevGetUintProperty(device, "PCI_CLASS", &pci_dev->class, 16) < 0) goto cleanup; @@ -1681,6 +1690,9 @@ nodeStateCleanup(void) virThreadJoin(&priv->th); } + if (driver->initPCI) + udevPCITranslateDeinit(); + virObjectUnref(priv); virObjectUnref(driver->nodeDeviceEventState); @@ -1688,7 +1700,6 @@ nodeStateCleanup(void) virMutexDestroy(&driver->lock); VIR_FREE(driver); - udevPCITranslateDeinit(); return 0; } @@ -1962,9 +1973,6 @@ nodeStateInitialize(bool privileged, driver->privateData = priv; driver->nodeDeviceEventState = virObjectEventStateNew(); - if (udevPCITranslateInit(privileged) < 0) - goto cleanup; - udev = udev_new(); if (!udev) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", -- 2.13.6

On Sat, Dec 09, 2017 at 12:29:14PM -0500, John Ferlan wrote:
If the timing is "just right", there is a possibility that the udev nodeStateInitialize conflicts with another systemd thread running an lspci command leaving both waiting for "something", but resulting in a hung libvirtd (and hung lspci thread) from which the only recovery is a reboot because killing either thread is impossible and results in a defunct libvirtd process if a SIGKILL is performed.
In order to avoid this let's move where the PCI initialization is done to be where it's actually needed. Ensure we only perform the initialization once via a driver bool. Likewise, during cleanup ensure we only call udevPCITranslateDeinit once the initialization is successful.
At least a failure for this driver won't hang out the rest of the the libvirt event loop. May not make certain things usable though. Still a libvirtd restart is far easier than a host reboot.
Is there a BZ for this or can you at least share what steps are necessary to have a chance of hitting this issue? I'm asking because it sounds like we should file a BZ against udev as well (possibly kernel) and a thorough investigation of where the deadlock happens is necessary because I don't see a any guarantee that just with a simple logic movement (and adding a trigger condition) we can make disappear a race outside of our scope for good. On the other hand, having to choose between a hung process requiring a host restart and a hung worker thread requiring a service restart, I'd obviously opt for the latter. So I'd say the next steps depend on how frequently and under what circumstances (specific host devices, kernel version, etc.) this happens, because to me it sounds odd how systemd and libpciaccess clash here. Erik

On 12/14/2017 08:19 AM, Erik Skultety wrote:
On Sat, Dec 09, 2017 at 12:29:14PM -0500, John Ferlan wrote:
If the timing is "just right", there is a possibility that the udev nodeStateInitialize conflicts with another systemd thread running an lspci command leaving both waiting for "something", but resulting in a hung libvirtd (and hung lspci thread) from which the only recovery is a reboot because killing either thread is impossible and results in a defunct libvirtd process if a SIGKILL is performed.
In order to avoid this let's move where the PCI initialization is done to be where it's actually needed. Ensure we only perform the initialization once via a driver bool. Likewise, during cleanup ensure we only call udevPCITranslateDeinit once the initialization is successful.
At least a failure for this driver won't hang out the rest of the the libvirt event loop. May not make certain things usable though. Still a libvirtd restart is far easier than a host reboot.
Is there a BZ for this or can you at least share what steps are necessary to have a chance of hitting this issue? I'm asking because it sounds like we should file a BZ against udev as well (possibly kernel) and a thorough investigation of where the deadlock happens is necessary because I don't see a any guarantee that just with a simple logic movement (and adding a trigger condition) we can make disappear a race outside of our scope for good. On the other hand, having to choose between a hung process requiring a host restart and a hung worker thread requiring a service restart, I'd obviously opt for the latter. So I'd say the next steps depend on how frequently and under what circumstances (specific host devices, kernel version, etc.) this happens, because to me it sounds odd how systemd and libpciaccess clash here.
Erik
w/r/t: reproducing Have you ever set up virt-test or avocado-vt? Have a bit of patience to retry the same test multiple times only to have it trigger once when the moon, stars, and sun align in the perfect order during the 14th hour of the 3rd day of the 11th month? ;-) Seriously though, it's not very reproducible - I tried multiple ways without libvirtd involved. However, when it does reproduce, then things are hosed w/r/t a defunct libvirtd from which only a reboot resolves. By moving to a separate thread at least I can restart libvirtd and have a somewhat crippled environment that doesn't include nodedev. w/r/t: bug report Well writing that report could be a challenge say nothing about the indeterminable wait for a resolution to said bug. At this point, I'm not sure if it's Fedora related, udev related, systemd related, or libvirtd related. I'd have to get back into that state without this patch in order to attempt to gather/recall more information that could be even useful for said bug report. I don't mind holding off on these last 2 patches, but by posting them I was kind of hoping there might be someone out there who saw the same thing and might be able to give me some ideas related to helping debug or resolve. John

On Thu, Dec 14, 2017 at 08:36:30AM -0500, John Ferlan wrote:
On 12/14/2017 08:19 AM, Erik Skultety wrote:
On Sat, Dec 09, 2017 at 12:29:14PM -0500, John Ferlan wrote:
If the timing is "just right", there is a possibility that the udev nodeStateInitialize conflicts with another systemd thread running an lspci command leaving both waiting for "something", but resulting in a hung libvirtd (and hung lspci thread) from which the only recovery is a reboot because killing either thread is impossible and results in a defunct libvirtd process if a SIGKILL is performed.
In order to avoid this let's move where the PCI initialization is done to be where it's actually needed. Ensure we only perform the initialization once via a driver bool. Likewise, during cleanup ensure we only call udevPCITranslateDeinit once the initialization is successful.
At least a failure for this driver won't hang out the rest of the the libvirt event loop. May not make certain things usable though. Still a libvirtd restart is far easier than a host reboot.
Is there a BZ for this or can you at least share what steps are necessary to have a chance of hitting this issue? I'm asking because it sounds like we should file a BZ against udev as well (possibly kernel) and a thorough investigation of where the deadlock happens is necessary because I don't see a any guarantee that just with a simple logic movement (and adding a trigger condition) we can make disappear a race outside of our scope for good. On the other hand, having to choose between a hung process requiring a host restart and a hung worker thread requiring a service restart, I'd obviously opt for the latter. So I'd say the next steps depend on how frequently and under what circumstances (specific host devices, kernel version, etc.) this happens, because to me it sounds odd how systemd and libpciaccess clash here.
Erik
w/r/t: reproducing
Have you ever set up virt-test or avocado-vt? Have a bit of patience to retry the same test multiple times only to have it trigger once when the moon, stars, and sun align in the perfect order during the 14th hour of the 3rd day of the 11th month? ;-)
Nope and to be honest you made a good job in discouraging me to even try doing that.
Seriously though, it's not very reproducible - I tried multiple ways without libvirtd involved. However, when it does reproduce, then things are hosed w/r/t a defunct libvirtd from which only a reboot resolves. By moving to a separate thread at least I can restart libvirtd and have a somewhat crippled environment that doesn't include nodedev.
w/r/t: bug report
Well writing that report could be a challenge say nothing about the indeterminable wait for a resolution to said bug. At this point, I'm not sure if it's Fedora related, udev related, systemd related, or libvirtd related. I'd have to get back into that state without this patch in order to attempt to gather/recall more information that could be even useful for said bug report.
Fair point.
I don't mind holding off on these last 2 patches, but by posting them I was kind of hoping there might be someone out there who saw the same thing and might be able to give me some ideas related to helping debug or resolve.
I see, well, provided the issue is such a toughie to hit as you describe, I think unless we have a solid reproducer (it's a shame a don't know how to use systemtap probes they're designed for such anomalies) and know exactly what's going on, we should defer merging patches 2,3/3.
John
participants (2)
-
Erik Skultety
-
John Ferlan