[libvirt] [PATCH v2 0/5] Workaround mdev uevent race affecting node device driver

v1 here https://www.redhat.com/archives/libvir-list/2017-June/msg00826.html This series addresses [1]. It builds on top of [2], series which introduces a few small fixes and improvements. Even though that one hasn't gotten a review yet, you can fetch and build this from my github branch [3]. Changes since v1: - moved the device handling routine into a detached worker thread, so the event loop wouldn't block - adjusted virFileAccess according to the reviewer's comments Notes: Merging the nodedev obj's refactor storing the list of nodedev objects in a self-lockable object made it possible to drop the driver locks at certain spots either completely or at least rearrange them according to the needs. The worker thread therefore doesn't need to hold the driver lock while waiting on the device's attributes to appear (which would be wrong anyway, since it would still block the event loop) and only acquires the driver lock when doing sanity checks on the udev monitor. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1463285 [2] https://www.redhat.com/archives/libvir-list/2017-July/msg01077.html [3] https://github.com/eskultety/libvirt/commits/mdev-nodedev-next Erik Skultety (5): nodedev: Introduce udevCheckMonitorFD helper function udev: Split udevEventHandleCallback in two functions udev: Convert udevEventHandleThread to an actual thread routine util: Introduce virFileWaitForAccess nodedev: Work around the uevent race by hooking up virFileWaitForAccess src/libvirt_private.syms | 1 + src/node_device/node_device_udev.c | 194 +++++++++++++++++++++++++++++++------ src/util/virfile.c | 29 ++++++ src/util/virfile.h | 2 + 4 files changed, 194 insertions(+), 32 deletions(-) -- 2.13.3

We need to perform some sanity checks on the udev monitor before every use so that we know nothing changed in the meantime. The reason for moving the code to a separate function is to be able to perform the same check from a worker thread that will replace the udevEventHandleCallback in terms of poking udev for device events. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 56 +++++++++++++++++++++++--------------- 1 file changed, 34 insertions(+), 22 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index dc9ed9e25..0b4e22f29 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1616,42 +1616,54 @@ udevHandleOneDevice(struct udev_device *device) } -static void -udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, - int fd, - int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) +/* the caller must be holding the driver lock prior to calling this function */ +static bool +udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd) { - struct udev_device *device = NULL; - struct udev_monitor *udev_monitor = NULL; - int udev_fd = -1; + int real_fd = -1; - nodeDeviceLock(); - if (!(udev_monitor = DRV_STATE_UDEV_MONITOR(driver))) - goto error; + /* sanity check that the monitor socket hasn't changed */ + real_fd = udev_monitor_get_fd(udev_monitor); - udev_fd = udev_monitor_get_fd(udev_monitor); - if (fd != udev_fd) { + if (real_fd != fd) { udevPrivate *priv = driver->privateData; virReportError(VIR_ERR_INTERNAL_ERROR, _("File descriptor returned by udev %d does not " "match node device file descriptor %d"), - fd, udev_fd); + real_fd, fd); - /* this is a non-recoverable error, let's remove the handle, so that we - * don't get in here again because of some spurious behaviour and report - * the same error multiple times + /* this is a non-recoverable error, thus the event handle has to be + * removed, so that we don't get into the handler again because of some + * spurious behaviour */ virEventRemoveHandle(priv->watch); - goto error; + return false; } - device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { + return true; +} + + +static void +udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, + int fd, + int events ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, fd)) + goto unlock; + + if (!(device = udev_monitor_receive_device(udev_monitor))) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); - goto error; + goto unlock; } nodeDeviceUnlock(); @@ -1661,7 +1673,7 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, udev_device_unref(device); return; - error: + unlock: nodeDeviceUnlock(); goto cleanup; } -- 2.13.3

This patch splits udevEventHandleCallback in two (introduces udevEventHandleThread) in order to be later able to refactor the latter to actually become a detached thread which will wait some time for the kernel to create the whole sysfs tree for a device as we cannot do that in the event loop directly. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 53 ++++++++++++++++++++++++++------------ 1 file changed, 36 insertions(+), 17 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 0b4e22f29..9e3a0f9d7 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1646,36 +1646,55 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd) static void +udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) +{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + int fd = (intptr_t) opaque; + + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, fd)) + goto unlock; + + device = udev_monitor_receive_device(udev_monitor); + if (device == NULL) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto unlock; + } + nodeDeviceUnlock(); + udevHandleOneDevice(device); + + cleanup: + udev_device_unref(device); + return; + + unlock: + nodeDeviceUnlock(); + goto cleanup; +} + + +static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED) { - struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); - if (!udevCheckMonitorFD(udev_monitor, fd)) - goto unlock; - - if (!(device = udev_monitor_receive_device(udev_monitor))) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - goto unlock; + if (!udevCheckMonitorFD(udev_monitor, fd)) { + nodeDeviceUnlock(); + return; } - nodeDeviceUnlock(); - udevHandleOneDevice(device); - - cleanup: - udev_device_unref(device); - return; - unlock: - nodeDeviceUnlock(); - goto cleanup; + udevEventHandleThread((void *)(intptr_t) fd); } -- 2.13.3

Adjust udevEventHandleThread to be a proper thread routine running in an infinite loop handling devices. Also introduce udevEventThreadData private structure. Every time there's and incoming event from udev, udevEventHandleCallback only increments the number of events queuing on the monitor and signals the worker thread to handle them. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 131 ++++++++++++++++++++++++++++++------- 1 file changed, 108 insertions(+), 23 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 9e3a0f9d7..e05819fe2 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -59,6 +59,54 @@ struct _udevPrivate { bool privileged; }; +typedef struct _udevEventThreadData udevEventThreadData; +typedef udevEventThreadData *udevEventThreadDataPtr; + +struct _udevEventThreadData { + virMutex lock; + virCond threadCond; + + bool threadQuit; + int monitor_fd; + unsigned long long nevents; /* number of udev events queuing on monitor */ +}; + + +static udevEventThreadDataPtr +udevEventThreadDataNew(int fd) +{ + udevEventThreadDataPtr ret = NULL; + + if (VIR_ALLOC_QUIET(ret) < 0) + return NULL; + + if (virMutexInit(&ret->lock) < 0) + goto cleanup; + + if (virCondInit(&ret->threadCond) < 0) + goto cleanup_mutex; + + ret->monitor_fd = fd; + + return ret; + + cleanup_mutex: + virMutexDestroy(&ret->lock); + + cleanup: + VIR_FREE(ret); + return NULL; +} + + +static void +udevEventThreadDataFree(udevEventThreadDataPtr data) +{ + virMutexDestroy(&data->lock); + virCondDestroy(&data->threadCond); + VIR_FREE(data); +} + static bool udevHasDeviceProperty(struct udev_device *dev, @@ -1646,34 +1694,53 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd) static void -udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) +udevEventHandleThread(void *opaque) { + udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; - int fd = (intptr_t) opaque; - nodeDeviceLock(); - udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + /* continue rather than break from the loop on non-fatal errors */ + while (1) { + virMutexLock(&privateData->lock); + while (privateData->nevents == 0 && !privateData->threadQuit) { + if (virCondWait(&privateData->threadCond, &privateData->lock)) { + virReportSystemError(errno, "%s", + _("failed to wait on condition")); + goto cleanup; + } + } - if (!udevCheckMonitorFD(udev_monitor, fd)) - goto unlock; + privateData->nevents--; + virMutexUnlock(&privateData->lock); - device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - goto unlock; + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { + nodeDeviceUnlock(); + goto cleanup; + } + + device = udev_monitor_receive_device(udev_monitor); + nodeDeviceUnlock(); + + if (!device) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto next; + } + + udevHandleOneDevice(device); + + next: + udev_device_unref(device); } - nodeDeviceUnlock(); - udevHandleOneDevice(device); cleanup: udev_device_unref(device); + udevEventThreadDataFree(privateData); return; - - unlock: - nodeDeviceUnlock(); - goto cleanup; } @@ -1681,20 +1748,28 @@ static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) + void *opaque) { struct udev_monitor *udev_monitor = NULL; + udevEventThreadDataPtr threadData = opaque; nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); - if (!udevCheckMonitorFD(udev_monitor, fd)) { + virMutexLock(&threadData->lock); + threadData->threadQuit = true; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); + nodeDeviceUnlock(); return; } nodeDeviceUnlock(); - udevEventHandleThread((void *)(intptr_t) fd); + virMutexLock(&threadData->lock); + threadData->nevents++; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); } @@ -1821,6 +1896,9 @@ nodeStateInitialize(bool privileged, { udevPrivate *priv = NULL; struct udev *udev = NULL; + int monitor_fd = -1; + virThread th; + udevEventThreadDataPtr threadData = NULL; if (VIR_ALLOC(priv) < 0) return -1; @@ -1881,6 +1959,14 @@ nodeStateInitialize(bool privileged, 128 * 1024 * 1024); #endif + monitor_fd = udev_monitor_get_fd(priv->udev_monitor); + if (!(threadData = udevEventThreadDataNew(monitor_fd)) || + virThreadCreate(&th, false, udevEventHandleThread, threadData) < 0) { + virReportSystemError(errno, "%s", + _("failed to create udev handling thread")); + goto cleanup; + } + /* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing * devices because libvirt will simply recreate the device if we @@ -1889,9 +1975,8 @@ nodeStateInitialize(bool privileged, * enumeration. The alternative is to register the callback after * we enumerate, in which case we will fail to create any devices * that appear while the enumeration is taking place. */ - priv->watch = virEventAddHandle(udev_monitor_get_fd(priv->udev_monitor), - VIR_EVENT_HANDLE_READABLE, - udevEventHandleCallback, NULL, NULL); + priv->watch = virEventAddHandle(monitor_fd, VIR_EVENT_HANDLE_READABLE, + udevEventHandleCallback, threadData, NULL); if (priv->watch == -1) goto unlock; -- 2.13.3

Since we have a number of places where we workaround timing issues with devices, attributes (files in general) not being available at the time of processing them by calling usleep in a loop for a fixed number of tries, we could as well have a utility function that would do that. Therefore we won't have to duplicate this ugly workaround even more. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virfile.c | 29 +++++++++++++++++++++++++++++ src/util/virfile.h | 2 ++ 3 files changed, 32 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 37b815c06..fa018693b 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1706,6 +1706,7 @@ virFileStripSuffix; virFileTouch; virFileUnlock; virFileUpdatePerm; +virFileWaitForAccess; virFileWrapperFdClose; virFileWrapperFdFree; virFileWrapperFdNew; diff --git a/src/util/virfile.c b/src/util/virfile.c index 2f28e83f4..15b46bcdd 100644 --- a/src/util/virfile.c +++ b/src/util/virfile.c @@ -4166,3 +4166,32 @@ virFileReadValueString(char **value, const char *format, ...) VIR_FREE(str); return ret; } + + +/** + * virFileWaitForAccess: + * @path: absolute path to a sysfs attribute (can be a symlink) + * @ms: how long to wait (in milliseconds) + * @tries: how many times should we try to wait for @path to become accessible + * + * Checks the existence of @path. In case the file defined by @path + * doesn't exist, we wait for it to appear in @ms milliseconds (for up to + * @tries attempts). + * + * Returns 0 on success, -1 on error, setting errno appropriately. + */ +int +virFileWaitForAccess(const char *path, size_t ms, size_t tries) +{ + errno = 0; + + /* wait for @path to be accessible in @ms milliseconds, up to @tries */ + while (tries-- > 0 && !virFileExists(path)) { + if (tries == 0 || errno != ENOENT) + return -1; + + usleep(ms * 1000); + } + + return 0; +} diff --git a/src/util/virfile.h b/src/util/virfile.h index 57ceb8072..42630ebb5 100644 --- a/src/util/virfile.h +++ b/src/util/virfile.h @@ -347,6 +347,8 @@ int virFileReadValueScaledInt(unsigned long long *value, const char *format, ... int virFileReadValueString(char **value, const char *format, ...) ATTRIBUTE_FMT_PRINTF(2, 3); +int virFileWaitForAccess(const char *path, size_t ms, size_t tries); + int virFileInData(int fd, int *inData, -- 2.13.3

If we find ourselves in the situation that the 'add' uevent has been fired earlier than the sysfs tree for a device was created, we should use the best-effort approach and give kernel some predetermined amount of time, thus waiting for the attributes to be ready rather than discarding the device from our device list forever. If those don't appear in the given time frame, we need to move on, since libvirt can't wait indefinitely. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285 Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index e05819fe2..b1a61f9f5 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1166,9 +1166,23 @@ udevProcessMediatedDevice(struct udev_device *dev, char *canonicalpath = NULL; virNodeDevCapMdevPtr data = &def->caps->data.mdev; - if (virAsprintf(&linkpath, "%s/mdev_type", udev_device_get_syspath(dev)) < 0) + /* Because of a kernel uevent race, we might get the 'add' event prior to + * the sysfs tree being ready, so any attempt to access any sysfs attribute + * would result in ENOENT and us dropping the device, so let's work around + * it by waiting for the attributes to become available. + */ + + if (virAsprintf(&linkpath, "%s/mdev_type", + udev_device_get_syspath(dev)) < 0) goto cleanup; + if (virFileWaitForAccess(linkpath, 1, 100) < 0) { + virReportSystemError(errno, + _("failed to wait for file '%s' to appear"), + linkpath); + goto cleanup; + } + if (virFileResolveLink(linkpath, &canonicalpath) < 0) { virReportSystemError(errno, _("failed to resolve '%s'"), linkpath); goto cleanup; -- 2.13.3

On Fri, Jul 28, 2017 at 09:43:57AM +0200, Erik Skultety wrote:
v1 here https://www.redhat.com/archives/libvir-list/2017-June/msg00826.html
This series addresses [1]. It builds on top of [2], series which introduces a few small fixes and improvements. Even though that one hasn't gotten a review yet, you can fetch and build this from my github branch [3].
Changes since v1: - moved the device handling routine into a detached worker thread, so the event loop wouldn't block - adjusted virFileAccess according to the reviewer's comments
Notes: Merging the nodedev obj's refactor storing the list of nodedev objects in a self-lockable object made it possible to drop the driver locks at certain spots either completely or at least rearrange them according to the needs. The worker thread therefore doesn't need to hold the driver lock while waiting on the device's attributes to appear (which would be wrong anyway, since it would still block the event loop) and only acquires the driver lock when doing sanity checks on the udev monitor.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1463285 [2] https://www.redhat.com/archives/libvir-list/2017-July/msg01077.html [3] https://github.com/eskultety/libvirt/commits/mdev-nodedev-next
Erik Skultety (5): nodedev: Introduce udevCheckMonitorFD helper function udev: Split udevEventHandleCallback in two functions udev: Convert udevEventHandleThread to an actual thread routine util: Introduce virFileWaitForAccess nodedev: Work around the uevent race by hooking up virFileWaitForAccess
src/libvirt_private.syms | 1 + src/node_device/node_device_udev.c | 194 +++++++++++++++++++++++++++++++------ src/util/virfile.c | 29 ++++++ src/util/virfile.h | 2 + 4 files changed, 194 insertions(+), 32 deletions(-)
SNACK, I was pointed out that the worker thread spams the log with an unexpected error("monitor received NULL"), so it looks like there's a race somewhere. I checked the event counting, that looks okay, except that there's quite a few events coming on the fd during daemon's init phase and for most of them, there's nothing coming out of the monitor actually. I need to investigate more. Erik
participants (1)
-
Erik Skultety