[libvirt] [PATCH v4 0/7] Work around the kernel mdev uevent race in nodedev

v3 here: https://www.redhat.com/archives/libvir-list/2017-August/msg00703.html since v3: - some minor cosmetic changes related to comments from the original patches 1-2 - everything else (provided it didn't cause merge conflicts) is unchanged Erik Skultety (7): nodedev: Introduce udevEventCheckMonitorFD helper function nodedev: udev: Lock the driver in udevEventCheckMonitorFd udev: Split udevEventHandleCallback in two functions udev: Convert udevEventHandleThread to an actual thread routine nodedev: Disable/re-enable polling on the udev fd util: Introduce virFileWaitForExists nodedev: Work around the uevent race by hooking up virFileWaitForAccess src/libvirt_private.syms | 1 + src/node_device/node_device_udev.c | 194 +++++++++++++++++++++++++++++++++---- src/util/virfile.c | 31 ++++++ src/util/virfile.h | 2 + 4 files changed, 209 insertions(+), 19 deletions(-) -- 2.13.5

We need to perform some sanity checks on the udev monitor before every use so that we know nothing changed in the meantime. The reason for moving the code to a separate function is to be able to perform the same check from a worker thread that will replace the udevEventHandleCallback in terms of poking udev for device events. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 40 +++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index f4177455c..fe21ad7df 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1615,14 +1615,10 @@ udevHandleOneDevice(struct udev_device *device) } -static void -udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, - int fd, - int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) +static bool +udevEventCheckMonitorFD(struct udev_monitor *udev_monitor, + int fd) { - struct udev_device *device = NULL; - struct udev_monitor *udev_monitor = DRV_STATE_UDEV_MONITOR(driver); int udev_fd = -1; udev_fd = udev_monitor_get_fd(udev_monitor); @@ -1641,21 +1637,39 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, virEventRemoveHandle(priv->watch); priv->watch = -1; - goto cleanup; + return false; + } + + return true; +} + + +static void +udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, + int fd, + int events ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + nodeDeviceUnlock(); + return; } device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { + + if (!device) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); - goto cleanup; + return; } udevHandleOneDevice(device); - - cleanup: udev_device_unref(device); - return; } -- 2.13.5

On 09/18/2017 12:34 PM, Erik Skultety wrote:
We need to perform some sanity checks on the udev monitor before every use so that we know nothing changed in the meantime. The reason for moving the code to a separate function is to be able to perform the same check from a worker thread that will replace the udevEventHandleCallback in terms of poking udev for device events.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 40 +++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-)
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index f4177455c..fe21ad7df 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1615,14 +1615,10 @@ udevHandleOneDevice(struct udev_device *device) }
-static void -udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, - int fd, - int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) +static bool +udevEventCheckMonitorFD(struct udev_monitor *udev_monitor, + int fd) { - struct udev_device *device = NULL; - struct udev_monitor *udev_monitor = DRV_STATE_UDEV_MONITOR(driver); int udev_fd = -1;
udev_fd = udev_monitor_get_fd(udev_monitor); @@ -1641,21 +1637,39 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, virEventRemoveHandle(priv->watch); priv->watch = -1;
- goto cleanup; + return false; + } + + return true; +} + + +static void +udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, + int fd, + int events ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + nodeDeviceUnlock();
Me thinks this particular line belongs in the next patch... of course that means the { } won't be necessary here. Reviewed-by: John Ferlan <jferlan@redhat.com> John
+ return; }
device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { + + if (!device) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); - goto cleanup; + return; }
udevHandleOneDevice(device); - - cleanup: udev_device_unref(device); - return; }

On Wed, Sep 20, 2017 at 08:52:26AM -0400, John Ferlan wrote:
On 09/18/2017 12:34 PM, Erik Skultety wrote:
We need to perform some sanity checks on the udev monitor before every use so that we know nothing changed in the meantime. The reason for moving the code to a separate function is to be able to perform the same check from a worker thread that will replace the udevEventHandleCallback in terms of poking udev for device events.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 40 +++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-)
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index f4177455c..fe21ad7df 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1615,14 +1615,10 @@ udevHandleOneDevice(struct udev_device *device) }
-static void -udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, - int fd, - int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) +static bool +udevEventCheckMonitorFD(struct udev_monitor *udev_monitor, + int fd) { - struct udev_device *device = NULL; - struct udev_monitor *udev_monitor = DRV_STATE_UDEV_MONITOR(driver); int udev_fd = -1;
udev_fd = udev_monitor_get_fd(udev_monitor); @@ -1641,21 +1637,39 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, virEventRemoveHandle(priv->watch); priv->watch = -1;
- goto cleanup; + return false; + } + + return true; +} + + +static void +udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, + int fd, + int events ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + nodeDeviceUnlock();
Me thinks this particular line belongs in the next patch... of course that means the { } won't be necessary here.
Do-oh! Right, I'll move it - that's just a result of 4+ rebases since I couldn't make up my mind on how to split it in the least disruptive way so that I'd make the reviewer's job much easier. Thanks, Erik

The udev monitor is not an immutable resource, so better protect every access to it. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index fe21ad7df..6c7f887ed 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1615,6 +1615,7 @@ udevHandleOneDevice(struct udev_device *device) } +/* the caller must be holding the driver lock prior to calling this function */ static bool udevEventCheckMonitorFD(struct udev_monitor *udev_monitor, int fd) @@ -1653,6 +1654,7 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; + nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); if (!udevEventCheckMonitorFD(udev_monitor, fd)) { @@ -1661,6 +1663,7 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, } device = udev_monitor_receive_device(udev_monitor); + nodeDeviceUnlock(); if (!device) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", -- 2.13.5

On 09/18/2017 12:34 PM, Erik Skultety wrote:
The udev monitor is not an immutable resource, so better protect every access to it.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 3 +++ 1 file changed, 3 insertions(+)
Of course this one will have the obvious adjustment from merge conflict of the previous patch... Reviewed-by: John Ferlan <jferlan@redhat.com> John

This patch splits udevEventHandleCallback in two (introduces udevEventHandleThread) in order to be later able to refactor the latter to actually become a detached thread which will wait some time for the kernel to create the whole sysfs tree for a device as we cannot do that in the event loop directly. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 6c7f887ed..e144472f1 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1646,13 +1646,11 @@ udevEventCheckMonitorFD(struct udev_monitor *udev_monitor, static void -udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, - int fd, - int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) +udevEventHandleThread(void *opaque) { struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; + int fd = (intptr_t) opaque; nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); @@ -1676,6 +1674,27 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, } +static void +udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, + int fd, + int events ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + struct udev_monitor *udev_monitor = NULL; + + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + nodeDeviceUnlock(); + return; + } + nodeDeviceUnlock(); + + udevEventHandleThread((void *)(intptr_t) fd); +} + + /* DMI is intel-compatible specific */ #if defined(__x86_64__) || defined(__i386__) || defined(__amd64__) static void -- 2.13.5

On 09/18/2017 12:34 PM, Erik Skultety wrote:
This patch splits udevEventHandleCallback in two (introduces udevEventHandleThread) in order to be later able to refactor the latter to actually become a detached thread which will wait some time for the kernel to create the whole sysfs tree for a device as we cannot do that in the event loop directly.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-)
Reviewed-by: John Ferlan <jferlan@redhat.com> John

Adjust udevEventHandleThread to be a proper thread routine running in an infinite loop handling devices. Also introduce udevEventThreadData private structure. Every time there's and incoming event from udev, udevEventHandleCallback only increments the number of events queuing on the monitor and signals the worker thread to handle them. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 125 +++++++++++++++++++++++++++++++------ 1 file changed, 107 insertions(+), 18 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index e144472f1..96760d1e4 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -59,6 +59,54 @@ struct _udevPrivate { bool privileged; }; +typedef struct _udevEventThreadData udevEventThreadData; +typedef udevEventThreadData *udevEventThreadDataPtr; + +struct _udevEventThreadData { + virMutex lock; + virCond threadCond; + + bool threadQuit; + int monitor_fd; + unsigned long long nevents; /* number of udev events queuing on monitor */ +}; + + +static udevEventThreadDataPtr +udevEventThreadDataNew(int fd) +{ + udevEventThreadDataPtr ret = NULL; + + if (VIR_ALLOC_QUIET(ret) < 0) + return NULL; + + if (virMutexInit(&ret->lock) < 0) + goto cleanup; + + if (virCondInit(&ret->threadCond) < 0) + goto cleanup_mutex; + + ret->monitor_fd = fd; + + return ret; + + cleanup_mutex: + virMutexDestroy(&ret->lock); + + cleanup: + VIR_FREE(ret); + return NULL; +} + + +static void +udevEventThreadDataFree(udevEventThreadDataPtr data) +{ + virMutexDestroy(&data->lock); + virCondDestroy(&data->threadCond); + VIR_FREE(data); +} + static bool udevHasDeviceProperty(struct udev_device *dev, @@ -1648,29 +1696,51 @@ udevEventCheckMonitorFD(struct udev_monitor *udev_monitor, static void udevEventHandleThread(void *opaque) { + udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; - int fd = (intptr_t) opaque; - nodeDeviceLock(); - udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + /* continue rather than break from the loop on non-fatal errors */ + while (1) { + virMutexLock(&privateData->lock); + while (privateData->nevents == 0 && !privateData->threadQuit) { + if (virCondWait(&privateData->threadCond, &privateData->lock)) { + virReportSystemError(errno, "%s", + _("handler failed to wait on condition")); + goto cleanup; + } + } - if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + privateData->nevents--; + virMutexUnlock(&privateData->lock); + + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { + nodeDeviceUnlock(); + goto cleanup; + } + + device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock(); - return; - } - device = udev_monitor_receive_device(udev_monitor); - nodeDeviceUnlock(); + if (!device) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto next; + } + + udevHandleOneDevice(device); - if (!device) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - return; + next: + udev_device_unref(device); } - udevHandleOneDevice(device); + cleanup: udev_device_unref(device); + udevEventThreadDataFree(privateData); + return; } @@ -1678,20 +1748,29 @@ static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) + void *opaque) { struct udev_monitor *udev_monitor = NULL; + udevEventThreadDataPtr threadData = opaque; nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + virMutexLock(&threadData->lock); + threadData->threadQuit = true; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); + nodeDeviceUnlock(); return; } nodeDeviceUnlock(); - udevEventHandleThread((void *)(intptr_t) fd); + virMutexLock(&threadData->lock); + threadData->nevents++; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); } @@ -1818,6 +1897,9 @@ nodeStateInitialize(bool privileged, { udevPrivate *priv = NULL; struct udev *udev = NULL; + int monitor_fd = -1; + virThread th; + udevEventThreadDataPtr threadData = NULL; if (VIR_ALLOC(priv) < 0) return -1; @@ -1878,6 +1960,14 @@ nodeStateInitialize(bool privileged, 128 * 1024 * 1024); #endif + monitor_fd = udev_monitor_get_fd(priv->udev_monitor); + if (!(threadData = udevEventThreadDataNew(monitor_fd)) || + virThreadCreate(&th, false, udevEventHandleThread, threadData) < 0) { + virReportSystemError(errno, "%s", + _("failed to create udev handling thread")); + goto cleanup; + } + /* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing * devices because libvirt will simply recreate the device if we @@ -1886,9 +1976,8 @@ nodeStateInitialize(bool privileged, * enumeration. The alternative is to register the callback after * we enumerate, in which case we will fail to create any devices * that appear while the enumeration is taking place. */ - priv->watch = virEventAddHandle(udev_monitor_get_fd(priv->udev_monitor), - VIR_EVENT_HANDLE_READABLE, - udevEventHandleCallback, NULL, NULL); + priv->watch = virEventAddHandle(monitor_fd, VIR_EVENT_HANDLE_READABLE, + udevEventHandleCallback, threadData, NULL); if (priv->watch == -1) goto unlock; -- 2.13.5

On 09/18/2017 12:34 PM, Erik Skultety wrote:
Adjust udevEventHandleThread to be a proper thread routine running in an infinite loop handling devices. Also introduce udevEventThreadData private structure. Every time there's and incoming event from udev, udevEventHandleCallback only increments the number of events queuing on the monitor and signals the worker thread to handle them.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 125 +++++++++++++++++++++++++++++++------ 1 file changed, 107 insertions(+), 18 deletions(-)
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index e144472f1..96760d1e4 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -59,6 +59,54 @@ struct _udevPrivate { bool privileged; };
+typedef struct _udevEventThreadData udevEventThreadData; +typedef udevEventThreadData *udevEventThreadDataPtr; + +struct _udevEventThreadData { + virMutex lock; + virCond threadCond; + + bool threadQuit; + int monitor_fd; + unsigned long long nevents; /* number of udev events queuing on monitor */ +}; + + +static udevEventThreadDataPtr +udevEventThreadDataNew(int fd) +{ + udevEventThreadDataPtr ret = NULL; + + if (VIR_ALLOC_QUIET(ret) < 0) + return NULL; + + if (virMutexInit(&ret->lock) < 0) + goto cleanup; + + if (virCondInit(&ret->threadCond) < 0) + goto cleanup_mutex; + + ret->monitor_fd = fd; + + return ret; + + cleanup_mutex: + virMutexDestroy(&ret->lock); + + cleanup: + VIR_FREE(ret); + return NULL; +} + + +static void +udevEventThreadDataFree(udevEventThreadDataPtr data) +{ + virMutexDestroy(&data->lock); + virCondDestroy(&data->threadCond); + VIR_FREE(data); +} +
static bool udevHasDeviceProperty(struct udev_device *dev, @@ -1648,29 +1696,51 @@ udevEventCheckMonitorFD(struct udev_monitor *udev_monitor, static void udevEventHandleThread(void *opaque) { + udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; - int fd = (intptr_t) opaque;
- nodeDeviceLock(); - udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + /* continue rather than break from the loop on non-fatal errors */ ^^ This comment could go to [1]
+ while (1) { + virMutexLock(&privateData->lock); + while (privateData->nevents == 0 && !privateData->threadQuit) { + if (virCondWait(&privateData->threadCond, &privateData->lock)) { + virReportSystemError(errno, "%s", + _("handler failed to wait on condition"));
Is a virMutexUnlock required before eventually calling virMutexDestroy...
+ goto cleanup; + } + }
- if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + privateData->nevents--; + virMutexUnlock(&privateData->lock);
If we get here, then either nevents > 0 || threadQuit == true, but we don't check for threadQuit here before the fetch/check of monitor_fd, e.g. the reason for threadQuit = true, so although the following udev_monitor check "works", the question thus becomes is it necessary if threadQuit == true? I suppose it could be, but we could also jump to cleanup if threadQuit == true || !udevEventCheckMonitorFD
+ + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) {
This accesses privateData->monitor_fd without the mutex. So we don't have too many lock/unlock - consider a local @monitor_fd which is fetched while the lock is held.
+ nodeDeviceUnlock(); + goto cleanup; + } + + device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock(); - return; - }
- device = udev_monitor_receive_device(udev_monitor); - nodeDeviceUnlock();
[1] could move the comment here since that's what I believe it's meant to describe...
+ if (!device) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL"));
Perhaps a VIR_WARN? Doesn't perhaps really matter, but it's not an error it's just a condition we didn't expect that we're continuing on...
+ goto next;
This should just be a continue; instead of needing next... Not clear what happens if udev_device_unref(NULL) is called.
+ } + + udevHandleOneDevice(device);
- if (!device) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - return; + next: + udev_device_unref(device); }
- udevHandleOneDevice(device); + cleanup: udev_device_unref(device);
Should this be: if (device) udev_device_unref(device) I think the cleanups are obvious, so Reviewed-by: John Ferlan <jferlan@redhat.com> John
+ udevEventThreadDataFree(privateData); + return; }
@@ -1678,20 +1748,29 @@ static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) + void *opaque) { struct udev_monitor *udev_monitor = NULL; + udevEventThreadDataPtr threadData = opaque;
nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + virMutexLock(&threadData->lock); + threadData->threadQuit = true; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); + nodeDeviceUnlock(); return; } nodeDeviceUnlock();
- udevEventHandleThread((void *)(intptr_t) fd); + virMutexLock(&threadData->lock); + threadData->nevents++; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); }
@@ -1818,6 +1897,9 @@ nodeStateInitialize(bool privileged, { udevPrivate *priv = NULL; struct udev *udev = NULL; + int monitor_fd = -1; + virThread th; + udevEventThreadDataPtr threadData = NULL;
if (VIR_ALLOC(priv) < 0) return -1; @@ -1878,6 +1960,14 @@ nodeStateInitialize(bool privileged, 128 * 1024 * 1024); #endif
+ monitor_fd = udev_monitor_get_fd(priv->udev_monitor); + if (!(threadData = udevEventThreadDataNew(monitor_fd)) || + virThreadCreate(&th, false, udevEventHandleThread, threadData) < 0) { + virReportSystemError(errno, "%s", + _("failed to create udev handling thread")); + goto cleanup; + } + /* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing * devices because libvirt will simply recreate the device if we @@ -1886,9 +1976,8 @@ nodeStateInitialize(bool privileged, * enumeration. The alternative is to register the callback after * we enumerate, in which case we will fail to create any devices * that appear while the enumeration is taking place. */ - priv->watch = virEventAddHandle(udev_monitor_get_fd(priv->udev_monitor), - VIR_EVENT_HANDLE_READABLE, - udevEventHandleCallback, NULL, NULL); + priv->watch = virEventAddHandle(monitor_fd, VIR_EVENT_HANDLE_READABLE, + udevEventHandleCallback, threadData, NULL); if (priv->watch == -1) goto unlock;

[...]
+ while (1) { + virMutexLock(&privateData->lock); + while (privateData->nevents == 0 && !privateData->threadQuit) { + if (virCondWait(&privateData->threadCond, &privateData->lock)) { + virReportSystemError(errno, "%s", + _("handler failed to wait on condition"));
Is a virMutexUnlock required before eventually calling virMutexDestroy...
It is, but even an unlock wouldn't be enough, since there's concurrent access, even if everything goes smooth, the handle callback could be already waiting in queue to lock the mutex, which is too an undefined behaviour. So I'll change it so that the thread doesn't do any thread-local data cleanup (except for mutex unlock) and I'm going to add the free callback to virEventAddHandle, so that the data will be freed eventually. The interesting thing about this is that we never actually check the errno when we continue with the error path on virCondWait failure, most likely because it's veeery unlikely that such a thing would happen. If you look at the errnos it can return, you'll be able to see that a) none of the errors are really applicable to our usecases, therefore almost impossible to happen and b) one of them indicates that the mutex was still acquired, therefore needs to be unlocked, while the other one indicates that the state is unrecoverable, so the mutex couldn't have been acquired, but if you have a quick look throughout the code, we always unlock in this case (which is possibly undefined). But I'll stick with the unlock for the sake of consistency, you see how quickly the problem changes to something else :).
+ goto cleanup; + } + }
- if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + privateData->nevents--; + virMutexUnlock(&privateData->lock);
If we get here, then either nevents > 0 || threadQuit == true, but we don't check for threadQuit here before the fetch/check of monitor_fd, e.g. the reason for threadQuit = true, so although the following udev_monitor check "works", the question thus becomes is it necessary if threadQuit == true? I suppose it could be, but we could also jump to
You'd have to check for threadQuit==true everywhere to avoid any potential unnecessary work. Also, threadQuit == true only when udevEventCheckMonitorFD failed in udevEventHandleCallback, so when the thread invokes udevEventCheckMonitorFD, it will fail equally, thus nothing else will be done and that's because we can't recover from the previous udev monitor fd error.
cleanup if threadQuit == true || !udevEventCheckMonitorFD
+ + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) {
This accesses privateData->monitor_fd without the mutex. So we don't have too many lock/unlock - consider a local @monitor_fd which is fetched while the lock is held.
True, but @monitor_fd is defacto a constant, noone is touching it since it's defined during thread creation.
+ nodeDeviceUnlock(); + goto cleanup; + } + + device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock(); - return; - }
- device = udev_monitor_receive_device(udev_monitor); - nodeDeviceUnlock();
[1] could move the comment here since that's what I believe it's meant to describe...
+ if (!device) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL"));
Perhaps a VIR_WARN? Doesn't perhaps really matter, but it's not an error it's just a condition we didn't expect that we're continuing on...
That means that when the user sets global logging level to ERROR, and say they're not seeing their devices to be updated the way they expect, they're not going to see any signs of a potential failure, so I'm keeping it as an error.
+ goto next;
This should just be a continue; instead of needing next... Not clear what happens if udev_device_unref(NULL) is called.
Very true :).
+ } + + udevHandleOneDevice(device);
- if (!device) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - return; + next: + udev_device_unref(device); }
- udevHandleOneDevice(device); + cleanup: udev_device_unref(device);
Should this be:
if (device) udev_device_unref(device)
I think the cleanups are obvious, so
Reviewed-by: John Ferlan <jferlan@redhat.com>
Thanks, but there are more changes needed, so another version will be posted.
John
+ udevEventThreadDataFree(privateData); + return; }
@@ -1678,20 +1748,29 @@ static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) + void *opaque) { struct udev_monitor *udev_monitor = NULL; + udevEventThreadDataPtr threadData = opaque;
nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, fd)) { + virMutexLock(&threadData->lock); + threadData->threadQuit = true; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); + nodeDeviceUnlock(); return; } nodeDeviceUnlock();
- udevEventHandleThread((void *)(intptr_t) fd); + virMutexLock(&threadData->lock); + threadData->nevents++; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); }
@@ -1818,6 +1897,9 @@ nodeStateInitialize(bool privileged, { udevPrivate *priv = NULL; struct udev *udev = NULL; + int monitor_fd = -1; + virThread th; + udevEventThreadDataPtr threadData = NULL;
if (VIR_ALLOC(priv) < 0) return -1; @@ -1878,6 +1960,14 @@ nodeStateInitialize(bool privileged, 128 * 1024 * 1024); #endif
+ monitor_fd = udev_monitor_get_fd(priv->udev_monitor); + if (!(threadData = udevEventThreadDataNew(monitor_fd)) || + virThreadCreate(&th, false, udevEventHandleThread, threadData) < 0) { + virReportSystemError(errno, "%s", + _("failed to create udev handling thread")); + goto cleanup; + } + /* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing * devices because libvirt will simply recreate the device if we @@ -1886,9 +1976,8 @@ nodeStateInitialize(bool privileged, * enumeration. The alternative is to register the callback after * we enumerate, in which case we will fail to create any devices * that appear while the enumeration is taking place. */ - priv->watch = virEventAddHandle(udev_monitor_get_fd(priv->udev_monitor), - VIR_EVENT_HANDLE_READABLE, - udevEventHandleCallback, NULL, NULL); + priv->watch = virEventAddHandle(monitor_fd, VIR_EVENT_HANDLE_READABLE, + udevEventHandleCallback, threadData, NULL); if (priv->watch == -1) goto unlock;

On Thu, Oct 05, 2017 at 02:54:36PM +0200, Erik Skultety wrote:
[...]
+ while (1) { + virMutexLock(&privateData->lock); + while (privateData->nevents == 0 && !privateData->threadQuit) { + if (virCondWait(&privateData->threadCond, &privateData->lock)) { + virReportSystemError(errno, "%s", + _("handler failed to wait on condition"));
Is a virMutexUnlock required before eventually calling virMutexDestroy...
It is, but even an unlock wouldn't be enough, since there's concurrent access, even if everything goes smooth, the handle callback could be already waiting in queue to lock the mutex, which is too an undefined behaviour. So I'll change it so that the thread doesn't do any thread-local data cleanup (except for mutex unlock) and I'm going to add the free callback to virEventAddHandle, so that the data will be freed eventually.
Aand of course ^this is wrong since someone might be holding the lock during the time we try to remove the handle or the thread might be simply waiting for the kernel (its ultimate purpose) and the next time it touches the mutex, well, not good, so that wouldn't fly either. I have to give it more thought,' especially how and when the thread is decommissioned. Erik

The event loop may get scheduled earlier than the udev event handler thread which means that it would keep invoking the handler callback with "new" events, while in fact it's most likely still the same event which the handler thread hasn't managed to remove from the socket queue yet. This is due to unwanted increments of the number of events queuing on the socket and causes the handler thread to spam the logs with an error about libudev returning NULL (errno == EAGAIN). Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 96760d1e4..70e15ffb8 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1696,6 +1696,7 @@ udevEventCheckMonitorFD(struct udev_monitor *udev_monitor, static void udevEventHandleThread(void *opaque) { + udevPrivate *priv = NULL; udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; @@ -1715,6 +1716,7 @@ udevEventHandleThread(void *opaque) virMutexUnlock(&privateData->lock); nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver); if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { @@ -1725,6 +1727,9 @@ udevEventHandleThread(void *opaque) device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock(); + /* Re-enable polling for new events on the @udev_monitor */ + virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE); + if (!device) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); @@ -1750,10 +1755,12 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int events ATTRIBUTE_UNUSED, void *opaque) { + udevPrivate *priv = NULL; struct udev_monitor *udev_monitor = NULL; udevEventThreadDataPtr threadData = opaque; nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver); if (!udevEventCheckMonitorFD(udev_monitor, fd)) { @@ -1771,6 +1778,16 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, threadData->nevents++; virCondSignal(&threadData->threadCond); virMutexUnlock(&threadData->lock); + + /* Due to scheduling, the eventloop might poll the udev monitor before the + * handler thread actually removes the data from the socket, thus causing it + * to spam errors about data not being ready yet, because + * udevEventHandleCallback would be invoked multiple times incrementing the + * counter of events queuing for a single event. + * Therefore we need to disable polling on the fd until the thread actually + * removes the data from the socket. + */ + virEventUpdateHandle(priv->watch, 0); } -- 2.13.5

On 09/18/2017 12:34 PM, Erik Skultety wrote:
The event loop may get scheduled earlier than the udev event handler thread which means that it would keep invoking the handler callback with "new" events, while in fact it's most likely still the same event which the handler thread hasn't managed to remove from the socket queue yet. This is due to unwanted increments of the number of events queuing on the socket and causes the handler thread to spam the logs with an error about libudev returning NULL (errno == EAGAIN).
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 96760d1e4..70e15ffb8 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1696,6 +1696,7 @@ udevEventCheckMonitorFD(struct udev_monitor *udev_monitor, static void udevEventHandleThread(void *opaque) { + udevPrivate *priv = NULL; udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; @@ -1715,6 +1716,7 @@ udevEventHandleThread(void *opaque) virMutexUnlock(&privateData->lock);
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { @@ -1725,6 +1727,9 @@ udevEventHandleThread(void *opaque) device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock();
+ /* Re-enable polling for new events on the @udev_monitor */ + virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE); +
I think this should only be done when privateData->nevents == 0? If we have multiple events to read, then calling virEventPollUpdateHandle, (eventually) for every pass through the loop seems like a bit of overkill especially if udevEventHandleCallback turns right around and disables it again. Also fortunately there isn't more than one udev thread sending the events since you access the priv->watch without the driver lock... Conversely perhaps we only disable if events > 1... Then again, how does one get to 2 events queued if we're disabling as soon as we increment nevents? Wouldn't this totally obviate the need for nevents? I would think it would be overkill to disable/enable for just 1 event. If multiple are queued, then yeah we're starting to get behind... Of course that could effect the re-enable when events == 1 (or some MAGIC_CONSTANT_NUMBER_THRESHOLD_TO_DISABLE_POLLING) IIRC from the previous trip down this rabbit hole (I did briefly go back and read the comments) that what you're trying to avoid is the following message spamming the error logs because of a scheduling mishap... Thus perhaps the following message could only be displayed if nevents > 1 and nothing was found knowing that we have this "timing problem" between udev, this thread, and the fd polling scanner of when a device should be "found"... and then reset nevents == 0 with the appropriate lock. Curious on your thoughts before an R-B/ACK though. John
if (!device) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); @@ -1750,10 +1755,12 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int events ATTRIBUTE_UNUSED, void *opaque) { + udevPrivate *priv = NULL; struct udev_monitor *udev_monitor = NULL; udevEventThreadDataPtr threadData = opaque;
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, fd)) { @@ -1771,6 +1778,16 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, threadData->nevents++; virCondSignal(&threadData->threadCond); virMutexUnlock(&threadData->lock); + + /* Due to scheduling, the eventloop might poll the udev monitor before the + * handler thread actually removes the data from the socket, thus causing it + * to spam errors about data not being ready yet, because + * udevEventHandleCallback would be invoked multiple times incrementing the + * counter of events queuing for a single event. + * Therefore we need to disable polling on the fd until the thread actually + * removes the data from the socket. + */ + virEventUpdateHandle(priv->watch, 0); }

[...]
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { @@ -1725,6 +1727,9 @@ udevEventHandleThread(void *opaque) device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock();
+ /* Re-enable polling for new events on the @udev_monitor */ + virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE); +
I think this should only be done when privateData->nevents == 0? If we have multiple events to read, then calling virEventPollUpdateHandle, (eventually) for every pass through the loop seems like a bit of overkill especially if udevEventHandleCallback turns right around and disables it again.
Also fortunately there isn't more than one udev thread sending the events since you access the priv->watch without the driver lock...
Conversely perhaps we only disable if events > 1... Then again, how does one get to 2 events queued if we're disabling as soon as we increment
Very good point, technically events would still get queued, we just wouldn't check and yes, we would process 1 event at a time. Not optimal, but if you look at the original code and compare it with this one performance-wise (and I hope I haven't missed anything that would render everything I write next a complete rubbish), the time complexity hasn't changed, the space complexity hasn't changed, what changed is code complexity which makes the code a bit slower due to the excessive locking and toggling the FD polling back and forth. So essentially you've got the same thing as you had before..but asynchronous. However, yes, the usage of @nevents is completely useless now (haven't realized that immediately, thanks) and a simple signalling should suffice. So how could we make it faster though? I thought more about the idea you shared in one of the previous reviews, letting the thread actually pull all the data from the monitor, to which I IIRC replied something in the sense that the event counting mechanism wouldn't allow that and it would break. Okay, let's drop the event counting. What if we now let both the udev handler thread and the event loop "poll" the file descriptor, IOW let the event loop polling the monitor fd, thus invoking udevHandleCallback which would in turn keep signalling the handler thread that there are some data. The difference now in the handler thread would be that it wouldn't blindly trust the callback about the data, because of the scheduling issue, it would keep poking the monitor itself until it gets either EAGAIN or EWOULDBLOCK from recvmsg() called in libudev, which would then be the signal to start waiting on the condition. The next an event appears, a signal from udevHandleCallback would finally have a meaning and wouldn't be ignored. This way, it's actually the handler thread who's 'the boss', as most of the signals from udevHandleCallback would be lost/NOPs. Would you agree to such an approach? Erik
nevents? Wouldn't this totally obviate the need for nevents?
I would think it would be overkill to disable/enable for just 1 event. If multiple are queued, then yeah we're starting to get behind... Of course that could effect the re-enable when events == 1 (or some MAGIC_CONSTANT_NUMBER_THRESHOLD_TO_DISABLE_POLLING)
This magic threshold approach still wouldn't work because you don't know anything about the event at that point, you'd have to process them to actually find out whether that's a legitimate event or it's an already noted event that hasn't been processed yet, so it would still mess with your counters.
IIRC from the previous trip down this rabbit hole (I did briefly go back and read the comments) that what you're trying to avoid is the following message spamming the error logs because of a scheduling mishap... Thus perhaps the following message could only be displayed if nevents > 1 and nothing was found knowing that we have this "timing problem" between udev, this thread, and the fd polling scanner of when a device should be "found"... and then reset nevents == 0 with the appropriate lock.
Curious on your thoughts before an R-B/ACK though.
John
if (!device) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); @@ -1750,10 +1755,12 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int events ATTRIBUTE_UNUSED, void *opaque) { + udevPrivate *priv = NULL; struct udev_monitor *udev_monitor = NULL; udevEventThreadDataPtr threadData = opaque;
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, fd)) { @@ -1771,6 +1778,16 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, threadData->nevents++; virCondSignal(&threadData->threadCond); virMutexUnlock(&threadData->lock); + + /* Due to scheduling, the eventloop might poll the udev monitor before the + * handler thread actually removes the data from the socket, thus causing it + * to spam errors about data not being ready yet, because + * udevEventHandleCallback would be invoked multiple times incrementing the + * counter of events queuing for a single event. + * Therefore we need to disable polling on the fd until the thread actually + * removes the data from the socket. + */ + virEventUpdateHandle(priv->watch, 0); }

On 09/28/2017 06:00 AM, Erik Skultety wrote:
[...]
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { @@ -1725,6 +1727,9 @@ udevEventHandleThread(void *opaque) device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock();
+ /* Re-enable polling for new events on the @udev_monitor */ + virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE); +
I think this should only be done when privateData->nevents == 0? If we have multiple events to read, then calling virEventPollUpdateHandle, (eventually) for every pass through the loop seems like a bit of overkill especially if udevEventHandleCallback turns right around and disables it again.
Also fortunately there isn't more than one udev thread sending the events since you access the priv->watch without the driver lock...
Conversely perhaps we only disable if events > 1... Then again, how does one get to 2 events queued if we're disabling as soon as we increment
Very good point, technically events would still get queued, we just wouldn't check and yes, we would process 1 event at a time. Not optimal, but if you look at the original code and compare it with this one performance-wise (and I hope I haven't missed anything that would render everything I write next a complete rubbish), the time complexity hasn't changed, the space complexity hasn't changed, what changed is code complexity which makes the code a bit slower due to the excessive locking and toggling the FD polling back and forth. So essentially you've got the same thing as you had before..but asynchronous. However, yes, the usage of @nevents is completely useless now (haven't realized that immediately, thanks) and a simple signalling should suffice.
Having it "slower" is necessarily bad ;-) That gives some of the other slower buggers a chance to fill in the details we need. Throwing the control back to udev quicker could aid in that too.
So how could we make it faster though? I thought more about the idea you shared in one of the previous reviews, letting the thread actually pull all the data from the monitor, to which I IIRC replied something in the sense that the event counting mechanism wouldn't allow that and it would break. Okay, let's drop the event counting. What if we now let both the udev handler thread and the event loop "poll" the file descriptor, IOW let the event loop polling the monitor fd, thus invoking udevHandleCallback which would in turn keep signalling the handler thread that there are some data. The difference now in the handler thread would be that it wouldn't blindly trust the callback about the data, because of the scheduling issue, it would keep poking the monitor itself until it gets either EAGAIN or EWOULDBLOCK from recvmsg() called in libudev, which would then be the signal to start waiting on the condition. The next an event appears, a signal from udevHandleCallback would finally have a meaning and wouldn't be ignored.
Is making it faster really a goal? It's preferable that it works consistently I think. The various errno possibilities and the desire to avoid the "udev_monitor_receive_device returned NULL" message processing because we had too many "cooks in the kitchen" trying to determine whether a new device was really available or was this just another notification for something we're already processing. Also, not that I expect the udev code to change, but if a new errno is added then we may have to keep up... Always a fear especially if we're using the errno to dictate our algorithm.
This way, it's actually the handler thread who's 'the boss', as most of the signals from udevHandleCallback would be lost/NOPs.
Would you agree to such an approach?
At some point we get too fancy and think too hard about a problem letting it consume us. I think when I presented my idea - I wasn't really understanding the details of how we consume the udev data. Now that I have a bit more of a clue - perhaps the original idea wasn't so good. If we receive multiple notifications that a device is ready to be processed even though we could be processing it - how are we to truly know whether we missed one or we really got it and udev was just reminding us again. I'm not against looking at a different mechanism - the question then becomes from your perspective is it worth it? John
Erik
nevents? Wouldn't this totally obviate the need for nevents?
I would think it would be overkill to disable/enable for just 1 event. If multiple are queued, then yeah we're starting to get behind... Of course that could effect the re-enable when events == 1 (or some MAGIC_CONSTANT_NUMBER_THRESHOLD_TO_DISABLE_POLLING)
This magic threshold approach still wouldn't work because you don't know anything about the event at that point, you'd have to process them to actually find out whether that's a legitimate event or it's an already noted event that hasn't been processed yet, so it would still mess with your counters.
IIRC from the previous trip down this rabbit hole (I did briefly go back and read the comments) that what you're trying to avoid is the following message spamming the error logs because of a scheduling mishap... Thus perhaps the following message could only be displayed if nevents > 1 and nothing was found knowing that we have this "timing problem" between udev, this thread, and the fd polling scanner of when a device should be "found"... and then reset nevents == 0 with the appropriate lock.
Curious on your thoughts before an R-B/ACK though.
John
if (!device) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); @@ -1750,10 +1755,12 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int events ATTRIBUTE_UNUSED, void *opaque) { + udevPrivate *priv = NULL; struct udev_monitor *udev_monitor = NULL; udevEventThreadDataPtr threadData = opaque;
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, fd)) { @@ -1771,6 +1778,16 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, threadData->nevents++; virCondSignal(&threadData->threadCond); virMutexUnlock(&threadData->lock); + + /* Due to scheduling, the eventloop might poll the udev monitor before the + * handler thread actually removes the data from the socket, thus causing it + * to spam errors about data not being ready yet, because + * udevEventHandleCallback would be invoked multiple times incrementing the + * counter of events queuing for a single event. + * Therefore we need to disable polling on the fd until the thread actually + * removes the data from the socket. + */ + virEventUpdateHandle(priv->watch, 0); }

On Fri, Sep 29, 2017 at 09:46:48AM -0400, John Ferlan wrote:
On 09/28/2017 06:00 AM, Erik Skultety wrote:
[...]
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { @@ -1725,6 +1727,9 @@ udevEventHandleThread(void *opaque) device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock();
+ /* Re-enable polling for new events on the @udev_monitor */ + virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE); +
I think this should only be done when privateData->nevents == 0? If we have multiple events to read, then calling virEventPollUpdateHandle, (eventually) for every pass through the loop seems like a bit of overkill especially if udevEventHandleCallback turns right around and disables it again.
Also fortunately there isn't more than one udev thread sending the events since you access the priv->watch without the driver lock...
Conversely perhaps we only disable if events > 1... Then again, how does one get to 2 events queued if we're disabling as soon as we increment
Very good point, technically events would still get queued, we just wouldn't check and yes, we would process 1 event at a time. Not optimal, but if you look at the original code and compare it with this one performance-wise (and I hope I haven't missed anything that would render everything I write next a complete rubbish), the time complexity hasn't changed, the space complexity hasn't changed, what changed is code complexity which makes the code a bit slower due to the excessive locking and toggling the FD polling back and forth. So essentially you've got the same thing as you had before..but asynchronous. However, yes, the usage of @nevents is completely useless now (haven't realized that immediately, thanks) and a simple signalling should suffice.
Having it "slower" is necessarily bad ;-) That gives some of the other slower buggers a chance to fill in the details we need. Throwing the control back to udev quicker could aid in that too.
I'm sorry, but I don't follow, how do you hand control back over to something you can't control? udev is not paused in any way during the phase libvirt is processing the events, so it keeps pushing new events to the socket queue, until it can't push any more.
So how could we make it faster though? I thought more about the idea you shared in one of the previous reviews, letting the thread actually pull all the data from the monitor, to which I IIRC replied something in the sense that the event counting mechanism wouldn't allow that and it would break. Okay, let's drop the event counting. What if we now let both the udev handler thread and the event loop "poll" the file descriptor, IOW let the event loop polling the monitor fd, thus invoking udevHandleCallback which would in turn keep signalling the handler thread that there are some data. The difference now in the handler thread would be that it wouldn't blindly trust the callback about the data, because of the scheduling issue, it would keep poking the monitor itself until it gets either EAGAIN or EWOULDBLOCK from recvmsg() called in libudev, which would then be the signal to start waiting on the condition. The next an event appears, a signal from udevHandleCallback would finally have a meaning and wouldn't be ignored.
Is making it faster really a goal? It's preferable that it works
"I would think it would be overkill to disable/enable for just 1 event." I was trying to come up with something faster based on that statement.
consistently I think. The various errno possibilities and the desire to avoid the "udev_monitor_receive_device returned NULL" message processing because we had too many "cooks in the kitchen" trying to determine whether a new device was really available or was this just another notification for something we're already processing.
Also, not that I expect the udev code to change, but if a new errno is added then we may have to keep up... Always a fear especially if we're
Not necessarily, that would mean that either libudev replaces recvmsg with something else or that recvmsg starts returning more/different errnos to signal that there are no data to be pulled - since there are already 2 such errnos, it's highly unlikely IMHO. Unless libudev changes substantially, in terms of error signaling, I think we can safely ignore the other errnos as the actual outcome for libvirt is that libudev failed because of some socket error, but since that can either come from kernel or from libudev itself, the best thing we can do is shrug our shoulders and say "libudev failed for some reason, but we can't tell why".
using the errno to dictate our algorithm.
This way, it's actually the handler thread who's 'the boss', as most of the signals from udevHandleCallback would be lost/NOPs.
Would you agree to such an approach?
At some point we get too fancy and think too hard about a problem letting it consume us. I think when I presented my idea - I wasn't really understanding the details of how we consume the udev data. Now that I have a bit more of a clue - perhaps the original idea wasn't so good. If we receive multiple notifications that a device is ready to be processed even though we could be processing it - how are we to truly know whether we missed one or we really got it and udev was just reminding us again.
udev doesn't remind us of anything, it's our event loop that does - if we don't pre-process the events somehow, we can't tell them apart and even if we did, it would be utterly unreliable since you'd have to include some kind of nonce and a hash to be really sure that it's a duplicate event and not a device that was un-plugged and re-plugged again in a very short period of time. So, this issue is not a matter of our communication with udev, it's about how good can we manage our discover-pull-process event algorithm.
I'm not against looking at a different mechanism - the question then becomes from your perspective is it worth it?
I think the ultimate question is how fast do we want to deliver a fix. Of course I'm open to explore more approaches, but we can do that even with a "preliminary" fix included (even though, honestly, once this is merged, I doubt anyone will find time to do any kind of experiments unless there's another breakage to fix in this code). So, based on what I just wrote, I'm inclined to say no, it really wouldn't be of much worth. Even though I came with another proposal, I personally still like the previous version with one-by-one event version, since despite (perhaps) being a bit slower, it's a more transparent and consistent (minus the event counter) solution than the one I proposed the last time. Erik

On 10/02/2017 03:19 AM, Erik Skultety wrote:
On Fri, Sep 29, 2017 at 09:46:48AM -0400, John Ferlan wrote:
On 09/28/2017 06:00 AM, Erik Skultety wrote:
[...]
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { @@ -1725,6 +1727,9 @@ udevEventHandleThread(void *opaque) device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock();
+ /* Re-enable polling for new events on the @udev_monitor */ + virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE); +
I think this should only be done when privateData->nevents == 0? If we have multiple events to read, then calling virEventPollUpdateHandle, (eventually) for every pass through the loop seems like a bit of overkill especially if udevEventHandleCallback turns right around and disables it again.
Also fortunately there isn't more than one udev thread sending the events since you access the priv->watch without the driver lock...
Conversely perhaps we only disable if events > 1... Then again, how does one get to 2 events queued if we're disabling as soon as we increment
Very good point, technically events would still get queued, we just wouldn't check and yes, we would process 1 event at a time. Not optimal, but if you look at the original code and compare it with this one performance-wise (and I hope I haven't missed anything that would render everything I write next a complete rubbish), the time complexity hasn't changed, the space complexity hasn't changed, what changed is code complexity which makes the code a bit slower due to the excessive locking and toggling the FD polling back and forth. So essentially you've got the same thing as you had before..but asynchronous. However, yes, the usage of @nevents is completely useless now (haven't realized that immediately, thanks) and a simple signalling should suffice.
Having it "slower" is necessarily bad ;-) That gives some of the other slower buggers a chance to fill in the details we need. Throwing the control back to udev quicker could aid in that too.
I'm sorry, but I don't follow, how do you hand control back over to something you can't control? udev is not paused in any way during the phase libvirt is processing the events, so it keeps pushing new events to the socket queue, until it can't push any more.
Hmmm... If the problem we're chasing is that some other consumer isn't processing "fast enough" for our needs (for whatever reason), then by perhaps making the libvirt implementation a bit slower because now we have to "wait" for the thread to be signaled that something's there, then perhaps, just maybe, with any good fortune the thing that was causing the problem before would have "just enough time" to complete such that we'll never have to enter the timing loop in subsequent patches. Hopefully that helps explain.
So how could we make it faster though? I thought more about the idea you shared in one of the previous reviews, letting the thread actually pull all the data from the monitor, to which I IIRC replied something in the sense that the event counting mechanism wouldn't allow that and it would break. Okay, let's drop the event counting. What if we now let both the udev handler thread and the event loop "poll" the file descriptor, IOW let the event loop polling the monitor fd, thus invoking udevHandleCallback which would in turn keep signalling the handler thread that there are some data. The difference now in the handler thread would be that it wouldn't blindly trust the callback about the data, because of the scheduling issue, it would keep poking the monitor itself until it gets either EAGAIN or EWOULDBLOCK from recvmsg() called in libudev, which would then be the signal to start waiting on the condition. The next an event appears, a signal from udevHandleCallback would finally have a meaning and wouldn't be ignored.
Is making it faster really a goal? It's preferable that it works
"I would think it would be overkill to disable/enable for just 1 event." I was trying to come up with something faster based on that statement.
True and understood... But can the thread rely on two factors - "A" we were told there was an event and "B" we process until we get the NULL (w/ EAGAIN) from the udev_monitor_receive_device. If so, then having the thread process until empty is perhaps the way out of the disable/enable for just 1 event and keeping track of the count of events. If not then I think I agree with the sentiment later in this response of one-by-one version. If we go with process until NULL type logic, would the thread need some sort of wakeup timeout to ensure we don't miss something because of some scheduling thing? I don't think so, but had to ask anyway.
consistently I think. The various errno possibilities and the desire to avoid the "udev_monitor_receive_device returned NULL" message processing because we had too many "cooks in the kitchen" trying to determine whether a new device was really available or was this just another notification for something we're already processing.
Also, not that I expect the udev code to change, but if a new errno is added then we may have to keep up... Always a fear especially if we're
Not necessarily, that would mean that either libudev replaces recvmsg with something else or that recvmsg starts returning more/different errnos to signal that there are no data to be pulled - since there are already 2 such errnos, it's highly unlikely IMHO. Unless libudev changes substantially, in terms of error signaling, I think we can safely ignore the other errnos as the actual outcome for libvirt is that libudev failed because of some socket error, but since that can either come from kernel or from libudev itself, the best thing we can do is shrug our shoulders and say "libudev failed for some reason, but we can't tell why".
OK - fair enough... Just exhausting all options/excuses!
using the errno to dictate our algorithm.
This way, it's actually the handler thread who's 'the boss', as most of the signals from udevHandleCallback would be lost/NOPs.
Would you agree to such an approach?
At some point we get too fancy and think too hard about a problem letting it consume us. I think when I presented my idea - I wasn't really understanding the details of how we consume the udev data. Now that I have a bit more of a clue - perhaps the original idea wasn't so good. If we receive multiple notifications that a device is ready to be processed even though we could be processing it - how are we to truly know whether we missed one or we really got it and udev was just reminding us again.
udev doesn't remind us of anything, it's our event loop that does - if we don't pre-process the events somehow, we can't tell them apart and even if we did, it would be utterly unreliable since you'd have to include some kind of nonce and a hash to be really sure that it's a duplicate event and not a device that was un-plugged and re-plugged again in a very short period of time. So, this issue is not a matter of our communication with udev, it's about how good can we manage our discover-pull-process event algorithm.
OK - that's just me and my shortcut interpretation of how we're notified.
I'm not against looking at a different mechanism - the question then becomes from your perspective is it worth it?
I think the ultimate question is how fast do we want to deliver a fix. Of course I'm open to explore more approaches, but we can do that even with a "preliminary" fix included (even though, honestly, once this is merged, I doubt anyone will find time to do any kind of experiments unless there's another breakage to fix in this code). So, based on what I just wrote, I'm inclined to say no, it really wouldn't be of much worth.
Can we win the race to fix the root-cause? I think we've explored our alternatives thoroughly and am fine with libvirt making an adjustment since we cannot rely on when something else is fixed. There are others that have a different opinion. Still while it's MDEV this time, the last time it was VHBA, and the next time it will be something different - guaranteed. If we can provide a way to avoid our customers from having to deal with "intermittent" and "unexplained" failures because we're waiting for someone else to fix the problem, then I say we fix it and move on. Especially when we can understand the problem. If eventually they fix their problem, then our solution gathers cobwebs, but so what.
Even though I came with another proposal, I personally still like the previous version with one-by-one event version, since despite (perhaps) being a bit slower, it's a more transparent and consistent (minus the event counter) solution than the one I proposed the last time.
Erik
I don't have that version paged into short term memory ;-)... If you have a thread that processes events until udev_monitor_receive_device returns NULL, then does that do the trick? Regardless of event counting? John

[...]
Even though I came with another proposal, I personally still like the previous version with one-by-one event version, since despite (perhaps) being a bit slower, it's a more transparent and consistent (minus the event counter) solution than the one I proposed the last time.
Erik
I don't have that version paged into short term memory ;-)... If you have a thread that processes events until udev_monitor_receive_device returns NULL, then does that do the trick? Regardless of event counting?
Well, so far it's just an idea, I haven't tried it yet, it's just it appeared to me like it could. I'll try and get back with results. Erik

Since we have a number of places where we workaround timing issues with devices, attributes (files in general) not being available at the time of processing them by calling usleep in a loop for a fixed number of tries, we could as well have a utility function that would do that. Therefore we won't have to duplicate this ugly workaround even more. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virfile.c | 31 +++++++++++++++++++++++++++++++ src/util/virfile.h | 2 ++ 3 files changed, 34 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index d12db9b3f..57a5ea300 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1730,6 +1730,7 @@ virFileStripSuffix; virFileTouch; virFileUnlock; virFileUpdatePerm; +virFileWaitForExists; virFileWrapperFdClose; virFileWrapperFdFree; virFileWrapperFdNew; diff --git a/src/util/virfile.c b/src/util/virfile.c index 2f28e83f4..01cc61e98 100644 --- a/src/util/virfile.c +++ b/src/util/virfile.c @@ -4166,3 +4166,34 @@ virFileReadValueString(char **value, const char *format, ...) VIR_FREE(str); return ret; } + + +/** + * virFileWaitForExists: + * @path: absolute path to a sysfs attribute (can be a symlink) + * @ms: how long to wait (in milliseconds) + * @tries: how many times should we try to wait for @path to become accessible + * + * Checks the existence of @path. In case the file defined by @path + * doesn't exist, we wait for it to appear in @ms milliseconds (for up to + * @tries attempts). + * + * Returns 0 on success, -1 on error, setting errno appropriately. + */ +int +virFileWaitForExists(const char *path, + size_t ms, + size_t tries) +{ + errno = 0; + + /* wait for @path to be accessible in @ms milliseconds, up to @tries */ + while (tries-- > 0 && !virFileExists(path)) { + if (tries == 0 || errno != ENOENT) + return -1; + + usleep(ms * 1000); + } + + return 0; +} diff --git a/src/util/virfile.h b/src/util/virfile.h index 57ceb8072..310a8e7f6 100644 --- a/src/util/virfile.h +++ b/src/util/virfile.h @@ -347,6 +347,8 @@ int virFileReadValueScaledInt(unsigned long long *value, const char *format, ... int virFileReadValueString(char **value, const char *format, ...) ATTRIBUTE_FMT_PRINTF(2, 3); +int virFileWaitForExists(const char *path, size_t ms, size_t tries); + int virFileInData(int fd, int *inData, -- 2.13.5

On 09/18/2017 12:34 PM, Erik Skultety wrote:
Since we have a number of places where we workaround timing issues with devices, attributes (files in general) not being available at the time of processing them by calling usleep in a loop for a fixed number of tries, we could as well have a utility function that would do that. Therefore we won't have to duplicate this ugly workaround even more.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virfile.c | 31 +++++++++++++++++++++++++++++++ src/util/virfile.h | 2 ++ 3 files changed, 34 insertions(+)
Reviewed-by: John Ferlan <jferlan@redhat.com> John

If we find ourselves in the situation that the 'add' uevent has been fired earlier than the sysfs tree for a device was created, we should use the best-effort approach and give kernel some predetermined amount of time, thus waiting for the attributes to be ready rather than discarding the device from our device list forever. If those don't appear in the given time frame, we need to move on, since libvirt can't wait indefinitely. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285 Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 70e15ffb8..2f63256a3 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1166,9 +1166,23 @@ udevProcessMediatedDevice(struct udev_device *dev, char *canonicalpath = NULL; virNodeDevCapMdevPtr data = &def->caps->data.mdev; - if (virAsprintf(&linkpath, "%s/mdev_type", udev_device_get_syspath(dev)) < 0) + /* Because of a kernel uevent race, we might get the 'add' event prior to + * the sysfs tree being ready, so any attempt to access any sysfs attribute + * would result in ENOENT and us dropping the device, so let's work around + * it by waiting for the attributes to become available. + */ + + if (virAsprintf(&linkpath, "%s/mdev_type", + udev_device_get_syspath(dev)) < 0) goto cleanup; + if (virFileWaitForExists(linkpath, 1, 100) < 0) { + virReportSystemError(errno, + _("failed to wait for file '%s' to appear"), + linkpath); + goto cleanup; + } + if (virFileResolveLink(linkpath, &canonicalpath) < 0) { virReportSystemError(errno, _("failed to resolve '%s'"), linkpath); goto cleanup; -- 2.13.5

On 09/18/2017 12:34 PM, Erik Skultety wrote:
If we find ourselves in the situation that the 'add' uevent has been fired earlier than the sysfs tree for a device was created, we should use the best-effort approach and give kernel some predetermined amount of time, thus waiting for the attributes to be ready rather than discarding the device from our device list forever. If those don't appear in the given time frame, we need to move on, since libvirt can't wait indefinitely.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 70e15ffb8..2f63256a3 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1166,9 +1166,23 @@ udevProcessMediatedDevice(struct udev_device *dev, char *canonicalpath = NULL; virNodeDevCapMdevPtr data = &def->caps->data.mdev;
- if (virAsprintf(&linkpath, "%s/mdev_type", udev_device_get_syspath(dev)) < 0) + /* Because of a kernel uevent race, we might get the 'add' event prior to + * the sysfs tree being ready, so any attempt to access any sysfs attribute + * would result in ENOENT and us dropping the device, so let's work around + * it by waiting for the attributes to become available. + */ + + if (virAsprintf(&linkpath, "%s/mdev_type", + udev_device_get_syspath(dev)) < 0) goto cleanup;
+ if (virFileWaitForExists(linkpath, 1, 100) < 0) { + virReportSystemError(errno, + _("failed to wait for file '%s' to appear"), + linkpath); + goto cleanup; + } +
So the linkpath gets created after the canonicalpath... and we don't have to check that right? Considering what I pointed out in my previous review - I wouldn't be able to use virFileWaitForExists, but a similar loop would be possible. For NPIV the file exists, it's the contents that are bogus momentarily. Other consumers waiting that I looked at usually wait on some sort of open() and usleep() when ENOENT is returned (there's multiple examples if you search on ulseep). Wonder if any of those could utilize something like this... I know patches welcome ;-) Reviewed-by: John Ferlan <jferlan@redhat.com>
if (virFileResolveLink(linkpath, &canonicalpath) < 0) { virReportSystemError(errno, _("failed to resolve '%s'"), linkpath); goto cleanup;

On Wed, Sep 20, 2017 at 10:33:14AM -0400, John Ferlan wrote:
On 09/18/2017 12:34 PM, Erik Skultety wrote:
If we find ourselves in the situation that the 'add' uevent has been fired earlier than the sysfs tree for a device was created, we should use the best-effort approach and give kernel some predetermined amount of time, thus waiting for the attributes to be ready rather than discarding the device from our device list forever. If those don't appear in the given time frame, we need to move on, since libvirt can't wait indefinitely.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 70e15ffb8..2f63256a3 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1166,9 +1166,23 @@ udevProcessMediatedDevice(struct udev_device *dev, char *canonicalpath = NULL; virNodeDevCapMdevPtr data = &def->caps->data.mdev;
- if (virAsprintf(&linkpath, "%s/mdev_type", udev_device_get_syspath(dev)) < 0) + /* Because of a kernel uevent race, we might get the 'add' event prior to + * the sysfs tree being ready, so any attempt to access any sysfs attribute + * would result in ENOENT and us dropping the device, so let's work around + * it by waiting for the attributes to become available. + */ + + if (virAsprintf(&linkpath, "%s/mdev_type", + udev_device_get_syspath(dev)) < 0) goto cleanup;
+ if (virFileWaitForExists(linkpath, 1, 100) < 0) { + virReportSystemError(errno, + _("failed to wait for file '%s' to appear"), + linkpath); + goto cleanup; + } +
So the linkpath gets created after the canonicalpath... and we don't
Well, I would assume so, I believe that the kernel wouldn't create symlinks first and the canonical paths second, such a design would be sooo broken. Thanks, Erik
have to check that right?
Considering what I pointed out in my previous review - I wouldn't be able to use virFileWaitForExists, but a similar loop would be possible. For NPIV the file exists, it's the contents that are bogus momentarily.
Other consumers waiting that I looked at usually wait on some sort of open() and usleep() when ENOENT is returned (there's multiple examples if you search on ulseep). Wonder if any of those could utilize something like this... I know patches welcome ;-)
Reviewed-by: John Ferlan <jferlan@redhat.com>
if (virFileResolveLink(linkpath, &canonicalpath) < 0) { virReportSystemError(errno, _("failed to resolve '%s'"), linkpath); goto cleanup;
participants (2)
-
Erik Skultety
-
John Ferlan