[libvirt] [PATCH v3 0/6] Work around the kernel mdev uevent race in nodedev

v2 here: https://www.redhat.com/archives/libvir-list/2017-July/msg01268.html Since v2: - added patch 4/6 that fixes the issue with the handler thread spamming logs with "udev_monitor_receive_device returned NULL" -> the event loop callback now disables polling on the udev monitor's fd every time there's a new event, leaving the responsibility for re-enabling it back to the handler thread once it had removed the corresponding data from the socket. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285 Erik Skultety (6): nodedev: Introduce udevCheckMonitorFD helper function udev: Split udevEventHandleCallback in two functions udev: Convert udevEventHandleThread to an actual thread routine nodedev: Disable/re-enable polling on the udev fd util: Introduce virFileWaitForAccess nodedev: Work around the uevent race by hooking up virFileWaitForAccess src/libvirt_private.syms | 1 + src/node_device/node_device_udev.c | 208 ++++++++++++++++++++++++++++++++----- src/util/virfile.c | 29 ++++++ src/util/virfile.h | 2 + 4 files changed, 214 insertions(+), 26 deletions(-) -- 2.13.3

We need to perform some sanity checks on the udev monitor before every use so that we know nothing changed in the meantime. The reason for moving the code to a separate function is to be able to perform the same check from a worker thread that will replace the udevEventHandleCallback in terms of poking udev for device events. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 57 ++++++++++++++++++++++++++------------ 1 file changed, 39 insertions(+), 18 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index f4177455c..465d272ba 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1615,47 +1615,68 @@ udevHandleOneDevice(struct udev_device *device) } -static void -udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, - int fd, - int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) +/* the caller must be holding the driver lock prior to calling this function */ +static bool +udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd) { - struct udev_device *device = NULL; - struct udev_monitor *udev_monitor = DRV_STATE_UDEV_MONITOR(driver); - int udev_fd = -1; + int real_fd = -1; - udev_fd = udev_monitor_get_fd(udev_monitor); - if (fd != udev_fd) { + /* sanity check that the monitor socket hasn't changed */ + real_fd = udev_monitor_get_fd(udev_monitor); + + if (real_fd != fd) { udevPrivate *priv = driver->privateData; virReportError(VIR_ERR_INTERNAL_ERROR, _("File descriptor returned by udev %d does not " "match node device file descriptor %d"), - fd, udev_fd); + real_fd, fd); - /* this is a non-recoverable error, let's remove the handle, so that we - * don't get in here again because of some spurious behaviour and report - * the same error multiple times + /* this is a non-recoverable error, thus the event handle has to be + * removed, so that we don't get into the handler again because of some + * spurious behaviour */ virEventRemoveHandle(priv->watch); priv->watch = -1; - goto cleanup; + return false; } - device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { + return true; +} + + +static void +udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, + int fd, + int events ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, fd)) + goto unlock; + + if (!(device = udev_monitor_receive_device(udev_monitor))) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); - goto cleanup; + goto unlock; } + nodeDeviceUnlock(); udevHandleOneDevice(device); cleanup: udev_device_unref(device); return; + + unlock: + nodeDeviceUnlock(); + goto cleanup; } -- 2.13.3

On 08/24/2017 07:23 AM, Erik Skultety wrote:
We need to perform some sanity checks on the udev monitor before every use so that we know nothing changed in the meantime. The reason for moving the code to a separate function is to be able to perform the same check from a worker thread that will replace the udevEventHandleCallback in terms of poking udev for device events.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 57 ++++++++++++++++++++++++++------------ 1 file changed, 39 insertions(+), 18 deletions(-)
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index f4177455c..465d272ba 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1615,47 +1615,68 @@ udevHandleOneDevice(struct udev_device *device) }
-static void -udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, - int fd, - int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) +/* the caller must be holding the driver lock prior to calling this function */ +static bool +udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd)
One line for each argument... I think in keeping with other functions - this should be named 'udevEventCheckMonitorFD'
{ - struct udev_device *device = NULL; - struct udev_monitor *udev_monitor = DRV_STATE_UDEV_MONITOR(driver); - int udev_fd = -1; + int real_fd = -1;
- udev_fd = udev_monitor_get_fd(udev_monitor); - if (fd != udev_fd) { + /* sanity check that the monitor socket hasn't changed */ + real_fd = udev_monitor_get_fd(udev_monitor); + + if (real_fd != fd) { udevPrivate *priv = driver->privateData;
virReportError(VIR_ERR_INTERNAL_ERROR, _("File descriptor returned by udev %d does not " "match node device file descriptor %d"), - fd, udev_fd); + real_fd, fd);
- /* this is a non-recoverable error, let's remove the handle, so that we - * don't get in here again because of some spurious behaviour and report - * the same error multiple times + /* this is a non-recoverable error, thus the event handle has to be + * removed, so that we don't get into the handler again because of some + * spurious behaviour */ virEventRemoveHandle(priv->watch); priv->watch = -1;
Hmmm... thinking a couple patches later - this would seem to be a good candidate for something that udevEventThreadDataFree could do (as long as it held the appropriate lock of course). It's almost too bad we couldn't somehow "pretend" to have a restart... different problem though!
- goto cleanup; + return false; }
- device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { + return true; +} + + +static void +udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, + int fd, + int events ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
Technically there's a couple of things going on here - including the addition of the nodeDevice{Lock|Unlock}. Probably would have been best to make the split first, then add the Lock/Unlock afterwards (or vice versa). Still once I get to patch 3, I began wondering how the udev interaction works.
+ + if (!udevCheckMonitorFD(udev_monitor, fd)) + goto unlock;> + + if (!(device = udev_monitor_receive_device(udev_monitor))) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL"));
I almost wonder if it would be better to have this be a ReportSystemError w/ errno... Not that udev docs give any guidance there, but maybe it'd help.
- goto cleanup; + goto unlock;
I know this is "existing"; however, if !device, then what's the purpose of calling udev_device_unref(NULL)? In fact there's one path in udevGetDMIData which actually checks != NULL before calling - although it has no reason to since I see no way for it to be NULL at cleanup (again a different issue). Also, wouldn't nodeDeviceLock/Unlock wrapping only be necessary within udevCheckMonitorFD? Why would the udev call need to be wrapped as well? John
}
+ nodeDeviceUnlock(); udevHandleOneDevice(device);
cleanup: udev_device_unref(device); return; + + unlock: + nodeDeviceUnlock(); + goto cleanup; }

[...]
+ + if (!udevCheckMonitorFD(udev_monitor, fd)) + goto unlock;> + + if (!(device = udev_monitor_receive_device(udev_monitor))) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL"));
I almost wonder if it would be better to have this be a ReportSystemError w/ errno... Not that udev docs give any guidance there, but maybe it'd help.
Not really, that would imply that we can rely on libudev setting the errno, but the call can fail in multiple ways the most of which are unrelated to any system call and given the poor interface libudev has :/ we don't have much of a choice (but yeah, the error is rather uninformative, but since you have no control over it, it's the best we have IMO).
- goto cleanup; + goto unlock;
I know this is "existing"; however, if !device, then what's the purpose of calling udev_device_unref(NULL)? In fact there's one path in udevGetDMIData which actually checks != NULL before calling - although it has no reason to since I see no way for it to be NULL at cleanup (again a different issue).
Also, wouldn't nodeDeviceLock/Unlock wrapping only be necessary within udevCheckMonitorFD? Why would the udev call need to be wrapped as well?
So, IMHO, again, I'm still not convinced about the whole file descriptor changing under our hands idea (unrelated - was it actually the correct wording?). But since the general consensus was to keep the sanity checks, I moved the @fd extraction within the locks. Now, if we presume such thing can happen, you don't want anyone touching the driver structure in the meantime, otherwise you're left with a dangling pointer @udev_monitor and bad things would happen. This way, you don't have to worry about the driver's structure getting changed, possibly invalidating the file descriptor (not that we were, but if we really would/should...). Erik

This patch splits udevEventHandleCallback in two (introduces udevEventHandleThread) in order to be later able to refactor the latter to actually become a detached thread which will wait some time for the kernel to create the whole sysfs tree for a device as we cannot do that in the event loop directly. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 54 ++++++++++++++++++++++++++------------ 1 file changed, 37 insertions(+), 17 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 465d272ba..fe943c78b 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1647,36 +1647,56 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd) static void +udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) +{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + int fd = (intptr_t) opaque; + + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, fd)) + goto unlock; + + device = udev_monitor_receive_device(udev_monitor); + if (device == NULL) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto unlock; + } + + nodeDeviceUnlock(); + udevHandleOneDevice(device); + + cleanup: + udev_device_unref(device); + return; + + unlock: + nodeDeviceUnlock(); + goto cleanup; +} + + +static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED) { - struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); - if (!udevCheckMonitorFD(udev_monitor, fd)) - goto unlock; - - if (!(device = udev_monitor_receive_device(udev_monitor))) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - goto unlock; + if (!udevCheckMonitorFD(udev_monitor, fd)) { + nodeDeviceUnlock(); + return; } - nodeDeviceUnlock(); - udevHandleOneDevice(device); - - cleanup: - udev_device_unref(device); - return; - unlock: - nodeDeviceUnlock(); - goto cleanup; + udevEventHandleThread((void *)(intptr_t) fd); } -- 2.13.3

On 08/24/2017 07:23 AM, Erik Skultety wrote:
This patch splits udevEventHandleCallback in two (introduces udevEventHandleThread) in order to be later able to refactor the latter to actually become a detached thread which will wait some time for the kernel to create the whole sysfs tree for a device as we cannot do that in the event loop directly.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 54 ++++++++++++++++++++++++++------------ 1 file changed, 37 insertions(+), 17 deletions(-)
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 465d272ba..fe943c78b 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1647,36 +1647,56 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd)
static void +udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED)
It's not ATTRIBUTE_UNUSED
+{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + int fd = (intptr_t) opaque;
^^^ opaque!
+ + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, fd)) + goto unlock; + + device = udev_monitor_receive_device(udev_monitor); + if (device == NULL) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto unlock; + } + + nodeDeviceUnlock(); + udevHandleOneDevice(device); + + cleanup: + udev_device_unref(device); + return; + + unlock: + nodeDeviceUnlock(); + goto cleanup; +} + + +static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED) { - struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL;
nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
- if (!udevCheckMonitorFD(udev_monitor, fd)) - goto unlock; - - if (!(device = udev_monitor_receive_device(udev_monitor))) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - goto unlock; + if (!udevCheckMonitorFD(udev_monitor, fd)) { + nodeDeviceUnlock(); + return; }
The above sequence will be done twice - once here and repeated again in HandleThread. I understand why, but as I get to patch 3 - I'm not sure things are going to work as expected. John
- nodeDeviceUnlock(); - udevHandleOneDevice(device); - - cleanup: - udev_device_unref(device); - return;
- unlock: - nodeDeviceUnlock(); - goto cleanup; + udevEventHandleThread((void *)(intptr_t) fd); }

On Mon, Aug 28, 2017 at 11:04:35AM -0400, John Ferlan wrote:
On 08/24/2017 07:23 AM, Erik Skultety wrote:
This patch splits udevEventHandleCallback in two (introduces udevEventHandleThread) in order to be later able to refactor the latter to actually become a detached thread which will wait some time for the kernel to create the whole sysfs tree for a device as we cannot do that in the event loop directly.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 54 ++++++++++++++++++++++++++------------ 1 file changed, 37 insertions(+), 17 deletions(-)
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 465d272ba..fe943c78b 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1647,36 +1647,56 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd)
static void +udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED)
It's not ATTRIBUTE_UNUSED
+{ + struct udev_device *device = NULL; + struct udev_monitor *udev_monitor = NULL; + int fd = (intptr_t) opaque;
^^^ opaque!
Right, good catch, thanks.
+ + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, fd)) + goto unlock; + + device = udev_monitor_receive_device(udev_monitor); + if (device == NULL) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto unlock; + } + + nodeDeviceUnlock(); + udevHandleOneDevice(device); + + cleanup: + udev_device_unref(device); + return; + + unlock: + nodeDeviceUnlock(); + goto cleanup; +} + + +static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED) { - struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL;
nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
- if (!udevCheckMonitorFD(udev_monitor, fd)) - goto unlock; - - if (!(device = udev_monitor_receive_device(udev_monitor))) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - goto unlock; + if (!udevCheckMonitorFD(udev_monitor, fd)) { + nodeDeviceUnlock(); + return; }
The above sequence will be done twice - once here and repeated again in HandleThread. I understand why, but as I get to patch 3 - I'm not sure things are going to work as expected.
Okay, let's continue with the discussion in patch 3 comments :). Erik

Adjust udevEventHandleThread to be a proper thread routine running in an infinite loop handling devices. Also introduce udevEventThreadData private structure. Every time there's and incoming event from udev, udevEventHandleCallback only increments the number of events queuing on the monitor and signals the worker thread to handle them. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 132 ++++++++++++++++++++++++++++++------- 1 file changed, 108 insertions(+), 24 deletions(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index fe943c78b..444e5be4d 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -59,6 +59,54 @@ struct _udevPrivate { bool privileged; }; +typedef struct _udevEventThreadData udevEventThreadData; +typedef udevEventThreadData *udevEventThreadDataPtr; + +struct _udevEventThreadData { + virMutex lock; + virCond threadCond; + + bool threadQuit; + int monitor_fd; + unsigned long long nevents; /* number of udev events queuing on monitor */ +}; + + +static udevEventThreadDataPtr +udevEventThreadDataNew(int fd) +{ + udevEventThreadDataPtr ret = NULL; + + if (VIR_ALLOC_QUIET(ret) < 0) + return NULL; + + if (virMutexInit(&ret->lock) < 0) + goto cleanup; + + if (virCondInit(&ret->threadCond) < 0) + goto cleanup_mutex; + + ret->monitor_fd = fd; + + return ret; + + cleanup_mutex: + virMutexDestroy(&ret->lock); + + cleanup: + VIR_FREE(ret); + return NULL; +} + + +static void +udevEventThreadDataFree(udevEventThreadDataPtr data) +{ + virMutexDestroy(&data->lock); + virCondDestroy(&data->threadCond); + VIR_FREE(data); +} + static bool udevHasDeviceProperty(struct udev_device *dev, @@ -1647,35 +1695,53 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd) static void -udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) +udevEventHandleThread(void *opaque) { + udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; - int fd = (intptr_t) opaque; - nodeDeviceLock(); - udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + /* continue rather than break from the loop on non-fatal errors */ + while (1) { + virMutexLock(&privateData->lock); + while (privateData->nevents == 0 && !privateData->threadQuit) { + if (virCondWait(&privateData->threadCond, &privateData->lock)) { + virReportSystemError(errno, "%s", + _("handler failed to wait on condition")); + goto cleanup; + } + } - if (!udevCheckMonitorFD(udev_monitor, fd)) - goto unlock; + privateData->nevents--; + virMutexUnlock(&privateData->lock); - device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - goto unlock; + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { + nodeDeviceUnlock(); + goto cleanup; + } + + device = udev_monitor_receive_device(udev_monitor); + nodeDeviceUnlock(); + + if (!device) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto next; + } + + udevHandleOneDevice(device); + + next: + udev_device_unref(device); } - nodeDeviceUnlock(); - udevHandleOneDevice(device); - cleanup: udev_device_unref(device); + udevEventThreadDataFree(privateData); return; - - unlock: - nodeDeviceUnlock(); - goto cleanup; } @@ -1683,20 +1749,28 @@ static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) + void *opaque) { struct udev_monitor *udev_monitor = NULL; + udevEventThreadDataPtr threadData = opaque; nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); - if (!udevCheckMonitorFD(udev_monitor, fd)) { + virMutexLock(&threadData->lock); + threadData->threadQuit = true; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); + nodeDeviceUnlock(); return; } nodeDeviceUnlock(); - udevEventHandleThread((void *)(intptr_t) fd); + virMutexLock(&threadData->lock); + threadData->nevents++; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); } @@ -1823,6 +1897,9 @@ nodeStateInitialize(bool privileged, { udevPrivate *priv = NULL; struct udev *udev = NULL; + int monitor_fd = -1; + virThread th; + udevEventThreadDataPtr threadData = NULL; if (VIR_ALLOC(priv) < 0) return -1; @@ -1883,6 +1960,14 @@ nodeStateInitialize(bool privileged, 128 * 1024 * 1024); #endif + monitor_fd = udev_monitor_get_fd(priv->udev_monitor); + if (!(threadData = udevEventThreadDataNew(monitor_fd)) || + virThreadCreate(&th, false, udevEventHandleThread, threadData) < 0) { + virReportSystemError(errno, "%s", + _("failed to create udev handling thread")); + goto cleanup; + } + /* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing * devices because libvirt will simply recreate the device if we @@ -1891,9 +1976,8 @@ nodeStateInitialize(bool privileged, * enumeration. The alternative is to register the callback after * we enumerate, in which case we will fail to create any devices * that appear while the enumeration is taking place. */ - priv->watch = virEventAddHandle(udev_monitor_get_fd(priv->udev_monitor), - VIR_EVENT_HANDLE_READABLE, - udevEventHandleCallback, NULL, NULL); + priv->watch = virEventAddHandle(monitor_fd, VIR_EVENT_HANDLE_READABLE, + udevEventHandleCallback, threadData, NULL); if (priv->watch == -1) goto unlock; -- 2.13.3

On 08/24/2017 07:23 AM, Erik Skultety wrote:
Adjust udevEventHandleThread to be a proper thread routine running in an infinite loop handling devices. Also introduce udevEventThreadData private structure. Every time there's and incoming event from udev, udevEventHandleCallback only increments the number of events queuing on the monitor and signals the worker thread to handle them.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 132 ++++++++++++++++++++++++++++++------- 1 file changed, 108 insertions(+), 24 deletions(-)
Once we get here I'm not sure I understand the udev interaction. I guess it's not "crystal clear" that the socket relationship is a queue of device events. I also haven't studied the udev code nor have I been working through this as much as you have lately - so perhaps something you've uncovered will help educate me in this manner. Still here's my thoughts and a small sliver of investigative data... So far there's been a 1-to-1 interaction between libvirt and udev event. With this change there could be an n-to-1 interaction - as in receive @n devices. Up to this point the interaction has been: 1. udev event 2. validate fd/udev_monitor 3. call udev_monitor_receive_device to get @device 4. process @device 5. unref @device 6. return to udev After this point 1. udev event 2. validate fd/udev_monitor 3. increase event count 4. signal condition to thread to wakeup 5. return to udev asynchronously in a loop 1. wakeup on condition and for each 'event' refcnt 2. decrease event refcnt 3. validate fd/udev_monitor 4. call udev_monitor_receive_device to get @device 5. process @device 6. unref @device If there's more than one event, does udev_monitor_receive_device guarantee the order? Are we sure udev buffers/queues in this manner? That is since we're not grabbing @device before we return to udev when the event is signaled, is there any guarantee from udev that *the same* device will still exist when we do get around to making our call? My concern being how "long" is the data guaranteed to be there. As I read the subsequent patch - I'm thinking perhaps we really want to keep that 1-to-1 relationship. Could the reason that you're getting those messages be because we're now calling udev out of the order it expected? Do we know why we're getting a NULL return? I found: https://sourcecodebrowser.com/udev/141/libudev-monitor_8c.html which seems to indicate a number of reasons to return NULL... Maybe udev is processing another device and while doing so it blocks anyone calling udev_monitor_receive_device. Conversely while processing an event that same block wouldn't be present because the expectation is that the called function will call udev_monitor_receive_device. I think perhaps the processing thread is fine to have; however, instead of processing the event it's processing an @device that was managed in udevEventHandleCallback and put onto some "worker queue". At least that way we know we have a refcnt'd udev_device and we can process in an expected order. John
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index fe943c78b..444e5be4d 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -59,6 +59,54 @@ struct _udevPrivate { bool privileged; };
+typedef struct _udevEventThreadData udevEventThreadData; +typedef udevEventThreadData *udevEventThreadDataPtr; + +struct _udevEventThreadData { + virMutex lock; + virCond threadCond; + + bool threadQuit; + int monitor_fd; + unsigned long long nevents; /* number of udev events queuing on monitor */ +}; + + +static udevEventThreadDataPtr +udevEventThreadDataNew(int fd) +{ + udevEventThreadDataPtr ret = NULL; + + if (VIR_ALLOC_QUIET(ret) < 0) + return NULL; + + if (virMutexInit(&ret->lock) < 0) + goto cleanup; + + if (virCondInit(&ret->threadCond) < 0) + goto cleanup_mutex; + + ret->monitor_fd = fd; + + return ret; + + cleanup_mutex: + virMutexDestroy(&ret->lock); + + cleanup: + VIR_FREE(ret); + return NULL; +} + + +static void +udevEventThreadDataFree(udevEventThreadDataPtr data) +{ + virMutexDestroy(&data->lock); + virCondDestroy(&data->threadCond); + VIR_FREE(data); +} +
static bool udevHasDeviceProperty(struct udev_device *dev, @@ -1647,35 +1695,53 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd)
static void -udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) +udevEventHandleThread(void *opaque) { + udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; - int fd = (intptr_t) opaque;
- nodeDeviceLock(); - udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + /* continue rather than break from the loop on non-fatal errors */ + while (1) { + virMutexLock(&privateData->lock); + while (privateData->nevents == 0 && !privateData->threadQuit) { + if (virCondWait(&privateData->threadCond, &privateData->lock)) { + virReportSystemError(errno, "%s", + _("handler failed to wait on condition")); + goto cleanup; + } + }
- if (!udevCheckMonitorFD(udev_monitor, fd)) - goto unlock; + privateData->nevents--; + virMutexUnlock(&privateData->lock);
- device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - goto unlock; + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { + nodeDeviceUnlock(); + goto cleanup; + } + + device = udev_monitor_receive_device(udev_monitor); + nodeDeviceUnlock(); + + if (!device) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto next; + } + + udevHandleOneDevice(device); + + next: + udev_device_unref(device); }
- nodeDeviceUnlock(); - udevHandleOneDevice(device); - cleanup: udev_device_unref(device); + udevEventThreadDataFree(privateData); return; - - unlock: - nodeDeviceUnlock(); - goto cleanup; }
@@ -1683,20 +1749,28 @@ static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) + void *opaque) { struct udev_monitor *udev_monitor = NULL; + udevEventThreadDataPtr threadData = opaque;
nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); - if (!udevCheckMonitorFD(udev_monitor, fd)) { + virMutexLock(&threadData->lock); + threadData->threadQuit = true; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); + nodeDeviceUnlock(); return; } nodeDeviceUnlock();
- udevEventHandleThread((void *)(intptr_t) fd); + virMutexLock(&threadData->lock); + threadData->nevents++; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); }
@@ -1823,6 +1897,9 @@ nodeStateInitialize(bool privileged, { udevPrivate *priv = NULL; struct udev *udev = NULL; + int monitor_fd = -1; + virThread th; + udevEventThreadDataPtr threadData = NULL;
if (VIR_ALLOC(priv) < 0) return -1; @@ -1883,6 +1960,14 @@ nodeStateInitialize(bool privileged, 128 * 1024 * 1024); #endif
+ monitor_fd = udev_monitor_get_fd(priv->udev_monitor); + if (!(threadData = udevEventThreadDataNew(monitor_fd)) || + virThreadCreate(&th, false, udevEventHandleThread, threadData) < 0) { + virReportSystemError(errno, "%s", + _("failed to create udev handling thread")); + goto cleanup; + } + /* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing * devices because libvirt will simply recreate the device if we @@ -1891,9 +1976,8 @@ nodeStateInitialize(bool privileged, * enumeration. The alternative is to register the callback after * we enumerate, in which case we will fail to create any devices * that appear while the enumeration is taking place. */ - priv->watch = virEventAddHandle(udev_monitor_get_fd(priv->udev_monitor), - VIR_EVENT_HANDLE_READABLE, - udevEventHandleCallback, NULL, NULL); + priv->watch = virEventAddHandle(monitor_fd, VIR_EVENT_HANDLE_READABLE, + udevEventHandleCallback, threadData, NULL); if (priv->watch == -1) goto unlock;

On Mon, Aug 28, 2017 at 12:26:08PM -0400, John Ferlan wrote:
On 08/24/2017 07:23 AM, Erik Skultety wrote:
Adjust udevEventHandleThread to be a proper thread routine running in an infinite loop handling devices. Also introduce udevEventThreadData private structure. Every time there's and incoming event from udev, udevEventHandleCallback only increments the number of events queuing on the monitor and signals the worker thread to handle them.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 132 ++++++++++++++++++++++++++++++------- 1 file changed, 108 insertions(+), 24 deletions(-)
Once we get here I'm not sure I understand the udev interaction. I guess it's not "crystal clear" that the socket relationship is a queue of device events. I also haven't studied the udev code nor have I been working through this as much as you have lately - so perhaps something you've uncovered will help educate me in this manner. Still here's my thoughts and a small sliver of investigative data...
So far there's been a 1-to-1 interaction between libvirt and udev event. With this change there could be an n-to-1 interaction - as in receive @n devices.
Up to this point the interaction has been:
1. udev event 2. validate fd/udev_monitor 3. call udev_monitor_receive_device to get @device 4. process @device 5. unref @device 6. return to udev
After this point
1. udev event 2. validate fd/udev_monitor 3. increase event count 4. signal condition to thread to wakeup 5. return to udev
asynchronously in a loop
1. wakeup on condition and for each 'event' refcnt 2. decrease event refcnt 3. validate fd/udev_monitor 4. call udev_monitor_receive_device to get @device 5. process @device 6. unref @device
If there's more than one event, does udev_monitor_receive_device guarantee the order? Are we sure udev buffers/queues in this manner?
This should be fairly deterministic as it's not that much of a udev problem as it is a kernel problem (since we're talking about a netlink socket which is internally represented by a 'struct socket' structure in the kernel which in fact implements socket buffers, more specifically - very important - queues). So, if kernel didn't guarantee the order of packets/events on the socket, you'd experience the same problem with calling udev_monitor_receive_device, which basically only does recvmsg(), directly from udevEventHandleCallback. IOW if that was the case, udev itself could get easily confused by the kernel events about a device being deleted that had not been previously added etc. So, yes, the order has to be guaranteed and if it is not, then it's a kernel bug.
That is since we're not grabbing @device before we return to udev when the event is signaled, is there any guarantee from udev that *the same* device will still exist when we do get around to making our call? My concern being how "long" is the data guaranteed to be there.
As I said, unless the socket buffer size limit has been reached, in which case ENOSPC was returned by libudev - I had a BZ on this: https://bugzilla.redhat.com/show_bug.cgi?id=1450960
As I read the subsequent patch - I'm thinking perhaps we really want to keep that 1-to-1 relationship. Could the reason that you're getting those messages be because we're now calling udev out of the order it expected? Do we know why we're getting a NULL return? I found:
Yeah, I was getting EAGAIN and it wasn't until that point that I realized that the scheduler came into play, IOW the main event loop thread polls the file descriptor list, invoking the corresponding handlers, which means that udevEventHandleCallback just increased the overall number of events waiting to be processed by the handler thread, except that since no one actually extracted the data out of the socket and the thread hadn't been scheduled in the meantime it's still the same data that the main event loop thread finds waiting on the socket the next time it polls the @fd, therefore incrementing the event counter multiple times for the same event. Naturally, when the handler thread finally removes the event from the monitor, not knowing the counter had incremented for this event multiple times, it will fail to remove any more events from the monitor, as none had arrived in the meantime, therefore the EAGAIN.
https://sourcecodebrowser.com/udev/141/libudev-monitor_8c.html
which seems to indicate a number of reasons to return NULL... Maybe udev is processing another device and while doing so it blocks anyone calling udev_monitor_receive_device. Conversely while processing an event that same block wouldn't be present because the expectation is that the called function will call udev_monitor_receive_device.
If this was the case, sooner or later we'd hit the problem anyway.
I think perhaps the processing thread is fine to have; however, instead of processing the event it's processing an @device that was managed in udevEventHandleCallback and put onto some "worker queue". At least that
This would work nicely as well, I just went with a simpler solution where I didn't have to create another device queue and only offloaded the whole processing onto the handler thread, so it's just a matter taste :). Erik
way we know we have a refcnt'd udev_device and we can process in an expected order.
John
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index fe943c78b..444e5be4d 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -59,6 +59,54 @@ struct _udevPrivate { bool privileged; };
+typedef struct _udevEventThreadData udevEventThreadData; +typedef udevEventThreadData *udevEventThreadDataPtr; + +struct _udevEventThreadData { + virMutex lock; + virCond threadCond; + + bool threadQuit; + int monitor_fd; + unsigned long long nevents; /* number of udev events queuing on monitor */ +}; + + +static udevEventThreadDataPtr +udevEventThreadDataNew(int fd) +{ + udevEventThreadDataPtr ret = NULL; + + if (VIR_ALLOC_QUIET(ret) < 0) + return NULL; + + if (virMutexInit(&ret->lock) < 0) + goto cleanup; + + if (virCondInit(&ret->threadCond) < 0) + goto cleanup_mutex; + + ret->monitor_fd = fd; + + return ret; + + cleanup_mutex: + virMutexDestroy(&ret->lock); + + cleanup: + VIR_FREE(ret); + return NULL; +} + + +static void +udevEventThreadDataFree(udevEventThreadDataPtr data) +{ + virMutexDestroy(&data->lock); + virCondDestroy(&data->threadCond); + VIR_FREE(data); +} +
static bool udevHasDeviceProperty(struct udev_device *dev, @@ -1647,35 +1695,53 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd)
static void -udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) +udevEventHandleThread(void *opaque) { + udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; - int fd = (intptr_t) opaque;
- nodeDeviceLock(); - udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + /* continue rather than break from the loop on non-fatal errors */ + while (1) { + virMutexLock(&privateData->lock); + while (privateData->nevents == 0 && !privateData->threadQuit) { + if (virCondWait(&privateData->threadCond, &privateData->lock)) { + virReportSystemError(errno, "%s", + _("handler failed to wait on condition")); + goto cleanup; + } + }
- if (!udevCheckMonitorFD(udev_monitor, fd)) - goto unlock; + privateData->nevents--; + virMutexUnlock(&privateData->lock);
- device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - goto unlock; + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { + nodeDeviceUnlock(); + goto cleanup; + } + + device = udev_monitor_receive_device(udev_monitor); + nodeDeviceUnlock(); + + if (!device) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto next; + } + + udevHandleOneDevice(device); + + next: + udev_device_unref(device); }
- nodeDeviceUnlock(); - udevHandleOneDevice(device); - cleanup: udev_device_unref(device); + udevEventThreadDataFree(privateData); return; - - unlock: - nodeDeviceUnlock(); - goto cleanup; }
@@ -1683,20 +1749,28 @@ static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) + void *opaque) { struct udev_monitor *udev_monitor = NULL; + udevEventThreadDataPtr threadData = opaque;
nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); - if (!udevCheckMonitorFD(udev_monitor, fd)) { + virMutexLock(&threadData->lock); + threadData->threadQuit = true; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); + nodeDeviceUnlock(); return; } nodeDeviceUnlock();
- udevEventHandleThread((void *)(intptr_t) fd); + virMutexLock(&threadData->lock); + threadData->nevents++; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); }
@@ -1823,6 +1897,9 @@ nodeStateInitialize(bool privileged, { udevPrivate *priv = NULL; struct udev *udev = NULL; + int monitor_fd = -1; + virThread th; + udevEventThreadDataPtr threadData = NULL;
if (VIR_ALLOC(priv) < 0) return -1; @@ -1883,6 +1960,14 @@ nodeStateInitialize(bool privileged, 128 * 1024 * 1024); #endif
+ monitor_fd = udev_monitor_get_fd(priv->udev_monitor); + if (!(threadData = udevEventThreadDataNew(monitor_fd)) || + virThreadCreate(&th, false, udevEventHandleThread, threadData) < 0) { + virReportSystemError(errno, "%s", + _("failed to create udev handling thread")); + goto cleanup; + } + /* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing * devices because libvirt will simply recreate the device if we @@ -1891,9 +1976,8 @@ nodeStateInitialize(bool privileged, * enumeration. The alternative is to register the callback after * we enumerate, in which case we will fail to create any devices * that appear while the enumeration is taking place. */ - priv->watch = virEventAddHandle(udev_monitor_get_fd(priv->udev_monitor), - VIR_EVENT_HANDLE_READABLE, - udevEventHandleCallback, NULL, NULL); + priv->watch = virEventAddHandle(monitor_fd, VIR_EVENT_HANDLE_READABLE, + udevEventHandleCallback, threadData, NULL); if (priv->watch == -1) goto unlock;

On 08/29/2017 08:49 AM, Erik Skultety wrote:
On Mon, Aug 28, 2017 at 12:26:08PM -0400, John Ferlan wrote:
On 08/24/2017 07:23 AM, Erik Skultety wrote:
Adjust udevEventHandleThread to be a proper thread routine running in an infinite loop handling devices. Also introduce udevEventThreadData private structure. Every time there's and incoming event from udev, udevEventHandleCallback only increments the number of events queuing on the monitor and signals the worker thread to handle them.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 132 ++++++++++++++++++++++++++++++------- 1 file changed, 108 insertions(+), 24 deletions(-)
Let me take my head out of the VxHS sandbox ;-) and peek into this other quagmire!
Once we get here I'm not sure I understand the udev interaction. I guess it's not "crystal clear" that the socket relationship is a queue of device events. I also haven't studied the udev code nor have I been working through this as much as you have lately - so perhaps something you've uncovered will help educate me in this manner. Still here's my thoughts and a small sliver of investigative data...
So far there's been a 1-to-1 interaction between libvirt and udev event. With this change there could be an n-to-1 interaction - as in receive @n devices.
Up to this point the interaction has been:
1. udev event 2. validate fd/udev_monitor 3. call udev_monitor_receive_device to get @device 4. process @device 5. unref @device 6. return to udev
After this point
1. udev event 2. validate fd/udev_monitor 3. increase event count 4. signal condition to thread to wakeup 5. return to udev
asynchronously in a loop
1. wakeup on condition and for each 'event' refcnt 2. decrease event refcnt 3. validate fd/udev_monitor 4. call udev_monitor_receive_device to get @device 5. process @device 6. unref @device
If there's more than one event, does udev_monitor_receive_device guarantee the order? Are we sure udev buffers/queues in this manner?
This should be fairly deterministic as it's not that much of a udev problem as it is a kernel problem (since we're talking about a netlink socket which is internally represented by a 'struct socket' structure in the kernel which in fact implements socket buffers, more specifically - very important - queues). So, if kernel didn't guarantee the order of packets/events on the socket, you'd experience the same problem with calling udev_monitor_receive_device, which basically only does recvmsg(), directly from udevEventHandleCallback. IOW if that was the case, udev itself could get easily confused by the kernel events about a device being deleted that had not been previously added etc. So, yes, the order has to be guaranteed and if it is not, then it's a kernel bug.
Ah... OK great - I hadn't dug into that code and figured you'd been looking more in depth. I think as part of the commit message for this (or whatever this evolves into), the a description of how the udev processing works could help. At the very least something that says, kernel udev keeps messages in a queue on the socket to allow pulling off the messages by udev_monitor_receive_device. If too many messages stockpile, ENOSPC is generated, while if we try to pull something off but nothing exists, then EAGAIN is generated. Since the socket has a remote end (libvirt) that "should" be notified when data arrives (or something like that).
That is since we're not grabbing @device before we return to udev when the event is signaled, is there any guarantee from udev that *the same* device will still exist when we do get around to making our call? My concern being how "long" is the data guaranteed to be there.
As I said, unless the socket buffer size limit has been reached, in which case ENOSPC was returned by libudev - I had a BZ on this: https://bugzilla.redhat.com/show_bug.cgi?id=1450960
Ah yes, fix the race to no space problem by adding more space...
As I read the subsequent patch - I'm thinking perhaps we really want to keep that 1-to-1 relationship. Could the reason that you're getting those messages be because we're now calling udev out of the order it expected? Do we know why we're getting a NULL return? I found:
Yeah, I was getting EAGAIN and it wasn't until that point that I realized that the scheduler came into play, IOW the main event loop thread polls the file descriptor list, invoking the corresponding handlers, which means that udevEventHandleCallback just increased the overall number of events waiting to be processed by the handler thread, except that since no one actually extracted the data out of the socket and the thread hadn't been scheduled in the meantime it's still the same data that the main event loop thread finds waiting on the socket the next time it polls the @fd, therefore incrementing the event counter multiple times for the same event. Naturally, when the handler thread finally removes the event from the monitor, not knowing the counter had incremented for this event multiple times, it will fail to remove any more events from the monitor, as none had arrived in the meantime, therefore the EAGAIN.
Seems to me then 'nevents' counting is pointless other than to indicate that you got "some" event that needs to be processed. When the thread wakes up or is scheduled, then it processes all it can find (locked of course) and then clears the flag indicating the presence of events. Of course, that blocks the Callback function from setting the flag that there's an event. Thus still having a somewhat similar problem as you're trying to fix but also introducing the problem that the thread wakes up and locks private data, event is signaled again indicating the same event waiting, but it cannot get the private data lock, meanwhile the thread processes all events, releases the lock. This allows the other thread to gain the lock, set the new event bit, but since we've already processed it - when the thread wakes up it finds no event and generates the bogus message. IIUC, then patch4 resolves this by disabling the main event loop from sending events until we've processed them. Thus should patch4 be merged with this one? In any case, that's some nice detective work - I can only imagine this was *really* frustrating to dig into! Some day it'll be "fixed" and we won't trip into the problem, but until then, I now understand things a lot better...
https://sourcecodebrowser.com/udev/141/libudev-monitor_8c.html
which seems to indicate a number of reasons to return NULL... Maybe udev is processing another device and while doing so it blocks anyone calling udev_monitor_receive_device. Conversely while processing an event that same block wouldn't be present because the expectation is that the called function will call udev_monitor_receive_device.
If this was the case, sooner or later we'd hit the problem anyway.
I think perhaps the processing thread is fine to have; however, instead of processing the event it's processing an @device that was managed in udevEventHandleCallback and put onto some "worker queue". At least that
This would work nicely as well, I just went with a simpler solution where I didn't have to create another device queue and only offloaded the whole processing onto the handler thread, so it's just a matter taste :).
Yep - of course perhaps that could be a "next" step for some future fortunate sole that falls into some other issue... At least with a few more tidbits of history someone may have a chance at figure it out. Still even with our own device queue - we're still at the mercy of the speed and processing logic of the udev code, so it may not help in any way other than providing a few more cycles for files to magically appear and be properly filled while making sure libvirt's udev event processing logic isn't a bottle neck. All we do is get an event, get the device, add the device to our own queue, and return. Then it's up to that queue mgmt logic to appropriately message issue somewhere. I'm not requiring our own device queue, but just typing my thoughts. John
Erik
way we know we have a refcnt'd udev_device and we can process in an expected order.
John
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index fe943c78b..444e5be4d 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -59,6 +59,54 @@ struct _udevPrivate { bool privileged; };
+typedef struct _udevEventThreadData udevEventThreadData; +typedef udevEventThreadData *udevEventThreadDataPtr; + +struct _udevEventThreadData { + virMutex lock; + virCond threadCond; + + bool threadQuit; + int monitor_fd; + unsigned long long nevents; /* number of udev events queuing on monitor */ +}; + + +static udevEventThreadDataPtr +udevEventThreadDataNew(int fd) +{ + udevEventThreadDataPtr ret = NULL; + + if (VIR_ALLOC_QUIET(ret) < 0) + return NULL; + + if (virMutexInit(&ret->lock) < 0) + goto cleanup; + + if (virCondInit(&ret->threadCond) < 0) + goto cleanup_mutex; + + ret->monitor_fd = fd; + + return ret; + + cleanup_mutex: + virMutexDestroy(&ret->lock); + + cleanup: + VIR_FREE(ret); + return NULL; +} + + +static void +udevEventThreadDataFree(udevEventThreadDataPtr data) +{ + virMutexDestroy(&data->lock); + virCondDestroy(&data->threadCond); + VIR_FREE(data); +} +
static bool udevHasDeviceProperty(struct udev_device *dev, @@ -1647,35 +1695,53 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd)
static void -udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED) +udevEventHandleThread(void *opaque) { + udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; - int fd = (intptr_t) opaque;
- nodeDeviceLock(); - udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + /* continue rather than break from the loop on non-fatal errors */ + while (1) { + virMutexLock(&privateData->lock); + while (privateData->nevents == 0 && !privateData->threadQuit) { + if (virCondWait(&privateData->threadCond, &privateData->lock)) { + virReportSystemError(errno, "%s", + _("handler failed to wait on condition")); + goto cleanup; + } + }
- if (!udevCheckMonitorFD(udev_monitor, fd)) - goto unlock; + privateData->nevents--; + virMutexUnlock(&privateData->lock);
- device = udev_monitor_receive_device(udev_monitor); - if (device == NULL) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("udev_monitor_receive_device returned NULL")); - goto unlock; + nodeDeviceLock(); + udev_monitor = DRV_STATE_UDEV_MONITOR(driver); + + if (!udevCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { + nodeDeviceUnlock(); + goto cleanup; + } + + device = udev_monitor_receive_device(udev_monitor); + nodeDeviceUnlock(); + + if (!device) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("udev_monitor_receive_device returned NULL")); + goto next; + } + + udevHandleOneDevice(device); + + next: + udev_device_unref(device); }
- nodeDeviceUnlock(); - udevHandleOneDevice(device); - cleanup: udev_device_unref(device); + udevEventThreadDataFree(privateData); return; - - unlock: - nodeDeviceUnlock(); - goto cleanup; }
@@ -1683,20 +1749,28 @@ static void udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int fd, int events ATTRIBUTE_UNUSED, - void *data ATTRIBUTE_UNUSED) + void *opaque) { struct udev_monitor *udev_monitor = NULL; + udevEventThreadDataPtr threadData = opaque;
nodeDeviceLock(); udev_monitor = DRV_STATE_UDEV_MONITOR(driver); - if (!udevCheckMonitorFD(udev_monitor, fd)) { + virMutexLock(&threadData->lock); + threadData->threadQuit = true; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); + nodeDeviceUnlock(); return; } nodeDeviceUnlock();
- udevEventHandleThread((void *)(intptr_t) fd); + virMutexLock(&threadData->lock); + threadData->nevents++; + virCondSignal(&threadData->threadCond); + virMutexUnlock(&threadData->lock); }
@@ -1823,6 +1897,9 @@ nodeStateInitialize(bool privileged, { udevPrivate *priv = NULL; struct udev *udev = NULL; + int monitor_fd = -1; + virThread th; + udevEventThreadDataPtr threadData = NULL;
if (VIR_ALLOC(priv) < 0) return -1; @@ -1883,6 +1960,14 @@ nodeStateInitialize(bool privileged, 128 * 1024 * 1024); #endif
+ monitor_fd = udev_monitor_get_fd(priv->udev_monitor); + if (!(threadData = udevEventThreadDataNew(monitor_fd)) || + virThreadCreate(&th, false, udevEventHandleThread, threadData) < 0) { + virReportSystemError(errno, "%s", + _("failed to create udev handling thread")); + goto cleanup; + } + /* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing * devices because libvirt will simply recreate the device if we @@ -1891,9 +1976,8 @@ nodeStateInitialize(bool privileged, * enumeration. The alternative is to register the callback after * we enumerate, in which case we will fail to create any devices * that appear while the enumeration is taking place. */ - priv->watch = virEventAddHandle(udev_monitor_get_fd(priv->udev_monitor), - VIR_EVENT_HANDLE_READABLE, - udevEventHandleCallback, NULL, NULL); + priv->watch = virEventAddHandle(monitor_fd, VIR_EVENT_HANDLE_READABLE, + udevEventHandleCallback, threadData, NULL); if (priv->watch == -1) goto unlock;

On Thu, Aug 31, 2017 at 12:23:09PM -0400, John Ferlan wrote:
On 08/29/2017 08:49 AM, Erik Skultety wrote:
On Mon, Aug 28, 2017 at 12:26:08PM -0400, John Ferlan wrote:
On 08/24/2017 07:23 AM, Erik Skultety wrote:
Adjust udevEventHandleThread to be a proper thread routine running in an infinite loop handling devices. Also introduce udevEventThreadData private structure. Every time there's and incoming event from udev, udevEventHandleCallback only increments the number of events queuing on the monitor and signals the worker thread to handle them.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 132 ++++++++++++++++++++++++++++++------- 1 file changed, 108 insertions(+), 24 deletions(-)
Let me take my head out of the VxHS sandbox ;-) and peek into this other quagmire!
Once we get here I'm not sure I understand the udev interaction. I guess it's not "crystal clear" that the socket relationship is a queue of device events. I also haven't studied the udev code nor have I been working through this as much as you have lately - so perhaps something you've uncovered will help educate me in this manner. Still here's my thoughts and a small sliver of investigative data...
So far there's been a 1-to-1 interaction between libvirt and udev event. With this change there could be an n-to-1 interaction - as in receive @n devices.
Up to this point the interaction has been:
1. udev event 2. validate fd/udev_monitor 3. call udev_monitor_receive_device to get @device 4. process @device 5. unref @device 6. return to udev
After this point
1. udev event 2. validate fd/udev_monitor 3. increase event count 4. signal condition to thread to wakeup 5. return to udev
asynchronously in a loop
1. wakeup on condition and for each 'event' refcnt 2. decrease event refcnt 3. validate fd/udev_monitor 4. call udev_monitor_receive_device to get @device 5. process @device 6. unref @device
If there's more than one event, does udev_monitor_receive_device guarantee the order? Are we sure udev buffers/queues in this manner?
This should be fairly deterministic as it's not that much of a udev problem as it is a kernel problem (since we're talking about a netlink socket which is internally represented by a 'struct socket' structure in the kernel which in fact implements socket buffers, more specifically - very important - queues). So, if kernel didn't guarantee the order of packets/events on the socket, you'd experience the same problem with calling udev_monitor_receive_device, which basically only does recvmsg(), directly from udevEventHandleCallback. IOW if that was the case, udev itself could get easily confused by the kernel events about a device being deleted that had not been previously added etc. So, yes, the order has to be guaranteed and if it is not, then it's a kernel bug.
Ah... OK great - I hadn't dug into that code and figured you'd been looking more in depth. I think as part of the commit message for this (or whatever this evolves into), the a description of how the udev processing works could help. At the very least something that says, kernel udev keeps messages in a queue on the socket to allow pulling off the messages by udev_monitor_receive_device. If too many messages stockpile, ENOSPC is generated, while if we try to pull something off but nothing exists, then EAGAIN is generated. Since the socket has a remote end (libvirt) that "should" be notified when data arrives (or something like that).
OK, commit message is going to be enhanced, but probably not this one, rather the one for patch 4/6. ...
Seems to me then 'nevents' counting is pointless other than to indicate that you got "some" event that needs to be processed.
When the thread wakes up or is scheduled, then it processes all it can find (locked of course) and then clears the flag indicating the presence of events. Of course, that blocks the Callback function from setting the flag that there's an event.
Thus still having a somewhat similar problem as you're trying to fix but also introducing the problem that the thread wakes up and locks private data, event is signaled again indicating the same event waiting, but it cannot get the private data lock, meanwhile the thread processes all events, releases the lock. This allows the other thread to gain the
It doesn't process everything it can, it only processes up to N events it already knows about, so if there are 2 events and you're about to process the second one, lock the private data, decrement the events, process it and then either wait if the handle callback wasn't waiting on the lock to increment @nevents again or process any new events signaled by the handle callback. So the important thing is that the problem you're describing cannot happen, since the worker thread *does* rely on the actual number of events (the ones it knows about) waiting on the monitor, it's not a pointless counter.
lock, set the new event bit, but since we've already processed it - when the thread wakes up it finds no event and generates the bogus message.
IIUC, then patch4 resolves this by disabling the main event loop from
To be precise, patch 4 only takes into account the existence of a scheduler :). While I could merge 4/6 into this one, I'd rather not do it, simply because I tried to make all the changes as gradual as possible and I'm afraid that if I merged it, the change and the meaning of 4/6 might easily get overlooked, which is quite the opposite of what I was trying to achieve, I wanted to make the change and the corresponding issue explicit, in case someone hits a similar issue in the future.
sending events until we've processed them. Thus should patch4 be merged with this one?
In any case, that's some nice detective work - I can only imagine this was *really* frustrating to dig into! Some day it'll be "fixed" and we won't trip into the problem, but until then, I now understand things a lot better...
Well, honestly, it could have been much worse given libudev's codebase, since there are exits that do not set any errno, they just return NULL straight away and since I'm still about to find out where does libudev actually outputs the error messages, as I'm pretty sure it's not within our context as I've been told, trying to debug a race by investigating libudev's internals and exit points is not something I would describe as fun and would recommend in any way. Erik

The event loop may get scheduled earlier than the udev event handler thread which means that it would keep invoking the handler callback with "new" events, while in fact it's most likely still the same event which the handler thread hasn't managed to remove from the socket queue yet. This is due to unwanted increments of the number of events queuing on the socket and causes the handler thread to spam the logs with an error about libudev returning NULL (errno == EAGAIN). Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 444e5be4d..e3a647e3d 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1697,6 +1697,7 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd) static void udevEventHandleThread(void *opaque) { + udevPrivate *priv = NULL; udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; @@ -1716,6 +1717,7 @@ udevEventHandleThread(void *opaque) virMutexUnlock(&privateData->lock); nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver); if (!udevCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { @@ -1726,6 +1728,9 @@ udevEventHandleThread(void *opaque) device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock(); + /* Re-enable polling for new events on the @udev_monitor */ + virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE); + if (!device) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); @@ -1751,10 +1756,12 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int events ATTRIBUTE_UNUSED, void *opaque) { + udevPrivate *priv = NULL; struct udev_monitor *udev_monitor = NULL; udevEventThreadDataPtr threadData = opaque; nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver); if (!udevCheckMonitorFD(udev_monitor, fd)) { virMutexLock(&threadData->lock); @@ -1771,6 +1778,16 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, threadData->nevents++; virCondSignal(&threadData->threadCond); virMutexUnlock(&threadData->lock); + + /* Due to scheduling, the eventloop might poll the udev monitor before the + * handler thread actually removes the data from the socket, thus causing it + * to spam errors about data not being ready yet, because + * udevEventHandleCallback would be invoked multiple times incrementing the + * counter of events queuing for a single event. + * Therefore we need to disable polling on the fd until the thread actually + * removes the data from the socket. + */ + virEventUpdateHandle(priv->watch, 0); } -- 2.13.3

On 08/24/2017 07:23 AM, Erik Skultety wrote:
The event loop may get scheduled earlier than the udev event handler thread which means that it would keep invoking the handler callback with "new" events, while in fact it's most likely still the same event which the handler thread hasn't managed to remove from the socket queue yet. This is due to unwanted increments of the number of events queuing on the socket and causes the handler thread to spam the logs with an error about libudev returning NULL (errno == EAGAIN).
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
And by disabling the polling couldn't we miss an event? That would be really bad if the event after the one we miss relies on the event we missed creating something that the subsequent one would need (if that makes sense). John
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 444e5be4d..e3a647e3d 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1697,6 +1697,7 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd) static void udevEventHandleThread(void *opaque) { + udevPrivate *priv = NULL; udevEventThreadDataPtr privateData = opaque; struct udev_device *device = NULL; struct udev_monitor *udev_monitor = NULL; @@ -1716,6 +1717,7 @@ udevEventHandleThread(void *opaque) virMutexUnlock(&privateData->lock);
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
if (!udevCheckMonitorFD(udev_monitor, privateData->monitor_fd)) { @@ -1726,6 +1728,9 @@ udevEventHandleThread(void *opaque) device = udev_monitor_receive_device(udev_monitor); nodeDeviceUnlock();
+ /* Re-enable polling for new events on the @udev_monitor */ + virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE); + if (!device) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("udev_monitor_receive_device returned NULL")); @@ -1751,10 +1756,12 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, int events ATTRIBUTE_UNUSED, void *opaque) { + udevPrivate *priv = NULL; struct udev_monitor *udev_monitor = NULL; udevEventThreadDataPtr threadData = opaque;
nodeDeviceLock(); + priv = driver->privateData; udev_monitor = DRV_STATE_UDEV_MONITOR(driver); if (!udevCheckMonitorFD(udev_monitor, fd)) { virMutexLock(&threadData->lock); @@ -1771,6 +1778,16 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED, threadData->nevents++; virCondSignal(&threadData->threadCond); virMutexUnlock(&threadData->lock); + + /* Due to scheduling, the eventloop might poll the udev monitor before the + * handler thread actually removes the data from the socket, thus causing it + * to spam errors about data not being ready yet, because + * udevEventHandleCallback would be invoked multiple times incrementing the + * counter of events queuing for a single event. + * Therefore we need to disable polling on the fd until the thread actually + * removes the data from the socket. + */ + virEventUpdateHandle(priv->watch, 0); }

On Mon, Aug 28, 2017 at 12:38:08PM -0400, John Ferlan wrote:
On 08/24/2017 07:23 AM, Erik Skultety wrote:
The event loop may get scheduled earlier than the udev event handler thread which means that it would keep invoking the handler callback with "new" events, while in fact it's most likely still the same event which the handler thread hasn't managed to remove from the socket queue yet. This is due to unwanted increments of the number of events queuing on the socket and causes the handler thread to spam the logs with an error about libudev returning NULL (errno == EAGAIN).
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
And by disabling the polling couldn't we miss an event? That would be really bad if the event after the one we miss relies on the event we missed creating something that the subsequent one would need (if that makes sense).
No, we just don't poll on the @fd for a moment, but the data has kept being delivered to the udev monitor and waits for us to be extracted. Erik

Since we have a number of places where we workaround timing issues with devices, attributes (files in general) not being available at the time of processing them by calling usleep in a loop for a fixed number of tries, we could as well have a utility function that would do that. Therefore we won't have to duplicate this ugly workaround even more. Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virfile.c | 29 +++++++++++++++++++++++++++++ src/util/virfile.h | 2 ++ 3 files changed, 32 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 2149b11b7..b9fb54a79 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1724,6 +1724,7 @@ virFileStripSuffix; virFileTouch; virFileUnlock; virFileUpdatePerm; +virFileWaitForAccess; virFileWrapperFdClose; virFileWrapperFdFree; virFileWrapperFdNew; diff --git a/src/util/virfile.c b/src/util/virfile.c index 2f28e83f4..15b46bcdd 100644 --- a/src/util/virfile.c +++ b/src/util/virfile.c @@ -4166,3 +4166,32 @@ virFileReadValueString(char **value, const char *format, ...) VIR_FREE(str); return ret; } + + +/** + * virFileWaitForAccess: + * @path: absolute path to a sysfs attribute (can be a symlink) + * @ms: how long to wait (in milliseconds) + * @tries: how many times should we try to wait for @path to become accessible + * + * Checks the existence of @path. In case the file defined by @path + * doesn't exist, we wait for it to appear in @ms milliseconds (for up to + * @tries attempts). + * + * Returns 0 on success, -1 on error, setting errno appropriately. + */ +int +virFileWaitForAccess(const char *path, size_t ms, size_t tries) +{ + errno = 0; + + /* wait for @path to be accessible in @ms milliseconds, up to @tries */ + while (tries-- > 0 && !virFileExists(path)) { + if (tries == 0 || errno != ENOENT) + return -1; + + usleep(ms * 1000); + } + + return 0; +} diff --git a/src/util/virfile.h b/src/util/virfile.h index 57ceb8072..42630ebb5 100644 --- a/src/util/virfile.h +++ b/src/util/virfile.h @@ -347,6 +347,8 @@ int virFileReadValueScaledInt(unsigned long long *value, const char *format, ... int virFileReadValueString(char **value, const char *format, ...) ATTRIBUTE_FMT_PRINTF(2, 3); +int virFileWaitForAccess(const char *path, size_t ms, size_t tries); + int virFileInData(int fd, int *inData, -- 2.13.3

On 08/24/2017 07:23 AM, Erik Skultety wrote:
Since we have a number of places where we workaround timing issues with devices, attributes (files in general) not being available at the time of processing them by calling usleep in a loop for a fixed number of tries, we could as well have a utility function that would do that. Therefore we won't have to duplicate this ugly workaround even more.
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virfile.c | 29 +++++++++++++++++++++++++++++ src/util/virfile.h | 2 ++ 3 files changed, 32 insertions(+)
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 2149b11b7..b9fb54a79 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1724,6 +1724,7 @@ virFileStripSuffix; virFileTouch; virFileUnlock; virFileUpdatePerm; +virFileWaitForAccess; virFileWrapperFdClose; virFileWrapperFdFree; virFileWrapperFdNew; diff --git a/src/util/virfile.c b/src/util/virfile.c index 2f28e83f4..15b46bcdd 100644 --- a/src/util/virfile.c +++ b/src/util/virfile.c @@ -4166,3 +4166,32 @@ virFileReadValueString(char **value, const char *format, ...) VIR_FREE(str); return ret; } + + +/** + * virFileWaitForAccess: + * @path: absolute path to a sysfs attribute (can be a symlink) + * @ms: how long to wait (in milliseconds) + * @tries: how many times should we try to wait for @path to become accessible + * + * Checks the existence of @path. In case the file defined by @path + * doesn't exist, we wait for it to appear in @ms milliseconds (for up to + * @tries attempts). + * + * Returns 0 on success, -1 on error, setting errno appropriately. + */ +int +virFileWaitForAccess(const char *path, size_t ms, size_t tries)
Multiple lines per argument NIT: WaitForExists() John
+{ + errno = 0; + + /* wait for @path to be accessible in @ms milliseconds, up to @tries */ + while (tries-- > 0 && !virFileExists(path)) { + if (tries == 0 || errno != ENOENT) + return -1; + + usleep(ms * 1000); + } + + return 0; +} diff --git a/src/util/virfile.h b/src/util/virfile.h index 57ceb8072..42630ebb5 100644 --- a/src/util/virfile.h +++ b/src/util/virfile.h @@ -347,6 +347,8 @@ int virFileReadValueScaledInt(unsigned long long *value, const char *format, ... int virFileReadValueString(char **value, const char *format, ...) ATTRIBUTE_FMT_PRINTF(2, 3);
+int virFileWaitForAccess(const char *path, size_t ms, size_t tries); +
int virFileInData(int fd, int *inData,

If we find ourselves in the situation that the 'add' uevent has been fired earlier than the sysfs tree for a device was created, we should use the best-effort approach and give kernel some predetermined amount of time, thus waiting for the attributes to be ready rather than discarding the device from our device list forever. If those don't appear in the given time frame, we need to move on, since libvirt can't wait indefinitely. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285 Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index e3a647e3d..96a87f4ab 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1166,9 +1166,23 @@ udevProcessMediatedDevice(struct udev_device *dev, char *canonicalpath = NULL; virNodeDevCapMdevPtr data = &def->caps->data.mdev; - if (virAsprintf(&linkpath, "%s/mdev_type", udev_device_get_syspath(dev)) < 0) + /* Because of a kernel uevent race, we might get the 'add' event prior to + * the sysfs tree being ready, so any attempt to access any sysfs attribute + * would result in ENOENT and us dropping the device, so let's work around + * it by waiting for the attributes to become available. + */ + + if (virAsprintf(&linkpath, "%s/mdev_type", + udev_device_get_syspath(dev)) < 0) goto cleanup; + if (virFileWaitForAccess(linkpath, 1, 100) < 0) { + virReportSystemError(errno, + _("failed to wait for file '%s' to appear"), + linkpath); + goto cleanup; + } + if (virFileResolveLink(linkpath, &canonicalpath) < 0) { virReportSystemError(errno, _("failed to resolve '%s'"), linkpath); goto cleanup; -- 2.13.3

On 08/24/2017 07:23 AM, Erik Skultety wrote:
If we find ourselves in the situation that the 'add' uevent has been fired earlier than the sysfs tree for a device was created, we should use the best-effort approach and give kernel some predetermined amount of time, thus waiting for the attributes to be ready rather than discarding the device from our device list forever. If those don't appear in the given time frame, we need to move on, since libvirt can't wait indefinitely.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
IIRC - I pointed out to you that this is eerily familiar to something that happens in the vHBA code w/r/t to wwnn/wwpn files. Except that the files exist, but have a -1 in them which is totally bogus. Then some magic thing happens and the real wwnn/wwpn is placed into the file, but libvirt already looked and failed. When I tried to work around this the decision was to let it be and call it a kernel / udev bug. https://www.redhat.com/archives/libvir-list/2016-June/msg02213.html and Daniel's answer https://www.redhat.com/archives/libvir-list/2016-July/msg00912.html Although yes, with the other changes in place one would think having a wait is no big deal. Still are you guaranteed that once the file exists that the data within the file is valid? In the vHBA case it wasn't and that led to issues. I'd "use this" processing instead of the hack I proposed as well seeing as it doesn't seem kernel/udev fixing issues such as these is on any priority list /-{ John
diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index e3a647e3d..96a87f4ab 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1166,9 +1166,23 @@ udevProcessMediatedDevice(struct udev_device *dev, char *canonicalpath = NULL; virNodeDevCapMdevPtr data = &def->caps->data.mdev;
- if (virAsprintf(&linkpath, "%s/mdev_type", udev_device_get_syspath(dev)) < 0) + /* Because of a kernel uevent race, we might get the 'add' event prior to + * the sysfs tree being ready, so any attempt to access any sysfs attribute + * would result in ENOENT and us dropping the device, so let's work around + * it by waiting for the attributes to become available. + */ + + if (virAsprintf(&linkpath, "%s/mdev_type", + udev_device_get_syspath(dev)) < 0) goto cleanup;
+ if (virFileWaitForAccess(linkpath, 1, 100) < 0) { + virReportSystemError(errno, + _("failed to wait for file '%s' to appear"), + linkpath); + goto cleanup; + } + if (virFileResolveLink(linkpath, &canonicalpath) < 0) { virReportSystemError(errno, _("failed to resolve '%s'"), linkpath); goto cleanup;

On Mon, Aug 28, 2017 at 12:40:44PM -0400, John Ferlan wrote:
On 08/24/2017 07:23 AM, Erik Skultety wrote:
If we find ourselves in the situation that the 'add' uevent has been fired earlier than the sysfs tree for a device was created, we should use the best-effort approach and give kernel some predetermined amount of time, thus waiting for the attributes to be ready rather than discarding the device from our device list forever. If those don't appear in the given time frame, we need to move on, since libvirt can't wait indefinitely.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285
Signed-off-by: Erik Skultety <eskultet@redhat.com> --- src/node_device/node_device_udev.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
IIRC - I pointed out to you that this is eerily familiar to something that happens in the vHBA code w/r/t to wwnn/wwpn files. Except that the files exist, but have a -1 in them which is totally bogus. Then some magic thing happens and the real wwnn/wwpn is placed into the file, but libvirt already looked and failed. When I tried to work around this the decision was to let it be and call it a kernel / udev bug.
https://www.redhat.com/archives/libvir-list/2016-June/msg02213.html
and Daniel's answer
https://www.redhat.com/archives/libvir-list/2016-July/msg00912.html
Although yes, with the other changes in place one would think having a wait is no big deal.
Still are you guaranteed that once the file exists that the data within the file is valid? In the vHBA case it wasn't and that led to issues.
Yes, I recall you pointing me to this issue before and you're right that if the data is bogus, we can't do much about that, except that in this case, I'm only relying on the existence of the file/dir, because I need its name to determine the mediated device's type, not its content, which arguably makes it a different problem. Erik
I'd "use this" processing instead of the hack I proposed as well seeing as it doesn't seem kernel/udev fixing issues such as these is on any priority list /-{
Exactly ^this :(

On Thu, Aug 24, 2017 at 01:23:26PM +0200, Erik Skultety wrote:
v2 here: https://www.redhat.com/archives/libvir-list/2017-July/msg01268.html
Since v2: - added patch 4/6 that fixes the issue with the handler thread spamming logs with "udev_monitor_receive_device returned NULL" -> the event loop callback now disables polling on the udev monitor's fd every time there's a new event, leaving the responsibility for re-enabling it back to the handler thread once it had removed the corresponding data from the socket.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285
ping Erik

On 09/18/2017 01:53 AM, Erik Skultety wrote:
On Thu, Aug 24, 2017 at 01:23:26PM +0200, Erik Skultety wrote:
v2 here: https://www.redhat.com/archives/libvir-list/2017-July/msg01268.html
Since v2: - added patch 4/6 that fixes the issue with the handler thread spamming logs with "udev_monitor_receive_device returned NULL" -> the event loop callback now disables polling on the udev monitor's fd every time there's a new event, leaving the responsibility for re-enabling it back to the handler thread once it had removed the corresponding data from the socket.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285
ping
I think I had assumed there was going to be a v4... Was I incorrect and you would prefer to keep this series as is presented here? John
Erik
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Mon, Sep 18, 2017 at 08:47:52AM -0400, John Ferlan wrote:
On 09/18/2017 01:53 AM, Erik Skultety wrote:
On Thu, Aug 24, 2017 at 01:23:26PM +0200, Erik Skultety wrote:
v2 here: https://www.redhat.com/archives/libvir-list/2017-July/msg01268.html
Since v2: - added patch 4/6 that fixes the issue with the handler thread spamming logs with "udev_monitor_receive_device returned NULL" -> the event loop callback now disables polling on the udev monitor's fd every time there's a new event, leaving the responsibility for re-enabling it back to the handler thread once it had removed the corresponding data from the socket.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1463285
ping
I think I had assumed there was going to be a v4... Was I incorrect and you would prefer to keep this series as is presented here?
Oh, I managed to somehow forget about the comments in patches 1-2 as I fixed them locally and then solely focused on our technical discussion in patches 3-6. No problem, I'll resend a v4 adjusting the first two patches, since there weren't any semantic-related points, rather than points regarding the whole idea in general. Erik
John
Erik
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
participants (2)
-
Erik Skultety
-
John Ferlan