Re: [libvirt] [PATCH 5/5] nodedev: Protect every access to udev_monitor by locking the driver

17 Aug 2017

On Tue, Aug 01, 2017 at 04:13:55PM -0400, John Ferlan wrote:
...
On 07/26/2017 02:44 AM, Erik Skultety wrote:
...
Since @udev_monitor isn't immutable within the driver, we need to
protect every access to it by locking the driver, so that no spurious
action changing the pointer while one thread is actively using it might
occur. This patch just takes some precaution measures.
Signed-off-by: Erik Skultety <eskultet@redhat.com>
---
 src/node_device/node_device_udev.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c
index 7ecb330df..0643a5845 100644
--- a/src/node_device/node_device_udev.c
+++ b/src/node_device/node_device_udev.c
@@ -1623,9 +1623,13 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED,
                         void *data ATTRIBUTE_UNUSED)
 {
     struct udev_device *device = NULL;
-    struct udev_monitor *udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
+    struct udev_monitor *udev_monitor = NULL;
     int udev_fd = -1;
+    nodeDeviceLock();
+    if (!(udev_monitor = DRV_STATE_UDEV_MONITOR(driver)))
+        goto error;
+
Originally I thought maybe we should provide some sort of error here,
but I convinced myself perhaps not - although I wouldn't be opposed to a
VIR_WARN or something...
My second thought was should we add some sort of EventRemoveHandle type
operation? But once we get the lock, the only way udev_monitor would be
NULL is if nodeStateCleanup is run which should already have removed the
handle, so probably not. But then it struck me, if that's the case, then
the nodeDeviceLock could fail rather miserably because driver could be
NULL... (e.g. virMutexLock(&driver->lock);)
Still not quite clear how a path such as this could be triggered other
than perhaps the last dying breaths of the daemon - unless something
that the *unref does causes any events to be flushed. Maybe a "delayed"
It doesn't, libudev just closes the socket, and because we already removed the
event handle, the only possible entry point I see is when there's an event to
handle, but the daemon's going down and preemption causes the daemon to grab
the lock first, is that what you're referring to as 'delayed' perhaps?
...
event - who knows. Still, if we wait on the lock, we could end up in a
state where @driver doesn't exist or anything it was referencing has
been freed...
So I think we need :
if (!driver)
        return;
to follow how nodeStateCleanup will bail...
then (considering my thoughts in patch 2)
Add to node_device_udev.h (and use it in Cleanup too, so we don't have
to create a local @priv variable):
#define DRV_STATE_WATCH(ds) (((udevPrivate *)((ds)->privateData))->watch)
nodeDeviceLock();
    if (!driver || !driver->privateData ||
        DRV_STATE_WATCH(driver) == -1 ||
        !(udev_monitor = DRV_STATE_UDEV_MONITOR(driver))) {
        nodeDeviceUnlock();
        return;
    }
Anyhow, looking at ^this as the potential addition to our best effort to
terminate gracefully, I suddenly don't feel any motivation to pursue this
further, especially since both locking a non-existent mutex and destroying a
locked mutex results into an undefined behaviour according to the
documentation, so even despite our hard effort of checking whether the driver
still exists and whether it's internals are still intact, it still might go
haywire during termination, who knows. Therefore I opt for dropping patches 3
and 5 from the series as I no longer see it as worth the effort.

Thank you for your input.
Erik
...
then the rest of the code.
If you're "waiting" to get the lock while Cleanup is running, then
remove the chance that something that Cleanup does could cause awful
things to happen in this code.
In cleanup, once nodeDeviceUnlock() is run, but before virMutexDestroy,
we could grab the lock. That would cause virMutexDestroy
(pthread_mutex_destroy) to fail, but we don't check that. That just
becomes a side effect of using an event function and taking out the
mutex I suppose.
Oh and I should point out that it's possible that virMutexDestroy does
run, thus our getting the lock does nothing, but one would think that
the subsequent condition checks wouldn't be racing too hard against the
VIR_FREE(driver) in the other thread.
The impossible part of this race is that how does one tell if something
is waiting on the mutex other than getting a failure when we destroy it,
but since we don't manage failures for destroy, we have no way to be
sure. Still in the long run, both threads are heading to a very quick
death, so does it really matter.
John
(muttering to myself how much I hate lock races).
...
udev_fd = udev_monitor_get_fd(udev_monitor);
     if (fd != udev_fd) {
         udevPrivate *priv = driver->privateData;
@@ -1640,22 +1644,26 @@ udevEventHandleCallback(int watch ATTRIBUTE_UNUSED,
          * the same error multiple times
          */
         virEventRemoveHandle(priv->watch);
-
-        goto cleanup;
+        goto error;
     }
device = udev_monitor_receive_device(udev_monitor);
     if (device == NULL) {
         virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
                        _("udev_monitor_receive_device returned NULL"));
-        goto cleanup;
+        goto error;
     }
+    nodeDeviceUnlock();
     udevHandleOneDevice(device);
cleanup:
     udev_device_unref(device);
     return;
+
+ error:
+    nodeDeviceUnlock();
+    goto cleanup;
 }