[PATCH 0/2] Warn on pthread errors

libvirt currently exhibits undefined behavior due to pthread mutex misuse, e.g. destroying a locked mutex or attempting to unlock an already unlocked mutex. Add a warning if such a case is detected, so we can start on fixing the issues. Tim Wiederhake (2): virMutex: Warn on error DO NOT MERGE: virMutex: Fail loudly src/util/virthread.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) -- 2.43.0

`pthread_mutex_destroy`, `pthread_mutex_lock` and `pthread_mutex_unlock` return an error code that is currently ignored. Add debug information if one of these operations failed, e.g. when there is an attempt to destroy a still locked mutex or unlock an already unlocked mutex. Both scenarios are considered undefined behavior. Signed-off-by: Tim Wiederhake <twiederh@redhat.com> --- src/util/virthread.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/src/util/virthread.c b/src/util/virthread.c index 5422bb74fd..14116a2221 100644 --- a/src/util/virthread.c +++ b/src/util/virthread.c @@ -35,7 +35,10 @@ #include "viralloc.h" #include "virthreadjob.h" +#include "virlog.h" +#define VIR_FROM_THIS VIR_FROM_THREAD +VIR_LOG_INIT("util.thread"); int virOnce(virOnceControl *once, virOnceFunc init) { @@ -83,17 +86,23 @@ int virMutexInitRecursive(virMutex *m) void virMutexDestroy(virMutex *m) { - pthread_mutex_destroy(&m->lock); + if (pthread_mutex_destroy(&m->lock)) { + VIR_WARN("Failed to destroy mutex=%p", m); + } } void virMutexLock(virMutex *m) { - pthread_mutex_lock(&m->lock); + if (pthread_mutex_lock(&m->lock)) { + VIR_WARN("Failed to lock mutex=%p", m); + } } void virMutexUnlock(virMutex *m) { - pthread_mutex_unlock(&m->lock); + if (pthread_mutex_unlock(&m->lock)) { + VIR_WARN("Failed to unlock mutex=%p", m); + } } virLockGuard virLockGuardLock(virMutex *m) -- 2.43.0

On 7/3/24 14:44, Tim Wiederhake wrote:
`pthread_mutex_destroy`, `pthread_mutex_lock` and `pthread_mutex_unlock` return an error code that is currently ignored.
Add debug information if one of these operations failed, e.g. when there is an attempt to destroy a still locked mutex or unlock an already unlocked mutex. Both scenarios are considered undefined behavior.
Signed-off-by: Tim Wiederhake <twiederh@redhat.com> --- src/util/virthread.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/src/util/virthread.c b/src/util/virthread.c index 5422bb74fd..14116a2221 100644 --- a/src/util/virthread.c +++ b/src/util/virthread.c @@ -35,7 +35,10 @@
#include "viralloc.h" #include "virthreadjob.h" +#include "virlog.h"
+#define VIR_FROM_THIS VIR_FROM_THREAD +VIR_LOG_INIT("util.thread");
int virOnce(virOnceControl *once, virOnceFunc init) { @@ -83,17 +86,23 @@ int virMutexInitRecursive(virMutex *m)
void virMutexDestroy(virMutex *m) { - pthread_mutex_destroy(&m->lock); + if (pthread_mutex_destroy(&m->lock)) { + VIR_WARN("Failed to destroy mutex=%p", m); + } }
void virMutexLock(virMutex *m) { - pthread_mutex_lock(&m->lock); + if (pthread_mutex_lock(&m->lock)) { + VIR_WARN("Failed to lock mutex=%p", m); + } }
void virMutexUnlock(virMutex *m) { - pthread_mutex_unlock(&m->lock); + if (pthread_mutex_unlock(&m->lock)) { + VIR_WARN("Failed to unlock mutex=%p", m); + } }
virLockGuard virLockGuardLock(virMutex *m)
I'm a bit worried that without additional info (e.g. stacktrace) this info is going to be useless. Though, even stacktrace might be not enough :( But this can at least let us know there's something suspicious going on. Reviewed-by: Michal Privoznik <mprivozn@redhat.com> Michal

On Wed, Jul 03, 2024 at 02:44:37PM +0200, Tim Wiederhake wrote:
`pthread_mutex_destroy`, `pthread_mutex_lock` and `pthread_mutex_unlock` return an error code that is currently ignored.
Add debug information if one of these operations failed, e.g. when there is an attempt to destroy a still locked mutex or unlock an already unlocked mutex. Both scenarios are considered undefined behavior.
Signed-off-by: Tim Wiederhake <twiederh@redhat.com> --- src/util/virthread.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/src/util/virthread.c b/src/util/virthread.c index 5422bb74fd..14116a2221 100644 --- a/src/util/virthread.c +++ b/src/util/virthread.c @@ -35,7 +35,10 @@
#include "viralloc.h" #include "virthreadjob.h" +#include "virlog.h"
+#define VIR_FROM_THIS VIR_FROM_THREAD +VIR_LOG_INIT("util.thread");
int virOnce(virOnceControl *once, virOnceFunc init) { @@ -83,17 +86,23 @@ int virMutexInitRecursive(virMutex *m)
void virMutexDestroy(virMutex *m) { - pthread_mutex_destroy(&m->lock); + if (pthread_mutex_destroy(&m->lock)) { + VIR_WARN("Failed to destroy mutex=%p", m); + } }
void virMutexLock(virMutex *m) { - pthread_mutex_lock(&m->lock); + if (pthread_mutex_lock(&m->lock)) { + VIR_WARN("Failed to lock mutex=%p", m); + } }
void virMutexUnlock(virMutex *m) { - pthread_mutex_unlock(&m->lock); + if (pthread_mutex_unlock(&m->lock)) { + VIR_WARN("Failed to unlock mutex=%p", m); + } }
I'd be surprised if these lock/unlock warnings ever trigger, since IIUC they would need us to be using an error checking mutex, not a regular mutex. IOW, aren't these just adding condition test overhead + unreachable code to the lock calls ? The 2nd patch shows failures in the destroy calls IIUC. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, 2024-07-04 at 08:10 +0100, Daniel P. Berrangé wrote:
On Wed, Jul 03, 2024 at 02:44:37PM +0200, Tim Wiederhake wrote:
`pthread_mutex_destroy`, `pthread_mutex_lock` and `pthread_mutex_unlock` return an error code that is currently ignored.
Add debug information if one of these operations failed, e.g. when there is an attempt to destroy a still locked mutex or unlock an already unlocked mutex. Both scenarios are considered undefined behavior.
Signed-off-by: Tim Wiederhake <twiederh@redhat.com> --- src/util/virthread.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/src/util/virthread.c b/src/util/virthread.c index 5422bb74fd..14116a2221 100644 --- a/src/util/virthread.c +++ b/src/util/virthread.c @@ -35,7 +35,10 @@ #include "viralloc.h" #include "virthreadjob.h" +#include "virlog.h" +#define VIR_FROM_THIS VIR_FROM_THREAD +VIR_LOG_INIT("util.thread"); int virOnce(virOnceControl *once, virOnceFunc init) { @@ -83,17 +86,23 @@ int virMutexInitRecursive(virMutex *m) void virMutexDestroy(virMutex *m) { - pthread_mutex_destroy(&m->lock); + if (pthread_mutex_destroy(&m->lock)) { + VIR_WARN("Failed to destroy mutex=%p", m); + } } void virMutexLock(virMutex *m) { - pthread_mutex_lock(&m->lock); + if (pthread_mutex_lock(&m->lock)) { + VIR_WARN("Failed to lock mutex=%p", m); + } } void virMutexUnlock(virMutex *m) { - pthread_mutex_unlock(&m->lock); + if (pthread_mutex_unlock(&m->lock)) { + VIR_WARN("Failed to unlock mutex=%p", m); + } }
I'd be surprised if these lock/unlock warnings ever trigger, since IIUC they would need us to be using an error checking mutex, not a regular mutex. IOW, aren't these just adding condition test overhead + unreachable code to the lock calls ?
The 2nd patch shows failures in the destroy calls IIUC.
With regards, Daniel
I have looked more closely into the issue now. pthread_mutex_lock and pthread_mutex_unlock do indeed not return a non-zero value over us not using error checking mutexes. During my last attempt at fixing the issues I had a patch that would count lockings and unlockings of mutexes explicitly, and I believe I recall seeing problems in that area as well. Sadly, I cannot reproduce that now, at least not reliably: Ignoring the warnings for pthread_mutex_destroy, virnetdaemontest does seem to trigger my "number of locks == number of unlocks" check in about 3 out of 10.000 runs. And sometimes with a frequency of 1 in 10. Sometimes not at all. In any case: I do not consider the checks for locking / unlocking dead code. So far I have been using the test suite to check for obvious issues, but I cannot rule out that libvirt itself has race conditions too. I would advocate for merging this patch as is, and add a patch to enable error checking for the mutexes. Regards, Tim

On Thu, Jul 18, 2024 at 03:50:27PM +0200, Tim Wiederhake wrote:
On Thu, 2024-07-04 at 08:10 +0100, Daniel P. Berrangé wrote:
On Wed, Jul 03, 2024 at 02:44:37PM +0200, Tim Wiederhake wrote:
`pthread_mutex_destroy`, `pthread_mutex_lock` and `pthread_mutex_unlock` return an error code that is currently ignored.
Add debug information if one of these operations failed, e.g. when there is an attempt to destroy a still locked mutex or unlock an already unlocked mutex. Both scenarios are considered undefined behavior.
Signed-off-by: Tim Wiederhake <twiederh@redhat.com> --- src/util/virthread.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/src/util/virthread.c b/src/util/virthread.c index 5422bb74fd..14116a2221 100644 --- a/src/util/virthread.c +++ b/src/util/virthread.c @@ -35,7 +35,10 @@ #include "viralloc.h" #include "virthreadjob.h" +#include "virlog.h" +#define VIR_FROM_THIS VIR_FROM_THREAD +VIR_LOG_INIT("util.thread"); int virOnce(virOnceControl *once, virOnceFunc init) { @@ -83,17 +86,23 @@ int virMutexInitRecursive(virMutex *m) void virMutexDestroy(virMutex *m) { - pthread_mutex_destroy(&m->lock); + if (pthread_mutex_destroy(&m->lock)) { + VIR_WARN("Failed to destroy mutex=%p", m); + } } void virMutexLock(virMutex *m) { - pthread_mutex_lock(&m->lock); + if (pthread_mutex_lock(&m->lock)) { + VIR_WARN("Failed to lock mutex=%p", m); + } } void virMutexUnlock(virMutex *m) { - pthread_mutex_unlock(&m->lock); + if (pthread_mutex_unlock(&m->lock)) { + VIR_WARN("Failed to unlock mutex=%p", m); + } }
I'd be surprised if these lock/unlock warnings ever trigger, since IIUC they would need us to be using an error checking mutex, not a regular mutex. IOW, aren't these just adding condition test overhead + unreachable code to the lock calls ?
The 2nd patch shows failures in the destroy calls IIUC.
I have looked more closely into the issue now. pthread_mutex_lock and pthread_mutex_unlock do indeed not return a non-zero value over us not using error checking mutexes.
During my last attempt at fixing the issues I had a patch that would count lockings and unlockings of mutexes explicitly, and I believe I recall seeing problems in that area as well. Sadly, I cannot reproduce that now, at least not reliably: Ignoring the warnings for pthread_mutex_destroy, virnetdaemontest does seem to trigger my "number of locks == number of unlocks" check in about 3 out of 10.000 runs. And sometimes with a frequency of 1 in 10. Sometimes not at all. In any case: I do not consider the checks for locking / unlocking dead code.
So far I have been using the test suite to check for obvious issues, but I cannot rule out that libvirt itself has race conditions too.
But if the checks never fire, due to the mutex type, this isn't helping us diagnose anything surely ?
I would advocate for merging this patch as is, and add a patch to enable error checking for the mutexes.
Error checking mutexes aren't something we can easily enable, because they break any code which needs to hold a mutex across fork() and unlock in the child. IOW, we would need two different types of mutex and pick which to use, where. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

This is just to demonstrate that libvirt currently exhibits undefined behavior due to pthread mutex misuse. With this patch applied, several libvirt tests fail due to the triggered `abort()` calls: * cputest * qemuagenttest * qemucapabilitiestest * qemumigparamstest * qemuhotplugtest * qemumonitorjsontest * qemusecuritytest * qemuxmlconftest Signed-off-by: Tim Wiederhake <twiederh@redhat.com> --- src/util/virthread.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/util/virthread.c b/src/util/virthread.c index 14116a2221..dc51144d48 100644 --- a/src/util/virthread.c +++ b/src/util/virthread.c @@ -88,6 +88,7 @@ void virMutexDestroy(virMutex *m) { if (pthread_mutex_destroy(&m->lock)) { VIR_WARN("Failed to destroy mutex=%p", m); + abort(); } } @@ -95,6 +96,7 @@ void virMutexLock(virMutex *m) { if (pthread_mutex_lock(&m->lock)) { VIR_WARN("Failed to lock mutex=%p", m); + abort(); } } @@ -102,6 +104,7 @@ void virMutexUnlock(virMutex *m) { if (pthread_mutex_unlock(&m->lock)) { VIR_WARN("Failed to unlock mutex=%p", m); + abort(); } } -- 2.43.0

On 7/3/24 14:44, Tim Wiederhake wrote:
This is just to demonstrate that libvirt currently exhibits undefined behavior due to pthread mutex misuse. With this patch applied, several libvirt tests fail due to the triggered `abort()` calls: * cputest * qemuagenttest * qemucapabilitiestest * qemumigparamstest * qemuhotplugtest * qemumonitorjsontest * qemusecuritytest * qemuxmlconftest
IIUC, the problem is just in tests, not libvirt code itself.
Signed-off-by: Tim Wiederhake <twiederh@redhat.com> --- src/util/virthread.c | 3 +++ 1 file changed, 3 insertions(+)
Michal

On 7/3/24 14:44, Tim Wiederhake wrote:
libvirt currently exhibits undefined behavior due to pthread mutex misuse, e.g. destroying a locked mutex or attempting to unlock an already unlocked mutex.
Add a warning if such a case is detected, so we can start on fixing the issues.
I expect this to be even harder to fix than to detect. BTW: in the past I used to play with: https://github.com/dbadapt/mutrace which looked like it could detect some locking problems. I think I even had some patches locally for easier debugging of libvirt. But I can't find that repo anymore. Oh, and then there's DRD tool in valgrind, but I can't find the right combination of arguments to make it produce anything usable. Michal
participants (3)
-
Daniel P. Berrangé
-
Michal Prívozník
-
Tim Wiederhake