[PATCH 0/2] tests: Don't destroy locked mutexes

I've rewritten our virMutexes to check for failures [1] and saw couple of problems. In most cases we are destroying a locked mutex which these patches aim to fix. Then we have virLogLock() which is locked in virFork() and then unlocked from child (a different process) which is problematic. Pthread doesn't like it when one thread locks a mutex only so that another one unlocks it. Semaphores don't care. Anyway, leave virLogMutex aside, patches here fix problems at hand. BTW: commits v8.0.0-267-g39ac285c6b and v8.0.0-236-ga7201789ab were spotted because of this branch of mine. 1: https://gitlab.com/MichalPrivoznik/libvirt/-/commits/tests_mutext/ Michal Prívozník (2): tests: Track if @vm was created by qemuMonitorCommonTestNew() tests: Destroy domain object properly tests/qemuhotplugtest.c | 13 +++++-------- tests/qemumonitortestutils.c | 7 ++++++- tests/qemusecuritytest.c | 3 ++- tests/qemuxml2argvtest.c | 22 +++++++++++++--------- 4 files changed, 26 insertions(+), 19 deletions(-) -- 2.34.1

There are two ways how a domain object can appear in _qemuMonitorTest structure. Either it was created by qemuMonitorCommonTestNew() or caller created the object and we are just leasing it. But this also means we have to release the domain object differently. If the domain object was created by us, then we must unlock and unref it (which is what virDomainObjEndAPI()) does, but if it just a leased object then we have to refrain from unlocking it because this would create lock/unlock imbalance. Moreover, caller might want to keep the object locked. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/qemumonitortestutils.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tests/qemumonitortestutils.c b/tests/qemumonitortestutils.c index 86300da68a..bb7a8e1191 100644 --- a/tests/qemumonitortestutils.c +++ b/tests/qemumonitortestutils.c @@ -84,6 +84,7 @@ struct _qemuMonitorTest { qemuMonitorTestItem **items; virDomainObj *vm; + bool vm_created; GHashTable *qapischema; }; @@ -406,7 +407,10 @@ qemuMonitorTestFree(qemuMonitorTest *test) g_object_unref(test->eventThread); - virObjectUnref(test->vm); + if (test->vm_created) + virDomainObjEndAPI(&test->vm); + else + virObjectUnref(test->vm); if (test->started) virThreadJoin(&test->thread); @@ -1025,6 +1029,7 @@ qemuMonitorCommonTestNew(virDomainXMLOption *xmlopt, return NULL; if (!(test->vm->def = virDomainDefNew(xmlopt))) return NULL; + test->vm_created = true; } if (virNetSocketNewListenUNIX(path, 0700, geteuid(), getegid(), -- 2.34.1

In a lot of cases we do plain virObjectUnref() with the domain object lock held. While this looks okay, what really happens under the hood is that we destroy a locked mutex. But with a help from virDomainObjEndAPI() we can do the right thing and unlock the mutex before destroying it. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tests/qemuhotplugtest.c | 13 +++++-------- tests/qemusecuritytest.c | 3 ++- tests/qemuxml2argvtest.c | 22 +++++++++++++--------- 3 files changed, 20 insertions(+), 18 deletions(-) diff --git a/tests/qemuhotplugtest.c b/tests/qemuhotplugtest.c index 263a92425c..de136f7987 100644 --- a/tests/qemuhotplugtest.c +++ b/tests/qemuhotplugtest.c @@ -278,7 +278,7 @@ testQemuHotplug(const void *data) goto cleanup; if (test->vm) { - vm = test->vm; + vm = g_steal_pointer(&test->vm); if (!vm->def) { VIR_TEST_VERBOSE("test skipped due to failure of dependent test"); goto cleanup; @@ -352,12 +352,9 @@ testQemuHotplug(const void *data) /* don't dispose test monitor with VM */ if (priv) priv->mon = NULL; - if (keep) { - test->vm = vm; - } else { - virObjectUnref(vm); - test->vm = NULL; - } + if (keep) + test->vm = g_steal_pointer(&vm); + virDomainObjEndAPI(&vm); virDomainDeviceDefFree(dev); return ((ret < 0 && fail) || (!ret && !fail)) ? 0 : -1; } @@ -397,7 +394,7 @@ testQemuHotplugCpuDataFree(struct testQemuHotplugCpuData *data) priv = data->vm->privateData; priv->mon = NULL; - virObjectUnref(data->vm); + virDomainObjEndAPI(&data->vm); } if (data->mon) { diff --git a/tests/qemusecuritytest.c b/tests/qemusecuritytest.c index 924c625a4c..2a7641dc83 100644 --- a/tests/qemusecuritytest.c +++ b/tests/qemusecuritytest.c @@ -86,7 +86,7 @@ static int testDomain(const void *opaque) { const struct testData *data = opaque; - g_autoptr(virDomainObj) vm = NULL; + virDomainObj *vm = NULL; g_autoptr(GHashTable) notRestored = virHashNew(NULL); size_t i; int ret = -1; @@ -126,6 +126,7 @@ testDomain(const void *opaque) ret = 0; cleanup: + virDomainObjEndAPI(&vm); g_unsetenv(ENVVAR); freePaths(); return ret; diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 6cf35a0ebf..794d76c776 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -642,11 +642,12 @@ testCompareXMLToArgvValidateSchema(virQEMUDriver *drv, unsigned int flags) { g_auto(GStrv) args = NULL; - g_autoptr(virDomainObj) vm = NULL; + virDomainObj *vm = NULL; qemuDomainObjPrivate *priv = NULL; GHashTable *schema = NULL; g_autoptr(virCommand) cmd = NULL; unsigned int parseFlags = info->parseFlags; + int ret = -1; /* comment out with line comment to enable schema checking for non _CAPS tests if (!info->schemafile) @@ -669,29 +670,32 @@ testCompareXMLToArgvValidateSchema(virQEMUDriver *drv, return 0; if (!(vm = virDomainObjNew(driver.xmlopt))) - return -1; + goto cleanup; parseFlags |= VIR_DOMAIN_DEF_PARSE_INACTIVE; if (!(vm->def = virDomainDefParseFile(info->infile, driver.xmlopt, NULL, parseFlags))) - return -1; + goto cleanup; priv = vm->privateData; if (virBitmapParse("0-3", &priv->autoNodeset, 4) < 0) - return -1; + goto cleanup; if (!(cmd = testCompareXMLToArgvCreateArgs(drv, vm, migrateURI, info, flags))) - return -1; + goto cleanup; if (virCommandGetArgList(cmd, &args) < 0) - return -1; + goto cleanup; if (testCompareXMLToArgvValidateSchemaCommand(args, schema) < 0) - return -1; + goto cleanup; - return 0; + ret = 0; + cleanup: + virDomainObjEndAPI(&vm); + return ret; } @@ -885,7 +889,7 @@ testCompareXMLToArgv(const void *data) cleanup: virDomainChrSourceDefClear(&monitor_chr); - virObjectUnref(vm); + virDomainObjEndAPI(&vm); virIdentitySetCurrent(NULL); virSetConnectSecret(NULL); virSetConnectStorage(NULL); -- 2.34.1

On Mon, Feb 14, 2022 at 02:32:55PM +0100, Michal Privoznik wrote:
I've rewritten our virMutexes to check for failures [1] and saw couple of problems. In most cases we are destroying a locked mutex which these patches aim to fix. Then we have virLogLock() which is locked in virFork() and then unlocked from child (a different process) which is problematic. Pthread doesn't like it when one thread locks a mutex only so that another one unlocks it. Semaphores don't care.
Despite behaviuor being undefined, in practice it works with any impl we have in supported platforms. The key problem we're addressing is that the child process needs to inherit a known state for the mutex. Whether that state is locked or unlocked doesn't matter, as long as it has a bulletproof guarantee of what that status is. We call virLogLock before fork() to guarantee a locked state, by blocking on any other threads in the parent that might be temporarily holding it. The pthread_atfork() handler can be used todo the same trick if you don't control the place where fork() is called. The man page explicitly says https://linux.die.net/man/3/pthread_atfork "The expected usage is that the prepare handler acquires all mutex locks and the other two fork handlers release them." I understand why error checking mutexes reject this but AFAIK what we do is the only possible solution that exists, other than never touching any mutexes at all in the child which is not viable for us. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 2/14/22 15:01, Daniel P. Berrangé wrote:
On Mon, Feb 14, 2022 at 02:32:55PM +0100, Michal Privoznik wrote:
I've rewritten our virMutexes to check for failures [1] and saw couple of problems. In most cases we are destroying a locked mutex which these patches aim to fix. Then we have virLogLock() which is locked in virFork() and then unlocked from child (a different process) which is problematic. Pthread doesn't like it when one thread locks a mutex only so that another one unlocks it. Semaphores don't care.
Despite behaviuor being undefined, in practice it works with any impl we have in supported platforms.
Indeed, that's why I haven't sent the patch that rewrites virLogMutex.
The key problem we're addressing is that the child process needs to inherit a known state for the mutex. Whether that state is locked or unlocked doesn't matter, as long as it has a bulletproof guarantee of what that status is.
We call virLogLock before fork() to guarantee a locked state, by blocking on any other threads in the parent that might be temporarily holding it.
Again, I understand that. It's just that the following code returns an error: #include <pthread.h> #include <stdlib.h> #include <unistd.h> #include <stdio.h> #include <string.h> int main(int argc, char *argv[]) { pthread_mutex_t l; pthread_mutexattr_t attr; pthread_mutexattr_init(&attr); pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_ERRORCHECK); pthread_mutex_init(&l, &attr); pthread_mutex_lock(&l); if (fork() == 0) { /* child */ int rc; if ((rc = pthread_mutex_unlock(&l)) != 0) { fprintf(stderr, "unlock error: %s\n", strerror(rc)); abort(); } } return 0; } Switching to a semaphore works again. I'm not advocating for that though, it's needless because apparently pthread does the right thing on all platforms we care about.
The pthread_atfork() handler can be used todo the same trick if you don't control the place where fork() is called. The man page explicitly says
https://linux.die.net/man/3/pthread_atfork
"The expected usage is that the prepare handler acquires all mutex locks and the other two fork handlers release them."
Which is a terrible advice, let me say. This assumes that there is a global list of mutexes (which there certainly is not in projects of our size), but more importantly - this is so fragile to lock ordering than anything else. Michal

On Mon, Feb 14, 2022 at 03:27:51PM +0100, Michal Prívozník wrote:
On 2/14/22 15:01, Daniel P. Berrangé wrote:
On Mon, Feb 14, 2022 at 02:32:55PM +0100, Michal Privoznik wrote:
I've rewritten our virMutexes to check for failures [1] and saw couple of problems. In most cases we are destroying a locked mutex which these patches aim to fix. Then we have virLogLock() which is locked in virFork() and then unlocked from child (a different process) which is problematic. Pthread doesn't like it when one thread locks a mutex only so that another one unlocks it. Semaphores don't care.
Despite behaviuor being undefined, in practice it works with any impl we have in supported platforms.
Indeed, that's why I haven't sent the patch that rewrites virLogMutex.
The key problem we're addressing is that the child process needs to inherit a known state for the mutex. Whether that state is locked or unlocked doesn't matter, as long as it has a bulletproof guarantee of what that status is.
We call virLogLock before fork() to guarantee a locked state, by blocking on any other threads in the parent that might be temporarily holding it.
Again, I understand that. It's just that the following code returns an error:
#include <pthread.h> #include <stdlib.h> #include <unistd.h> #include <stdio.h> #include <string.h>
int main(int argc, char *argv[]) { pthread_mutex_t l; pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr); pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_ERRORCHECK);
pthread_mutex_init(&l, &attr);
pthread_mutex_lock(&l); if (fork() == 0) { /* child */ int rc;
if ((rc = pthread_mutex_unlock(&l)) != 0) { fprintf(stderr, "unlock error: %s\n", strerror(rc)); abort(); } }
return 0; }
Switching to a semaphore works again. I'm not advocating for that though, it's needless because apparently pthread does the right thing on all platforms we care about.
I was thinking an alternative was to call 'pthread_mutex_init' in the child to blindly re-initialize the mutex, since we know the parent had synchronize already. Downside though is that we could leak resources if mutexes have some associated OS state we don't know about. The 'robust' mutex type looked quite attractive because it says PTHREAD_MUTEX_ROBUST If a mutex is initialized with the PTHREAD_MUTEX_ROBUST attribute and its owner dies without unlocking it, any future attempts to call pthread_mutex_lock(3) on this mutex will succeed and return EOWNERDEAD to indicate that the original owner no longer exists and the mutex is in an inconsistent state. Usually after EOWNERDEAD is returned, the next owner should call pthread_mu‐ tex_consistent(3) on the acquired mutex to make it con‐ sistent again before using it any further. IOW we could pthread_mutex_lock(&m) pid = fork() if (pid == 0) { pthread_mutex_lock(&m) pthread_mutex_consistent(&m) } else { pthread_mutex_unlock(&m) } but in practice that doesn't work. The lock() call in the child deadlocks instead of returning EOWNERDEAD. IOW, from POV of a robust mutex the parent thread & child process are the same, but from POV of an error chekcing mutex they are different. Yay for consistency :-(
The pthread_atfork() handler can be used todo the same trick if you don't control the place where fork() is called. The man page explicitly says
https://linux.die.net/man/3/pthread_atfork
"The expected usage is that the prepare handler acquires all mutex locks and the other two fork handlers release them."
Which is a terrible advice, let me say. This assumes that there is a global list of mutexes (which there certainly is not in projects of our size), but more importantly - this is so fragile to lock ordering than anything else.
Yeah lock ordering would be a trainwreck. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (3)
-
Daniel P. Berrangé
-
Michal Privoznik
-
Michal Prívozník