[libvirt] [PATCH 0/2] Avoid the thread race condition

https://bugzilla.redhat.com/show_bug.cgi?id=866524 The two patches are to fix the same bug in different method, 1/2 is suggested by Daniel in the bug. 2/2 is an alternative way to fix it by locking the whole virConnect object. The reason to post 2/2 is that I think 1/2 doesn't fix the root problem, though I believe it must fix the bug. A thread can get the lock before/during doRemoteClose, and it might drop into the similar situation, though it's not the case now because remoteClientCloseFunc doesn't do changes which can cause race for the later doRemoteClose. However, I'm not sure if locking the whole object will cause some unexpected result. Osier Yang (2): remote: Avoid the thread race condition Lock the whole virConnect Object when disposing to avoid the thread race src/datatypes.c | 4 ++-- src/remote/remote_driver.c | 4 ++++ 2 files changed, 6 insertions(+), 2 deletions(-) Regards, Osier

From: Daniel P. Berrange <berrange@redhat.com> https://bugzilla.redhat.com/show_bug.cgi?id=866524 Since the virConnect object is not locked wholely when doing virConenctDispose, a thread can get the lock and thus might cause the race. Detected by valgrind: ==23687== Invalid read of size 4 ==23687== at 0x38BAA091EC: pthread_mutex_lock (pthread_mutex_lock.c:61) ==23687== by 0x3FBA919E36: remoteClientCloseFunc (remote_driver.c:337) ==23687== by 0x3FBA936BF2: virNetClientCloseLocked (virnetclient.c:688) ==23687== by 0x3FBA9390D8: virNetClientIncomingEvent (virnetclient.c:1859) ==23687== by 0x3FBA851AAE: virEventPollRunOnce (event_poll.c:485) ==23687== by 0x3FBA850846: virEventRunDefaultImpl (event.c:247) ==23687== by 0x40CD61: vshEventLoop (virsh.c:2128) ==23687== by 0x3FBA8626F8: virThreadHelper (threads-pthread.c:161) ==23687== by 0x38BAA077F0: start_thread (pthread_create.c:301) ==23687== by 0x33F68E570C: clone (clone.S:115) ==23687== Address 0x4ca94e0 is 144 bytes inside a block of size 312 free'd ==23687== at 0x4A0595D: free (vg_replace_malloc.c:366) ==23687== by 0x3FBA8588B8: virFree (memory.c:309) ==23687== by 0x3FBA86AAFC: virObjectUnref (virobject.c:145) ==23687== by 0x3FBA8EA767: virConnectClose (libvirt.c:1458) ==23687== by 0x40C8B8: vshDeinit (virsh.c:2584) ==23687== by 0x41071E: main (virsh.c:3022) The above race is caused by the eventLoop thread tries to handle the net client event by calling the callback set by: virNetClientSetCloseCallback(priv->client, remoteClientCloseFunc, conn, NULL); I.E. remoteClientCloseFunc, which lock/unlock the virConnect object. This patch is to fix the bug by setting the callback to NULL when doRemoteClose. --- src/remote/remote_driver.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index e5d4af3..5cc7e32 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -1003,6 +1003,10 @@ doRemoteClose(virConnectPtr conn, struct private_data *priv) virObjectUnref(priv->tls); priv->tls = NULL; + virNetClientSetCloseCallback(priv->client, + NULL, + NULL, + NULL); virNetClientClose(priv->client); virObjectUnref(priv->client); priv->client = NULL; -- 1.7.7.6

On Wed, Dec 05, 2012 at 10:48:43PM +0800, Osier Yang wrote:
From: Daniel P. Berrange <berrange@redhat.com>
https://bugzilla.redhat.com/show_bug.cgi?id=866524
Since the virConnect object is not locked wholely when doing virConenctDispose, a thread can get the lock and thus might cause the race.
Detected by valgrind:
==23687== Invalid read of size 4 ==23687== at 0x38BAA091EC: pthread_mutex_lock (pthread_mutex_lock.c:61) ==23687== by 0x3FBA919E36: remoteClientCloseFunc (remote_driver.c:337) ==23687== by 0x3FBA936BF2: virNetClientCloseLocked (virnetclient.c:688) ==23687== by 0x3FBA9390D8: virNetClientIncomingEvent (virnetclient.c:1859) ==23687== by 0x3FBA851AAE: virEventPollRunOnce (event_poll.c:485) ==23687== by 0x3FBA850846: virEventRunDefaultImpl (event.c:247) ==23687== by 0x40CD61: vshEventLoop (virsh.c:2128) ==23687== by 0x3FBA8626F8: virThreadHelper (threads-pthread.c:161) ==23687== by 0x38BAA077F0: start_thread (pthread_create.c:301) ==23687== by 0x33F68E570C: clone (clone.S:115) ==23687== Address 0x4ca94e0 is 144 bytes inside a block of size 312 free'd ==23687== at 0x4A0595D: free (vg_replace_malloc.c:366) ==23687== by 0x3FBA8588B8: virFree (memory.c:309) ==23687== by 0x3FBA86AAFC: virObjectUnref (virobject.c:145) ==23687== by 0x3FBA8EA767: virConnectClose (libvirt.c:1458) ==23687== by 0x40C8B8: vshDeinit (virsh.c:2584) ==23687== by 0x41071E: main (virsh.c:3022)
The above race is caused by the eventLoop thread tries to handle the net client event by calling the callback set by: virNetClientSetCloseCallback(priv->client, remoteClientCloseFunc, conn, NULL);
I.E. remoteClientCloseFunc, which lock/unlock the virConnect object.
This patch is to fix the bug by setting the callback to NULL when doRemoteClose. --- src/remote/remote_driver.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index e5d4af3..5cc7e32 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -1003,6 +1003,10 @@ doRemoteClose(virConnectPtr conn, struct private_data *priv)
virObjectUnref(priv->tls); priv->tls = NULL; + virNetClientSetCloseCallback(priv->client, + NULL, + NULL, + NULL); virNetClientClose(priv->client); virObjectUnref(priv->client); priv->client = NULL;
ACK, this avoids the race condition of concurrent access. THe only other way to avoid it would be to have the remote driver hold an extra reference on the virConnectPtr, but this causes a circular reference, meaning the connection would never be able to be freed. Daniel

https://bugzilla.redhat.com/show_bug.cgi?id=866524 Since the virConnect object is not locked wholely when doing virConenctDispose, a thread can get the lock and thus might cause the race. Detected by valgrind: ==23687== Invalid read of size 4 ==23687== at 0x38BAA091EC: pthread_mutex_lock (pthread_mutex_lock.c:61) ==23687== by 0x3FBA919E36: remoteClientCloseFunc (remote_driver.c:337) ==23687== by 0x3FBA936BF2: virNetClientCloseLocked (virnetclient.c:688) ==23687== by 0x3FBA9390D8: virNetClientIncomingEvent (virnetclient.c:1859) ==23687== by 0x3FBA851AAE: virEventPollRunOnce (event_poll.c:485) ==23687== by 0x3FBA850846: virEventRunDefaultImpl (event.c:247) ==23687== by 0x40CD61: vshEventLoop (virsh.c:2128) ==23687== by 0x3FBA8626F8: virThreadHelper (threads-pthread.c:161) ==23687== by 0x38BAA077F0: start_thread (pthread_create.c:301) ==23687== by 0x33F68E570C: clone (clone.S:115) ==23687== Address 0x4ca94e0 is 144 bytes inside a block of size 312 free'd ==23687== at 0x4A0595D: free (vg_replace_malloc.c:366) ==23687== by 0x3FBA8588B8: virFree (memory.c:309) ==23687== by 0x3FBA86AAFC: virObjectUnref (virobject.c:145) ==23687== by 0x3FBA8EA767: virConnectClose (libvirt.c:1458) ==23687== by 0x40C8B8: vshDeinit (virsh.c:2584) ==23687== by 0x41071E: main (virsh.c:3022) The above race is caused by the eventLoop thread tries to handle the net client event by calling the callback set by: virNetClientSetCloseCallback(priv->client, remoteClientCloseFunc, conn, NULL); I.E. remoteClientCloseFunc, which lock/unlock the virConnect object. This patch is to fix it by locking the whole virConnect object when disposing it. --- src/datatypes.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/datatypes.c b/src/datatypes.c index c0ed3a2..8648ecb 100644 --- a/src/datatypes.c +++ b/src/datatypes.c @@ -128,6 +128,8 @@ virConnectDispose(void *obj) { virConnectPtr conn = obj; + virMutexLock(&conn->lock); + if (conn->networkDriver) conn->networkDriver->close(conn); if (conn->interfaceDriver) @@ -143,8 +145,6 @@ virConnectDispose(void *obj) if (conn->driver) conn->driver->close(conn); - virMutexLock(&conn->lock); - if (conn->closeFreeCallback) conn->closeFreeCallback(conn->closeOpaque); -- 1.7.7.6

On Wed, Dec 05, 2012 at 10:48:44PM +0800, Osier Yang wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=866524
Since the virConnect object is not locked wholely when doing virConenctDispose, a thread can get the lock and thus might cause the race.
This patch is to fix it by locking the whole virConnect object when disposing it.
No, this is wrong. If two threads are using an object, they must each be holding a reference on the object. virConnectDispose is the *last* thing to happen to an object. If a 2nd thread is still using an object when virConnectDipose runs, then by definition, it has forgotten to hold a reference count. Locking cannot save you at this point - the 2nd thread will end up waiting on a mutex that has been destroyed NACK Daniel

On 2012年12月06日 00:23, Daniel P. Berrange wrote:
On Wed, Dec 05, 2012 at 10:48:44PM +0800, Osier Yang wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=866524
Since the virConnect object is not locked wholely when doing virConenctDispose, a thread can get the lock and thus might cause the race.
This patch is to fix it by locking the whole virConnect object when disposing it.
No, this is wrong. If two threads are using an object, they must each be holding a reference on the object. virConnectDispose is the *last* thing to happen to an object. If a 2nd thread is still using an object when virConnectDipose runs, then by definition, it has forgotten to hold a reference count.
Er, right. Locking cannot save you
at this point - the 2nd thread will end up waiting on a mutex that has been destroyed
Okay, I did the wrong thing, pushed your version. Thanks. Regards, Osier
participants (2)
-
Daniel P. Berrange
-
Osier Yang