[libvirt] [PATCH] remote: Prevent race when closing a connection

A race condition can occur when virConnectClose is called parallel to the execution of the connection close callback in remoteClientCloseFunc. The race happens if the connection object is destroyed (including the mutex) while remoteClientCloseFunc is waiting for the connection mutex. After the destruction of the (non error checking) mutex remoteClientCloseFunc starts to process the callbacks. However the operations can occur against a freed (or even worse, reallocated) object. Another issue is that the closeFreeCallback is invoked even if it's NULL (this is the case for virsh). The solution is to clean out the callback pointers in virConnectDispose before destroying the mutex. This way remoteClientCloseFunc will return immediately after passing virMutexLock, thus avoiding potential data corruption. There's still the slight chance that the concluding virMutexUnlock could do harm on the freed connection object. This could be fixed using an error checking mutex which however has a much broader scope and impact. Signed-off-by: Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> --- src/datatypes.c | 8 +++++++- src/remote/remote_driver.c | 3 ++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/src/datatypes.c b/src/datatypes.c index b04e100..2358bdf 100644 --- a/src/datatypes.c +++ b/src/datatypes.c @@ -146,8 +146,14 @@ virConnectDispose(void *obj) virMutexLock(&conn->lock); - if (conn->closeFreeCallback) + if (conn->closeCallback) + conn->closeCallback = NULL; + + if (conn->closeFreeCallback) { conn->closeFreeCallback(conn->closeOpaque); + conn->closeFreeCallback = NULL; + conn->closeOpaque = NULL; + } virResetError(&conn->err); diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index 3721af9..885120e 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -358,7 +358,8 @@ static void remoteClientCloseFunc(virNetClientPtr client ATTRIBUTE_UNUSED, closeCallback(conn, reason, closeOpaque); virMutexLock(&conn->lock); conn->closeDispatch = false; - if (conn->closeUnregisterCount != closeUnregisterCount) + if (conn->closeUnregisterCount != closeUnregisterCount && + closeFreeCallback) closeFreeCallback(closeOpaque); } virMutexUnlock(&conn->lock); -- 1.7.9.5

On 03/14/2013 06:26 AM, Viktor Mihajlovski wrote:
A race condition can occur when virConnectClose is called parallel to the execution of the connection close callback in remoteClientCloseFunc.
The race happens if the connection object is destroyed (including the mutex) while remoteClientCloseFunc is waiting for the connection mutex. After the destruction of the (non error checking) mutex remoteClientCloseFunc starts to process the callbacks. However the operations can occur against a freed (or even worse, reallocated) object. Another issue is that the closeFreeCallback is invoked even if it's NULL (this is the case for virsh).
The solution is to clean out the callback pointers in virConnectDispose before destroying the mutex. This way remoteClientCloseFunc will return immediately after passing virMutexLock, thus avoiding potential data corruption. There's still the slight chance that the concluding virMutexUnlock could do harm on the freed connection object.
See my question below...
This could be fixed using an error checking mutex which however has a much broader scope and impact.
I'd like Dan to weigh in on this one, because he probably has a better insight into the proper way to break the race. But here are some smaller comments:
Signed-off-by: Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> --- src/datatypes.c | 8 +++++++- src/remote/remote_driver.c | 3 ++- 2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/datatypes.c b/src/datatypes.c index b04e100..2358bdf 100644 --- a/src/datatypes.c +++ b/src/datatypes.c @@ -146,8 +146,14 @@ virConnectDispose(void *obj)
virMutexLock(&conn->lock);
- if (conn->closeFreeCallback) + if (conn->closeCallback) + conn->closeCallback = NULL;
The if is pointless. Just blindly set conn->closeCallback to NULL.
+ + if (conn->closeFreeCallback) { conn->closeFreeCallback(conn->closeOpaque); + conn->closeFreeCallback = NULL; + conn->closeOpaque = NULL;
Clearing conn->closeOpaque is pointless; it is only ever used depending on conn->closeFreeCallback, and leaving it non-NULL doesn't hurt.
+ }
virResetError(&conn->err);
diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index 3721af9..885120e 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -358,7 +358,8 @@ static void remoteClientCloseFunc(virNetClientPtr client ATTRIBUTE_UNUSED, closeCallback(conn, reason, closeOpaque); virMutexLock(&conn->lock); conn->closeDispatch = false; - if (conn->closeUnregisterCount != closeUnregisterCount) + if (conn->closeUnregisterCount != closeUnregisterCount && + closeFreeCallback) closeFreeCallback(closeOpaque); } virMutexUnlock(&conn->lock);
...Wouldn't it be better to stash a copy of the callback pointer while the mutex is held, but avoid calling the callback until after the mutex is unlocked? Something like: <TYPE> cb = NULL; void* opaque; virMutexLock(&conn->lock); conn->closeDispatch = false; if (conn->closeUnregisterCount != closeUnregisterCount) { cb = closeFreeCallback; opaque = closeOpaque; } virMutexUnlock(&conn->lock); if (cb) cb(opaque); -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 03/15/2013 12:28 AM, Eric Blake wrote:
- if (conn->closeFreeCallback) + if (conn->closeCallback) + conn->closeCallback = NULL;
The if is pointless. Just blindly set conn->closeCallback to NULL.
agreed
+ + if (conn->closeFreeCallback) { conn->closeFreeCallback(conn->closeOpaque); + conn->closeFreeCallback = NULL; + conn->closeOpaque = NULL;
Clearing conn->closeOpaque is pointless; it is only ever used depending on conn->closeFreeCallback, and leaving it non-NULL doesn't hurt.
I know, and didn't do it initially, but then wanted to make it common with the callback deregistering code. And a small portion of paranoia doesn't hurt as I have come to learn.
...Wouldn't it be better to stash a copy of the callback pointer while the mutex is held, but avoid calling the callback until after the mutex is unlocked? Something like:
<TYPE> cb = NULL; void* opaque; virMutexLock(&conn->lock); conn->closeDispatch = false; if (conn->closeUnregisterCount != closeUnregisterCount) { cb = closeFreeCallback; opaque = closeOpaque; } virMutexUnlock(&conn->lock); if (cb) cb(opaque);
maybe, but this is again common to the other places where the freeing callback is invoked, i.e. within the lock. Waiting for Dan's comments... -- Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Köderitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Thu, Mar 14, 2013 at 01:26:55PM +0100, Viktor Mihajlovski wrote:
A race condition can occur when virConnectClose is called parallel to the execution of the connection close callback in remoteClientCloseFunc.
The race happens if the connection object is destroyed (including the mutex) while remoteClientCloseFunc is waiting for the connection mutex. After the destruction of the (non error checking) mutex remoteClientCloseFunc starts to process the callbacks. However the operations can occur against a freed (or even worse, reallocated) object. Another issue is that the closeFreeCallback is invoked even if it's NULL (this is the case for virsh).
The solution is to clean out the callback pointers in virConnectDispose before destroying the mutex. This way remoteClientCloseFunc will return immediately after passing virMutexLock, thus avoiding potential data corruption. There's still the slight chance that the concluding virMutexUnlock could do harm on the freed connection object. This could be fixed using an error checking mutex which however has a much broader scope and impact.
No, this really isn't solving the problem. The virConnectDipose function is the last thing to run on an object. Once virConnectDispose is running absolutely nothing else may safely use that object pointer. The thread that is not in virConnectDispose here is missing a reference on the object, to prevent it being destroyed while it is still in use. so NACk to this patch, it doesn't fix the problem, merely makes a SEGV slightly less likely. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 03/15/2013 10:33 AM, Daniel P. Berrange wrote:
No, this really isn't solving the problem. The virConnectDipose function is the last thing to run on an object. Once virConnectDispose is running absolutely nothing else may safely use that object pointer. The thread that is not in virConnectDispose here is missing a reference on the object, to prevent it being destroyed while it is still in use.
so NACk to this patch, it doesn't fix the problem, merely makes a SEGV slightly less likely.
Daniel
I understand your objection and I have already tested a patch which increments the object ref counter when registering a close callback. The unfortunate thing is that the close callback isn't guaranteed to be deregistered (causing a virsh leaked reference complaint). As my brain kicks in while I'm typing, this is probably the way to go ... -- Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Köderitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Fri, Mar 15, 2013 at 10:50:08AM +0100, Viktor Mihajlovski wrote:
On 03/15/2013 10:33 AM, Daniel P. Berrange wrote:
No, this really isn't solving the problem. The virConnectDipose function is the last thing to run on an object. Once virConnectDispose is running absolutely nothing else may safely use that object pointer. The thread that is not in virConnectDispose here is missing a reference on the object, to prevent it being destroyed while it is still in use.
so NACk to this patch, it doesn't fix the problem, merely makes a SEGV slightly less likely.
Daniel
I understand your objection and I have already tested a patch which increments the object ref counter when registering a close callback. The unfortunate thing is that the close callback isn't guaranteed to be deregistered (causing a virsh leaked reference complaint). As my brain kicks in while I'm typing, this is probably the way to go ...
With the existing domain event callbacks we hold a reference on the connection for as long as the callback is set. The app is required to unregister the callbacks prior to closing the connection. So the same approach is fine for the close callback. To prevent the leak in virsh, virsh should be de-registering the callback Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (3)
-
Daniel P. Berrange
-
Eric Blake
-
Viktor Mihajlovski