[libvirt] Race/broken mutex when closing a connection

I received a report of a virsh coredump that happened on a heavy loaded system. Debugging the corefile I saw that under certain circumstances the mutex for a connection can be destroyed while another thread is waiting for the lock on that connection. Specifically this seems to happen when vshDeinit is called after the completion of vshRunCommand and at the same time the event loop is receiving a client disconnect. vshDeinit calls virConnectClose which unrefs the connection object destroying the lock mutex before releasing the memory and nulling the callback pointers. At the same time the event loop will call remoteClientCloseFunc which waits for the mutex while the connection is unrefed. It will obtain the (now invalid but non-error checking) mutex, copy the closeCallback pointer (shorty before the connection object is destroyed) and will die upon trying to call closeFreeCallback (NULL by now). First I would like to see whether someone finds an obvious flaw in my reasoning. Then I'd would like to discuss how to get around it. One thought would be to increase the refcount of the connection object in remoteClientCloseFunc before calling the closeCallback. -- Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Köderitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On 03/12/2013 06:15 PM, Viktor Mihajlovski wrote:
At the same time the event loop will call remoteClientCloseFunc which waits for the mutex while the connection is unrefed. It will obtain the (now invalid but non-error checking) mutex, copy the closeCallback pointer (shorty before the connection object is destroyed) and will die upon trying to call closeFreeCallback (NULL by now).
Minor correction: the closeFreeCallback was NULL from the beginning but that was actually a blessing, because it indirectly helped to uncover this race.
One thought would be to increase the refcount of the connection object in remoteClientCloseFunc before calling the closeCallback.
Increasing the refcount there is too late. I have experimentally increased the refcount before registering remoteClientCloseFunc and decreased it on deregistering, but that leads to the dreaded virsh "leaked reference(s)" message because virConnectClose is always the first to happen. My next attempt would be to reset the closeCallback in virConnectDispose but I need to sleep over it first :-). -- Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Köderitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294
participants (1)
-
Viktor Mihajlovski