Re: [PATCH v2 10/13] qemu: avoid deadlock in qemuDomainObjStopWorker

11 Aug 2020

On 10.08.2020 20:40, Daniel Henrique Barboza wrote:
...
On 7/23/20 7:14 AM, Nikolay Shirokovskiy wrote:
...
We are dropping the only reference here so that the event loop thread
is going to be exited synchronously. In order to avoid deadlocks we
need to unlock the VM so that any handler being called can finish
execution and thus even loop thread be finished too.
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
---
  src/qemu/qemu_domain.c | 18 ++++++++++++++----
  1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index 5b22eb2..82b3d11 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -1637,11 +1637,21 @@ void
  qemuDomainObjStopWorker(virDomainObjPtr dom)
  {
      qemuDomainObjPrivatePtr priv = dom->privateData;
+    virEventThread *eventThread;
  -    if (priv->eventThread) {
-        g_object_unref(priv->eventThread);
-        priv->eventThread = NULL;
-    }
+    if (!priv->eventThread)
+        return;
+
+    /*
+     * We are dropping the only reference here so that the event loop thread
+     * is going to be exited synchronously. In order to avoid deadlocks we
+     * need to unlock the VM so that any handler being called can finish
+     * execution and thus even loop thread be finished too.
+     */
+    eventThread = g_steal_pointer(&priv->eventThread);
+    virObjectUnlock(dom);
+    g_object_unref(eventThread);
+    virObjectLock(dom);
I understand the intention of the code thanks to the comment just before
it, but this is not robust. This unlocking -> do stuff -> lock will only
work if the caller of qemuDomainObjStopWorker() is holding the mutex.
I see that this is the case for qemuDomainObjStopWorkerIter(), but
qemuDomainObjStopWorker() is also called by qemuProcessStop(). qemuProcessStop()
does not make any mutex lock/unlock, so you'll be assuming that all callers of
qemuProcessStop() will hold the mutex before calling it. One of them is qemuProcessInit(),
which calls qemuProcessStop() in the 'stop' jump, and there is no mutex lock
beforehand as well.
Now we're assuming that all callers of qemuProcessInit() are holding the mutex
lock beforehand. In a quick glance in the code I saw at least 2 instances that
this isn't the case, then we'll need to assume that the callers of those functions
are holding the mutex lock. So on and so forth.
Given that this code only makes sense when called from qemuDomainObjStopWorkerIter(),
I'd suggest removing the lock/unlock of this function (turning it into just a call
to qemuDomainObjStopWorker()) and move them inside the body of qemuDomainObjStopWorker(),
locking and unlocking the mutex inside the scope of the same function.
Hi, Daniel.

Actually all callers of qemuProcessStop hold the lock. Moreover they even hold  
job condition. And calling qemuProcessStop without lock/job condition would be  
a mistake AFIU (qemuConnectDomainXMLToNative is the only exception I see that   
holds the lock but not the job condition. But this function creates new vm      
object that is not shared with other threads) I understand you concern but      
there are precedents. Take a look for example virDomainObjListRemove. The       
lock requirements are documented for virDomainObjListRemove and I can do it for 
qemuDomainObjStopWorker too but it looks a bit over documenting to me.   

Nikolay