Re: [libvirt] [PATCH] qemu_agent: fix deadlock in qemuProcessHandleAgentEOF

26 Oct 2015

On 10/23/2015 10:17 PM, Wang Yufei wrote:
...
On 2015/10/23 23:32, John Ferlan wrote:
...
On 10/13/2015 02:36 AM, Wang Yufei wrote:
...
On 2015/10/2 20:17, John Ferlan wrote:
...
On 09/26/2015 08:18 AM, Wang Yufei wrote:
...
We shutdown a VM A by qemu agent,meanwhile an agent EOF
of VM A happened, there's a chance that deadlock occurred:
qemuProcessHandleAgentEOF in main thread
A)  priv->agent = NULL; //A happened before B
//deadlock when we get agent lock which's held by worker thread
    qemuAgentClose(agent);
qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread
B)  hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
if (hasRefs)
        virObjectUnlock(priv->agent); //agent lock will not be released here
So I close agent first, then set priv->agent NULL to fix the deadlock.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com>
Reviewed-by: Ren Guannan <renguannan@huawei.com>
---
 src/qemu/qemu_process.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)
Interesting - this is the exact opposite of commit id '1020a504' from
Michal over 3 years ago.
However, a bit of digging into the claim from the commit message drove
me to commit id '362d04779c' which removes the domain object lock that
was the basis of the initial patch.
While I'm not an expert in the vmagent code, I do note that the
HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed,
while the ExitAgent code doesn't necessarily (or directly) check whether
the priv->agent is still valid (IOW: that nothing else has removed it
already like the EOF).
So, while I don't discount that the patch works - I'm wondering whether
the smarts/logic should be built into ExitAgent to check for
!priv->agent (and or something else) that would indicate we're already
on the path of shutdown.
John
...

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index f2586a1..8c9622e 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
         goto unlock;
     }
+    qemuAgentClose(agent);
     priv->agent = NULL;
virObjectUnlock(vm);
-
-    qemuAgentClose(agent);
     return;
unlock:
.
Thank you for your reply.
At first, we should consider about the right logic. In my oppinion, we should call
qemuAgentClose at first, then we set priv->agent NULL, just like the logic in qemuProcessStop.
if (priv->agent) {
        qemuAgentClose(priv->agent);
        priv->agent = NULL;
        priv->agentError = false;
    }
Base on the right logic, we consider about the right lock which I have shown in my patch.
Whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or
something else) that would indicate we're already on the path of shutdown?
The answer is yes, we have to, because in ExitAgent we use agent lock which may be released
by other thread like qemuAgentClose, we have to check the agent first to make sure it's
safe to visit.
It doesn't feel right that moving qemuAgentClose to inside where the vm
lock is held will fix the issue. Or does the virCondSignal (e.g.
pthread_cond_signal or pthread_cond_broadcast) release the mutex that
would be held by the thread that did the EnterAgent?
So let's go back to your scenario again where A happens before B...  As
I see it, the bug is in B which has accesses and unreferences
priv->agent *without* first getting the vm lock (like other instances),
and assuming that priv->agent could still be valid (or the same) between
the time the vm was unlocked in EnterAgent and the point at which we are
at in ExitAgent.
So the bug would then seem to be in the EnterAgent, run Agent command,
ExitAgent sequence.  Since we release the vm lock during EnterAgent,
"anything" that could alter the priv->agent setting while the lock is
not held needs to be handled.
Thus perhaps the fix should be (in all callers)
agent = priv->agent;
EnterAgent(vm);
*Agent* Command
ExitAgent(vm, agent)
Where in ExitAgent then has:
virObjectLock(obj);
    if (priv->agent != agent)
        VIR_DEBUG("priv->agent=%p differs from agent=%p"
                  priv->agent, agent);
        virObjectUnlock(agent);
        return;
    }
if ((hasRefs = virObjectUnref(priv->agent)))
        virObjectUnlock(priv->agent);
VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)",
              priv->agent, obj, obj->def->name, hasRefs);
priv->agentStart = 0;
    if (!hasRefs)
        priv->agent = NULL;
Does this seem reasonable?
John
.
Let's see the order of lock if we modify it in your way:
Thread B
require agent lock (qemuDomainObjEnterAgent)
require vm lock (qemuDomainObjExitAgent, still hold agent lock)
Thread A
require vm lock (qemuProcessHandleAgentEOF)
require agent lock (qemuAgentClose, still hold vm lock)
Bomb, dead lock! Am I right? We must keep lock require in the same order.
First off - I've removed the fix you proposed from my environment...

If Thread B has the agent-lock and gets vm-lock, then Thread A cannot
get vm-lock until Thread B releases.

So how/when does Thread A get to agent lock (qemuAgentClose)? Well once
Thread B is done with it during virDomainObjEndAPI. So where's the
deadlock? By that time, Thread B has also released the agent-lock.

Conversely, if Thread A (EOF) gets the vm-lock, then Thread B cannot get
it until Thread A releases. Once Thread A releases and Thread B gets it,
then it can determine whether the priv->agent is the same (or valid).

In the EOF code you "could" have the agent-lock on entrance, but you
need the vm-lock in order to clear out the agent from the vm. Calling
qemuAgentClose with the agent passed from the caller (either a shutdown
or some sort of catastrophic failure).

So without any changes we currently have the following:

EnterAgent w/ vm-lock
    get agent-lock
    vm-unlock

ExitAgent w/ agent-lock
    unref/check agent refs
    if there are refs
        unlock-agent
    get vm-lock
    adjust vm agent related fields

EOF (may have agent-lock)
    acquire vm-lock
    check vm agent fields
    clear vm->priv->agent
    unlock-vm
    AgentClose(agent)

I think the current ordering is correct; however, the problem is the
ExitAgent *assumes* priv->agent is valid without getting the vm-lock.
As you pointed out in your commit message, if EOF runs, it clears
priv->agent (with the vm-lock). If we drop back even further, once we
release the vm-lock after EnterAgent, should not 'assume' priv->agent is
valid...

Probably easier if I post some patches to continue debate from there.

John