On 10/23/2015 10:17 PM, Wang Yufei wrote:
On 2015/10/23 23:32, John Ferlan wrote:
>
>
> On 10/13/2015 02:36 AM, Wang Yufei wrote:
>> On 2015/10/2 20:17, John Ferlan wrote:
>>
>>>
>>>
>>> On 09/26/2015 08:18 AM, Wang Yufei wrote:
>>>> We shutdown a VM A by qemu agent,meanwhile an agent EOF
>>>> of VM A happened, there's a chance that deadlock occurred:
>>>>
>>>> qemuProcessHandleAgentEOF in main thread
>>>> A) priv->agent = NULL; //A happened before B
>>>>
>>>> //deadlock when we get agent lock which's held by worker thread
>>>> qemuAgentClose(agent);
>>>>
>>>> qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker
thread
>>>> B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL,
return false
>>>>
>>>> if (hasRefs)
>>>> virObjectUnlock(priv->agent); //agent lock will not be
released here
>>>>
>>>> So I close agent first, then set priv->agent NULL to fix the
deadlock.
>>>>
>>>> Signed-off-by: Wang Yufei <james.wangyufei(a)huawei.com>
>>>> Reviewed-by: Ren Guannan <renguannan(a)huawei.com>
>>>> ---
>>>> src/qemu/qemu_process.c | 3 +--
>>>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>>>
>>>
>>> Interesting - this is the exact opposite of commit id '1020a504'
from
>>> Michal over 3 years ago.
>>>
>>> However, a bit of digging into the claim from the commit message drove
>>> me to commit id '362d04779c' which removes the domain object lock
that
>>> was the basis of the initial patch.
>>>
>>> While I'm not an expert in the vmagent code, I do note that the
>>> HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed,
>>> while the ExitAgent code doesn't necessarily (or directly) check whether
>>> the priv->agent is still valid (IOW: that nothing else has removed it
>>> already like the EOF).
>>>
>>> So, while I don't discount that the patch works - I'm wondering
whether
>>> the smarts/logic should be built into ExitAgent to check for
>>> !priv->agent (and or something else) that would indicate we're
already
>>> on the path of shutdown.
>>>
>>> John
>>>
>>>
>>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>>> index f2586a1..8c9622e 100644
>>>> --- a/src/qemu/qemu_process.c
>>>> +++ b/src/qemu/qemu_process.c
>>>> @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
>>>> goto unlock;
>>>> }
>>>>
>>>> + qemuAgentClose(agent);
>>>> priv->agent = NULL;
>>>>
>>>> virObjectUnlock(vm);
>>>> -
>>>> - qemuAgentClose(agent);
>>>> return;
>>>>
>>>> unlock:
>>>>
>>>
>>> .
>>>
>>
>> Thank you for your reply.
>>
>> At first, we should consider about the right logic. In my oppinion, we should
call
>> qemuAgentClose at first, then we set priv->agent NULL, just like the logic in
qemuProcessStop.
>>
>> if (priv->agent) {
>> qemuAgentClose(priv->agent);
>> priv->agent = NULL;
>> priv->agentError = false;
>> }
>>
>> Base on the right logic, we consider about the right lock which I have shown in
my patch.
>>
>> Whether the smarts/logic should be built into ExitAgent to check for
!priv->agent (and or
>> something else) that would indicate we're already on the path of shutdown?
>>
>> The answer is yes, we have to, because in ExitAgent we use agent lock which may
be released
>> by other thread like qemuAgentClose, we have to check the agent first to make
sure it's
>> safe to visit.
>>
>>
>
> It doesn't feel right that moving qemuAgentClose to inside where the vm
> lock is held will fix the issue. Or does the virCondSignal (e.g.
> pthread_cond_signal or pthread_cond_broadcast) release the mutex that
> would be held by the thread that did the EnterAgent?
>
> So let's go back to your scenario again where A happens before B... As
> I see it, the bug is in B which has accesses and unreferences
> priv->agent *without* first getting the vm lock (like other instances),
> and assuming that priv->agent could still be valid (or the same) between
> the time the vm was unlocked in EnterAgent and the point at which we are
> at in ExitAgent.
>
> So the bug would then seem to be in the EnterAgent, run Agent command,
> ExitAgent sequence. Since we release the vm lock during EnterAgent,
> "anything" that could alter the priv->agent setting while the lock is
> not held needs to be handled.
>
> Thus perhaps the fix should be (in all callers)
>
> agent = priv->agent;
> EnterAgent(vm);
> *Agent* Command
> ExitAgent(vm, agent)
>
> Where in ExitAgent then has:
>
> virObjectLock(obj);
> if (priv->agent != agent)
> VIR_DEBUG("priv->agent=%p differs from agent=%p"
> priv->agent, agent);
> virObjectUnlock(agent);
> return;
> }
>
> if ((hasRefs = virObjectUnref(priv->agent)))
> virObjectUnlock(priv->agent);
>
> VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)",
> priv->agent, obj, obj->def->name, hasRefs);
>
> priv->agentStart = 0;
> if (!hasRefs)
> priv->agent = NULL;
>
>
> Does this seem reasonable?
>
> John
>
> .
Let's see the order of lock if we modify it in your way:
Thread B
require agent lock (qemuDomainObjEnterAgent)
require vm lock (qemuDomainObjExitAgent, still hold agent lock)
Thread A
require vm lock (qemuProcessHandleAgentEOF)
require agent lock (qemuAgentClose, still hold vm lock)
Bomb, dead lock! Am I right? We must keep lock require in the same order.
First off - I've removed the fix you proposed from my environment...
If Thread B has the agent-lock and gets vm-lock, then Thread A cannot
get vm-lock until Thread B releases.
So how/when does Thread A get to agent lock (qemuAgentClose)? Well once
Thread B is done with it during virDomainObjEndAPI. So where's the
deadlock? By that time, Thread B has also released the agent-lock.
Conversely, if Thread A (EOF) gets the vm-lock, then Thread B cannot get
it until Thread A releases. Once Thread A releases and Thread B gets it,
then it can determine whether the priv->agent is the same (or valid).
In the EOF code you "could" have the agent-lock on entrance, but you
need the vm-lock in order to clear out the agent from the vm. Calling
qemuAgentClose with the agent passed from the caller (either a shutdown
or some sort of catastrophic failure).
So without any changes we currently have the following:
EnterAgent w/ vm-lock
get agent-lock
vm-unlock
ExitAgent w/ agent-lock
unref/check agent refs
if there are refs
unlock-agent
get vm-lock
adjust vm agent related fields
EOF (may have agent-lock)
acquire vm-lock
check vm agent fields
clear vm->priv->agent
unlock-vm
AgentClose(agent)
I think the current ordering is correct; however, the problem is the
ExitAgent *assumes* priv->agent is valid without getting the vm-lock.
As you pointed out in your commit message, if EOF runs, it clears
priv->agent (with the vm-lock). If we drop back even further, once we
release the vm-lock after EnterAgent, should not 'assume' priv->agent is
valid...
Probably easier if I post some patches to continue debate from there.
John