[libvirt] [PATCH] qemu_agent: fix deadlock in qemuProcessHandleAgentEOF

We shutdown a VM A by qemu agent,meanwhile an agent EOF of VM A happened, there's a chance that deadlock occurred: qemuProcessHandleAgentEOF in main thread A) priv->agent = NULL; //A happened before B //deadlock when we get agent lock which's held by worker thread qemuAgentClose(agent); qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false if (hasRefs) virObjectUnlock(priv->agent); //agent lock will not be released here So I close agent first, then set priv->agent NULL to fix the deadlock. Signed-off-by: Wang Yufei <james.wangyufei@huawei.com> Reviewed-by: Ren Guannan <renguannan@huawei.com> --- src/qemu/qemu_process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f2586a1..8c9622e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, goto unlock; } + qemuAgentClose(agent); priv->agent = NULL; virObjectUnlock(vm); - - qemuAgentClose(agent); return; unlock: -- 1.8.3.4

On 09/26/2015 08:18 AM, Wang Yufei wrote:
We shutdown a VM A by qemu agent,meanwhile an agent EOF of VM A happened, there's a chance that deadlock occurred:
qemuProcessHandleAgentEOF in main thread A) priv->agent = NULL; //A happened before B
//deadlock when we get agent lock which's held by worker thread qemuAgentClose(agent);
qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
if (hasRefs) virObjectUnlock(priv->agent); //agent lock will not be released here
So I close agent first, then set priv->agent NULL to fix the deadlock.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com> Reviewed-by: Ren Guannan <renguannan@huawei.com> --- src/qemu/qemu_process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Interesting - this is the exact opposite of commit id '1020a504' from Michal over 3 years ago. However, a bit of digging into the claim from the commit message drove me to commit id '362d04779c' which removes the domain object lock that was the basis of the initial patch. While I'm not an expert in the vmagent code, I do note that the HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, while the ExitAgent code doesn't necessarily (or directly) check whether the priv->agent is still valid (IOW: that nothing else has removed it already like the EOF). So, while I don't discount that the patch works - I'm wondering whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown. John
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f2586a1..8c9622e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, goto unlock; }
+ qemuAgentClose(agent); priv->agent = NULL;
virObjectUnlock(vm); - - qemuAgentClose(agent); return;
unlock:

On 2015/10/2 20:17, John Ferlan wrote:
On 09/26/2015 08:18 AM, Wang Yufei wrote:
We shutdown a VM A by qemu agent,meanwhile an agent EOF of VM A happened, there's a chance that deadlock occurred:
qemuProcessHandleAgentEOF in main thread A) priv->agent = NULL; //A happened before B
//deadlock when we get agent lock which's held by worker thread qemuAgentClose(agent);
qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
if (hasRefs) virObjectUnlock(priv->agent); //agent lock will not be released here
So I close agent first, then set priv->agent NULL to fix the deadlock.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com> Reviewed-by: Ren Guannan <renguannan@huawei.com> --- src/qemu/qemu_process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Interesting - this is the exact opposite of commit id '1020a504' from Michal over 3 years ago.
However, a bit of digging into the claim from the commit message drove me to commit id '362d04779c' which removes the domain object lock that was the basis of the initial patch.
While I'm not an expert in the vmagent code, I do note that the HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, while the ExitAgent code doesn't necessarily (or directly) check whether the priv->agent is still valid (IOW: that nothing else has removed it already like the EOF).
So, while I don't discount that the patch works - I'm wondering whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown.
John
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f2586a1..8c9622e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, goto unlock; }
+ qemuAgentClose(agent); priv->agent = NULL;
virObjectUnlock(vm); - - qemuAgentClose(agent); return;
unlock:
.
Thank you for your reply. At first, we should consider about the right logic. In my oppinion, we should call qemuAgentClose at first, then we set priv->agent NULL, just like the logic in qemuProcessStop. if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; priv->agentError = false; } Base on the right logic, we consider about the right lock which I have shown in my patch. Whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown? The answer is yes, we have to, because in ExitAgent we use agent lock which may be released by other thread like qemuAgentClose, we have to check the agent first to make sure it's safe to visit.

On 10/13/2015 02:36 AM, Wang Yufei wrote:
On 2015/10/2 20:17, John Ferlan wrote:
On 09/26/2015 08:18 AM, Wang Yufei wrote:
We shutdown a VM A by qemu agent,meanwhile an agent EOF of VM A happened, there's a chance that deadlock occurred:
qemuProcessHandleAgentEOF in main thread A) priv->agent = NULL; //A happened before B
//deadlock when we get agent lock which's held by worker thread qemuAgentClose(agent);
qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
if (hasRefs) virObjectUnlock(priv->agent); //agent lock will not be released here
So I close agent first, then set priv->agent NULL to fix the deadlock.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com> Reviewed-by: Ren Guannan <renguannan@huawei.com> --- src/qemu/qemu_process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Interesting - this is the exact opposite of commit id '1020a504' from Michal over 3 years ago.
However, a bit of digging into the claim from the commit message drove me to commit id '362d04779c' which removes the domain object lock that was the basis of the initial patch.
While I'm not an expert in the vmagent code, I do note that the HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, while the ExitAgent code doesn't necessarily (or directly) check whether the priv->agent is still valid (IOW: that nothing else has removed it already like the EOF).
So, while I don't discount that the patch works - I'm wondering whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown.
John
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f2586a1..8c9622e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, goto unlock; }
+ qemuAgentClose(agent); priv->agent = NULL;
virObjectUnlock(vm); - - qemuAgentClose(agent); return;
unlock:
.
Thank you for your reply.
At first, we should consider about the right logic. In my oppinion, we should call qemuAgentClose at first, then we set priv->agent NULL, just like the logic in qemuProcessStop.
if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; priv->agentError = false; }
Base on the right logic, we consider about the right lock which I have shown in my patch.
Whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown?
The answer is yes, we have to, because in ExitAgent we use agent lock which may be released by other thread like qemuAgentClose, we have to check the agent first to make sure it's safe to visit.
It doesn't feel right that moving qemuAgentClose to inside where the vm lock is held will fix the issue. Or does the virCondSignal (e.g. pthread_cond_signal or pthread_cond_broadcast) release the mutex that would be held by the thread that did the EnterAgent? So let's go back to your scenario again where A happens before B... As I see it, the bug is in B which has accesses and unreferences priv->agent *without* first getting the vm lock (like other instances), and assuming that priv->agent could still be valid (or the same) between the time the vm was unlocked in EnterAgent and the point at which we are at in ExitAgent. So the bug would then seem to be in the EnterAgent, run Agent command, ExitAgent sequence. Since we release the vm lock during EnterAgent, "anything" that could alter the priv->agent setting while the lock is not held needs to be handled. Thus perhaps the fix should be (in all callers) agent = priv->agent; EnterAgent(vm); *Agent* Command ExitAgent(vm, agent) Where in ExitAgent then has: virObjectLock(obj); if (priv->agent != agent) VIR_DEBUG("priv->agent=%p differs from agent=%p" priv->agent, agent); virObjectUnlock(agent); return; } if ((hasRefs = virObjectUnref(priv->agent))) virObjectUnlock(priv->agent); VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)", priv->agent, obj, obj->def->name, hasRefs); priv->agentStart = 0; if (!hasRefs) priv->agent = NULL; Does this seem reasonable? John

On 2015/10/23 23:32, John Ferlan wrote:
On 10/13/2015 02:36 AM, Wang Yufei wrote:
On 2015/10/2 20:17, John Ferlan wrote:
On 09/26/2015 08:18 AM, Wang Yufei wrote:
We shutdown a VM A by qemu agent,meanwhile an agent EOF of VM A happened, there's a chance that deadlock occurred:
qemuProcessHandleAgentEOF in main thread A) priv->agent = NULL; //A happened before B
//deadlock when we get agent lock which's held by worker thread qemuAgentClose(agent);
qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
if (hasRefs) virObjectUnlock(priv->agent); //agent lock will not be released here
So I close agent first, then set priv->agent NULL to fix the deadlock.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com> Reviewed-by: Ren Guannan <renguannan@huawei.com> --- src/qemu/qemu_process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Interesting - this is the exact opposite of commit id '1020a504' from Michal over 3 years ago.
However, a bit of digging into the claim from the commit message drove me to commit id '362d04779c' which removes the domain object lock that was the basis of the initial patch.
While I'm not an expert in the vmagent code, I do note that the HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, while the ExitAgent code doesn't necessarily (or directly) check whether the priv->agent is still valid (IOW: that nothing else has removed it already like the EOF).
So, while I don't discount that the patch works - I'm wondering whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown.
John
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f2586a1..8c9622e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, goto unlock; }
+ qemuAgentClose(agent); priv->agent = NULL;
virObjectUnlock(vm); - - qemuAgentClose(agent); return;
unlock:
.
Thank you for your reply.
At first, we should consider about the right logic. In my oppinion, we should call qemuAgentClose at first, then we set priv->agent NULL, just like the logic in qemuProcessStop.
if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; priv->agentError = false; }
Base on the right logic, we consider about the right lock which I have shown in my patch.
Whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown?
The answer is yes, we have to, because in ExitAgent we use agent lock which may be released by other thread like qemuAgentClose, we have to check the agent first to make sure it's safe to visit.
It doesn't feel right that moving qemuAgentClose to inside where the vm lock is held will fix the issue. Or does the virCondSignal (e.g. pthread_cond_signal or pthread_cond_broadcast) release the mutex that would be held by the thread that did the EnterAgent?
So let's go back to your scenario again where A happens before B... As I see it, the bug is in B which has accesses and unreferences priv->agent *without* first getting the vm lock (like other instances), and assuming that priv->agent could still be valid (or the same) between the time the vm was unlocked in EnterAgent and the point at which we are at in ExitAgent.
So the bug would then seem to be in the EnterAgent, run Agent command, ExitAgent sequence. Since we release the vm lock during EnterAgent, "anything" that could alter the priv->agent setting while the lock is not held needs to be handled.
Thus perhaps the fix should be (in all callers)
agent = priv->agent; EnterAgent(vm); *Agent* Command ExitAgent(vm, agent)
Where in ExitAgent then has:
virObjectLock(obj); if (priv->agent != agent) VIR_DEBUG("priv->agent=%p differs from agent=%p" priv->agent, agent); virObjectUnlock(agent); return; }
if ((hasRefs = virObjectUnref(priv->agent))) virObjectUnlock(priv->agent);
VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)", priv->agent, obj, obj->def->name, hasRefs);
priv->agentStart = 0; if (!hasRefs) priv->agent = NULL;
Does this seem reasonable?
John
.
Let's see the order of lock if we modify it in your way: Thread B require agent lock (qemuDomainObjEnterAgent) require vm lock (qemuDomainObjExitAgent, still hold agent lock) Thread A require vm lock (qemuProcessHandleAgentEOF) require agent lock (qemuAgentClose, still hold vm lock) Bomb, dead lock! Am I right? We must keep lock require in the same order.

On 10/23/2015 10:17 PM, Wang Yufei wrote:
On 2015/10/23 23:32, John Ferlan wrote:
On 10/13/2015 02:36 AM, Wang Yufei wrote:
On 2015/10/2 20:17, John Ferlan wrote:
On 09/26/2015 08:18 AM, Wang Yufei wrote:
We shutdown a VM A by qemu agent,meanwhile an agent EOF of VM A happened, there's a chance that deadlock occurred:
qemuProcessHandleAgentEOF in main thread A) priv->agent = NULL; //A happened before B
//deadlock when we get agent lock which's held by worker thread qemuAgentClose(agent);
qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
if (hasRefs) virObjectUnlock(priv->agent); //agent lock will not be released here
So I close agent first, then set priv->agent NULL to fix the deadlock.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com> Reviewed-by: Ren Guannan <renguannan@huawei.com> --- src/qemu/qemu_process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Interesting - this is the exact opposite of commit id '1020a504' from Michal over 3 years ago.
However, a bit of digging into the claim from the commit message drove me to commit id '362d04779c' which removes the domain object lock that was the basis of the initial patch.
While I'm not an expert in the vmagent code, I do note that the HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, while the ExitAgent code doesn't necessarily (or directly) check whether the priv->agent is still valid (IOW: that nothing else has removed it already like the EOF).
So, while I don't discount that the patch works - I'm wondering whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown.
John
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f2586a1..8c9622e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, goto unlock; }
+ qemuAgentClose(agent); priv->agent = NULL;
virObjectUnlock(vm); - - qemuAgentClose(agent); return;
unlock:
.
Thank you for your reply.
At first, we should consider about the right logic. In my oppinion, we should call qemuAgentClose at first, then we set priv->agent NULL, just like the logic in qemuProcessStop.
if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; priv->agentError = false; }
Base on the right logic, we consider about the right lock which I have shown in my patch.
Whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown?
The answer is yes, we have to, because in ExitAgent we use agent lock which may be released by other thread like qemuAgentClose, we have to check the agent first to make sure it's safe to visit.
It doesn't feel right that moving qemuAgentClose to inside where the vm lock is held will fix the issue. Or does the virCondSignal (e.g. pthread_cond_signal or pthread_cond_broadcast) release the mutex that would be held by the thread that did the EnterAgent?
So let's go back to your scenario again where A happens before B... As I see it, the bug is in B which has accesses and unreferences priv->agent *without* first getting the vm lock (like other instances), and assuming that priv->agent could still be valid (or the same) between the time the vm was unlocked in EnterAgent and the point at which we are at in ExitAgent.
So the bug would then seem to be in the EnterAgent, run Agent command, ExitAgent sequence. Since we release the vm lock during EnterAgent, "anything" that could alter the priv->agent setting while the lock is not held needs to be handled.
Thus perhaps the fix should be (in all callers)
agent = priv->agent; EnterAgent(vm); *Agent* Command ExitAgent(vm, agent)
Where in ExitAgent then has:
virObjectLock(obj); if (priv->agent != agent) VIR_DEBUG("priv->agent=%p differs from agent=%p" priv->agent, agent); virObjectUnlock(agent); return; }
if ((hasRefs = virObjectUnref(priv->agent))) virObjectUnlock(priv->agent);
VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)", priv->agent, obj, obj->def->name, hasRefs);
priv->agentStart = 0; if (!hasRefs) priv->agent = NULL;
Does this seem reasonable?
John
.
Let's see the order of lock if we modify it in your way: Thread B require agent lock (qemuDomainObjEnterAgent) require vm lock (qemuDomainObjExitAgent, still hold agent lock)
Thread A require vm lock (qemuProcessHandleAgentEOF) require agent lock (qemuAgentClose, still hold vm lock)
Bomb, dead lock! Am I right? We must keep lock require in the same order.
First off - I've removed the fix you proposed from my environment... If Thread B has the agent-lock and gets vm-lock, then Thread A cannot get vm-lock until Thread B releases. So how/when does Thread A get to agent lock (qemuAgentClose)? Well once Thread B is done with it during virDomainObjEndAPI. So where's the deadlock? By that time, Thread B has also released the agent-lock. Conversely, if Thread A (EOF) gets the vm-lock, then Thread B cannot get it until Thread A releases. Once Thread A releases and Thread B gets it, then it can determine whether the priv->agent is the same (or valid). In the EOF code you "could" have the agent-lock on entrance, but you need the vm-lock in order to clear out the agent from the vm. Calling qemuAgentClose with the agent passed from the caller (either a shutdown or some sort of catastrophic failure). So without any changes we currently have the following: EnterAgent w/ vm-lock get agent-lock vm-unlock ExitAgent w/ agent-lock unref/check agent refs if there are refs unlock-agent get vm-lock adjust vm agent related fields EOF (may have agent-lock) acquire vm-lock check vm agent fields clear vm->priv->agent unlock-vm AgentClose(agent) I think the current ordering is correct; however, the problem is the ExitAgent *assumes* priv->agent is valid without getting the vm-lock. As you pointed out in your commit message, if EOF runs, it clears priv->agent (with the vm-lock). If we drop back even further, once we release the vm-lock after EnterAgent, should not 'assume' priv->agent is valid... Probably easier if I post some patches to continue debate from there. John

On 10/02/2015 08:17 AM, John Ferlan wrote:
On 09/26/2015 08:18 AM, Wang Yufei wrote:
We shutdown a VM A by qemu agent,meanwhile an agent EOF of VM A happened, there's a chance that deadlock occurred:
qemuProcessHandleAgentEOF in main thread A) priv->agent = NULL; //A happened before B
//deadlock when we get agent lock which's held by worker thread qemuAgentClose(agent);
qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
if (hasRefs) virObjectUnlock(priv->agent); //agent lock will not be released here
So I close agent first, then set priv->agent NULL to fix the deadlock.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com> Reviewed-by: Ren Guannan <renguannan@huawei.com> --- src/qemu/qemu_process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Interesting - this is the exact opposite of commit id '1020a504' from Michal over 3 years ago.
However, a bit of digging into the claim from the commit message drove me to commit id '362d04779c' which removes the domain object lock that was the basis of the initial patch.
While I'm not an expert in the vmagent code, I do note that the HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, while the ExitAgent code doesn't necessarily (or directly) check whether the priv->agent is still valid (IOW: that nothing else has removed it already like the EOF).
So, while I don't discount that the patch works - I'm wondering whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown.
John
Ironically after spinning a few cycles and generating another patch, it seems my initial instincts were correct with respect to commit 362d04779c removing the reason/cause for 1020a504 and thus moving the qemuAgentClose is the correct option. Of course I'm not sure what caused me to doubt my first thoughts and start down the path of trying different ways to resolve this. I think I somehow got it stuck in my head that AgentDestroy still was taking out a lock. Oh well, I did learn something at least with respect to how this code works - so that part is good. I can also answer my own question with respect whether ExitAgent needs to check priv->agent and/or priv->beingDestroyed. With this patch, the EOF code will take out the vm-lock, then attempt to take out the agent-lock, but will be 'stuck' waiting for the AgentExit code to run. Once ExitAgent is called, it will remove it's reference and unlock. I don't believe there's a way for the !hasRefs to be true, so I suppose it could be removed... I'll hold off on pushing to allow pkrempa to make any comments. John
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f2586a1..8c9622e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, goto unlock; }
+ qemuAgentClose(agent); priv->agent = NULL;
virObjectUnlock(vm); - - qemuAgentClose(agent); return;
unlock:
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 10/27/2015 05:45 PM, John Ferlan wrote:
On 10/02/2015 08:17 AM, John Ferlan wrote:
On 09/26/2015 08:18 AM, Wang Yufei wrote:
We shutdown a VM A by qemu agent,meanwhile an agent EOF of VM A happened, there's a chance that deadlock occurred:
qemuProcessHandleAgentEOF in main thread A) priv->agent = NULL; //A happened before B
//deadlock when we get agent lock which's held by worker thread qemuAgentClose(agent);
qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
if (hasRefs) virObjectUnlock(priv->agent); //agent lock will not be released here
So I close agent first, then set priv->agent NULL to fix the deadlock.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com> Reviewed-by: Ren Guannan <renguannan@huawei.com> --- src/qemu/qemu_process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Interesting - this is the exact opposite of commit id '1020a504' from Michal over 3 years ago.
However, a bit of digging into the claim from the commit message drove me to commit id '362d04779c' which removes the domain object lock that was the basis of the initial patch.
While I'm not an expert in the vmagent code, I do note that the HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, while the ExitAgent code doesn't necessarily (or directly) check whether the priv->agent is still valid (IOW: that nothing else has removed it already like the EOF).
So, while I don't discount that the patch works - I'm wondering whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown.
John
Ironically after spinning a few cycles and generating another patch, it seems my initial instincts were correct with respect to commit 362d04779c removing the reason/cause for 1020a504 and thus moving the qemuAgentClose is the correct option.
Of course I'm not sure what caused me to doubt my first thoughts and start down the path of trying different ways to resolve this. I think I somehow got it stuck in my head that AgentDestroy still was taking out a lock. Oh well, I did learn something at least with respect to how this code works - so that part is good. I can also answer my own question with respect whether ExitAgent needs to check priv->agent and/or priv->beingDestroyed.
With this patch, the EOF code will take out the vm-lock, then attempt to take out the agent-lock, but will be 'stuck' waiting for the AgentExit code to run. Once ExitAgent is called, it will remove it's reference and unlock. I don't believe there's a way for the !hasRefs to be true, so I suppose it could be removed...
I'll hold off on pushing to allow pkrempa to make any comments.
John
Hmm.. Seems I got busy doing a couple of other things and forgot about this... So, I'd also like to add some "history" to the commit message (just in case). I have the following queued up - does it make sense? And of course, have I missed some other corner/edge case? John This essentially reverts commit id '1020a504'. It's also of note that commit id '362d0477' notes a possible/rare deadlock similar to what was seen in the monitor in commit id '25f582e3'. However, it seems interceding changes including commit id 'd960d06f' should remove the deadlock issue. With this change, if EOF is called: Get VM lock Check if !priv->agent || priv->beingDestroyed, then unlock VM Call qemuAgentClose Unlock VM When qemuAgentClose is called Get Agent lock If Agent->fd open, close it Unlock Agent Unref Agent qemuDomainObjEnterAgent Enter with VM lock Get Agent lock Increase Agent refcnt Unlock VM After running agent command, calling qemuDomainObjExitAgent Enter with Agent lock Unref Agent If not last reference, unlock Agent Get VM lock If we were in the middle of an EnterAgent, call Agent command, and ExitAgent sequence and the EOF code is triggered, then the EOF code can get the VM lock, make it's checks against !priv->agent || priv->beingDestroyed, and call qemuAgentClose. The CloseAgent would wait to get agent lock. The other thread then will eventually call ExitAgent, release the Agent lock and unref the Agent. Once ExitAgent releases the Agent lock, AgentClose will get the Agent Agent lock, close the fd, unlock the agent, and unref the agent. The final unref would cause deletion of the agent.
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f2586a1..8c9622e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, goto unlock; }
+ qemuAgentClose(agent); priv->agent = NULL;
virObjectUnlock(vm); - - qemuAgentClose(agent); return;
unlock:
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 2015/10/28 5:45, John Ferlan wrote:
On 10/02/2015 08:17 AM, John Ferlan wrote:
On 09/26/2015 08:18 AM, Wang Yufei wrote:
We shutdown a VM A by qemu agent,meanwhile an agent EOF of VM A happened, there's a chance that deadlock occurred:
qemuProcessHandleAgentEOF in main thread A) priv->agent = NULL; //A happened before B
//deadlock when we get agent lock which's held by worker thread qemuAgentClose(agent);
qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false
if (hasRefs) virObjectUnlock(priv->agent); //agent lock will not be released here
So I close agent first, then set priv->agent NULL to fix the deadlock.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com> Reviewed-by: Ren Guannan <renguannan@huawei.com> --- src/qemu/qemu_process.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Interesting - this is the exact opposite of commit id '1020a504' from Michal over 3 years ago.
However, a bit of digging into the claim from the commit message drove me to commit id '362d04779c' which removes the domain object lock that was the basis of the initial patch.
While I'm not an expert in the vmagent code, I do note that the HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, while the ExitAgent code doesn't necessarily (or directly) check whether the priv->agent is still valid (IOW: that nothing else has removed it already like the EOF).
So, while I don't discount that the patch works - I'm wondering whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or something else) that would indicate we're already on the path of shutdown.
John
Ironically after spinning a few cycles and generating another patch, it seems my initial instincts were correct with respect to commit 362d04779c removing the reason/cause for 1020a504 and thus moving the qemuAgentClose is the correct option.
Of course I'm not sure what caused me to doubt my first thoughts and start down the path of trying different ways to resolve this. I think I somehow got it stuck in my head that AgentDestroy still was taking out a lock. Oh well, I did learn something at least with respect to how this code works - so that part is good. I can also answer my own question with respect whether ExitAgent needs to check priv->agent and/or priv->beingDestroyed.
With this patch, the EOF code will take out the vm-lock, then attempt to take out the agent-lock, but will be 'stuck' waiting for the AgentExit code to run. Once ExitAgent is called, it will remove it's reference and unlock. I don't believe there's a way for the !hasRefs to be true, so I suppose it could be removed...
I'll hold off on pushing to allow pkrempa to make any comments.
John
Thanks all! At last we do something correct, that's enough.
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f2586a1..8c9622e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, goto unlock; }
+ qemuAgentClose(agent); priv->agent = NULL;
virObjectUnlock(vm); - - qemuAgentClose(agent); return;
unlock:
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
.
participants (2)
-
John Ferlan
-
Wang Yufei