[libvirt] [PATCH] qemu: Resolve agent deadlock

If we have a shutdown of a VM by a qemu agent while an agent EOF occurs at relatively the same time, it's possible that a deadlock occurs depending on what happens first. Assume qemuProcessHandleAgentEOF runs first, with the vm->lock it clears priv->agent, then clears the vm->lock. After some agent call such as qemuDomainShutdownFlags will call qemuDomainObjExitAgent in a worker thread with the agent lock. It will then reference the priv->agent field: hasRefs = virObjectUnref(priv->agent); in order to determine whether there are references. However, since priv->agent is NULL, this returns false, so the subsquent: if (hasRefs) virObjectUnlock(priv->agent); will not unlock the agent. Eventually the EOF thread will attempt to call qemuAgentClose which will try to lock the agent, but cannot since it's still locked. This patch resolves this issue by keeping a copy of the priv->agent address prior to releasing the vm->lock in qemuDomainObjEnterAgent. Then rather than assuming priv->agent is valid, use the copy of the agent pointer for the subsequent agent call and qemuDomainObjExitAgent. Within qemuDomainObjExitAgent, use the agent copy in order to release the agent lock, then get the vm->lock again and decide whether to clear the priv->agent based on whether any references still exist. Signed-off-by: John Ferlan <jferlan@redhat.com> --- Recent upstream patch: http://www.redhat.com/archives/libvir-list/2015-September/msg00971.html attempted to undo commit id '1020a504' which appears to be the wrong approach since it would still have the vm->lock during the qemuAgentClose. src/qemu/qemu_domain.c | 11 +++--- src/qemu/qemu_domain.h | 4 +-- src/qemu/qemu_driver.c | 90 +++++++++++++++++++++++++++++++++----------------- 3 files changed, 67 insertions(+), 38 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 890d8ed..a908df8 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1868,19 +1868,20 @@ qemuDomainObjEnterAgent(virDomainObjPtr obj) * Should be paired with an earlier qemuDomainObjEnterAgent() call */ void -qemuDomainObjExitAgent(virDomainObjPtr obj) +qemuDomainObjExitAgent(virDomainObjPtr obj, + qemuAgentPtr agent) { qemuDomainObjPrivatePtr priv = obj->privateData; bool hasRefs; - hasRefs = virObjectUnref(priv->agent); + hasRefs = virObjectUnref(agent); if (hasRefs) - virObjectUnlock(priv->agent); + virObjectUnlock(agent); virObjectLock(obj); - VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s)", - priv->agent, obj, obj->def->name); + VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)", + agent, obj, obj->def->name, hasRefs); priv->agentStart = 0; if (!hasRefs) diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 64cd7e1..def31a9 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -296,8 +296,8 @@ int qemuDomainObjEnterMonitorAsync(virQEMUDriverPtr driver, void qemuDomainObjEnterAgent(virDomainObjPtr obj) ATTRIBUTE_NONNULL(1); -void qemuDomainObjExitAgent(virDomainObjPtr obj) - ATTRIBUTE_NONNULL(1); +void qemuDomainObjExitAgent(virDomainObjPtr obj, qemuAgentPtr agent) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); void qemuDomainObjEnterRemote(virDomainObjPtr obj) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index a2cc002..b8c9ff7 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1994,9 +1994,10 @@ static int qemuDomainShutdownFlags(virDomainPtr dom, unsigned int flags) qemuDomainSetFakeReboot(driver, vm, isReboot); if (useAgent) { + qemuAgentPtr agent = priv->agent; qemuDomainObjEnterAgent(vm); - ret = qemuAgentShutdown(priv->agent, agentFlag); - qemuDomainObjExitAgent(vm); + ret = qemuAgentShutdown(agent, agentFlag); + qemuDomainObjExitAgent(vm, agent); } /* If we are not enforced to use just an agent, try ACPI @@ -2087,9 +2088,10 @@ qemuDomainReboot(virDomainPtr dom, unsigned int flags) qemuDomainSetFakeReboot(driver, vm, isReboot); if (useAgent) { + qemuAgentPtr agent = priv->agent; qemuDomainObjEnterAgent(vm); - ret = qemuAgentShutdown(priv->agent, agentFlag); - qemuDomainObjExitAgent(vm); + ret = qemuAgentShutdown(agent, agentFlag); + qemuDomainObjExitAgent(vm, agent); } /* If we are not enforced to use just an agent, try ACPI @@ -4864,6 +4866,7 @@ qemuDomainSetVcpusFlags(virDomainPtr dom, unsigned int nvcpus, qemuAgentCPUInfoPtr cpuinfo = NULL; int ncpuinfo; qemuDomainObjPrivatePtr priv; + qemuAgentPtr agent; size_t i; virCgroupPtr cgroup_temp = NULL; char *mem_mask = NULL; @@ -4924,6 +4927,7 @@ qemuDomainSetVcpusFlags(virDomainPtr dom, unsigned int nvcpus, } if (flags & VIR_DOMAIN_VCPU_GUEST) { + if (!qemuDomainAgentAvailable(vm, true)) goto endjob; @@ -4935,19 +4939,20 @@ qemuDomainSetVcpusFlags(virDomainPtr dom, unsigned int nvcpus, goto endjob; } + agent = priv->agent; qemuDomainObjEnterAgent(vm); - ncpuinfo = qemuAgentGetVCPUs(priv->agent, &cpuinfo); - qemuDomainObjExitAgent(vm); + ncpuinfo = qemuAgentGetVCPUs(agent, &cpuinfo); + qemuDomainObjExitAgent(vm, agent); - if (ncpuinfo < 0) + if (ncpuinfo < 0 || agent != priv->agent) goto endjob; if (qemuAgentUpdateCPUInfo(nvcpus, cpuinfo, ncpuinfo) < 0) goto endjob; qemuDomainObjEnterAgent(vm); - ret = qemuAgentSetVCPUs(priv->agent, cpuinfo, ncpuinfo); - qemuDomainObjExitAgent(vm); + ret = qemuAgentSetVCPUs(agent, cpuinfo, ncpuinfo); + qemuDomainObjExitAgent(vm, agent); if (ret < 0) goto endjob; @@ -5515,6 +5520,8 @@ qemuDomainGetVcpusFlags(virDomainPtr dom, unsigned int flags) goto cleanup; if (flags & VIR_DOMAIN_VCPU_GUEST) { + qemuAgentPtr agent; + if (!virDomainObjIsActive(vm)) { virReportError(VIR_ERR_INVALID_ARG, "%s", _("vCPU count provided by the guest agent can only be " @@ -5534,9 +5541,10 @@ qemuDomainGetVcpusFlags(virDomainPtr dom, unsigned int flags) goto endjob; } + agent = priv->agent; qemuDomainObjEnterAgent(vm); - ncpuinfo = qemuAgentGetVCPUs(priv->agent, &cpuinfo); - qemuDomainObjExitAgent(vm); + ncpuinfo = qemuAgentGetVCPUs(agent, &cpuinfo); + qemuDomainObjExitAgent(vm, agent); endjob: qemuDomainObjEndJob(driver, vm); @@ -13513,14 +13521,16 @@ qemuDomainSnapshotFSFreeze(virQEMUDriverPtr driver ATTRIBUTE_UNUSED, unsigned int nmountpoints) { qemuDomainObjPrivatePtr priv = vm->privateData; + qemuAgentPtr agent; int frozen; if (!qemuDomainAgentAvailable(vm, true)) return -1; + agent = priv->agent; qemuDomainObjEnterAgent(vm); - frozen = qemuAgentFSFreeze(priv->agent, mountpoints, nmountpoints); - qemuDomainObjExitAgent(vm); + frozen = qemuAgentFSFreeze(agent, mountpoints, nmountpoints); + qemuDomainObjExitAgent(vm, agent); return frozen < 0 ? -2 : frozen; } @@ -13532,19 +13542,21 @@ qemuDomainSnapshotFSThaw(virQEMUDriverPtr driver ATTRIBUTE_UNUSED, bool report) { qemuDomainObjPrivatePtr priv = vm->privateData; + qemuAgentPtr agent; int thawed; virErrorPtr err = NULL; if (!qemuDomainAgentAvailable(vm, report)) return -1; + agent = priv->agent; qemuDomainObjEnterAgent(vm); if (!report) err = virSaveLastError(); - thawed = qemuAgentFSThaw(priv->agent); + thawed = qemuAgentFSThaw(agent); if (!report) virSetError(err); - qemuDomainObjExitAgent(vm); + qemuDomainObjExitAgent(vm, agent); virFreeError(err); @@ -18146,6 +18158,7 @@ qemuDomainPMSuspendForDuration(virDomainPtr dom, { virQEMUDriverPtr driver = dom->conn->privateData; qemuDomainObjPrivatePtr priv; + qemuAgentPtr agent; virDomainObjPtr vm; int ret = -1; @@ -18218,9 +18231,10 @@ qemuDomainPMSuspendForDuration(virDomainPtr dom, goto endjob; } + agent = priv->agent; qemuDomainObjEnterAgent(vm); - ret = qemuAgentSuspend(priv->agent, target); - qemuDomainObjExitAgent(vm); + ret = qemuAgentSuspend(agent, target); + qemuDomainObjExitAgent(vm, agent); endjob: qemuDomainObjEndJob(driver, vm); @@ -18309,6 +18323,7 @@ qemuDomainQemuAgentCommand(virDomainPtr domain, int ret = -1; char *result = NULL; qemuDomainObjPrivatePtr priv; + qemuAgentPtr agent; virCheckFlags(0, NULL); @@ -18338,9 +18353,10 @@ qemuDomainQemuAgentCommand(virDomainPtr domain, goto endjob; } + agent = priv->agent; qemuDomainObjEnterAgent(vm); - ret = qemuAgentArbitraryCommand(priv->agent, cmd, &result, timeout); - qemuDomainObjExitAgent(vm); + ret = qemuAgentArbitraryCommand(agent, cmd, &result, timeout); + qemuDomainObjExitAgent(vm, agent); if (ret < 0) VIR_FREE(result); @@ -18411,6 +18427,7 @@ qemuDomainFSTrim(virDomainPtr dom, virDomainObjPtr vm; int ret = -1; qemuDomainObjPrivatePtr priv; + qemuAgentPtr agent; virCheckFlags(0, -1); @@ -18447,9 +18464,10 @@ qemuDomainFSTrim(virDomainPtr dom, goto endjob; } + agent = priv->agent; qemuDomainObjEnterAgent(vm); - ret = qemuAgentFSTrim(priv->agent, minimum); - qemuDomainObjExitAgent(vm); + ret = qemuAgentFSTrim(agent, minimum); + qemuDomainObjExitAgent(vm, agent); endjob: qemuDomainObjEndJob(driver, vm); @@ -18600,6 +18618,7 @@ qemuDomainGetTime(virDomainPtr dom, virQEMUDriverPtr driver = dom->conn->privateData; virDomainObjPtr vm = NULL; qemuDomainObjPrivatePtr priv; + qemuAgentPtr agent; int ret = -1; int rv; @@ -18625,9 +18644,10 @@ qemuDomainGetTime(virDomainPtr dom, if (!qemuDomainAgentAvailable(vm, true)) goto endjob; + agent = priv->agent; qemuDomainObjEnterAgent(vm); - rv = qemuAgentGetTime(priv->agent, seconds, nseconds); - qemuDomainObjExitAgent(vm); + rv = qemuAgentGetTime(agent, seconds, nseconds); + qemuDomainObjExitAgent(vm, agent); if (rv < 0) goto endjob; @@ -18650,6 +18670,7 @@ qemuDomainSetTime(virDomainPtr dom, { virQEMUDriverPtr driver = dom->conn->privateData; qemuDomainObjPrivatePtr priv; + qemuAgentPtr agent; virDomainObjPtr vm; bool rtcSync = flags & VIR_DOMAIN_TIME_SYNC; int ret = -1; @@ -18689,9 +18710,10 @@ qemuDomainSetTime(virDomainPtr dom, if (!qemuDomainAgentAvailable(vm, true)) goto endjob; + agent = priv->agent; qemuDomainObjEnterAgent(vm); - rv = qemuAgentSetTime(priv->agent, seconds, nseconds, rtcSync); - qemuDomainObjExitAgent(vm); + rv = qemuAgentSetTime(agent, seconds, nseconds, rtcSync); + qemuDomainObjExitAgent(vm, agent); if (rv < 0) goto endjob; @@ -19647,6 +19669,7 @@ qemuDomainGetFSInfo(virDomainPtr dom, { virQEMUDriverPtr driver = dom->conn->privateData; qemuDomainObjPrivatePtr priv; + qemuAgentPtr agent; virDomainObjPtr vm; int ret = -1; @@ -19672,9 +19695,10 @@ qemuDomainGetFSInfo(virDomainPtr dom, if (!qemuDomainAgentAvailable(vm, true)) goto endjob; + agent = priv->agent; qemuDomainObjEnterAgent(vm); - ret = qemuAgentGetFSInfo(priv->agent, info, vm->def); - qemuDomainObjExitAgent(vm); + ret = qemuAgentGetFSInfo(agent, info, vm->def); + qemuDomainObjExitAgent(vm, agent); endjob: qemuDomainObjEndJob(driver, vm); @@ -19692,6 +19716,7 @@ qemuDomainInterfaceAddresses(virDomainPtr dom, { virQEMUDriverPtr driver = dom->conn->privateData; qemuDomainObjPrivatePtr priv = NULL; + qemuAgentPtr agent; virDomainObjPtr vm = NULL; int ret = -1; @@ -19723,9 +19748,10 @@ qemuDomainInterfaceAddresses(virDomainPtr dom, if (!qemuDomainAgentAvailable(vm, true)) goto endjob; + agent = priv->agent; qemuDomainObjEnterAgent(vm); - ret = qemuAgentGetInterfaces(priv->agent, ifaces); - qemuDomainObjExitAgent(vm); + ret = qemuAgentGetInterfaces(agent, ifaces); + qemuDomainObjExitAgent(vm, agent); endjob: qemuDomainObjEndJob(driver, vm); @@ -19850,6 +19876,7 @@ qemuDomainSetUserPassword(virDomainPtr dom, { virQEMUDriverPtr driver = dom->conn->privateData; qemuDomainObjPrivatePtr priv; + qemuAgentPtr agent; virDomainObjPtr vm; int ret = -1; int rv; @@ -19876,10 +19903,11 @@ qemuDomainSetUserPassword(virDomainPtr dom, if (!qemuDomainAgentAvailable(vm, true)) goto endjob; + agent = priv->agent; qemuDomainObjEnterAgent(vm); - rv = qemuAgentSetUserPassword(priv->agent, user, password, + rv = qemuAgentSetUserPassword(agent, user, password, flags & VIR_DOMAIN_PASSWORD_ENCRYPTED); - qemuDomainObjExitAgent(vm); + qemuDomainObjExitAgent(vm, agent); if (rv < 0) goto endjob; -- 2.1.0

On Mon, Oct 26, 2015 at 11:06:21 -0400, John Ferlan wrote:
If we have a shutdown of a VM by a qemu agent while an agent EOF occurs at relatively the same time, it's possible that a deadlock occurs depending on what happens first.
Assume qemuProcessHandleAgentEOF runs first, with the vm->lock it clears priv->agent, then clears the vm->lock.
Couldn't we make sure, that qemuProcessHandleAgentEOF clears priv->agent if and only if it removed the last reference? [...]
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 890d8ed..a908df8 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1868,19 +1868,20 @@ qemuDomainObjEnterAgent(virDomainObjPtr obj) * Should be paired with an earlier qemuDomainObjEnterAgent() call */ void -qemuDomainObjExitAgent(virDomainObjPtr obj) +qemuDomainObjExitAgent(virDomainObjPtr obj, + qemuAgentPtr agent)
You would not need to modify this prototype ...
{ qemuDomainObjPrivatePtr priv = obj->privateData; bool hasRefs;
- hasRefs = virObjectUnref(priv->agent); + hasRefs = virObjectUnref(agent);
if (hasRefs) - virObjectUnlock(priv->agent); + virObjectUnlock(agent);
virObjectLock(obj); - VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s)", - priv->agent, obj, obj->def->name); + VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)", + agent, obj, obj->def->name, hasRefs);
priv->agentStart = 0; if (!hasRefs)
void qemuDomainObjEnterRemote(virDomainObjPtr obj) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index a2cc002..b8c9ff7 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1994,9 +1994,10 @@ static int qemuDomainShutdownFlags(virDomainPtr dom, unsigned int flags) qemuDomainSetFakeReboot(driver, vm, isReboot);
if (useAgent) { + qemuAgentPtr agent = priv->agent; qemuDomainObjEnterAgent(vm); - ret = qemuAgentShutdown(priv->agent, agentFlag); - qemuDomainObjExitAgent(vm); + ret = qemuAgentShutdown(agent, agentFlag); + qemuDomainObjExitAgent(vm, agent);
... and could avoid this rather ugly code here. The result would be IMO the same. Peter

On 10/26/2015 11:37 AM, Peter Krempa wrote:
On Mon, Oct 26, 2015 at 11:06:21 -0400, John Ferlan wrote:
If we have a shutdown of a VM by a qemu agent while an agent EOF occurs at relatively the same time, it's possible that a deadlock occurs depending on what happens first.
Assume qemuProcessHandleAgentEOF runs first, with the vm->lock it clears priv->agent, then clears the vm->lock.
Couldn't we make sure, that qemuProcessHandleAgentEOF clears priv->agent if and only if it removed the last reference?
reference to? agent? vm? via? qemuConnectAgent/qemuAgentOpen takes out a reference to the agent. That is un-referenced by qemuAgentClose. The EnterAgent takes out a reference which is un-referenced during ExitAgent. Adding a new reference just for EOF processing doesn't solve the problem. EOF doesn't do an agent-unlock, it does a vm-lock/unlock, and calls qemuAgentClose which will attempt an agent-lock. The agent may be locked by some other thread and it needs to wait until that reference is cleared. Adding some sort of unref-agent check logic in EOF similar to ExitAgent feels like it'll cause other issues. It seems "backwards" to remove the last reference and avoid everything else that qemuCloseAgent does. I think logically if some sort of unref-agent check logic was added, then qemuCloseAgent could not be called from either EOF or ExitAgent.
From EOF, if the current thread was the last one, then the agent is free'd so qemuCloseAgent shouldn't be called. If the current thread wasn't the last one, then we'd have to wait for the last reference check in ExitAgent, but once that is done, the agent is freed and thus qemuCloseAgent shouldn't be called.
[...]
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 890d8ed..a908df8 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1868,19 +1868,20 @@ qemuDomainObjEnterAgent(virDomainObjPtr obj) * Should be paired with an earlier qemuDomainObjEnterAgent() call */ void -qemuDomainObjExitAgent(virDomainObjPtr obj) +qemuDomainObjExitAgent(virDomainObjPtr obj, + qemuAgentPtr agent)
You would not need to modify this prototype ...
{ qemuDomainObjPrivatePtr priv = obj->privateData; bool hasRefs;
- hasRefs = virObjectUnref(priv->agent); + hasRefs = virObjectUnref(agent);
if (hasRefs) - virObjectUnlock(priv->agent); + virObjectUnlock(agent);
virObjectLock(obj); - VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s)", - priv->agent, obj, obj->def->name); + VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)", + agent, obj, obj->def->name, hasRefs);
priv->agentStart = 0; if (!hasRefs)
void qemuDomainObjEnterRemote(virDomainObjPtr obj) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index a2cc002..b8c9ff7 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1994,9 +1994,10 @@ static int qemuDomainShutdownFlags(virDomainPtr dom, unsigned int flags) qemuDomainSetFakeReboot(driver, vm, isReboot);
if (useAgent) { + qemuAgentPtr agent = priv->agent; qemuDomainObjEnterAgent(vm); - ret = qemuAgentShutdown(priv->agent, agentFlag); - qemuDomainObjExitAgent(vm); + ret = qemuAgentShutdown(agent, agentFlag); + qemuDomainObjExitAgent(vm, agent);
... and could avoid this rather ugly code here. The result would be IMO the same.
So IYO is it reasonable to access priv->agent even though we don't own the vm-lock once EnterAgent returns? John

On Mon, Oct 26, 2015 at 17:28:57 -0400, John Ferlan wrote:
On 10/26/2015 11:37 AM, Peter Krempa wrote:
On Mon, Oct 26, 2015 at 11:06:21 -0400, John Ferlan wrote:
If we have a shutdown of a VM by a qemu agent while an agent EOF occurs at relatively the same time, it's possible that a deadlock occurs depending on what happens first.
Assume qemuProcessHandleAgentEOF runs first, with the vm->lock it clears priv->agent, then clears the vm->lock.
Couldn't we make sure, that qemuProcessHandleAgentEOF clears priv->agent if and only if it removed the last reference?
reference to? agent? vm? via?
Since all of this is refering to the agent I was refering to it too.
qemuConnectAgent/qemuAgentOpen takes out a reference to the agent. That is un-referenced by qemuAgentClose. The EnterAgent takes out a reference which is un-referenced during ExitAgent. Adding a new reference just for EOF processing doesn't solve the problem.
EOF doesn't do an agent-unlock, it does a vm-lock/unlock, and calls qemuAgentClose which will attempt an agent-lock. The agent may be locked by some other thread and it needs to wait until that reference is cleared.
Adding some sort of unref-agent check logic in EOF similar to ExitAgent feels like it'll cause other issues. It seems "backwards" to remove the last reference and avoid everything else that qemuCloseAgent does.
I think logically if some sort of unref-agent check logic was added, then qemuCloseAgent could not be called from either EOF or ExitAgent.
From EOF, if the current thread was the last one, then the agent is free'd so qemuCloseAgent shouldn't be called. If the current thread wasn't the last one, then we'd have to wait for the last reference check in ExitAgent, but once that is done, the agent is freed and thus qemuCloseAgent shouldn't be called.
I wanted to point out that since we do have the 'EnterAgent' and 'ExitAgent' helpers, they should be used to do any kind of logic required to do the locking and it should not be necessary at any point to copy the pointer to priv->agent. Otherwise it creates a really ugly usage pattern which requires a separate variable and basically defeats the Enter/Exit pattern we use anywhere else. Doing so probably will increase the complexity of either the helpers or the closing function, but the complexity will not be exposed in a repeated pattern through the code. Since we already have the helpers, we should use them and not clutter the rest of the code. One other option worth checking is moving the stuff happening in qemuAgentClose into the destructor for the agent object (qemuAgentDispose) which would then auto-call it in the case where you remove the last reference. I didn't check thoroughly enough though to see whether it's possible. Peter

On 10/27/2015 01:00 AM, Peter Krempa wrote:
On Mon, Oct 26, 2015 at 17:28:57 -0400, John Ferlan wrote:
On 10/26/2015 11:37 AM, Peter Krempa wrote:
On Mon, Oct 26, 2015 at 11:06:21 -0400, John Ferlan wrote:
If we have a shutdown of a VM by a qemu agent while an agent EOF occurs at relatively the same time, it's possible that a deadlock occurs depending on what happens first.
Assume qemuProcessHandleAgentEOF runs first, with the vm->lock it clears priv->agent, then clears the vm->lock.
Couldn't we make sure, that qemuProcessHandleAgentEOF clears priv->agent if and only if it removed the last reference?
reference to? agent? vm? via?
Since all of this is refering to the agent I was refering to it too.
Difficult to be 100% certain from just your comment - no context and no suggestion for a way to handle this. There is a reference on the vm once an agent is connected...
qemuConnectAgent/qemuAgentOpen takes out a reference to the agent. That is un-referenced by qemuAgentClose. The EnterAgent takes out a reference which is un-referenced during ExitAgent. Adding a new reference just for EOF processing doesn't solve the problem.
EOF doesn't do an agent-unlock, it does a vm-lock/unlock, and calls qemuAgentClose which will attempt an agent-lock. The agent may be locked by some other thread and it needs to wait until that reference is cleared.
Adding some sort of unref-agent check logic in EOF similar to ExitAgent feels like it'll cause other issues. It seems "backwards" to remove the last reference and avoid everything else that qemuCloseAgent does.
I think logically if some sort of unref-agent check logic was added, then qemuCloseAgent could not be called from either EOF or ExitAgent.
From EOF, if the current thread was the last one, then the agent is free'd so qemuCloseAgent shouldn't be called. If the current thread wasn't the last one, then we'd have to wait for the last reference check in ExitAgent, but once that is done, the agent is freed and thus qemuCloseAgent shouldn't be called.
I wanted to point out that since we do have the 'EnterAgent' and 'ExitAgent' helpers, they should be used to do any kind of logic required to do the locking and it should not be necessary at any point to copy the pointer to priv->agent. Otherwise it creates a really ugly usage pattern which requires a separate variable and basically defeats the Enter/Exit pattern we use anywhere else.
Doing so probably will increase the complexity of either the helpers or the closing function, but the complexity will not be exposed in a repeated pattern through the code. Since we already have the helpers, we should use them and not clutter the rest of the code.
Fair enough... I guess I find it "odd" to access a structure after we've removed the lock on it, while other code/threads can modify that field. Since the only access is the one thing that was locked/reffed (eg priv->agent or in the Enter/Exit Monitor cases priv->mon), then for 99.9% of the time it's fine. The 0.1% is a timing window where priv->agent (or priv->mon) could be set to NULL by a EOF some time after EnterAgent has increased the ref on priv->agent. Looking at the ExitMonitor code compared to the ExitAgent code - it would seem logically ExitMonitor could have the same issue - that is priv->mon could be set to NULL in it's EOF path during qemuProcessStop, but that code is headed towards qemuProcessKill and there's many more instructions that need to be executed before perhaps an ExitMonitor would run in a separate thread. NB: I had only a brief look at the ExitMonitor/EOFMonitor logic. If somehow the EOF thread gets all the way to the "if (priv->mon) qemuMonitorClose(priv->mon); priv->mon = NULL;" code before the ExitMonitor thread can ObjectUnref(priv->mon), then it seems we'd have a similar scenario (unlikely, but possible).
One other option worth checking is moving the stuff happening in qemuAgentClose into the destructor for the agent object (qemuAgentDispose) which would then auto-call it in the case where you remove the last reference. I didn't check thoroughly enough though to see whether it's possible.
That seems risky, IMO... It also doesn't solve the issue where the vm's priv->agent gets set to NULL by the EOF code and the ExitAgent code cannot Unlock. So if the problem is that EOF clears priv->agent before calling qemuAgentClose(), then what happens if we remove (or move) that clearing? Well, qemuAgentClose will attempt to get the agent-lock, but needs to wait for the unlock. When ExitAgent is active it executes it's Unref and Unlock, but will never have the case of !hasRefs. Once it unlocks the agent, qemuAgentClose code does it's thing and upon return could set priv->agent = NULL (inside a vm-lock/unlock). However, that leaves a window where the agent is free'd, but priv->agent isn't NULL. That's a problem during qemuDomainSetVcpusFlags where the agent is accessed twice without checking the second if the agent was still available. Although that's resolvable with another AgentAvailable check and perhaps some logic to ascertain during EOF if we were in an inAgent Enter/Exit window. John
participants (2)
-
John Ferlan
-
Peter Krempa