[libvirt] [PATCH 00/10] qemu: agent: bugfix and improvments

Patch 1 fixes the bug I dealt with. Patches 2-3 is a somewhat protection from similar bugs. Patches 5-6 drop error flag from the agent altogether, so now there are less places for bugs to lurk. Rest patches cleanup loop function. I think 1-3 can be dropped if 6 is approved. It is just 6 can be considered controversial. Nikolay Shirokovskiy (10): qemu: agent: unset error flag in EOF handler qemu: agent: dont set error after agent cleanup qemu: agent: clear error flag upon init qemu: agent: handle agent connection errors in one place qemu: agent: straigten up failed agent start case qemu: agent: drop agent error flag (6) qemu: agent: simplify io loop qemu: agent: exit early if error is already set qemu: agent: fix loop error messages qemu: agent: drop redundant referencing in loop src/qemu/qemu_agent.c | 183 ++++++++++++++++++------------------------- src/qemu/qemu_agent.h | 2 - src/qemu/qemu_domain.c | 41 +++++----- src/qemu/qemu_domain.h | 1 - src/qemu/qemu_driver.c | 9 +-- src/qemu/qemu_migration.c | 13 +-- src/qemu/qemu_process.c | 90 ++++----------------- tests/qemumonitortestutils.c | 1 - 8 files changed, 119 insertions(+), 221 deletions(-) -- 1.8.3.1

Usually on domain shutdown event comes first from qemu and the flag is unset in processSerialChangedEvent. However there is a race between this function and qemuProcessHandleAgentEOF because the former is executed in a thread pool and the latter executed synchronously. qemuProcessHandleAgentEOF do not unset the flag thus if it was triggered before it remains set after domain shutdown/start and thus qemu agent becomes unavailable. --- src/qemu/qemu_process.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 31c8453..a581f96 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -151,6 +151,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, qemuAgentClose(agent); priv->agent = NULL; + priv->agentError = false; virObjectUnlock(vm); return; -- 1.8.3.1

On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
Usually on domain shutdown event comes first from qemu and the flag is unset in processSerialChangedEvent. However there is a race between this function and qemuProcessHandleAgentEOF because the former is executed in a thread pool and the latter executed synchronously. qemuProcessHandleAgentEOF do not unset the flag thus if it was triggered before it remains set after domain shutdown/start and thus qemu agent becomes unavailable.
What error occurred that would be avoided by clearing the agentError flag unconditionally? Considering that agentError being set is primarily used by "new connection" request processing (e.g. callers to qemuDomainAgentAvailable) in order to ascertain whether they should actually attempt the connection or fail because of some condition that would eventually lead to failure. I would think EOF from qemu_monitor for the qemu_agent would be one of those conditions *before* the domain state is not RUNNING or the priv->agent has been fully removed. I think what I'm missing is what sequence of events led to this condition? How was it reproduced.
--- src/qemu/qemu_process.c | 1 + 1 file changed, 1 insertion(+)
So I understand "eventually" your goal is to remove agentError in patch 6, but you need to lead down the path to get there. So let's see if I can follow along... The issue you chased is related to domain shutdown processing "order". The purpose up to this point in time for 'agentError' is to ensure callers of qemuDomainAgentAvailable() will fail before trying to use the agent. The qemuDomainAgentAvailable checks domain active, agentError, and whether the priv->agent is available. So the contention is during shutdown the EOF event firing at about the same time as the processSerialChangedEvent has caused an issue, but it's not 100% clear just from the description the sequence of events and the issue. So let's consider why these things happen: 1. The 'eofNotify' comes from the qemuMonitorIO function - ostensibly if the guest is shutting down 2. The 'processSerialChangedEvent' seems to occur for multiple reasons - startup, shutdown, some sort of channel failure (unix socket or pty). If I'm reading things correctly, this path is only involved when VSERPORT is being used. So, is this a case of guest startup where the initial agent connect/startup occurs (VSERPORT) and a guest shutdown racing? Or is this purely a shutdown ordering thing?
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 31c8453..a581f96 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -151,6 +151,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
qemuAgentClose(agent); priv->agent = NULL;
Because this is cleared, then qemuDomainAgentAvailable will fail because the agent is not present instead of failing with there was some error. That would seem to make the following clearing "safe" since we know we'll have a failure any way, but...
+ priv->agentError = false;
How does this alone solve the problem you saw. AFAICT it just changes *where* the error occurs and the error message displayed. Conversely, if for some reason for either 1. qemuConnectAgent failed and is being "flagged" in 1a. processSerialChangedEvent (for initial CONNECTED failure when QEMU_CAPS_VSERPORT_CHANGE) 1b. qemuMigrationFinish (for migration target) 1c. qemuProcessReconnect (for libvirtd restart) 1d. qemuProcessLaunch (initial domain startup before QEMU_CAPS_VSERPORT_CHANGE) 1e. qemuProcessAttach (for qemu-attach to existing qemu process) or 2. qemuProcessHandleAgentError - some sort of error and we're stopping any further agent communication. then you're potentially clearing agentError and could conceivably cause "extra" debugging steps to be taken only to figure out the reason it was because of some error rather than the "next" connection failing and indicating the failure was because of some previous error (it's a startup race for some other thread desiring to use the agent).
virObjectUnlock(vm); return;
So given all that - I'm not 100% clear how this really fixes the issue you say. Although, I have to wonder if the bug you're chasing is perhaps that qemuConnectAgent doesn't clear agentError at the right time as patch 3 attempts to fix. After reading through patch 4 - I think Patch 4 and a slightly modified patch 3 will fix what you're seeing, but I'd only know for sure given more details about what was seen. John

On 26.10.2016 22:48, John Ferlan wrote:
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
Usually on domain shutdown event comes first from qemu and the flag is unset in processSerialChangedEvent. However there is a race between this function and qemuProcessHandleAgentEOF because the former is executed in a thread pool and the latter executed synchronously. qemuProcessHandleAgentEOF do not unset the flag thus if it was triggered before it remains set after domain shutdown/start and thus qemu agent becomes unavailable.
What error occurred that would be avoided by clearing the agentError flag unconditionally? Considering that agentError being set is primarily
In short qemu agent becomes unavailable after shutdown. Long error case description is next: 1. agent error occurred and error flag is set. 2. domain is shutted down. Shutdown follows the path on which error flag stays set. (Actually error flag can be set in the process of shutdown, which is I guess the case that I observed.) 3. domain is started, on any attempt to use agent you got "QEMU guest agent is not available due to an error". As nobody clears error flag on start up it stays set. Obviously we don't need to clear on the first domain start after daemon start because qemuDomainObjPrivate is allocated and cleared to zeros. But restarts rely upon assumption that this flag is cleared on shutdown. And this is not true. There are shutdown paths where flags is not cleared. This patch addresses one of this paths: 1. shutdown started, in the process VSERPORT_CHANGE is delivered in IO loop thread. However it is handled in a different thread ( from driver->workerPool). 2. in the same IO loop thread we got EOF from agent and synchronously execute qemuProcessHandleAgentEOF. It closes agent, set priv->agent to NULL but forgets to clear agentError. 3. Now we execute VSERPORT_CHANGE handler(processSerialChangedEvent). It is delayed because it is asynchronous to VSERPORT_CHANGE itself. As priv->agent is NULL handler does not touch agentError flag. I guess usually time gap between VSERPORT_CHANGE and EOF is enough and step 3 is executed before step 2 and agentError is cleared upon shutdown. Basically it is enough to clear the flag on start up (patch 3) but I thought just for the case we would want to clear the flag on agent cleanup too.
used by "new connection" request processing (e.g. callers to qemuDomainAgentAvailable) in order to ascertain whether they should actually attempt the connection or fail because of some condition that would eventually lead to failure. I would think EOF from qemu_monitor for the qemu_agent would be one of those conditions *before* the domain state is not RUNNING or the priv->agent has been fully removed.
Don't really understand the part 'f 'before state in not RUNNING'.
From what I've seen in code agentError has 2 meanings, they described in patch 5.
I think what I'm missing is what sequence of events led to this condition? How was it reproduced.
--- src/qemu/qemu_process.c | 1 + 1 file changed, 1 insertion(+)
So I understand "eventually" your goal is to remove agentError in patch 6, but you need to lead down the path to get there. So let's see if I can follow along...
My original intention was to fix the bug that I tried to explain above. But in the process I found the other path how similar error can occur again (patch 2). Then I realized that we can protect ourselfs from similar bugs just in one place (patch 3). At this point I look into the code long enough to realize one don't need the flag altogether and decided that detail explanation of patch 1 and 2 helps me convince others to get rid of the flag.
The issue you chased is related to domain shutdown processing "order".
The purpose up to this point in time for 'agentError' is to ensure callers of qemuDomainAgentAvailable() will fail before trying to use the agent. The qemuDomainAgentAvailable checks domain active, agentError, and whether the priv->agent is available.
So the contention is during shutdown the EOF event firing at about the same time as the processSerialChangedEvent has caused an issue, but it's not 100% clear just from the description the sequence of events and the issue. So let's consider why these things happen:
1. The 'eofNotify' comes from the qemuMonitorIO function - ostensibly if the guest is shutting down
2. The 'processSerialChangedEvent' seems to occur for multiple reasons - startup, shutdown, some sort of channel failure (unix socket or pty). If I'm reading things correctly, this path is only involved when VSERPORT is being used.
So, is this a case of guest startup where the initial agent connect/startup occurs (VSERPORT) and a guest shutdown racing? Or is this purely a shutdown ordering thing?
Purely shutdown. And error in io loop from agent before shutdown or in the process.
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 31c8453..a581f96 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -151,6 +151,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
qemuAgentClose(agent); priv->agent = NULL;
Because this is cleared, then qemuDomainAgentAvailable will fail because the agent is not present instead of failing with there was some error. That would seem to make the following clearing "safe" since we know we'll have a failure any way, but...
+ priv->agentError = false;
How does this alone solve the problem you saw. AFAICT it just changes *where* the error occurs and the error message displayed.
The place that is affected is the following domain start up and agent use.
Conversely, if for some reason for either
1. qemuConnectAgent failed and is being "flagged" in 1a. processSerialChangedEvent (for initial CONNECTED failure when QEMU_CAPS_VSERPORT_CHANGE) 1b. qemuMigrationFinish (for migration target) 1c. qemuProcessReconnect (for libvirtd restart) 1d. qemuProcessLaunch (initial domain startup before QEMU_CAPS_VSERPORT_CHANGE) 1e. qemuProcessAttach (for qemu-attach to existing qemu process)
or
2. qemuProcessHandleAgentError - some sort of error and we're stopping any further agent communication.
then you're potentially clearing agentError and could conceivably cause "extra" debugging steps to be taken only to figure out the reason it was because of some error rather than the "next" connection failing and indicating the failure was because of some previous error (it's a startup race for some other thread desiring to use the agent).
As to case 1. AFAIU you talking about next case: 1. agent started (qemuAgentOpen is ok) 2. for some reason we have to fail qemuConnectAgent ( virSecurityManagerClearSocketLabel failed for example) and set error flag. 3. EOF is triggered and clears the error flag (in new version). This is not possible. qemuConnectAgent will eventually remove agent handle from the io loop before setting the flag on failure and EOF can not be delivered. This subtle case is another reason why we should throw away this flag. It adds too much complexity to analysis. Patch 5 unbinds the flag from this case, case of agent failed to start up. As to case 2. This is one of the patch targets. If there is an error set by qemuProcessHandleAgentError we need to clean it up when agent is cleaned up.
virObjectUnlock(vm); return;
So given all that - I'm not 100% clear how this really fixes the issue you say. Although, I have to wonder if the bug you're chasing is perhaps that qemuConnectAgent doesn't clear agentError at the right time as patch 3 attempts to fix.
Yep. The flag is not cleared. And it can be addressed both ways: clear on initialization to protect from "unclean" clean up, second - make clean up "clean".
After reading through patch 4 - I think Patch 4 and a slightly modified patch 3 will fix what you're seeing, but I'd only know for sure given more details about what was seen.
Patch 4 is pure refactoring IIRC and doesn't address the issue. Nikolay

On 27.10.2016 11:21, Nikolay Shirokovskiy wrote:
On 26.10.2016 22:48, John Ferlan wrote:
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
Usually on domain shutdown event comes first from qemu and the flag is unset in processSerialChangedEvent. However there is a race between this function and qemuProcessHandleAgentEOF because the former is executed in a thread pool and the latter executed synchronously. qemuProcessHandleAgentEOF do not unset the flag thus if it was triggered before it remains set after domain shutdown/start and thus qemu agent becomes unavailable.
What error occurred that would be avoided by clearing the agentError flag unconditionally? Considering that agentError being set is primarily
In short qemu agent becomes unavailable after shutdown. Long error case description is next:
1. agent error occurred and error flag is set. 2. domain is shutted down. Shutdown follows the path on which error flag stays set. (Actually error flag can be set in the process of shutdown, which is I guess the case that I observed.)
3. domain is started, on any attempt to use agent you got "QEMU guest agent is not available due to an error".
As nobody clears error flag on start up it stays set. Obviously we don't need to clear on the first domain start after daemon start because qemuDomainObjPrivate is allocated and cleared to zeros. But restarts rely upon assumption that this flag is cleared on shutdown. And this is not true. There are shutdown paths where flags is not cleared. This patch addresses one of this paths:
1. shutdown started, in the process VSERPORT_CHANGE is delivered in IO loop thread. However it is handled in a different thread ( from driver->workerPool). 2. in the same IO loop thread we got EOF from agent and synchronously execute qemuProcessHandleAgentEOF. It closes agent, set priv->agent to NULL but forgets to clear agentError. 3. Now we execute VSERPORT_CHANGE handler(processSerialChangedEvent). It is delayed because it is asynchronous to VSERPORT_CHANGE itself. As priv->agent is NULL handler does not touch agentError flag.
I guess usually time gap between VSERPORT_CHANGE and EOF is enough and step 3 is executed before step 2 and agentError is cleared upon shutdown.
Basically it is enough to clear the flag on start up (patch 3) but I thought just for the case we would want to clear the flag on agent cleanup too.
used by "new connection" request processing (e.g. callers to qemuDomainAgentAvailable) in order to ascertain whether they should actually attempt the connection or fail because of some condition that would eventually lead to failure. I would think EOF from qemu_monitor for the qemu_agent would be one of those conditions *before* the domain state is not RUNNING or the priv->agent has been fully removed.
Don't really understand the part 'f 'before state in not RUNNING'.
From what I've seen in code agentError has 2 meanings, they described in patch 5.
I think what I'm missing is what sequence of events led to this condition? How was it reproduced.
--- src/qemu/qemu_process.c | 1 + 1 file changed, 1 insertion(+)
So I understand "eventually" your goal is to remove agentError in patch 6, but you need to lead down the path to get there. So let's see if I can follow along...
My original intention was to fix the bug that I tried to explain above. But in the process I found the other path how similar error can occur again (patch 2). Then I realized that we can protect ourselfs from similar bugs just in one place (patch 3). At this point I look into the code long enough to realize one don't need the flag altogether and decided that detail explanation of patch 1 and 2 helps me convince others to get rid of the flag.
The issue you chased is related to domain shutdown processing "order".
The purpose up to this point in time for 'agentError' is to ensure callers of qemuDomainAgentAvailable() will fail before trying to use the agent. The qemuDomainAgentAvailable checks domain active, agentError, and whether the priv->agent is available.
So the contention is during shutdown the EOF event firing at about the same time as the processSerialChangedEvent has caused an issue, but it's not 100% clear just from the description the sequence of events and the issue. So let's consider why these things happen:
1. The 'eofNotify' comes from the qemuMonitorIO function - ostensibly if the guest is shutting down
2. The 'processSerialChangedEvent' seems to occur for multiple reasons - startup, shutdown, some sort of channel failure (unix socket or pty). If I'm reading things correctly, this path is only involved when VSERPORT is being used.
So, is this a case of guest startup where the initial agent connect/startup occurs (VSERPORT) and a guest shutdown racing? Or is this purely a shutdown ordering thing?
Purely shutdown. And error in io loop from agent before shutdown or in the process.
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 31c8453..a581f96 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -151,6 +151,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
qemuAgentClose(agent); priv->agent = NULL;
Because this is cleared, then qemuDomainAgentAvailable will fail because the agent is not present instead of failing with there was some error. That would seem to make the following clearing "safe" since we know we'll have a failure any way, but...
+ priv->agentError = false;
How does this alone solve the problem you saw. AFAICT it just changes *where* the error occurs and the error message displayed.
The place that is affected is the following domain start up and agent use.
Conversely, if for some reason for either
1. qemuConnectAgent failed and is being "flagged" in 1a. processSerialChangedEvent (for initial CONNECTED failure when QEMU_CAPS_VSERPORT_CHANGE) 1b. qemuMigrationFinish (for migration target) 1c. qemuProcessReconnect (for libvirtd restart) 1d. qemuProcessLaunch (initial domain startup before QEMU_CAPS_VSERPORT_CHANGE) 1e. qemuProcessAttach (for qemu-attach to existing qemu process)
or
2. qemuProcessHandleAgentError - some sort of error and we're stopping any further agent communication.
then you're potentially clearing agentError and could conceivably cause "extra" debugging steps to be taken only to figure out the reason it was because of some error rather than the "next" connection failing and indicating the failure was because of some previous error (it's a startup race for some other thread desiring to use the agent).
As to case 1. AFAIU you talking about next case:
1. agent started (qemuAgentOpen is ok) 2. for some reason we have to fail qemuConnectAgent ( virSecurityManagerClearSocketLabel failed for example) and set error flag. 3. EOF is triggered and clears the error flag (in new version).
This is not possible. qemuConnectAgent will eventually remove agent handle from the io loop before setting the flag on failure and EOF can not be delivered.
I was not correct. io loop event dispatching drops io loop lock on event delivering thus EOF can be delivered. So we have 2 cases for the race when agent is started and immediately gets EOF. A. qemuConnectAgent wins 1. virSecurityManagerClearSocketLabel fails, priv->agent stays NULL, error is set. 2. qemuProcessHandleAgentEOF exits early on priv->agent check. B. qemuProcessHandleAgentEOF wins 1. qemuProcessHandleAgentEOF exits early on priv->agent check. (Looks like debug message will be misleading in this case) 2. qemuConnectAgent proceeds as usual. In both cases clearing the error flag in qemuProcessHandleAgentEOF does not changes anything.
This subtle case is another reason why we should throw away this flag. It adds too much complexity to analysis. Patch 5 unbinds the flag from this case, case of agent failed to start up.
As to case 2.
This is one of the patch targets. If there is an error set by qemuProcessHandleAgentError we need to clean it up when agent is cleaned up.
virObjectUnlock(vm); return;
So given all that - I'm not 100% clear how this really fixes the issue you say. Although, I have to wonder if the bug you're chasing is perhaps that qemuConnectAgent doesn't clear agentError at the right time as patch 3 attempts to fix.
Yep. The flag is not cleared. And it can be addressed both ways: clear on initialization to protect from "unclean" clean up, second - make clean up "clean".
After reading through patch 4 - I think Patch 4 and a slightly modified patch 3 will fix what you're seeing, but I'd only know for sure given more details about what was seen.
Patch 4 is pure refactoring IIRC and doesn't address the issue.
Nikolay
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

If there is an error event after eof event then after agent is cleaned up on eof error flag will be set back and remains set after next domain start up making agent unavailable. Thus let's check before set this flag on error event. --- src/qemu/qemu_process.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index a581f96..3637f4b 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -180,7 +180,8 @@ qemuProcessHandleAgentError(qemuAgentPtr agent ATTRIBUTE_UNUSED, priv = vm->privateData; - priv->agentError = true; + if (priv->agent) + priv->agentError = true; virObjectUnlock(vm); } -- 1.8.3.1

On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
If there is an error event after eof event then after agent is cleaned up on eof error flag will be set back and remains set after next domain start up making agent unavailable. Thus let's check before set this flag on error event. --- src/qemu/qemu_process.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index a581f96..3637f4b 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -180,7 +180,8 @@ qemuProcessHandleAgentError(qemuAgentPtr agent ATTRIBUTE_UNUSED,
priv = vm->privateData;
- priv->agentError = true; + if (priv->agent) + priv->agentError = true;
A NULL priv->agent happens for the following reasons: 1. qemuDomainObjExitAgent - when the last reference is removed... I would hope we're not falling into this path for what you've seen... 2. processSerialChangedEvent - after a qemuAgentClose for VSERPORT callback and agentError is set to false... (so that seems to be happening properly) 3. qemuProcessHandleAgentEOF (discussed in patch 1) - where I don't believe you should be clearing agentError 4. qemuConnectAgent has a failure after qemuAgentOpen creates the 'agent', but before sets priv->agent = agent. 5. qemuProcessStop - stops a started agent *and* clears agentError flag (so that seems to be happening properly). w/r/t #4... there are windows in qemuConnectAgent where some error after qemuAgentOpen *succeeds* could trigger this function to be called before priv->agent is filled in. Thus by only clearing it when priv->agent is set, we could miss some error that occurred. While the "next" call could also fail and eventually set the error, I'm not convinced that only clearing when priv->agent is set is right. Long way of saying NAK on this one. John
virObjectUnlock(vm); }

On 26.10.2016 22:49, John Ferlan wrote:
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
If there is an error event after eof event then after agent is cleaned up on eof error flag will be set back and remains set after next domain start up making agent unavailable. Thus let's check before set this flag on error event. --- src/qemu/qemu_process.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index a581f96..3637f4b 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -180,7 +180,8 @@ qemuProcessHandleAgentError(qemuAgentPtr agent ATTRIBUTE_UNUSED,
priv = vm->privateData;
- priv->agentError = true; + if (priv->agent) + priv->agentError = true;
A NULL priv->agent happens for the following reasons:
1. qemuDomainObjExitAgent - when the last reference is removed... I would hope we're not falling into this path for what you've seen...
2. processSerialChangedEvent - after a qemuAgentClose for VSERPORT callback and agentError is set to false... (so that seems to be happening properly)
3. qemuProcessHandleAgentEOF (discussed in patch 1) - where I don't believe you should be clearing agentError
4. qemuConnectAgent has a failure after qemuAgentOpen creates the 'agent', but before sets priv->agent = agent.
5. qemuProcessStop - stops a started agent *and* clears agentError flag (so that seems to be happening properly).
w/r/t #4... there are windows in qemuConnectAgent where some error after qemuAgentOpen *succeeds* could trigger this function to be called before priv->agent is filled in.
Can not agree. Places you mentioned and qemuProcessHandleAgentError are serialized thru domain lock. So if the agent start up is ok and error is in io loop we do not break anything - when qemuProcessHandleAgentError get change to handle error priv->agent is not NULL. On the other hand this patch cures the situation that similar to described in patch 1. 1. shutdown 2. we get VSERPORT 3. we get ERROR from agent and set error flag 4. again we have problems on next domain start and agent usage, just as described in patch 1. Actually this is more like a protection. As on step 2 we close agent and eventually remove handle from io loop and this operation is synchronous (no more events will be delivered).
Thus by only clearing it when priv->agent is set, we could miss some error that occurred. While the "next" call could also fail and eventually set the error, I'm not convinced that only clearing when priv->agent is set is right.
Long way of saying NAK on this one.
John
virObjectUnlock(vm); }

On 27.10.2016 11:58, Nikolay Shirokovskiy wrote:
On 26.10.2016 22:49, John Ferlan wrote:
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
If there is an error event after eof event then after agent is cleaned up on eof error flag will be set back and remains set after next domain start up making agent unavailable. Thus let's check before set this flag on error event. --- src/qemu/qemu_process.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index a581f96..3637f4b 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -180,7 +180,8 @@ qemuProcessHandleAgentError(qemuAgentPtr agent ATTRIBUTE_UNUSED,
priv = vm->privateData;
- priv->agentError = true; + if (priv->agent) + priv->agentError = true;
A NULL priv->agent happens for the following reasons:
1. qemuDomainObjExitAgent - when the last reference is removed... I would hope we're not falling into this path for what you've seen...
2. processSerialChangedEvent - after a qemuAgentClose for VSERPORT callback and agentError is set to false... (so that seems to be happening properly)
3. qemuProcessHandleAgentEOF (discussed in patch 1) - where I don't believe you should be clearing agentError
4. qemuConnectAgent has a failure after qemuAgentOpen creates the 'agent', but before sets priv->agent = agent.
5. qemuProcessStop - stops a started agent *and* clears agentError flag (so that seems to be happening properly).
w/r/t #4... there are windows in qemuConnectAgent where some error after qemuAgentOpen *succeeds* could trigger this function to be called before priv->agent is filled in.
Can not agree. Places you mentioned and qemuProcessHandleAgentError are serialized thru domain lock. So if the agent start up is ok and error is in io loop we do not break anything - when qemuProcessHandleAgentError get change to handle error priv->agent is not NULL. On the other hand this patch cures the situation that similar to described in patch 1.
1. shutdown 2. we get VSERPORT 3. we get ERROR from agent and set error flag 4. again we have problems on next domain start and agent usage, just as described in patch 1.
Actually this is more like a protection. As on step 2 we close agent and eventually remove handle from io loop and this operation is synchronous (no more events will be delivered).
Ohh I failed again with reasoning... Lock is dropped during qemuAgentOpen in qemuConnectAgent. So let's consider it again. Yes. If we clear the error in qemuProcessHandleAgentError conditionally we can miss agent error upon startup etc. But hopefully next time we try to use agent we get timeout/new error. On the other hand error can be delivered after priv->agent is closed and set to NULL on shutdown. In this case if we set the error flag unconditionally we get problems on next domain start and agent use. Just as in case of patch 1. Eventually patch 3 makes patch 1 and 2 not necessary. I just wanted to keep correct value in the error flag just for the case. As to what is the meaning of this flag and it's correct value - patch 5 explains it.
Thus by only clearing it when priv->agent is set, we could miss some error that occurred. While the "next" call could also fail and eventually set the error, I'm not convinced that only clearing when priv->agent is set is right.
Long way of saying NAK on this one.
John
virObjectUnlock(vm); }
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

There were a few errors in the code when this flag was not cleared upon monitor cleanup. All of them could be fixed just resetting this flag upon agent monitor initialization. --- src/qemu/qemu_process.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 3637f4b..42f7f84 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -210,6 +210,8 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def); + priv->agentError = false; + if (!config) return 0; -- 1.8.3.1

On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
There were a few errors in the code when this flag was not cleared upon monitor cleanup. All of them could be fixed just resetting this flag upon agent monitor initialization.
We should fix the places where the flag wasn't cleared properly then.
--- src/qemu/qemu_process.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 3637f4b..42f7f84 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -210,6 +210,8 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
+ priv->agentError = false; +
This idea actually may have some merit and may actually be the cause of the bug you saw, but not right here.
if (!config) return 0;
I think perhaps a better place would be some time after we ascertain that "priv->agent" is not already set since the only way to set it is as a result of this function. My first inclination was in that VSERPORT_CHANGE check and return 0 (e.g. if called from qemuMigrationFinish, qemuProcessReconnect, qemuProcessLaunch, or qemuProcessAttach). But then I wasn't sure if there would be a race between setting 'state' to VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED in processSerialChangedEvent and one of the 4 other paths. So, I think perhaps the better place would be after the virObjectRef(vm) prior to calling qemuAgentOpen and unlocking the vm. IOW: We have the lock, alter priv->agentError. We're about to start anyway... That way, we'd know we're about the unconditionally clear the agentError just as if we were starting the first time, reconnecting to a running domain, migrating, or attaching. Thus ignoring patch 1 and 2 for a moment and considering this patch alone, we'd know for sure the agentError was cleared before a "real" startup and after a qemuProcessStop. Thus clearing before we start seems to be right (although it makes me wonder how it could have been set). Personally, I'd start with applying patch 4 and then trying to reproduce the condition... Have some sort of "marker" here that checks if agentError is true then complain. If we can figure out how we get into this function with agentError being set, then I think that'll really help me understand the order of events... John

On 26.10.2016 22:53, John Ferlan wrote:
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
There were a few errors in the code when this flag was not cleared upon monitor cleanup. All of them could be fixed just resetting this flag upon agent monitor initialization.
We should fix the places where the flag wasn't cleared properly then.
Well, the first 2 patches are exactly for this purpose.
--- src/qemu/qemu_process.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 3637f4b..42f7f84 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -210,6 +210,8 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
+ priv->agentError = false; +
This idea actually may have some merit and may actually be the cause of the bug you saw, but not right here.
if (!config) return 0;
I think perhaps a better place would be some time after we ascertain that "priv->agent" is not already set since the only way to set it is as a result of this function.
My first inclination was in that VSERPORT_CHANGE check and return 0 (e.g. if called from qemuMigrationFinish, qemuProcessReconnect, qemuProcessLaunch, or qemuProcessAttach).
But then I wasn't sure if there would be a race between setting 'state' to VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED in processSerialChangedEvent and one of the 4 other paths.
So, I think perhaps the better place would be after the virObjectRef(vm) prior to calling qemuAgentOpen and unlocking the vm. IOW: We have the lock, alter priv->agentError. We're about to start anyway...
If we fail to clean up the error flag correctly (patches 1 and 2 are dropped) on shutdown then cleaning it up in this place will be too late. For example in case we have QEMU_CAPS_VSERPORT_CHANGE then until serial port event we will be in error state while it is unconnected really. Resetting the error flag in this function is really a hack, a protection from errors like in patch 1 or 2. After thinking a while I would event put resetting before if (!config) return 0; check. Consider next situation: 1. shutown and the error flag is set after shutdown as described earlier 2. config is changed so that agent channel is removed 3. domain start 4. now as the flag is set, on attempt to use agent we get the same "QEMU guest agent is not available due to an error" message while agent is just not configured.
That way, we'd know we're about the unconditionally clear the agentError just as if we were starting the first time, reconnecting to a running domain, migrating, or attaching.
Thus ignoring patch 1 and 2 for a moment and considering this patch alone, we'd know for sure the agentError was cleared before a "real" startup and after a qemuProcessStop. Thus clearing before we start seems to be right (although it makes me wonder how it could have been set).
Personally, I'd start with applying patch 4 and then trying to reproduce the condition... Have some sort of "marker" here that checks if agentError is true then complain. If we can figure out how we get into this function with agentError being set, then I think that'll really help me understand the order of events...
John

--- src/qemu/qemu_driver.c | 8 +------ src/qemu/qemu_migration.c | 13 ++--------- src/qemu/qemu_process.c | 58 ++++++++++++----------------------------------- 3 files changed, 17 insertions(+), 62 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..edff973 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4397,7 +4397,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, virObjectEventPtr event = NULL; virDomainDeviceDef dev; qemuDomainObjPrivatePtr priv = vm->privateData; - int rc; if (connected) newstate = VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED; @@ -4450,13 +4449,8 @@ processSerialChangedEvent(virQEMUDriverPtr driver, if (STREQ_NULLABLE(dev.data.chr->target.name, "org.qemu.guest_agent.0")) { if (newstate == VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) { - if (!priv->agent) { - if ((rc = qemuConnectAgent(driver, vm)) == -2) + if (!priv->agent && qemuConnectAgent(driver, vm) < 0) goto endjob; - - if (rc < 0) - priv->agentError = true; - } } else { if (priv->agent) { qemuAgentClose(priv->agent); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index e2ca330..0a02236 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -6171,7 +6171,6 @@ qemuMigrationFinish(virQEMUDriverPtr driver, unsigned short port; unsigned long long timeReceived = 0; virObjectEventPtr event; - int rc; qemuDomainJobInfoPtr jobInfo = NULL; bool inPostCopy = false; bool doKill = true; @@ -6244,16 +6243,8 @@ qemuMigrationFinish(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_IN) < 0) goto endjob; - if ((rc = qemuConnectAgent(driver, vm)) < 0) { - if (rc == -2) - goto endjob; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } - + if (qemuConnectAgent(driver, vm) < 0) + goto endjob; if (flags & VIR_MIGRATE_PERSIST_DEST) { if (qemuMigrationPersist(driver, vm, mig, !v3proto) < 0) { diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 42f7f84..d7c9ce3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -206,7 +206,6 @@ int qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv = vm->privateData; - int ret = -1; qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def); @@ -252,8 +251,7 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentClose(agent); virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("guest crashed while connecting to the guest agent")); - ret = -2; - goto cleanup; + return -1; } if (virSecurityManagerClearSocketLabel(driver->securityManager, @@ -264,18 +262,16 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) goto cleanup; } - priv->agent = agent; - if (priv->agent == NULL) { - VIR_INFO("Failed to connect agent for %s", vm->def->name); - goto cleanup; - } - - ret = 0; - cleanup: - return ret; + if (!priv->agent) { + VIR_WARN("Cannot connect to QEMU guest agent for %s", vm->def->name); + priv->agentError = true; + virResetLastError(); + } + + return 0; } @@ -3249,7 +3245,6 @@ qemuProcessReconnect(void *opaque) int reason; virQEMUDriverConfigPtr cfg; size_t i; - int ret; unsigned int stopFlags = 0; bool jobStarted = false; virCapsPtr caps = NULL; @@ -3393,16 +3388,8 @@ qemuProcessReconnect(void *opaque) if (qemuProcessUpdateDevices(driver, obj) < 0) goto error; - /* Failure to connect to agent shouldn't be fatal */ - if ((ret = qemuConnectAgent(driver, obj)) < 0) { - if (ret == -2) - goto error; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - obj->def->name); - virResetLastError(); - priv->agentError = true; - } + if (qemuConnectAgent(driver, obj) < 0) + goto error; /* update domain state XML with possibly updated state in virDomainObj */ if (virDomainSaveStatus(driver->xmlopt, cfg->stateDir, obj, driver->caps) < 0) @@ -5488,16 +5475,8 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessWaitForMonitor(driver, vm, asyncJob, priv->qemuCaps, logCtxt) < 0) goto cleanup; - /* Failure to connect to agent shouldn't be fatal */ - if ((rv = qemuConnectAgent(driver, vm)) < 0) { - if (rv == -2) - goto cleanup; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } + if (qemuConnectAgent(driver, vm) < 0) + goto cleanup; VIR_DEBUG("Detecting if required emulator features are present"); if (!qemuProcessVerifyGuestCPU(driver, vm, asyncJob)) @@ -6176,7 +6155,6 @@ int qemuProcessAttach(virConnectPtr conn ATTRIBUTE_UNUSED, virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); virCapsPtr caps = NULL; bool active = false; - int ret; VIR_DEBUG("Beginning VM attach process"); @@ -6303,16 +6281,8 @@ int qemuProcessAttach(virConnectPtr conn ATTRIBUTE_UNUSED, if (qemuProcessWaitForMonitor(driver, vm, QEMU_ASYNC_JOB_NONE, priv->qemuCaps, NULL) < 0) goto error; - /* Failure to connect to agent shouldn't be fatal */ - if ((ret = qemuConnectAgent(driver, vm)) < 0) { - if (ret == -2) - goto error; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } + if (qemuConnectAgent(driver, vm) < 0) + goto error; VIR_DEBUG("Detecting VCPU PIDs"); if (qemuDomainRefreshVcpuInfo(driver, vm, QEMU_ASYNC_JOB_NONE, false) < 0) -- 1.8.3.1

There's no commit message... You're altering qemuConnectAgent to return -1 on the only error and perform the VIR_WARN plus agentError = true on other "soft" failures. Not exactly nothing going on! On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
--- src/qemu/qemu_driver.c | 8 +------ src/qemu/qemu_migration.c | 13 ++--------- src/qemu/qemu_process.c | 58 ++++++++++++----------------------------------- 3 files changed, 17 insertions(+), 62 deletions(-)
This idea seems to have merit too - something that I've thought now that I've read the first 3 patches... I think you should have started with this patch, it probably would have made the others easier to think about. In fact, I'm curious if with just this change and the suggestion in patch 3 for clearing agentError whether you can reproduced the bug you're trying to fix/resolve without any of the other patches.
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..edff973 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4397,7 +4397,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, virObjectEventPtr event = NULL; virDomainDeviceDef dev; qemuDomainObjPrivatePtr priv = vm->privateData; - int rc;
if (connected) newstate = VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED; @@ -4450,13 +4449,8 @@ processSerialChangedEvent(virQEMUDriverPtr driver,
if (STREQ_NULLABLE(dev.data.chr->target.name, "org.qemu.guest_agent.0")) { if (newstate == VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) { - if (!priv->agent) { - if ((rc = qemuConnectAgent(driver, vm)) == -2) + if (!priv->agent && qemuConnectAgent(driver, vm) < 0)
FWIW: qemuConnectAgent returns 0 "if (priv->agent)", so there's one less check here too. Could just be "if (qemuConnectAgent(driver, vm) < 0)"
goto endjob;
The indention for this is off (remove leading 4 spaces)
- - if (rc < 0) - priv->agentError = true; - } } else { if (priv->agent) { qemuAgentClose(priv->agent); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index e2ca330..0a02236 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -6171,7 +6171,6 @@ qemuMigrationFinish(virQEMUDriverPtr driver, unsigned short port; unsigned long long timeReceived = 0; virObjectEventPtr event; - int rc; qemuDomainJobInfoPtr jobInfo = NULL; bool inPostCopy = false; bool doKill = true; @@ -6244,16 +6243,8 @@ qemuMigrationFinish(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_IN) < 0) goto endjob;
- if ((rc = qemuConnectAgent(driver, vm)) < 0) { - if (rc == -2) - goto endjob; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } - + if (qemuConnectAgent(driver, vm) < 0) + goto endjob;
if (flags & VIR_MIGRATE_PERSIST_DEST) { if (qemuMigrationPersist(driver, vm, mig, !v3proto) < 0) { diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 42f7f84..d7c9ce3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -206,7 +206,6 @@ int qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv = vm->privateData; - int ret = -1; qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
@@ -252,8 +251,7 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentClose(agent); virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("guest crashed while connecting to the guest agent")); - ret = -2; - goto cleanup; + return -1; }
if (virSecurityManagerClearSocketLabel(driver->securityManager, @@ -264,18 +262,16 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) goto cleanup; }
- priv->agent = agent;
- if (priv->agent == NULL) { - VIR_INFO("Failed to connect agent for %s", vm->def->name);
Interesting a "marker" of sorts to know the virAgentOpen "failed" is that we'd get an Informational "Failed to connect..." followed shortly thereafter by a Warning "Cannot connect..." (depending of course on one's message display severity level). I think if you "restore" this without the goto cleanup (below) and of course the removal of the {}, then at least message parity is achieved... I suppose it could move up into the "if (agent == NULL)" too, but then it could be given even though an Error is reported. It's not that important, but it does leave some breadcrumbs. Again I'd like to see the breadcrumbs with just this patch applied to reproduced what you're chasing in patches 1-4... John
- goto cleanup; - } - - ret = 0; - cleanup: - return ret; + if (!priv->agent) {
+ VIR_WARN("Cannot connect to QEMU guest agent for %s", vm->def->name); + priv->agentError = true; + virResetLastError(); + } + + return 0; }
@@ -3249,7 +3245,6 @@ qemuProcessReconnect(void *opaque) int reason; virQEMUDriverConfigPtr cfg; size_t i; - int ret; unsigned int stopFlags = 0; bool jobStarted = false; virCapsPtr caps = NULL; @@ -3393,16 +3388,8 @@ qemuProcessReconnect(void *opaque) if (qemuProcessUpdateDevices(driver, obj) < 0) goto error;
- /* Failure to connect to agent shouldn't be fatal */ - if ((ret = qemuConnectAgent(driver, obj)) < 0) { - if (ret == -2) - goto error; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - obj->def->name); - virResetLastError(); - priv->agentError = true; - } + if (qemuConnectAgent(driver, obj) < 0) + goto error;
/* update domain state XML with possibly updated state in virDomainObj */ if (virDomainSaveStatus(driver->xmlopt, cfg->stateDir, obj, driver->caps) < 0) @@ -5488,16 +5475,8 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessWaitForMonitor(driver, vm, asyncJob, priv->qemuCaps, logCtxt) < 0) goto cleanup;
- /* Failure to connect to agent shouldn't be fatal */ - if ((rv = qemuConnectAgent(driver, vm)) < 0) { - if (rv == -2) - goto cleanup; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } + if (qemuConnectAgent(driver, vm) < 0) + goto cleanup;
VIR_DEBUG("Detecting if required emulator features are present"); if (!qemuProcessVerifyGuestCPU(driver, vm, asyncJob)) @@ -6176,7 +6155,6 @@ int qemuProcessAttach(virConnectPtr conn ATTRIBUTE_UNUSED, virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); virCapsPtr caps = NULL; bool active = false; - int ret;
VIR_DEBUG("Beginning VM attach process");
@@ -6303,16 +6281,8 @@ int qemuProcessAttach(virConnectPtr conn ATTRIBUTE_UNUSED, if (qemuProcessWaitForMonitor(driver, vm, QEMU_ASYNC_JOB_NONE, priv->qemuCaps, NULL) < 0) goto error;
- /* Failure to connect to agent shouldn't be fatal */ - if ((ret = qemuConnectAgent(driver, vm)) < 0) { - if (ret == -2) - goto error; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } + if (qemuConnectAgent(driver, vm) < 0) + goto error;
VIR_DEBUG("Detecting VCPU PIDs"); if (qemuDomainRefreshVcpuInfo(driver, vm, QEMU_ASYNC_JOB_NONE, false) < 0)

On 26.10.2016 22:57, John Ferlan wrote:
There's no commit message...
This is simple refactoring.
You're altering qemuConnectAgent to return -1 on the only error and perform the VIR_WARN plus agentError = true on other "soft" failures.
Not exactly nothing going on!
This is what already present in code, I've just moved soft failures in qemuConnectAgent.
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
--- src/qemu/qemu_driver.c | 8 +------ src/qemu/qemu_migration.c | 13 ++--------- src/qemu/qemu_process.c | 58 ++++++++++++----------------------------------- 3 files changed, 17 insertions(+), 62 deletions(-)
This idea seems to have merit too - something that I've thought now that I've read the first 3 patches...
I think you should have started with this patch, it probably would have made the others easier to think about. In fact, I'm curious if with just this change and the suggestion in patch 3 for clearing agentError whether you can reproduced the bug you're trying to fix/resolve without any of the other patches.
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..edff973 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4397,7 +4397,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, virObjectEventPtr event = NULL; virDomainDeviceDef dev; qemuDomainObjPrivatePtr priv = vm->privateData; - int rc;
if (connected) newstate = VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED; @@ -4450,13 +4449,8 @@ processSerialChangedEvent(virQEMUDriverPtr driver,
if (STREQ_NULLABLE(dev.data.chr->target.name, "org.qemu.guest_agent.0")) { if (newstate == VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) { - if (!priv->agent) { - if ((rc = qemuConnectAgent(driver, vm)) == -2) + if (!priv->agent && qemuConnectAgent(driver, vm) < 0)
FWIW: qemuConnectAgent returns 0 "if (priv->agent)", so there's one less check here too. Could just be "if (qemuConnectAgent(driver, vm) < 0)"
It depends on patch 3. If we clear the error early in qemuConnectAgent then we'd better check before calling this function.
goto endjob;
The indention for this is off (remove leading 4 spaces)
- - if (rc < 0) - priv->agentError = true; - } } else { if (priv->agent) { qemuAgentClose(priv->agent); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index e2ca330..0a02236 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -6171,7 +6171,6 @@ qemuMigrationFinish(virQEMUDriverPtr driver, unsigned short port; unsigned long long timeReceived = 0; virObjectEventPtr event; - int rc; qemuDomainJobInfoPtr jobInfo = NULL; bool inPostCopy = false; bool doKill = true; @@ -6244,16 +6243,8 @@ qemuMigrationFinish(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_IN) < 0) goto endjob;
- if ((rc = qemuConnectAgent(driver, vm)) < 0) { - if (rc == -2) - goto endjob; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } - + if (qemuConnectAgent(driver, vm) < 0) + goto endjob;
if (flags & VIR_MIGRATE_PERSIST_DEST) { if (qemuMigrationPersist(driver, vm, mig, !v3proto) < 0) { diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 42f7f84..d7c9ce3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -206,7 +206,6 @@ int qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv = vm->privateData; - int ret = -1; qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
@@ -252,8 +251,7 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentClose(agent); virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("guest crashed while connecting to the guest agent")); - ret = -2; - goto cleanup; + return -1; }
if (virSecurityManagerClearSocketLabel(driver->securityManager, @@ -264,18 +262,16 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) goto cleanup; }
- priv->agent = agent;
- if (priv->agent == NULL) { - VIR_INFO("Failed to connect agent for %s", vm->def->name);
Interesting a "marker" of sorts to know the virAgentOpen "failed" is that we'd get an Informational "Failed to connect..." followed shortly thereafter by a Warning "Cannot connect..." (depending of course on one's message display severity level).
I think if you "restore" this without the goto cleanup (below) and of course the removal of the {}, then at least message parity is achieved... I suppose it could move up into the "if (agent == NULL)" too, but then it could be given even though an Error is reported.
It's not that important, but it does leave some breadcrumbs.
Agree. Even though I'm still wondering how useful this message is as this patch is intended to be refactoring let's keep this message.
Again I'd like to see the breadcrumbs with just this patch applied to reproduced what you're chasing in patches 1-4...
Sorry, not possible. The situation I've described in patch 1 is based on analysis. However after applying the patches 1-3 the bug is not occured in our test system anymoree. Nikolay

On 10/27/2016 07:34 AM, Nikolay Shirokovskiy wrote:
On 26.10.2016 22:57, John Ferlan wrote:
There's no commit message...
This is simple refactoring.
Not really... A simple refactor would have kept the -2 logic. What's been done here is to let qemuConnectAgent make the decision on "hard" or "soft" error and only use < 0 as failure
You're altering qemuConnectAgent to return -1 on the only error and perform the VIR_WARN plus agentError = true on other "soft" failures.
Not exactly nothing going on!
This is what already present in code, I've just moved soft failures in qemuConnectAgent.
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
--- src/qemu/qemu_driver.c | 8 +------ src/qemu/qemu_migration.c | 13 ++--------- src/qemu/qemu_process.c | 58 ++++++++++++----------------------------------- 3 files changed, 17 insertions(+), 62 deletions(-)
This idea seems to have merit too - something that I've thought now that I've read the first 3 patches...
I think you should have started with this patch, it probably would have made the others easier to think about. In fact, I'm curious if with just this change and the suggestion in patch 3 for clearing agentError whether you can reproduced the bug you're trying to fix/resolve without any of the other patches.
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..edff973 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4397,7 +4397,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, virObjectEventPtr event = NULL; virDomainDeviceDef dev; qemuDomainObjPrivatePtr priv = vm->privateData; - int rc;
if (connected) newstate = VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED; @@ -4450,13 +4449,8 @@ processSerialChangedEvent(virQEMUDriverPtr driver,
if (STREQ_NULLABLE(dev.data.chr->target.name, "org.qemu.guest_agent.0")) { if (newstate == VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) { - if (!priv->agent) { - if ((rc = qemuConnectAgent(driver, vm)) == -2) + if (!priv->agent && qemuConnectAgent(driver, vm) < 0)
FWIW: qemuConnectAgent returns 0 "if (priv->agent)", so there's one less check here too. Could just be "if (qemuConnectAgent(driver, vm) < 0)"
It depends on patch 3. If we clear the error early in qemuConnectAgent then we'd better check before calling this function.
Right my mind was already there or more succinctly said - I was trying to consider this as the first patch as I think it'll make the reasoning a bit more clear. Since the "issue" has been described now as a pure shutdown and restart after error problem without needing to consider the migration, reconnect, and attach separately - it's easier to focus on. So my feeling now after reading and thinking a bit about everything so far is that this patch could go in (mostly) as is. That way there are then 2 places to set agentError=true and two places to set agentError=false (discluding original allocation and reconnect paths). I don't think we can "combine" the Error and EOF paths. They're separate. What happens if some day someone uses the VSERPORT logic to somehow restart an agent after an error, after a close, and before a shutdown. IOW: Some kind of error recovery logic. In order to handle the issue where agentError is set, but we cannot start again (e.g. the have error, do shutdown, do start path) - add an unconditional priv->agentError = false before in qemuConnectAgent before the virSecurityManagerSetDaemonSocketLabel call. The other agentError = false settings can then be removed. So we're left with 2 places to set error and one place to clear. I'll continue to consider this, bug figured I'd get out a response to all this sooner than later. I'm less convinced mucking with combining EOF and Error is a good idea in patches 7-10, but I haven't thought too much about it either. John
goto endjob;
The indention for this is off (remove leading 4 spaces)
- - if (rc < 0) - priv->agentError = true; - } } else { if (priv->agent) { qemuAgentClose(priv->agent); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index e2ca330..0a02236 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -6171,7 +6171,6 @@ qemuMigrationFinish(virQEMUDriverPtr driver, unsigned short port; unsigned long long timeReceived = 0; virObjectEventPtr event; - int rc; qemuDomainJobInfoPtr jobInfo = NULL; bool inPostCopy = false; bool doKill = true; @@ -6244,16 +6243,8 @@ qemuMigrationFinish(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_IN) < 0) goto endjob;
- if ((rc = qemuConnectAgent(driver, vm)) < 0) { - if (rc == -2) - goto endjob; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } - + if (qemuConnectAgent(driver, vm) < 0) + goto endjob;
if (flags & VIR_MIGRATE_PERSIST_DEST) { if (qemuMigrationPersist(driver, vm, mig, !v3proto) < 0) { diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 42f7f84..d7c9ce3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -206,7 +206,6 @@ int qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv = vm->privateData; - int ret = -1; qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
@@ -252,8 +251,7 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentClose(agent); virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("guest crashed while connecting to the guest agent")); - ret = -2; - goto cleanup; + return -1; }
if (virSecurityManagerClearSocketLabel(driver->securityManager, @@ -264,18 +262,16 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) goto cleanup; }
- priv->agent = agent;
- if (priv->agent == NULL) { - VIR_INFO("Failed to connect agent for %s", vm->def->name);
Interesting a "marker" of sorts to know the virAgentOpen "failed" is that we'd get an Informational "Failed to connect..." followed shortly thereafter by a Warning "Cannot connect..." (depending of course on one's message display severity level).
I think if you "restore" this without the goto cleanup (below) and of course the removal of the {}, then at least message parity is achieved... I suppose it could move up into the "if (agent == NULL)" too, but then it could be given even though an Error is reported.
It's not that important, but it does leave some breadcrumbs.
Agree. Even though I'm still wondering how useful this message is as this patch is intended to be refactoring let's keep this message.
Again I'd like to see the breadcrumbs with just this patch applied to reproduced what you're chasing in patches 1-4...
Sorry, not possible. The situation I've described in patch 1 is based on analysis. However after applying the patches 1-3 the bug is not occured in our test system anymoree.
Nikolay

On 27.10.2016 18:51, John Ferlan wrote:
On 10/27/2016 07:34 AM, Nikolay Shirokovskiy wrote:
On 26.10.2016 22:57, John Ferlan wrote:
There's no commit message...
This is simple refactoring.
Not really...
A simple refactor would have kept the -2 logic.
Ohh, my bad english. I mean just(mere) refactoring. Not simple. However as a refactoring step this patch is not complex from my POV.
What's been done here is to let qemuConnectAgent make the decision on "hard" or "soft" error and only use < 0 as failure
You're altering qemuConnectAgent to return -1 on the only error and perform the VIR_WARN plus agentError = true on other "soft" failures.
Not exactly nothing going on!
This is what already present in code, I've just moved soft failures in qemuConnectAgent.
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
--- src/qemu/qemu_driver.c | 8 +------ src/qemu/qemu_migration.c | 13 ++--------- src/qemu/qemu_process.c | 58 ++++++++++++----------------------------------- 3 files changed, 17 insertions(+), 62 deletions(-)
This idea seems to have merit too - something that I've thought now that I've read the first 3 patches...
I think you should have started with this patch, it probably would have made the others easier to think about. In fact, I'm curious if with just this change and the suggestion in patch 3 for clearing agentError whether you can reproduced the bug you're trying to fix/resolve without any of the other patches.
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..edff973 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4397,7 +4397,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, virObjectEventPtr event = NULL; virDomainDeviceDef dev; qemuDomainObjPrivatePtr priv = vm->privateData; - int rc;
if (connected) newstate = VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED; @@ -4450,13 +4449,8 @@ processSerialChangedEvent(virQEMUDriverPtr driver,
if (STREQ_NULLABLE(dev.data.chr->target.name, "org.qemu.guest_agent.0")) { if (newstate == VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) { - if (!priv->agent) { - if ((rc = qemuConnectAgent(driver, vm)) == -2) + if (!priv->agent && qemuConnectAgent(driver, vm) < 0)
FWIW: qemuConnectAgent returns 0 "if (priv->agent)", so there's one less check here too. Could just be "if (qemuConnectAgent(driver, vm) < 0)"
It depends on patch 3. If we clear the error early in qemuConnectAgent then we'd better check before calling this function.
Right my mind was already there or more succinctly said - I was trying to consider this as the first patch as I think it'll make the reasoning a bit more clear.
Agree.
Since the "issue" has been described now as a pure shutdown and restart after error problem without needing to consider the migration, reconnect, and attach separately - it's easier to focus on.
So my feeling now after reading and thinking a bit about everything so far is that this patch could go in (mostly) as is. That way there are then 2 places to set agentError=true and two places to set agentError=false (discluding original allocation and reconnect paths).
I don't think we can "combine" the Error and EOF paths. They're separate. What happens if some day someone uses the VSERPORT logic to
Right now there is no value in separation AFAIU but a lot of code that is difficult to follow.
somehow restart an agent after an error, after a close, and before a shutdown. IOW: Some kind of error recovery logic.
Dont's see how having separate EOF and Error will help in this case.
In order to handle the issue where agentError is set, but we cannot start again (e.g. the have error, do shutdown, do start path) - add an unconditional priv->agentError = false before in qemuConnectAgent before the virSecurityManagerSetDaemonSocketLabel call. The other agentError = false settings can then be removed.
Can not agree. For modern qemu agent will be in error state on start up until serial port appeared. Should be in 'not connected' state.
So we're left with 2 places to set error and one place to clear. I'll continue to consider this, bug figured I'd get out a response to all this sooner than later.
I'm less convinced mucking with combining EOF and Error is a good idea in patches 7-10, but I haven't thought too much about it either.
John
goto endjob;
The indention for this is off (remove leading 4 spaces)
- - if (rc < 0) - priv->agentError = true; - } } else { if (priv->agent) { qemuAgentClose(priv->agent); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index e2ca330..0a02236 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -6171,7 +6171,6 @@ qemuMigrationFinish(virQEMUDriverPtr driver, unsigned short port; unsigned long long timeReceived = 0; virObjectEventPtr event; - int rc; qemuDomainJobInfoPtr jobInfo = NULL; bool inPostCopy = false; bool doKill = true; @@ -6244,16 +6243,8 @@ qemuMigrationFinish(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_IN) < 0) goto endjob;
- if ((rc = qemuConnectAgent(driver, vm)) < 0) { - if (rc == -2) - goto endjob; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } - + if (qemuConnectAgent(driver, vm) < 0) + goto endjob;
if (flags & VIR_MIGRATE_PERSIST_DEST) { if (qemuMigrationPersist(driver, vm, mig, !v3proto) < 0) { diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 42f7f84..d7c9ce3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -206,7 +206,6 @@ int qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv = vm->privateData; - int ret = -1; qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
@@ -252,8 +251,7 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentClose(agent); virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("guest crashed while connecting to the guest agent")); - ret = -2; - goto cleanup; + return -1; }
if (virSecurityManagerClearSocketLabel(driver->securityManager, @@ -264,18 +262,16 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) goto cleanup; }
- priv->agent = agent;
- if (priv->agent == NULL) { - VIR_INFO("Failed to connect agent for %s", vm->def->name);
Interesting a "marker" of sorts to know the virAgentOpen "failed" is that we'd get an Informational "Failed to connect..." followed shortly thereafter by a Warning "Cannot connect..." (depending of course on one's message display severity level).
I think if you "restore" this without the goto cleanup (below) and of course the removal of the {}, then at least message parity is achieved... I suppose it could move up into the "if (agent == NULL)" too, but then it could be given even though an Error is reported.
It's not that important, but it does leave some breadcrumbs.
Agree. Even though I'm still wondering how useful this message is as this patch is intended to be refactoring let's keep this message.
Again I'd like to see the breadcrumbs with just this patch applied to reproduced what you're chasing in patches 1-4...
Sorry, not possible. The situation I've described in patch 1 is based on analysis. However after applying the patches 1-3 the bug is not occured in our test system anymoree.
Nikolay

On 10/28/2016 03:37 AM, Nikolay Shirokovskiy wrote:
On 27.10.2016 18:51, John Ferlan wrote:
On 10/27/2016 07:34 AM, Nikolay Shirokovskiy wrote:
On 26.10.2016 22:57, John Ferlan wrote:
There's no commit message...
This is simple refactoring.
Not really...
A simple refactor would have kept the -2 logic.
Ohh, my bad english. I mean just(mere) refactoring. Not simple. However as a refactoring step this patch is not complex from my POV.
Commit messages don't have to be a long story, but they shouldn't be empty. If you're relying on purely the 60 character commit message to convey what's been done, then the changes would have to be similarly short. Part of your refactor altered logic and that's something that needs to be mentioned.
What's been done here is to let qemuConnectAgent make the decision on "hard" or "soft" error and only use < 0 as failure
You're altering qemuConnectAgent to return -1 on the only error and perform the VIR_WARN plus agentError = true on other "soft" failures.
Not exactly nothing going on!
This is what already present in code, I've just moved soft failures in qemuConnectAgent.
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
--- src/qemu/qemu_driver.c | 8 +------ src/qemu/qemu_migration.c | 13 ++--------- src/qemu/qemu_process.c | 58 ++++++++++++----------------------------------- 3 files changed, 17 insertions(+), 62 deletions(-)
This idea seems to have merit too - something that I've thought now that I've read the first 3 patches...
I think you should have started with this patch, it probably would have made the others easier to think about. In fact, I'm curious if with just this change and the suggestion in patch 3 for clearing agentError whether you can reproduced the bug you're trying to fix/resolve without any of the other patches.
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..edff973 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4397,7 +4397,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, virObjectEventPtr event = NULL; virDomainDeviceDef dev; qemuDomainObjPrivatePtr priv = vm->privateData; - int rc;
if (connected) newstate = VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED; @@ -4450,13 +4449,8 @@ processSerialChangedEvent(virQEMUDriverPtr driver,
if (STREQ_NULLABLE(dev.data.chr->target.name, "org.qemu.guest_agent.0")) { if (newstate == VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) { - if (!priv->agent) { - if ((rc = qemuConnectAgent(driver, vm)) == -2) + if (!priv->agent && qemuConnectAgent(driver, vm) < 0)
FWIW: qemuConnectAgent returns 0 "if (priv->agent)", so there's one less check here too. Could just be "if (qemuConnectAgent(driver, vm) < 0)"
It depends on patch 3. If we clear the error early in qemuConnectAgent then we'd better check before calling this function.
Right my mind was already there or more succinctly said - I was trying to consider this as the first patch as I think it'll make the reasoning a bit more clear.
Agree.
Since the "issue" has been described now as a pure shutdown and restart after error problem without needing to consider the migration, reconnect, and attach separately - it's easier to focus on.
So my feeling now after reading and thinking a bit about everything so far is that this patch could go in (mostly) as is. That way there are then 2 places to set agentError=true and two places to set agentError=false (discluding original allocation and reconnect paths).
I don't think we can "combine" the Error and EOF paths. They're separate. What happens if some day someone uses the VSERPORT logic to
Right now there is no value in separation AFAIU but a lot of code that is difficult to follow.
EOF and Error are different. That's why there's two different callbacks. The agent implementation has 'chosen' to just make Error essentially fatal for all future attempts while the guest is running. I seem to recall recent discussion regarding why there's a separate Error handler and the thought that some sort of error recover could be done, but I cannot find it in the last 3 months of archives on a simple title search. It could have been a conversation at KVM Forum too. Could someone come along and alter the current reality - perhaps. Your "choice" is remove a bunch of code that someone would then have to add back later. I'm erring on the side of caution - having separate Error and EOF does work - what doesn't work right is the 'agentError' logic. But we can fix that. As for difficult to follow - that's perhaps my other hot button issue - it's not self documenting and that's frustrating. Some will tout usage of git log history to understand how something's supposed to work. I land more on the side of - if you learn something about code you're changing, then leave entrails in the code to hopefully help the next person reading the code and not the git log's. It's a delicate balance for my penchant to be too verbose vs. providing sufficient information. I'll sometimes leave a comment somewhere that means nothing to someone else, but causes me to remember why I did something (like the pensieve in the Harry Potter movies ;-)).
somehow restart an agent after an error, after a close, and before a shutdown. IOW: Some kind of error recovery logic.
Dont's see how having separate EOF and Error will help in this case.
Check the callers of qemuDomainAgentAvailable and imagine a day where we can take an error and do something smarter... If there's multiple "things" an Agent can do and we disable them all just because of an error that's kind of frustrating. So maybe someone comes along and classifies the calls into groups (state, vcpu, FS, time, interface, security). Then let's say we have some sort of error from a vcpu call - does that necessarily mean we should disable getting/setting time? Do we know enough about the error. The vmagent is not something I focus on, so I don't have the answer to whether we could make something recoverable. I also know there's a Linux agent and Windows agent - so In any case, I don't think we should disable error processing just because it's difficult to follow... As an aside - another "interesting" part of the VSERPORT processing... qemuAgentConnect will avoid running opening the channel if VSERPORT capability is set *and* config->state != CONNECTED. However, the VSERPORT processing code in processSerialChangedEvent will only work if the target.name is "org.qemu.guest_agent.0"... Sure that's our default, but nonetheless interesting about two different checks.
In order to handle the issue where agentError is set, but we cannot start again (e.g. the have error, do shutdown, do start path) - add an unconditional priv->agentError = false before in qemuConnectAgent before the virSecurityManagerSetDaemonSocketLabel call. The other agentError = false settings can then be removed.
Can not agree. For modern qemu agent will be in error state on start up until serial port appeared. Should be in 'not connected' state.
It's semantics... the error someone will get until VSERPORT runs is agent unavailable due to error"... Once the serial port magic happens, it could then be running, error, or not connected. But this would only be for the case where guest running, vmagent error, guest shutdown, and guest start (until serial port is connect). FWIW: One way around that is in the modern (processSerialChangedEvent) code rather than clearing agentError only if priv->agent was set, you could also unconditionally set it after/before. IOW: DISCONNECTED was received. A possible downside to that option is it's not architecturally clear to me what conditions other than grep-ing thru the JSON return data in qemuMonitorJSONExtractChardevInfo what could cause DISCONNECTED. Does it only occur at shutdown? Or could it occur when say some network/socket error causes a disconnect. In which case, a running guest/agent could have an error, get a disconnect, but clear agentError 'unexpectedly'... Does the agent logic at some point in time send a second CONNECTED if for some reason a network is restored. Again, I'm not that familiar with the inner workings there, so I err on the side of caution and not unconditionally clear. John
So we're left with 2 places to set error and one place to clear. I'll continue to consider this, bug figured I'd get out a response to all this sooner than later.
I'm less convinced mucking with combining EOF and Error is a good idea in patches 7-10, but I haven't thought too much about it either.
John
goto endjob;
The indention for this is off (remove leading 4 spaces)
- - if (rc < 0) - priv->agentError = true; - } } else { if (priv->agent) { qemuAgentClose(priv->agent); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index e2ca330..0a02236 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -6171,7 +6171,6 @@ qemuMigrationFinish(virQEMUDriverPtr driver, unsigned short port; unsigned long long timeReceived = 0; virObjectEventPtr event; - int rc; qemuDomainJobInfoPtr jobInfo = NULL; bool inPostCopy = false; bool doKill = true; @@ -6244,16 +6243,8 @@ qemuMigrationFinish(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_IN) < 0) goto endjob;
- if ((rc = qemuConnectAgent(driver, vm)) < 0) { - if (rc == -2) - goto endjob; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } - + if (qemuConnectAgent(driver, vm) < 0) + goto endjob;
if (flags & VIR_MIGRATE_PERSIST_DEST) { if (qemuMigrationPersist(driver, vm, mig, !v3proto) < 0) { diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 42f7f84..d7c9ce3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -206,7 +206,6 @@ int qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv = vm->privateData; - int ret = -1; qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
@@ -252,8 +251,7 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentClose(agent); virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("guest crashed while connecting to the guest agent")); - ret = -2; - goto cleanup; + return -1; }
if (virSecurityManagerClearSocketLabel(driver->securityManager, @@ -264,18 +262,16 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) goto cleanup; }
- priv->agent = agent;
- if (priv->agent == NULL) { - VIR_INFO("Failed to connect agent for %s", vm->def->name);
Interesting a "marker" of sorts to know the virAgentOpen "failed" is that we'd get an Informational "Failed to connect..." followed shortly thereafter by a Warning "Cannot connect..." (depending of course on one's message display severity level).
I think if you "restore" this without the goto cleanup (below) and of course the removal of the {}, then at least message parity is achieved... I suppose it could move up into the "if (agent == NULL)" too, but then it could be given even though an Error is reported.
It's not that important, but it does leave some breadcrumbs.
Agree. Even though I'm still wondering how useful this message is as this patch is intended to be refactoring let's keep this message.
Again I'd like to see the breadcrumbs with just this patch applied to reproduced what you're chasing in patches 1-4...
Sorry, not possible. The situation I've described in patch 1 is based on analysis. However after applying the patches 1-3 the bug is not occured in our test system anymoree.
Nikolay

On 28.10.2016 14:42, John Ferlan wrote:
On 10/28/2016 03:37 AM, Nikolay Shirokovskiy wrote:
On 27.10.2016 18:51, John Ferlan wrote:
On 10/27/2016 07:34 AM, Nikolay Shirokovskiy wrote:
On 26.10.2016 22:57, John Ferlan wrote:
There's no commit message...
This is simple refactoring.
Not really...
A simple refactor would have kept the -2 logic.
Ohh, my bad english. I mean just(mere) refactoring. Not simple. However as a refactoring step this patch is not complex from my POV.
Commit messages don't have to be a long story, but they shouldn't be empty. If you're relying on purely the 60 character commit message to convey what's been done, then the changes would have to be similarly short.
Part of your refactor altered logic and that's something that needs to be mentioned.
What's been done here is to let qemuConnectAgent make the decision on "hard" or "soft" error and only use < 0 as failure
You're altering qemuConnectAgent to return -1 on the only error and perform the VIR_WARN plus agentError = true on other "soft" failures.
Not exactly nothing going on!
This is what already present in code, I've just moved soft failures in qemuConnectAgent.
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
--- src/qemu/qemu_driver.c | 8 +------ src/qemu/qemu_migration.c | 13 ++--------- src/qemu/qemu_process.c | 58 ++++++++++++----------------------------------- 3 files changed, 17 insertions(+), 62 deletions(-)
This idea seems to have merit too - something that I've thought now that I've read the first 3 patches...
I think you should have started with this patch, it probably would have made the others easier to think about. In fact, I'm curious if with just this change and the suggestion in patch 3 for clearing agentError whether you can reproduced the bug you're trying to fix/resolve without any of the other patches.
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..edff973 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4397,7 +4397,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, virObjectEventPtr event = NULL; virDomainDeviceDef dev; qemuDomainObjPrivatePtr priv = vm->privateData; - int rc;
if (connected) newstate = VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED; @@ -4450,13 +4449,8 @@ processSerialChangedEvent(virQEMUDriverPtr driver,
if (STREQ_NULLABLE(dev.data.chr->target.name, "org.qemu.guest_agent.0")) { if (newstate == VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) { - if (!priv->agent) { - if ((rc = qemuConnectAgent(driver, vm)) == -2) + if (!priv->agent && qemuConnectAgent(driver, vm) < 0)
FWIW: qemuConnectAgent returns 0 "if (priv->agent)", so there's one less check here too. Could just be "if (qemuConnectAgent(driver, vm) < 0)"
It depends on patch 3. If we clear the error early in qemuConnectAgent then we'd better check before calling this function.
Right my mind was already there or more succinctly said - I was trying to consider this as the first patch as I think it'll make the reasoning a bit more clear.
Agree.
Since the "issue" has been described now as a pure shutdown and restart after error problem without needing to consider the migration, reconnect, and attach separately - it's easier to focus on.
So my feeling now after reading and thinking a bit about everything so far is that this patch could go in (mostly) as is. That way there are then 2 places to set agentError=true and two places to set agentError=false (discluding original allocation and reconnect paths).
I don't think we can "combine" the Error and EOF paths. They're separate. What happens if some day someone uses the VSERPORT logic to
Right now there is no value in separation AFAIU but a lot of code that is difficult to follow.
EOF and Error are different. That's why there's two different callbacks. The agent implementation has 'chosen' to just make Error essentially fatal for all future attempts while the guest is running. I seem to recall recent discussion regarding why there's a separate Error handler and the thought that some sort of error recover could be done, but I cannot find it in the last 3 months of archives on a simple title search. It could have been a conversation at KVM Forum too.
Could someone come along and alter the current reality - perhaps. Your "choice" is remove a bunch of code that someone would then have to add back later. I'm erring on the side of caution - having separate Error and EOF does work - what doesn't work right is the 'agentError' logic. But we can fix that.
The point is that we can drop error flag altogether if we make explicit the current state. And current state is that we don't need to differentiate EOF and Error.
As for difficult to follow - that's perhaps my other hot button issue - it's not self documenting and that's frustrating. Some will tout usage of git log history to understand how something's supposed to work. I land more on the side of - if you learn something about code you're changing, then leave entrails in the code to hopefully help the next person reading the code and not the git log's. It's a delicate balance for my penchant to be too verbose vs. providing sufficient information. I'll sometimes leave a comment somewhere that means nothing to someone else, but causes me to remember why I did something (like the pensieve in the Harry Potter movies ;-)).
It is true. Some cases are not obvious from code only. But in this case we have just pieces of code that add nothing but complexity AFAIU. And if the strategy is keeping the code even if it has no value then complexity will be grow further and further... If there is value - then let's make it explicit or if nothing can be done in this attempt at least document the code. But after looking for a while at this code I realized - it can be simplier.
somehow restart an agent after an error, after a close, and before a shutdown. IOW: Some kind of error recovery logic.
Dont's see how having separate EOF and Error will help in this case.
Check the callers of qemuDomainAgentAvailable and imagine a day where we can take an error and do something smarter... If there's multiple "things" an Agent can do and we disable them all just because of an error that's kind of frustrating. So maybe someone comes along and classifies the calls into groups (state, vcpu, FS, time, interface, security). Then let's say we have some sort of error from a vcpu call - does that necessarily mean we should disable getting/setting time? Do we know enough about the error. The vmagent is not something I focus on, so I don't have the answer to whether we could make something recoverable. I also know there's a Linux agent and Windows agent - so
Error in io loop makes agent non-functional, no matter what command is. It's just a case of breaked channel, communication is impossible until new attempt to start communication. So I'm still don't understand why we need eof/error.
In any case, I don't think we should disable error processing just because it's difficult to follow...
No no, the point was no value *and* difficult to follow. Difficult to follow alone can not be argument )) Ok. After giving it more thought I think I'm not right on this. We *do* want to know whether it is EOF or Error. For example on domain shutdown we can receive EOF on channel and it i'd want to report it as 'not connected' rather then io error. I'll make new series.
As an aside - another "interesting" part of the VSERPORT processing... qemuAgentConnect will avoid running opening the channel if VSERPORT capability is set *and* config->state != CONNECTED. However, the VSERPORT processing code in processSerialChangedEvent will only work if the target.name is "org.qemu.guest_agent.0"... Sure that's our default, but nonetheless interesting about two different checks.
In order to handle the issue where agentError is set, but we cannot start again (e.g. the have error, do shutdown, do start path) - add an unconditional priv->agentError = false before in qemuConnectAgent before the virSecurityManagerSetDaemonSocketLabel call. The other agentError = false settings can then be removed.
Can not agree. For modern qemu agent will be in error state on start up until serial port appeared. Should be in 'not connected' state.
It's semantics... the error someone will get until VSERPORT runs is agent unavailable due to error"... Once the serial port magic happens, it could then be running, error, or not connected. But this would only be for the case where guest running, vmagent error, guest shutdown, and guest start (until serial port is connect).
And this will be quite misleading in logs. Agent error is major error condition AFAIK somethings is wrong with channel to qemu agent. qemu or guest driver need to be fixed...
FWIW: One way around that is in the modern (processSerialChangedEvent) code rather than clearing agentError only if priv->agent was set, you could also unconditionally set it after/before. IOW: DISCONNECTED was received. A possible downside to that option is it's not
I think it's fragile. Serial disconnected and getting error condition in channel are in race.
architecturally clear to me what conditions other than grep-ing thru the JSON return data in qemuMonitorJSONExtractChardevInfo what could cause DISCONNECTED. Does it only occur at shutdown? Or could it occur when say some network/socket error causes a disconnect. In which case, a running guest/agent could have an error, get a disconnect, but clear agentError 'unexpectedly'... Does the agent logic at some point in time send a second CONNECTED if for some reason a network is restored. Again, I'm not that familiar with the inner workings there, so I err on the side of caution and not unconditionally clear.
John
So we're left with 2 places to set error and one place to clear. I'll continue to consider this, bug figured I'd get out a response to all this sooner than later.
I'm less convinced mucking with combining EOF and Error is a good idea in patches 7-10, but I haven't thought too much about it either.
John
goto endjob;
The indention for this is off (remove leading 4 spaces)
- - if (rc < 0) - priv->agentError = true; - } } else { if (priv->agent) { qemuAgentClose(priv->agent); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index e2ca330..0a02236 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -6171,7 +6171,6 @@ qemuMigrationFinish(virQEMUDriverPtr driver, unsigned short port; unsigned long long timeReceived = 0; virObjectEventPtr event; - int rc; qemuDomainJobInfoPtr jobInfo = NULL; bool inPostCopy = false; bool doKill = true; @@ -6244,16 +6243,8 @@ qemuMigrationFinish(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_IN) < 0) goto endjob;
- if ((rc = qemuConnectAgent(driver, vm)) < 0) { - if (rc == -2) - goto endjob; - - VIR_WARN("Cannot connect to QEMU guest agent for %s", - vm->def->name); - virResetLastError(); - priv->agentError = true; - } - + if (qemuConnectAgent(driver, vm) < 0) + goto endjob;
if (flags & VIR_MIGRATE_PERSIST_DEST) { if (qemuMigrationPersist(driver, vm, mig, !v3proto) < 0) { diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 42f7f84..d7c9ce3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -206,7 +206,6 @@ int qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv = vm->privateData; - int ret = -1; qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
@@ -252,8 +251,7 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentClose(agent); virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("guest crashed while connecting to the guest agent")); - ret = -2; - goto cleanup; + return -1; }
if (virSecurityManagerClearSocketLabel(driver->securityManager, @@ -264,18 +262,16 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) goto cleanup; }
- priv->agent = agent;
- if (priv->agent == NULL) { - VIR_INFO("Failed to connect agent for %s", vm->def->name);
Interesting a "marker" of sorts to know the virAgentOpen "failed" is that we'd get an Informational "Failed to connect..." followed shortly thereafter by a Warning "Cannot connect..." (depending of course on one's message display severity level).
I think if you "restore" this without the goto cleanup (below) and of course the removal of the {}, then at least message parity is achieved... I suppose it could move up into the "if (agent == NULL)" too, but then it could be given even though an Error is reported.
It's not that important, but it does leave some breadcrumbs.
Agree. Even though I'm still wondering how useful this message is as this patch is intended to be refactoring let's keep this message.
Again I'd like to see the breadcrumbs with just this patch applied to reproduced what you're chasing in patches 1-4...
Sorry, not possible. The situation I've described in patch 1 is based on analysis. However after applying the patches 1-3 the bug is not occured in our test system anymoree.
Nikolay

agentError is used for 2 different cases: 1. agent monitor is failed to start 2. io error in agent monitor Actually to check for the first case we don't need the flag at all. NULL agent is always error for old qemu (QEMU_CAPS_VSERPORT_CHANGE is not supported), while for modern qemu it is an error only if channel is in connected state. --- src/qemu/qemu_domain.c | 37 +++++++++++++++++++++++-------------- src/qemu/qemu_process.c | 1 - 2 files changed, 23 insertions(+), 15 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 9b1a32e..cd788c8 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -5046,6 +5046,7 @@ qemuDomainAgentAvailable(virDomainObjPtr vm, bool reportError) { qemuDomainObjPrivatePtr priv = vm->privateData; + virDomainChrDefPtr config; if (virDomainObjGetState(vm, NULL) != VIR_DOMAIN_RUNNING) { if (reportError) { @@ -5062,22 +5063,30 @@ qemuDomainAgentAvailable(virDomainObjPtr vm, } return false; } - if (!priv->agent) { - if (qemuFindAgentConfig(vm->def)) { - if (reportError) { - virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", - _("QEMU guest agent is not connected")); - } - return false; - } else { - if (reportError) { - virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", - _("QEMU guest agent is not configured")); - } - return false; + + if (priv->agent) + return true; + + config = qemuFindAgentConfig(vm->def); + + if (!config) { + if (reportError) { + virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", + _("QEMU guest agent is not configured")); } + } else if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_VSERPORT_CHANGE) && + config->state != VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) { + if (reportError) { + virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", + _("QEMU guest agent is not connected")); + } + } else if (reportError) { + virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", + _("QEMU guest agent is not " + "available due to an error")); } - return true; + + return false; } diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index d7c9ce3..78d530f 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -267,7 +267,6 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) cleanup: if (!priv->agent) { VIR_WARN("Cannot connect to QEMU guest agent for %s", vm->def->name); - priv->agentError = true; virResetLastError(); } -- 1.8.3.1

On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
agentError is used for 2 different cases:
1. agent monitor is failed to start
Non guest fatal failure in qemuConnectAgent when first trying to connect to the agent
2. io error in agent monitor
I/O error with running agent resulting in qemuProcessHandleAgentError being called Most of the above would seemingly be a reason for the removal of the agentError (which needs to be a separate patch)...
Actually to check for the first case we don't need the flag at all. NULL agent is always error for old qemu (QEMU_CAPS_VSERPORT_CHANGE is not supported), while for modern qemu it is an error only if channel is in connected state.
I suppose this explanation could be added too, but I'd have to see it in the context of what I've learned so far in a "new" series. In any case, none of the above text seems to have anything to do with the change in reporting the errors at startup.
--- src/qemu/qemu_domain.c | 37 +++++++++++++++++++++++-------------- src/qemu/qemu_process.c | 1 - 2 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 9b1a32e..cd788c8 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -5046,6 +5046,7 @@ qemuDomainAgentAvailable(virDomainObjPtr vm, bool reportError) { qemuDomainObjPrivatePtr priv = vm->privateData; + virDomainChrDefPtr config;
if (virDomainObjGetState(vm, NULL) != VIR_DOMAIN_RUNNING) { if (reportError) { @@ -5062,22 +5063,30 @@ qemuDomainAgentAvailable(virDomainObjPtr vm, } return false; } - if (!priv->agent) { - if (qemuFindAgentConfig(vm->def)) { - if (reportError) { - virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", - _("QEMU guest agent is not connected")); - } - return false; - } else { - if (reportError) { - virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", - _("QEMU guest agent is not configured")); - } - return false; + + if (priv->agent) + return true; + + config = qemuFindAgentConfig(vm->def); + + if (!config) { + if (reportError) { + virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", + _("QEMU guest agent is not configured")); }
I'm OK through here, but the next two I'm not so sure I agree with.
+ } else if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_VSERPORT_CHANGE) && + config->state != VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) {
What does the manner of startup (either legacy or VSERPORT) have to do with which error should be reported.
+ if (reportError) { + virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", + _("QEMU guest agent is not connected")); + } + } else if (reportError) { + virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", + _("QEMU guest agent is not " + "available due to an error"));
I know you're setting up for the next set of patches (to remove the agentError flag), but I'm not sure yet whether using VSERPORT change configuration is the right way. Unless of course my assumption from patch 1 review is true... This is really a startup/shutdown race... John
} - return true; + + return false; }
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index d7c9ce3..78d530f 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -267,7 +267,6 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) cleanup: if (!priv->agent) { VIR_WARN("Cannot connect to QEMU guest agent for %s", vm->def->name); - priv->agentError = true;
Wait, what, why? you just added this in the previous patch and I see no reason for removing it now. This would need a separate patch and reason. John
virResetLastError(); }

On 26.10.2016 23:04, John Ferlan wrote:
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
agentError is used for 2 different cases:
1. agent monitor is failed to start
Non guest fatal failure in qemuConnectAgent when first trying to connect to the agent
2. io error in agent monitor
I/O error with running agent resulting in qemuProcessHandleAgentError being called
Most of the above would seemingly be a reason for the removal of the agentError (which needs to be a separate patch)...
Removal is done in 2 steps. This is a step 1. Drop flag usage for the case 1.
Actually to check for the first case we don't need the flag at all. NULL agent is always error for old qemu (QEMU_CAPS_VSERPORT_CHANGE is not supported), while for modern qemu it is an error only if channel is in connected state.
I suppose this explanation could be added too, but I'd have to see it in the context of what I've learned so far in a "new" series.
In any case, none of the above text seems to have anything to do with the change in reporting the errors at startup.
'failed agent start case' - that is case 1, we drop flag usage for this case. Nothing to do... how come?
--- src/qemu/qemu_domain.c | 37 +++++++++++++++++++++++-------------- src/qemu/qemu_process.c | 1 - 2 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 9b1a32e..cd788c8 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -5046,6 +5046,7 @@ qemuDomainAgentAvailable(virDomainObjPtr vm, bool reportError) { qemuDomainObjPrivatePtr priv = vm->privateData; + virDomainChrDefPtr config;
if (virDomainObjGetState(vm, NULL) != VIR_DOMAIN_RUNNING) { if (reportError) { @@ -5062,22 +5063,30 @@ qemuDomainAgentAvailable(virDomainObjPtr vm, } return false; } - if (!priv->agent) { - if (qemuFindAgentConfig(vm->def)) { - if (reportError) { - virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", - _("QEMU guest agent is not connected")); - } - return false; - } else { - if (reportError) { - virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", - _("QEMU guest agent is not configured")); - } - return false; + + if (priv->agent) + return true; + + config = qemuFindAgentConfig(vm->def); + + if (!config) { + if (reportError) { + virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", + _("QEMU guest agent is not configured")); }
I'm OK through here, but the next two I'm not so sure I agree with.
+ } else if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_VSERPORT_CHANGE) && + config->state != VIR_DOMAIN_CHR_DEVICE_STATE_CONNECTED) {
What does the manner of startup (either legacy or VSERPORT) have to do with which error should be reported.
Ok. Let me explain it further. Case 1. Legacy. 1. domain start (migration finish etc) 2. qemuConnectAgent is called, failed we need to report 'agent error' Case 2. VSERPORT 1. domain start 2. qemuConnectAgent exits early in between we need to report 'not connected' 3. VSERPORT_CHANGE arrived 4. qemuConnectAgent is called, failed now we need to report 'agent error'
+ if (reportError) { + virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", + _("QEMU guest agent is not connected")); + } + } else if (reportError) { + virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", + _("QEMU guest agent is not " + "available due to an error"));
I know you're setting up for the next set of patches (to remove the agentError flag), but I'm not sure yet whether using VSERPORT change configuration is the right way.
Unless of course my assumption from patch 1 review is true... This is really a startup/shutdown race...
John
} - return true; + + return false; }
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index d7c9ce3..78d530f 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -267,7 +267,6 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) cleanup: if (!priv->agent) { VIR_WARN("Cannot connect to QEMU guest agent for %s", vm->def->name); - priv->agentError = true;
Wait, what, why? you just added this in the previous patch and I see no reason for removing it now.
After the above changes to qemuDomainAgentAvailable the meaning of this flag to report agent startup errors (case 1 from the commit message) is disappered. So we don't need to set this flag anymore here. We don't add this flag in the previous patch just move into single place.
This would need a separate patch and reason.
I think this is a matter of one patch. We drop the use case 1 from the flag. Nikolay

Let's take a closer look at qemuAgentIO. In the case of error we stop listening to any events besides error and eof. Then set last error so that all next loop invocations do very little: 1. if it is an error then just call error callback once more. Current callback is noop for subsequent invocations. 2. if it is an eof then call eof callback, it will close monitor eventually. So why waiting for eof? Let's just close monitor on error. Now we can drop error flag and deal with NULL monitor case only (qemuDomainAgentAvailable stays correct). --- src/qemu/qemu_domain.c | 8 -------- src/qemu/qemu_domain.h | 1 - src/qemu/qemu_driver.c | 1 - src/qemu/qemu_process.c | 19 ++----------------- 4 files changed, 2 insertions(+), 27 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index cd788c8..b6756b1 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -5055,14 +5055,6 @@ qemuDomainAgentAvailable(virDomainObjPtr vm, } return false; } - if (priv->agentError) { - if (reportError) { - virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", - _("QEMU guest agent is not " - "available due to an error")); - } - return false; - } if (priv->agent) return true; diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 521531b..257bfcb 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -177,7 +177,6 @@ struct _qemuDomainObjPrivate { unsigned long long monStart; qemuAgentPtr agent; - bool agentError; unsigned long long agentStart; bool gotShutdown; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index edff973..b6fb1df 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4455,7 +4455,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; - priv->agentError = false; } } diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 78d530f..3da1bd5 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -151,7 +151,6 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, qemuAgentClose(agent); priv->agent = NULL; - priv->agentError = false; virObjectUnlock(vm); return; @@ -169,21 +168,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, * allowed */ static void -qemuProcessHandleAgentError(qemuAgentPtr agent ATTRIBUTE_UNUSED, +qemuProcessHandleAgentError(qemuAgentPtr agent, virDomainObjPtr vm) { - qemuDomainObjPrivatePtr priv; - - VIR_DEBUG("Received error from agent on %p '%s'", vm, vm->def->name); - - virObjectLock(vm); - - priv = vm->privateData; - - if (priv->agent) - priv->agentError = true; - - virObjectUnlock(vm); + qemuProcessHandleAgentEOF(agent, vm); } static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent, @@ -209,8 +197,6 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def); - priv->agentError = false; - if (!config) return 0; @@ -5915,7 +5901,6 @@ void qemuProcessStop(virQEMUDriverPtr driver, if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; - priv->agentError = false; } if (priv->mon) { -- 1.8.3.1

On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
Let's take a closer look at qemuAgentIO. In the case of error we stop listening to any events besides error and eof. Then set last error so that all next loop invocations do very little:
1. if it is an error then just call error callback once more. Current callback is noop for subsequent invocations.
2. if it is an eof then call eof callback, it will close monitor eventually.
So why waiting for eof? Let's just close monitor on error. Now we can drop error flag and deal with NULL monitor case only (qemuDomainAgentAvailable stays correct). --- src/qemu/qemu_domain.c | 8 -------- src/qemu/qemu_domain.h | 1 - src/qemu/qemu_driver.c | 1 - src/qemu/qemu_process.c | 19 ++----------------- 4 files changed, 2 insertions(+), 27 deletions(-)
While we're not "reading" anything - why continue to sent and wait for more stuff if before sending (e.g. qemuDomainAgentAvailable callers) we could detect that there was a failure and thus we shouldn't even attempt the send. Wouldn't the EOF tell us that we're all done processing whatever was sent our way before we got the first error? Not so sure about this one. I'd have to think more about things in light of what's being chased. John
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index cd788c8..b6756b1 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -5055,14 +5055,6 @@ qemuDomainAgentAvailable(virDomainObjPtr vm, } return false; } - if (priv->agentError) { - if (reportError) { - virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", - _("QEMU guest agent is not " - "available due to an error")); - } - return false; - }
if (priv->agent) return true; diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 521531b..257bfcb 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -177,7 +177,6 @@ struct _qemuDomainObjPrivate { unsigned long long monStart;
qemuAgentPtr agent; - bool agentError; unsigned long long agentStart;
bool gotShutdown; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index edff973..b6fb1df 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4455,7 +4455,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; - priv->agentError = false; } }
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 78d530f..3da1bd5 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -151,7 +151,6 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
qemuAgentClose(agent); priv->agent = NULL; - priv->agentError = false;
virObjectUnlock(vm); return; @@ -169,21 +168,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, * allowed */ static void -qemuProcessHandleAgentError(qemuAgentPtr agent ATTRIBUTE_UNUSED, +qemuProcessHandleAgentError(qemuAgentPtr agent, virDomainObjPtr vm) { - qemuDomainObjPrivatePtr priv; - - VIR_DEBUG("Received error from agent on %p '%s'", vm, vm->def->name); - - virObjectLock(vm); - - priv = vm->privateData; - - if (priv->agent) - priv->agentError = true; - - virObjectUnlock(vm); + qemuProcessHandleAgentEOF(agent, vm); }
static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent, @@ -209,8 +197,6 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
- priv->agentError = false; - if (!config) return 0;
@@ -5915,7 +5901,6 @@ void qemuProcessStop(virQEMUDriverPtr driver, if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; - priv->agentError = false; }
if (priv->mon) {

On 26.10.2016 23:18, John Ferlan wrote:
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
Let's take a closer look at qemuAgentIO. In the case of error we stop listening to any events besides error and eof. Then set last error so that all next loop invocations do very little:
1. if it is an error then just call error callback once more. Current callback is noop for subsequent invocations.
2. if it is an eof then call eof callback, it will close monitor eventually.
So why waiting for eof? Let's just close monitor on error. Now we can drop error flag and deal with NULL monitor case only (qemuDomainAgentAvailable stays correct). --- src/qemu/qemu_domain.c | 8 -------- src/qemu/qemu_domain.h | 1 - src/qemu/qemu_driver.c | 1 - src/qemu/qemu_process.c | 19 ++----------------- 4 files changed, 2 insertions(+), 27 deletions(-)
While we're not "reading" anything - why continue to sent and wait for more stuff if before sending (e.g. qemuDomainAgentAvailable callers) we could detect that there was a failure and thus we shouldn't even attempt the send.
I did not suggested that. I don't want to drop error condition I just want to drop the error flag to check for this condition. I mean if we close monitor on error just as on eof then the priv->agent pointer check will be enough. The commit message explains that there is not sense keeping monitor after error anyway.
Wouldn't the EOF tell us that we're all done processing whatever was sent our way before we got the first error?
We just don't use EOF this way. Some kind of half close from the agent.
Not so sure about this one. I'd have to think more about things in light of what's being chased.
I want to drop the error flag after we discovered the kind of problems that it can cause)))
John
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index cd788c8..b6756b1 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -5055,14 +5055,6 @@ qemuDomainAgentAvailable(virDomainObjPtr vm, } return false; } - if (priv->agentError) { - if (reportError) { - virReportError(VIR_ERR_AGENT_UNRESPONSIVE, "%s", - _("QEMU guest agent is not " - "available due to an error")); - } - return false; - }
if (priv->agent) return true; diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 521531b..257bfcb 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -177,7 +177,6 @@ struct _qemuDomainObjPrivate { unsigned long long monStart;
qemuAgentPtr agent; - bool agentError; unsigned long long agentStart;
bool gotShutdown; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index edff973..b6fb1df 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4455,7 +4455,6 @@ processSerialChangedEvent(virQEMUDriverPtr driver, if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; - priv->agentError = false; } }
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 78d530f..3da1bd5 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -151,7 +151,6 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
qemuAgentClose(agent); priv->agent = NULL; - priv->agentError = false;
virObjectUnlock(vm); return; @@ -169,21 +168,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, * allowed */ static void -qemuProcessHandleAgentError(qemuAgentPtr agent ATTRIBUTE_UNUSED, +qemuProcessHandleAgentError(qemuAgentPtr agent, virDomainObjPtr vm) { - qemuDomainObjPrivatePtr priv; - - VIR_DEBUG("Received error from agent on %p '%s'", vm, vm->def->name); - - virObjectLock(vm); - - priv = vm->privateData; - - if (priv->agent) - priv->agentError = true; - - virObjectUnlock(vm); + qemuProcessHandleAgentEOF(agent, vm); }
static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent, @@ -209,8 +197,6 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm) qemuAgentPtr agent = NULL; virDomainChrDefPtr config = qemuFindAgentConfig(vm->def);
- priv->agentError = false; - if (!config) return 0;
@@ -5915,7 +5901,6 @@ void qemuProcessStop(virQEMUDriverPtr driver, if (priv->agent) { qemuAgentClose(priv->agent); priv->agent = NULL; - priv->agentError = false; }
if (priv->mon) {

Now we don't need to differentiate error and eof cases in the loop function. So let's simplify it radically using goto instead of flags. --- src/qemu/qemu_agent.c | 185 ++++++++++++++++++------------------------- src/qemu/qemu_agent.h | 2 - src/qemu/qemu_process.c | 22 +---- tests/qemumonitortestutils.c | 1 - 4 files changed, 83 insertions(+), 127 deletions(-) diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index ec8d47e..43d78c9 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -595,8 +595,8 @@ static void qemuAgentIO(int watch, int fd, int events, void *opaque) { qemuAgentPtr mon = opaque; - bool error = false; - bool eof = false; + void (*errorNotify)(qemuAgentPtr, virDomainObjPtr); + virDomainObjPtr vm; virObjectRef(mon); /* lock access to the monitor and protect fd */ @@ -606,122 +606,95 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) #endif if (mon->fd != fd || mon->watch != watch) { - if (events & (VIR_EVENT_HANDLE_HANGUP | VIR_EVENT_HANDLE_ERROR)) - eof = true; virReportError(VIR_ERR_INTERNAL_ERROR, _("event from unexpected fd %d!=%d / watch %d!=%d"), mon->fd, fd, mon->watch, watch); - error = true; - } else if (mon->lastError.code != VIR_ERR_OK) { - if (events & (VIR_EVENT_HANDLE_HANGUP | VIR_EVENT_HANDLE_ERROR)) - eof = true; - error = true; - } else { - if (events & VIR_EVENT_HANDLE_WRITABLE) { - if (mon->connectPending) { - if (qemuAgentIOConnect(mon) < 0) - error = true; - } else { - if (qemuAgentIOWrite(mon) < 0) - error = true; - } - events &= ~VIR_EVENT_HANDLE_WRITABLE; - } - - if (!error && - events & VIR_EVENT_HANDLE_READABLE) { - int got = qemuAgentIORead(mon); - events &= ~VIR_EVENT_HANDLE_READABLE; - if (got < 0) { - error = true; - } else if (got == 0) { - eof = true; - } else { - /* Ignore hangup/error events if we read some data, to - * give time for that data to be consumed */ - events = 0; - - if (qemuAgentIOProcess(mon) < 0) - error = true; - } - } - - if (!error && - events & VIR_EVENT_HANDLE_HANGUP) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("End of file from agent monitor")); - eof = true; - events &= ~VIR_EVENT_HANDLE_HANGUP; - } - - if (!error && !eof && - events & VIR_EVENT_HANDLE_ERROR) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Invalid file descriptor while waiting for monitor")); - eof = true; - events &= ~VIR_EVENT_HANDLE_ERROR; - } - if (!error && events) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Unhandled event %d for monitor fd %d"), - events, mon->fd); - error = true; - } + goto error; } - if (error || eof) { - if (mon->lastError.code != VIR_ERR_OK) { - /* Already have an error, so clear any new error */ - virResetLastError(); + if (mon->lastError.code != VIR_ERR_OK) + goto error; + + if (events & VIR_EVENT_HANDLE_WRITABLE) { + if (mon->connectPending) { + if (qemuAgentIOConnect(mon) < 0) + goto error; } else { - virErrorPtr err = virGetLastError(); - if (!err) - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Error while processing monitor IO")); - virCopyLastError(&mon->lastError); - virResetLastError(); + if (qemuAgentIOWrite(mon) < 0) + goto error; } + events &= ~VIR_EVENT_HANDLE_WRITABLE; + } - VIR_DEBUG("Error on monitor %s", NULLSTR(mon->lastError.message)); - /* If IO process resulted in an error & we have a message, - * then wakeup that waiter */ - if (mon->msg && !mon->msg->finished) { - mon->msg->finished = 1; - virCondSignal(&mon->notify); - } + if (events & VIR_EVENT_HANDLE_READABLE) { + int got = qemuAgentIORead(mon); + + if (got <= 0) + goto error; + + if (qemuAgentIOProcess(mon) < 0) + goto error; + + /* Ignore hangup/error events if we read some data, to + * give time for that data to be consumed */ + events = 0; + } + + if (events & VIR_EVENT_HANDLE_HANGUP) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("End of file from agent monitor")); + goto error; + } + + if (events & VIR_EVENT_HANDLE_ERROR) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Invalid file descriptor while waiting for monitor")); + goto error; + } + + if (events) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unhandled event %d for monitor fd %d"), + events, mon->fd); + goto error; } qemuAgentUpdateWatch(mon); + virObjectUnlock(mon); + virObjectUnref(mon); + return; - /* We have to unlock to avoid deadlock against command thread, - * but is this safe ? I think it is, because the callback - * will try to acquire the virDomainObjPtr mutex next */ - if (eof) { - void (*eofNotify)(qemuAgentPtr, virDomainObjPtr) - = mon->cb->eofNotify; - virDomainObjPtr vm = mon->vm; - - /* Make sure anyone waiting wakes up now */ - virCondSignal(&mon->notify); - virObjectUnlock(mon); - virObjectUnref(mon); - VIR_DEBUG("Triggering EOF callback"); - (eofNotify)(mon, vm); - } else if (error) { - void (*errorNotify)(qemuAgentPtr, virDomainObjPtr) - = mon->cb->errorNotify; - virDomainObjPtr vm = mon->vm; - - /* Make sure anyone waiting wakes up now */ - virCondSignal(&mon->notify); - virObjectUnlock(mon); - virObjectUnref(mon); - VIR_DEBUG("Triggering error callback"); - (errorNotify)(mon, vm); + error: + if (mon->lastError.code != VIR_ERR_OK) { + /* Already have an error, so clear any new error */ + virResetLastError(); } else { - virObjectUnlock(mon); - virObjectUnref(mon); + virErrorPtr err = virGetLastError(); + if (!err) + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Error while processing monitor IO")); + virCopyLastError(&mon->lastError); + virResetLastError(); } + + VIR_DEBUG("Error on monitor %s", NULLSTR(mon->lastError.message)); + /* If IO process resulted in an error & we have a message, + * then wakeup that waiter */ + if (mon->msg && !mon->msg->finished) { + mon->msg->finished = 1; + virCondSignal(&mon->notify); + } + + qemuAgentUpdateWatch(mon); + + errorNotify = mon->cb->errorNotify; + vm = mon->vm; + + /* Make sure anyone waiting wakes up now */ + virCondSignal(&mon->notify); + virObjectUnlock(mon); + virObjectUnref(mon); + (errorNotify)(mon, vm); } @@ -732,9 +705,9 @@ qemuAgentOpen(virDomainObjPtr vm, { qemuAgentPtr mon; - if (!cb || !cb->eofNotify) { + if (!cb || !cb->errorNotify) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("EOF notify callback must be supplied")); + _("error notify callback must be supplied")); return NULL; } diff --git a/src/qemu/qemu_agent.h b/src/qemu/qemu_agent.h index 6dd9c70..76e8772 100644 --- a/src/qemu/qemu_agent.h +++ b/src/qemu/qemu_agent.h @@ -36,8 +36,6 @@ typedef qemuAgentCallbacks *qemuAgentCallbacksPtr; struct _qemuAgentCallbacks { void (*destroy)(qemuAgentPtr mon, virDomainObjPtr vm); - void (*eofNotify)(qemuAgentPtr mon, - virDomainObjPtr vm); void (*errorNotify)(qemuAgentPtr mon, virDomainObjPtr vm); }; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 3da1bd5..fe7682e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -128,12 +128,12 @@ extern virQEMUDriverPtr qemu_driver; * performed */ static void -qemuProcessHandleAgentEOF(qemuAgentPtr agent, - virDomainObjPtr vm) +qemuProcessHandleAgentError(qemuAgentPtr agent, + virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv; - VIR_DEBUG("Received EOF from agent on %p '%s'", vm, vm->def->name); + VIR_DEBUG("Received error from agent on %p '%s'", vm, vm->def->name); virObjectLock(vm); @@ -145,7 +145,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, } if (priv->beingDestroyed) { - VIR_DEBUG("Domain is being destroyed, agent EOF is expected"); + VIR_DEBUG("Domain is being destroyed, agent error is expected"); goto unlock; } @@ -161,19 +161,6 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, } -/* - * This is invoked when there is some kind of error - * parsing data to/from the agent. The VM can continue - * to run, but no further agent commands will be - * allowed - */ -static void -qemuProcessHandleAgentError(qemuAgentPtr agent, - virDomainObjPtr vm) -{ - qemuProcessHandleAgentEOF(agent, vm); -} - static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent, virDomainObjPtr vm) { @@ -185,7 +172,6 @@ static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent, static qemuAgentCallbacks agentCallbacks = { .destroy = qemuProcessHandleAgentDestroy, - .eofNotify = qemuProcessHandleAgentEOF, .errorNotify = qemuProcessHandleAgentError, }; diff --git a/tests/qemumonitortestutils.c b/tests/qemumonitortestutils.c index c86a27a..d9b2726 100644 --- a/tests/qemumonitortestutils.c +++ b/tests/qemumonitortestutils.c @@ -752,7 +752,6 @@ qemuMonitorTestAgentNotify(qemuAgentPtr agent ATTRIBUTE_UNUSED, static qemuAgentCallbacks qemuMonitorTestAgentCallbacks = { - .eofNotify = qemuMonitorTestAgentNotify, .errorNotify = qemuMonitorTestAgentNotify, }; -- 1.8.3.1

On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
Now we don't need to differentiate error and eof cases in the loop function. So let's simplify it radically using goto instead of flags. --- src/qemu/qemu_agent.c | 185 ++++++++++++++++++------------------------- src/qemu/qemu_agent.h | 2 - src/qemu/qemu_process.c | 22 +---- tests/qemumonitortestutils.c | 1 - 4 files changed, 83 insertions(+), 127 deletions(-)
I would think it's understood the genesis of this comes from qemuMonitorIO (tripped me up in my patch 1 review too, but still the same result)... Part of me believes whatever happens here could be considered for qemuMonitorIO too, but that's perhaps a bigger hurdle to jump. I'm not sure why we want to make Error and EOF be the same thing. John I'm stopping here...
diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index ec8d47e..43d78c9 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -595,8 +595,8 @@ static void qemuAgentIO(int watch, int fd, int events, void *opaque) { qemuAgentPtr mon = opaque; - bool error = false; - bool eof = false; + void (*errorNotify)(qemuAgentPtr, virDomainObjPtr); + virDomainObjPtr vm;
virObjectRef(mon); /* lock access to the monitor and protect fd */ @@ -606,122 +606,95 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) #endif
if (mon->fd != fd || mon->watch != watch) { - if (events & (VIR_EVENT_HANDLE_HANGUP | VIR_EVENT_HANDLE_ERROR)) - eof = true; virReportError(VIR_ERR_INTERNAL_ERROR, _("event from unexpected fd %d!=%d / watch %d!=%d"), mon->fd, fd, mon->watch, watch); - error = true; - } else if (mon->lastError.code != VIR_ERR_OK) { - if (events & (VIR_EVENT_HANDLE_HANGUP | VIR_EVENT_HANDLE_ERROR)) - eof = true; - error = true; - } else { - if (events & VIR_EVENT_HANDLE_WRITABLE) { - if (mon->connectPending) { - if (qemuAgentIOConnect(mon) < 0) - error = true; - } else { - if (qemuAgentIOWrite(mon) < 0) - error = true; - } - events &= ~VIR_EVENT_HANDLE_WRITABLE; - } - - if (!error && - events & VIR_EVENT_HANDLE_READABLE) { - int got = qemuAgentIORead(mon); - events &= ~VIR_EVENT_HANDLE_READABLE; - if (got < 0) { - error = true; - } else if (got == 0) { - eof = true; - } else { - /* Ignore hangup/error events if we read some data, to - * give time for that data to be consumed */ - events = 0; - - if (qemuAgentIOProcess(mon) < 0) - error = true; - } - } - - if (!error && - events & VIR_EVENT_HANDLE_HANGUP) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("End of file from agent monitor")); - eof = true; - events &= ~VIR_EVENT_HANDLE_HANGUP; - } - - if (!error && !eof && - events & VIR_EVENT_HANDLE_ERROR) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Invalid file descriptor while waiting for monitor")); - eof = true; - events &= ~VIR_EVENT_HANDLE_ERROR; - } - if (!error && events) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Unhandled event %d for monitor fd %d"), - events, mon->fd); - error = true; - } + goto error; }
- if (error || eof) { - if (mon->lastError.code != VIR_ERR_OK) { - /* Already have an error, so clear any new error */ - virResetLastError(); + if (mon->lastError.code != VIR_ERR_OK) + goto error; + + if (events & VIR_EVENT_HANDLE_WRITABLE) { + if (mon->connectPending) { + if (qemuAgentIOConnect(mon) < 0) + goto error; } else { - virErrorPtr err = virGetLastError(); - if (!err) - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Error while processing monitor IO")); - virCopyLastError(&mon->lastError); - virResetLastError(); + if (qemuAgentIOWrite(mon) < 0) + goto error; } + events &= ~VIR_EVENT_HANDLE_WRITABLE; + }
- VIR_DEBUG("Error on monitor %s", NULLSTR(mon->lastError.message)); - /* If IO process resulted in an error & we have a message, - * then wakeup that waiter */ - if (mon->msg && !mon->msg->finished) { - mon->msg->finished = 1; - virCondSignal(&mon->notify); - } + if (events & VIR_EVENT_HANDLE_READABLE) { + int got = qemuAgentIORead(mon); + + if (got <= 0) + goto error; + + if (qemuAgentIOProcess(mon) < 0) + goto error; + + /* Ignore hangup/error events if we read some data, to + * give time for that data to be consumed */ + events = 0; + } + + if (events & VIR_EVENT_HANDLE_HANGUP) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("End of file from agent monitor")); + goto error; + } + + if (events & VIR_EVENT_HANDLE_ERROR) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Invalid file descriptor while waiting for monitor")); + goto error; + } + + if (events) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unhandled event %d for monitor fd %d"), + events, mon->fd); + goto error; }
qemuAgentUpdateWatch(mon); + virObjectUnlock(mon); + virObjectUnref(mon); + return;
- /* We have to unlock to avoid deadlock against command thread, - * but is this safe ? I think it is, because the callback - * will try to acquire the virDomainObjPtr mutex next */ - if (eof) { - void (*eofNotify)(qemuAgentPtr, virDomainObjPtr) - = mon->cb->eofNotify; - virDomainObjPtr vm = mon->vm; - - /* Make sure anyone waiting wakes up now */ - virCondSignal(&mon->notify); - virObjectUnlock(mon); - virObjectUnref(mon); - VIR_DEBUG("Triggering EOF callback"); - (eofNotify)(mon, vm); - } else if (error) { - void (*errorNotify)(qemuAgentPtr, virDomainObjPtr) - = mon->cb->errorNotify; - virDomainObjPtr vm = mon->vm; - - /* Make sure anyone waiting wakes up now */ - virCondSignal(&mon->notify); - virObjectUnlock(mon); - virObjectUnref(mon); - VIR_DEBUG("Triggering error callback"); - (errorNotify)(mon, vm); + error: + if (mon->lastError.code != VIR_ERR_OK) { + /* Already have an error, so clear any new error */ + virResetLastError(); } else { - virObjectUnlock(mon); - virObjectUnref(mon); + virErrorPtr err = virGetLastError(); + if (!err) + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Error while processing monitor IO")); + virCopyLastError(&mon->lastError); + virResetLastError(); } + + VIR_DEBUG("Error on monitor %s", NULLSTR(mon->lastError.message)); + /* If IO process resulted in an error & we have a message, + * then wakeup that waiter */ + if (mon->msg && !mon->msg->finished) { + mon->msg->finished = 1; + virCondSignal(&mon->notify); + } + + qemuAgentUpdateWatch(mon); + + errorNotify = mon->cb->errorNotify; + vm = mon->vm; + + /* Make sure anyone waiting wakes up now */ + virCondSignal(&mon->notify); + virObjectUnlock(mon); + virObjectUnref(mon); + (errorNotify)(mon, vm); }
@@ -732,9 +705,9 @@ qemuAgentOpen(virDomainObjPtr vm, { qemuAgentPtr mon;
- if (!cb || !cb->eofNotify) { + if (!cb || !cb->errorNotify) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("EOF notify callback must be supplied")); + _("error notify callback must be supplied")); return NULL; }
diff --git a/src/qemu/qemu_agent.h b/src/qemu/qemu_agent.h index 6dd9c70..76e8772 100644 --- a/src/qemu/qemu_agent.h +++ b/src/qemu/qemu_agent.h @@ -36,8 +36,6 @@ typedef qemuAgentCallbacks *qemuAgentCallbacksPtr; struct _qemuAgentCallbacks { void (*destroy)(qemuAgentPtr mon, virDomainObjPtr vm); - void (*eofNotify)(qemuAgentPtr mon, - virDomainObjPtr vm); void (*errorNotify)(qemuAgentPtr mon, virDomainObjPtr vm); }; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 3da1bd5..fe7682e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -128,12 +128,12 @@ extern virQEMUDriverPtr qemu_driver; * performed */ static void -qemuProcessHandleAgentEOF(qemuAgentPtr agent, - virDomainObjPtr vm) +qemuProcessHandleAgentError(qemuAgentPtr agent, + virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv;
- VIR_DEBUG("Received EOF from agent on %p '%s'", vm, vm->def->name); + VIR_DEBUG("Received error from agent on %p '%s'", vm, vm->def->name);
virObjectLock(vm);
@@ -145,7 +145,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, }
if (priv->beingDestroyed) { - VIR_DEBUG("Domain is being destroyed, agent EOF is expected"); + VIR_DEBUG("Domain is being destroyed, agent error is expected"); goto unlock; }
@@ -161,19 +161,6 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, }
-/* - * This is invoked when there is some kind of error - * parsing data to/from the agent. The VM can continue - * to run, but no further agent commands will be - * allowed - */ -static void -qemuProcessHandleAgentError(qemuAgentPtr agent, - virDomainObjPtr vm) -{ - qemuProcessHandleAgentEOF(agent, vm); -} - static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent, virDomainObjPtr vm) { @@ -185,7 +172,6 @@ static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent,
static qemuAgentCallbacks agentCallbacks = { .destroy = qemuProcessHandleAgentDestroy, - .eofNotify = qemuProcessHandleAgentEOF, .errorNotify = qemuProcessHandleAgentError, };
diff --git a/tests/qemumonitortestutils.c b/tests/qemumonitortestutils.c index c86a27a..d9b2726 100644 --- a/tests/qemumonitortestutils.c +++ b/tests/qemumonitortestutils.c @@ -752,7 +752,6 @@ qemuMonitorTestAgentNotify(qemuAgentPtr agent ATTRIBUTE_UNUSED,
static qemuAgentCallbacks qemuMonitorTestAgentCallbacks = { - .eofNotify = qemuMonitorTestAgentNotify, .errorNotify = qemuMonitorTestAgentNotify, };

On 26.10.2016 23:48, John Ferlan wrote:
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
Now we don't need to differentiate error and eof cases in the loop function. So let's simplify it radically using goto instead of flags. --- src/qemu/qemu_agent.c | 185 ++++++++++++++++++------------------------- src/qemu/qemu_agent.h | 2 - src/qemu/qemu_process.c | 22 +---- tests/qemumonitortestutils.c | 1 - 4 files changed, 83 insertions(+), 127 deletions(-)
I would think it's understood the genesis of this comes from qemuMonitorIO (tripped me up in my patch 1 review too, but still the same result)...
Part of me believes whatever happens here could be considered for qemuMonitorIO too, but that's perhaps a bigger hurdle to jump.
I'm not sure why we want to make Error and EOF be the same thing.
I would not put the change this way. It is just doesn't matter for us - EOF or error. The cause is different, the result is same. Then why keep this overwhelming logic in event loop? Just compare qemuAgentIO before and after the patch... Nikolay
diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index ec8d47e..43d78c9 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -595,8 +595,8 @@ static void qemuAgentIO(int watch, int fd, int events, void *opaque) { qemuAgentPtr mon = opaque; - bool error = false; - bool eof = false; + void (*errorNotify)(qemuAgentPtr, virDomainObjPtr); + virDomainObjPtr vm;
virObjectRef(mon); /* lock access to the monitor and protect fd */ @@ -606,122 +606,95 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) #endif
if (mon->fd != fd || mon->watch != watch) { - if (events & (VIR_EVENT_HANDLE_HANGUP | VIR_EVENT_HANDLE_ERROR)) - eof = true; virReportError(VIR_ERR_INTERNAL_ERROR, _("event from unexpected fd %d!=%d / watch %d!=%d"), mon->fd, fd, mon->watch, watch); - error = true; - } else if (mon->lastError.code != VIR_ERR_OK) { - if (events & (VIR_EVENT_HANDLE_HANGUP | VIR_EVENT_HANDLE_ERROR)) - eof = true; - error = true; - } else { - if (events & VIR_EVENT_HANDLE_WRITABLE) { - if (mon->connectPending) { - if (qemuAgentIOConnect(mon) < 0) - error = true; - } else { - if (qemuAgentIOWrite(mon) < 0) - error = true; - } - events &= ~VIR_EVENT_HANDLE_WRITABLE; - } - - if (!error && - events & VIR_EVENT_HANDLE_READABLE) { - int got = qemuAgentIORead(mon); - events &= ~VIR_EVENT_HANDLE_READABLE; - if (got < 0) { - error = true; - } else if (got == 0) { - eof = true; - } else { - /* Ignore hangup/error events if we read some data, to - * give time for that data to be consumed */ - events = 0; - - if (qemuAgentIOProcess(mon) < 0) - error = true; - } - } - - if (!error && - events & VIR_EVENT_HANDLE_HANGUP) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("End of file from agent monitor")); - eof = true; - events &= ~VIR_EVENT_HANDLE_HANGUP; - } - - if (!error && !eof && - events & VIR_EVENT_HANDLE_ERROR) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Invalid file descriptor while waiting for monitor")); - eof = true; - events &= ~VIR_EVENT_HANDLE_ERROR; - } - if (!error && events) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Unhandled event %d for monitor fd %d"), - events, mon->fd); - error = true; - } + goto error; }
- if (error || eof) { - if (mon->lastError.code != VIR_ERR_OK) { - /* Already have an error, so clear any new error */ - virResetLastError(); + if (mon->lastError.code != VIR_ERR_OK) + goto error; + + if (events & VIR_EVENT_HANDLE_WRITABLE) { + if (mon->connectPending) { + if (qemuAgentIOConnect(mon) < 0) + goto error; } else { - virErrorPtr err = virGetLastError(); - if (!err) - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Error while processing monitor IO")); - virCopyLastError(&mon->lastError); - virResetLastError(); + if (qemuAgentIOWrite(mon) < 0) + goto error; } + events &= ~VIR_EVENT_HANDLE_WRITABLE; + }
- VIR_DEBUG("Error on monitor %s", NULLSTR(mon->lastError.message)); - /* If IO process resulted in an error & we have a message, - * then wakeup that waiter */ - if (mon->msg && !mon->msg->finished) { - mon->msg->finished = 1; - virCondSignal(&mon->notify); - } + if (events & VIR_EVENT_HANDLE_READABLE) { + int got = qemuAgentIORead(mon); + + if (got <= 0) + goto error; + + if (qemuAgentIOProcess(mon) < 0) + goto error; + + /* Ignore hangup/error events if we read some data, to + * give time for that data to be consumed */ + events = 0; + } + + if (events & VIR_EVENT_HANDLE_HANGUP) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("End of file from agent monitor")); + goto error; + } + + if (events & VIR_EVENT_HANDLE_ERROR) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Invalid file descriptor while waiting for monitor")); + goto error; + } + + if (events) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unhandled event %d for monitor fd %d"), + events, mon->fd); + goto error; }
qemuAgentUpdateWatch(mon); + virObjectUnlock(mon); + virObjectUnref(mon); + return;
- /* We have to unlock to avoid deadlock against command thread, - * but is this safe ? I think it is, because the callback - * will try to acquire the virDomainObjPtr mutex next */ - if (eof) { - void (*eofNotify)(qemuAgentPtr, virDomainObjPtr) - = mon->cb->eofNotify; - virDomainObjPtr vm = mon->vm; - - /* Make sure anyone waiting wakes up now */ - virCondSignal(&mon->notify); - virObjectUnlock(mon); - virObjectUnref(mon); - VIR_DEBUG("Triggering EOF callback"); - (eofNotify)(mon, vm); - } else if (error) { - void (*errorNotify)(qemuAgentPtr, virDomainObjPtr) - = mon->cb->errorNotify; - virDomainObjPtr vm = mon->vm; - - /* Make sure anyone waiting wakes up now */ - virCondSignal(&mon->notify); - virObjectUnlock(mon); - virObjectUnref(mon); - VIR_DEBUG("Triggering error callback"); - (errorNotify)(mon, vm); + error: + if (mon->lastError.code != VIR_ERR_OK) { + /* Already have an error, so clear any new error */ + virResetLastError(); } else { - virObjectUnlock(mon); - virObjectUnref(mon); + virErrorPtr err = virGetLastError(); + if (!err) + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Error while processing monitor IO")); + virCopyLastError(&mon->lastError); + virResetLastError(); } + + VIR_DEBUG("Error on monitor %s", NULLSTR(mon->lastError.message)); + /* If IO process resulted in an error & we have a message, + * then wakeup that waiter */ + if (mon->msg && !mon->msg->finished) { + mon->msg->finished = 1; + virCondSignal(&mon->notify); + } + + qemuAgentUpdateWatch(mon); + + errorNotify = mon->cb->errorNotify; + vm = mon->vm; + + /* Make sure anyone waiting wakes up now */ + virCondSignal(&mon->notify); + virObjectUnlock(mon); + virObjectUnref(mon); + (errorNotify)(mon, vm); }
@@ -732,9 +705,9 @@ qemuAgentOpen(virDomainObjPtr vm, { qemuAgentPtr mon;
- if (!cb || !cb->eofNotify) { + if (!cb || !cb->errorNotify) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("EOF notify callback must be supplied")); + _("error notify callback must be supplied")); return NULL; }
diff --git a/src/qemu/qemu_agent.h b/src/qemu/qemu_agent.h index 6dd9c70..76e8772 100644 --- a/src/qemu/qemu_agent.h +++ b/src/qemu/qemu_agent.h @@ -36,8 +36,6 @@ typedef qemuAgentCallbacks *qemuAgentCallbacksPtr; struct _qemuAgentCallbacks { void (*destroy)(qemuAgentPtr mon, virDomainObjPtr vm); - void (*eofNotify)(qemuAgentPtr mon, - virDomainObjPtr vm); void (*errorNotify)(qemuAgentPtr mon, virDomainObjPtr vm); }; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 3da1bd5..fe7682e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -128,12 +128,12 @@ extern virQEMUDriverPtr qemu_driver; * performed */ static void -qemuProcessHandleAgentEOF(qemuAgentPtr agent, - virDomainObjPtr vm) +qemuProcessHandleAgentError(qemuAgentPtr agent, + virDomainObjPtr vm) { qemuDomainObjPrivatePtr priv;
- VIR_DEBUG("Received EOF from agent on %p '%s'", vm, vm->def->name); + VIR_DEBUG("Received error from agent on %p '%s'", vm, vm->def->name);
virObjectLock(vm);
@@ -145,7 +145,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, }
if (priv->beingDestroyed) { - VIR_DEBUG("Domain is being destroyed, agent EOF is expected"); + VIR_DEBUG("Domain is being destroyed, agent error is expected"); goto unlock; }
@@ -161,19 +161,6 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, }
-/* - * This is invoked when there is some kind of error - * parsing data to/from the agent. The VM can continue - * to run, but no further agent commands will be - * allowed - */ -static void -qemuProcessHandleAgentError(qemuAgentPtr agent, - virDomainObjPtr vm) -{ - qemuProcessHandleAgentEOF(agent, vm); -} - static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent, virDomainObjPtr vm) { @@ -185,7 +172,6 @@ static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent,
static qemuAgentCallbacks agentCallbacks = { .destroy = qemuProcessHandleAgentDestroy, - .eofNotify = qemuProcessHandleAgentEOF, .errorNotify = qemuProcessHandleAgentError, };
diff --git a/tests/qemumonitortestutils.c b/tests/qemumonitortestutils.c index c86a27a..d9b2726 100644 --- a/tests/qemumonitortestutils.c +++ b/tests/qemumonitortestutils.c @@ -752,7 +752,6 @@ qemuMonitorTestAgentNotify(qemuAgentPtr agent ATTRIBUTE_UNUSED,
static qemuAgentCallbacks qemuMonitorTestAgentCallbacks = { - .eofNotify = qemuMonitorTestAgentNotify, .errorNotify = qemuMonitorTestAgentNotify, };

If we catch io error in previous loop invocation we don't need to report 'unexpected fd' error only to drop it later. Let's also call error callback only once. --- src/qemu/qemu_agent.c | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index 43d78c9..5dc39b6 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -597,6 +597,7 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) qemuAgentPtr mon = opaque; void (*errorNotify)(qemuAgentPtr, virDomainObjPtr); virDomainObjPtr vm; + virErrorPtr err; virObjectRef(mon); /* lock access to the monitor and protect fd */ @@ -605,6 +606,13 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) VIR_DEBUG("Agent %p I/O on watch %d fd %d events %d", mon, watch, fd, events); #endif + /* this is not interesting at all */ + if (mon->lastError.code != VIR_ERR_OK) { + virObjectUnlock(mon); + virObjectUnref(mon); + return; + } + if (mon->fd != fd || mon->watch != watch) { virReportError(VIR_ERR_INTERNAL_ERROR, _("event from unexpected fd %d!=%d / watch %d!=%d"), @@ -612,9 +620,6 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) goto error; } - if (mon->lastError.code != VIR_ERR_OK) - goto error; - if (events & VIR_EVENT_HANDLE_WRITABLE) { if (mon->connectPending) { if (qemuAgentIOConnect(mon) < 0) @@ -665,19 +670,12 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) return; error: - if (mon->lastError.code != VIR_ERR_OK) { - /* Already have an error, so clear any new error */ - virResetLastError(); - } else { - virErrorPtr err = virGetLastError(); - if (!err) - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Error while processing monitor IO")); - virCopyLastError(&mon->lastError); - virResetLastError(); - } + if (!(err = virGetLastError())) + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Error while processing monitor IO")); + virCopyLastError(&mon->lastError); + virResetLastError(); - VIR_DEBUG("Error on monitor %s", NULLSTR(mon->lastError.message)); /* If IO process resulted in an error & we have a message, * then wakeup that waiter */ if (mon->msg && !mon->msg->finished) { -- 1.8.3.1

--- src/qemu/qemu_agent.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index 5dc39b6..36bc7c7 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -634,9 +634,15 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) if (events & VIR_EVENT_HANDLE_READABLE) { int got = qemuAgentIORead(mon); - if (got <= 0) + if (got < 0) goto error; + if (got == 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("EOF from agent")); + goto error; + } + if (qemuAgentIOProcess(mon) < 0) goto error; @@ -647,13 +653,13 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) if (events & VIR_EVENT_HANDLE_HANGUP) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("End of file from agent monitor")); + _("Write hangup from agent")); goto error; } if (events & VIR_EVENT_HANDLE_ERROR) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Invalid file descriptor while waiting for monitor")); + _("Write error to agent")); goto error; } -- 1.8.3.1

The caller should take care of it. In this case the caller is default event loop implementation. Actually this can be complicated because error callback can trigger unreferencing the monitor. But first this is done at the end anyway when we drop reference already, second current default loop implementation defers deleting handles and this is for ages because this implementation is public API [1]. So we have reference while we are in this call. [1] https://www.redhat.com/archives/libvir-list/2016-September/msg01005.html --- src/qemu/qemu_agent.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index 36bc7c7..6aaacfd 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -599,7 +599,6 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) virDomainObjPtr vm; virErrorPtr err; - virObjectRef(mon); /* lock access to the monitor and protect fd */ virObjectLock(mon); #if DEBUG_IO @@ -609,7 +608,6 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) /* this is not interesting at all */ if (mon->lastError.code != VIR_ERR_OK) { virObjectUnlock(mon); - virObjectUnref(mon); return; } @@ -672,7 +670,6 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) qemuAgentUpdateWatch(mon); virObjectUnlock(mon); - virObjectUnref(mon); return; error: @@ -697,7 +694,6 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) /* Make sure anyone waiting wakes up now */ virCondSignal(&mon->notify); virObjectUnlock(mon); - virObjectUnref(mon); (errorNotify)(mon, vm); } -- 1.8.3.1

ping On 04.10.2016 09:56, Nikolay Shirokovskiy wrote:
Patch 1 fixes the bug I dealt with. Patches 2-3 is a somewhat protection from similar bugs. Patches 5-6 drop error flag from the agent altogether, so now there are less places for bugs to lurk. Rest patches cleanup loop function.
I think 1-3 can be dropped if 6 is approved. It is just 6 can be considered controversial.
Nikolay Shirokovskiy (10): qemu: agent: unset error flag in EOF handler qemu: agent: dont set error after agent cleanup qemu: agent: clear error flag upon init qemu: agent: handle agent connection errors in one place qemu: agent: straigten up failed agent start case qemu: agent: drop agent error flag (6) qemu: agent: simplify io loop qemu: agent: exit early if error is already set qemu: agent: fix loop error messages qemu: agent: drop redundant referencing in loop
src/qemu/qemu_agent.c | 183 ++++++++++++++++++------------------------- src/qemu/qemu_agent.h | 2 - src/qemu/qemu_domain.c | 41 +++++----- src/qemu/qemu_domain.h | 1 - src/qemu/qemu_driver.c | 9 +-- src/qemu/qemu_migration.c | 13 +-- src/qemu/qemu_process.c | 90 ++++----------------- tests/qemumonitortestutils.c | 1 - 8 files changed, 119 insertions(+), 221 deletions(-)
participants (2)
-
John Ferlan
-
Nikolay Shirokovskiy