Re: [libvirt] [PATCH 01/10] qemu: agent: unset error flag in EOF handler

27 Oct 2016

On 26.10.2016 22:48, John Ferlan wrote:
...
On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote:
...
Usually on domain shutdown event comes first from qemu
and the flag is unset in processSerialChangedEvent. However
there is a race between this function and qemuProcessHandleAgentEOF
because the former is executed in a thread pool and the latter
executed synchronously. qemuProcessHandleAgentEOF do not unset
the flag thus if it was triggered before it remains set after domain
shutdown/start and thus qemu agent becomes unavailable.
What error occurred that would be avoided by clearing the agentError
flag unconditionally? Considering that agentError being set is primarily
In short qemu agent becomes unavailable after shutdown. 
Long error case description is next:

1. agent error occurred and error flag is set.
2. domain is shutted down. Shutdown follows the path on which error flag stays set.
(Actually error flag can be set in the process of shutdown,
which is I guess the case that I observed.)

3. domain is started, on any attempt to use agent you got 
"QEMU guest agent is not available due to an error".

  As nobody clears error flag on start up it stays set. Obviously
we don't need to clear on the first domain start after daemon start
because qemuDomainObjPrivate is allocated and cleared to zeros.
But restarts rely upon assumption that this flag is cleared on shutdown.
And this is not true. There are shutdown paths where flags is not cleared.
This patch addresses one of this paths:


1. shutdown started, in the process VSERPORT_CHANGE is delivered in
IO loop thread. However it is handled in a different thread (
from driver->workerPool).
2. in the same IO loop thread we got EOF from agent and synchronously
execute qemuProcessHandleAgentEOF. It closes agent, set priv->agent
to NULL but forgets to clear agentError.
3. Now we execute VSERPORT_CHANGE handler(processSerialChangedEvent).
It is delayed because it is asynchronous to VSERPORT_CHANGE itself. 
As priv->agent is NULL handler does not touch agentError flag.

I guess usually time gap between VSERPORT_CHANGE and EOF is enough
and step 3 is executed before step 2 and agentError is cleared
upon shutdown.

Basically it is enough to clear the flag on start up (patch 3) but I
thought just for the case we would want to clear the flag on agent cleanup
too.
...
used by "new connection" request processing (e.g. callers to
qemuDomainAgentAvailable) in order to ascertain whether they should
actually attempt the connection or fail because of some condition that
would eventually lead to failure.  I would think EOF from qemu_monitor
for the qemu_agent would be one of those conditions *before* the domain
state is not RUNNING or the priv->agent has been fully removed.
Don't really understand the part 'f 'before state in not RUNNING'.
...
From what I've seen in code agentError has 2 meanings, they described
in patch 5.
...
I think what I'm missing is what sequence of events led to this
condition? How was it reproduced.
...
---
 src/qemu/qemu_process.c | 1 +
 1 file changed, 1 insertion(+)
So I understand "eventually" your goal is to remove agentError in patch
6, but you need to lead down the path to get there. So let's see if I
can follow along...
My original intention was to fix the bug that I tried to 
explain above. But in the process I found the other path how
similar error can occur again (patch 2). Then I realized that
we can protect ourselfs from similar bugs just in one place (patch 3).
At this point I look into the code long enough to realize
one don't need the flag altogether and decided that detail explanation
of patch 1 and 2 helps me convince others to get rid of the flag.
...
The issue you chased is related to domain shutdown processing "order".
The purpose up to this point in time for 'agentError' is to ensure
callers of qemuDomainAgentAvailable() will fail before trying to use the
agent. The qemuDomainAgentAvailable checks domain active, agentError,
and whether the priv->agent is available.
So the contention is during shutdown the EOF event firing at about the
same time as the processSerialChangedEvent has caused an issue, but it's
not 100% clear just from the description the sequence of events and the
issue. So let's consider why these things happen:
1. The 'eofNotify' comes from the qemuMonitorIO function - ostensibly if
the guest is shutting down
2. The 'processSerialChangedEvent' seems to occur for multiple reasons -
startup, shutdown, some sort of channel failure (unix socket or pty). If
I'm reading things correctly, this path is only involved when VSERPORT
is being used.
So, is this a case of guest startup where the initial agent
connect/startup occurs (VSERPORT) and a guest shutdown racing? Or is
this purely a shutdown ordering thing?
Purely shutdown. And error in io loop from agent before shutdown 
or in the process.
...
...

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 31c8453..a581f96 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -151,6 +151,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent,
qemuAgentClose(agent);
     priv->agent = NULL;
Because this is cleared, then qemuDomainAgentAvailable will fail because
the agent is not present instead of failing with there was some error.
That would seem to make the following clearing "safe" since we know
we'll have a failure any way, but...
...
+    priv->agentError = false;
How does this alone solve the problem you saw.  AFAICT it just changes
*where* the error occurs and the error message displayed.
The place that is affected is the following domain start up and agent use.
...
Conversely, if for some reason for either
1. qemuConnectAgent failed and is being "flagged" in
  1a. processSerialChangedEvent (for initial CONNECTED failure when
QEMU_CAPS_VSERPORT_CHANGE)
  1b. qemuMigrationFinish (for migration target)
  1c. qemuProcessReconnect (for libvirtd restart)
  1d. qemuProcessLaunch (initial domain startup before
QEMU_CAPS_VSERPORT_CHANGE)
  1e. qemuProcessAttach (for qemu-attach to existing qemu process)
or
2. qemuProcessHandleAgentError - some sort of error and we're stopping
any further agent communication.
then you're potentially clearing agentError and could conceivably cause
"extra" debugging steps to be taken only to figure out the reason it was
because of some error rather than the "next" connection failing and
indicating the failure was because of some previous error (it's a
startup race for some other thread desiring to use the agent).
As to case 1. AFAIU you talking about next case:

1. agent started (qemuAgentOpen is ok)
2. for some reason we have to fail qemuConnectAgent (
virSecurityManagerClearSocketLabel failed for example)
and set error flag.
3. EOF is triggered and clears the error flag (in new version).

This is not possible. qemuConnectAgent will eventually remove
agent handle from the io loop before setting the flag on failure and
EOF can not be delivered.

This subtle case is another reason why we should throw away this flag.
It adds too much complexity to analysis. Patch 5 unbinds the flag from this case,
case of agent failed to start up.

As to case 2.

This is one of the patch targets. If there is an error set by qemuProcessHandleAgentError
we need to clean it up when agent is cleaned up.
...
...
virObjectUnlock(vm);
     return;
So given all that - I'm not 100% clear how this really fixes the issue
you say.  Although, I have to wonder if the bug you're chasing is
perhaps that qemuConnectAgent doesn't clear agentError at the right time
as patch 3 attempts to fix.
Yep. The flag is not cleared. And it can be addressed both ways:
clear on initialization to protect from "unclean" clean up,
second - make clean up "clean".
...
After reading through patch 4 - I think Patch 4 and a slightly modified
patch 3 will fix what you're seeing, but I'd only know for sure given
more details about what was seen.
Patch 4 is pure refactoring IIRC and doesn't address the issue.

Nikolay