Thanks for the thorough review Michal, clearly it needs more work. In
particular, I was not yet aware of the private xml save/restore stuff.
Thanks for the pointers in that regard. But I have one particular
question below:
On Fri, 2019-11-08 at 11:46 +0100, Michal Privoznik wrote:
1: It's not that simple. Firstly, you need to grab an agent job
(QEMU_AGENT_JOB_MODIFY ideally) to make sure no other thread is in
the
middle of talking to the agent. Secondly, some domains may not have
an
agent configured at all (in which case, priv->agent is going to be
NULL). And lastly, we need to track the timeout set in domain
private
data so that if libvirtd restarts the same timeout is preserved. But
changing private data means you have to acquire domain job too.
Is this really true? I'm obviously still not intimately familiar with
the libvirt 'job' stuff, but my understanding of jobs is that it is a
method to avoid multiple threads talking to the qemu monitor (or agent,
for agent jobs) at the same time. The data integrity of the structs
themselves are protected by mutex. But a mutex is insufficient to
mediate access to the monitor/agent because waiting for a response may
involve a sleep, and the mutex will be released while the thread
sleeps. So this 'job' system was added on top to indicate that the
monitor/agent was in use and other threads need to wait until the job
is complete before they can interact with the monitor or agent. Is my
understanding correct so far?
In this situation, we're not talking to the qemu monitor or the agent,
we're merely changing some struct data. So from a data-integrity point
of view, we simply need to hold the mutex lock, right? (We do hold the
mutex lock for the domain from calling qemuDomainObjFromDomain()). So
it doesn't seem strictly necessary to acquire a job here (either for
the agent, or for the domain/monitor).
One argument I can see for acquiring a job here is that it ensures that
any private data is not changed while another thread is in the middle
of a monitor query. And that's potentially a compelling justification
for acquiring a job. But in this circumstance, I don't see that it
gains us much.
For example, let's say Thread A has already acquired an agent job for
getHostname(). This thread will call qemuAgentCommand() and pass the
value of mon->timeout for the 'seconds' argument. Inside this function,
it will further call qemuAgentGuestSync(), which again uses the mon-
timeout value to determine how long it should wait for a response. It
sends the guest-sync command to the guest agent, and then waits for a
response (releasing the mutex). While it is waiting, let's say that
Thread B calls SetResponseTimeout(). Let's further imagine that
SetResponseTimeout() doesn't acquire a job (as in my patch). It can
acquire the agent mutex (since the other thread released it while
sleeping) and safely change the mon->timeout value immediately. At some
point, Thread A wakes up, handles the guest-sync response and issues
the 'guest-get-host-name' command to the agent. It uses the value
passed as an argument to qemuAgentCommand() to determine how long to
wait for a response from the agent. So it will not use the new value
set by thread B, but will use the old value that was set when it was
first called. (But I think an argument could be made that it would be
fine to use this new timeout value as well.)
This behavior all seems fine to me. Am I misunderstanding something
fundamentally here? Or is there just a general policy that all data
modification should be protected by a job, regardless of whether we're
using the qemu monitor or agent? If so, perhaps that could be better
documented.
One final concern: in your fixup commit, you acquire both a normal
domain job and an agent job (in the case where the agent is active).
But this is something I've been working to remove from libvirt after a
discussion with Daniel Berrange in
https://bugzilla.redhat.com/show_bug.cgi?id=1759566