Hi, it seems the discussion has stalled so let me give the oVirt
perspective here to (hopefully) revive it. The place where we need to
configure the timeout the most is for periodic queries of the guest
state. As opposed to that, setting the timeout e.g. on shutdown command
does make sense (instead of blocking), but it is not that crucial to
have a great degree of flexibility.
Several more comments below.
On Fri, Oct 04, 2019 at 10:13:11AM +0100, Daniel P. Berrangé wrote:
On Fri, Oct 04, 2019 at 08:14:39AM +0200, Peter Krempa wrote:
> On Thu, Oct 03, 2019 at 16:52:12 -0500, Jonathon Jongsma wrote:
> > Some layered products such as oVirt have requested a way to avoid being
> > blocked by guest agent commands when querying a loaded vm. For example,
> > many guest agent commands are polled periodically to monitor changes,
> > and rather than blocking the calling process, they'd prefer to simply
> > time out when an agent query is taking too long.
> >
> > This patch adds a way for the user to specify a custom agent timeout
> > that is applied to all agent commands.
> >
> > One special case to note here is the 'guest-sync' command.
'guest-sync'
> > is issued internally prior to calling any other command. (For example,
> > when libvirt wants to call 'guest-get-fsinfo', we first call
> > 'guest-sync' and then call 'guest-get-fsinfo').
> >
> > Previously, the 'guest-sync' command used a 5-second timeout
> > (VIR_DOMAIN_QEMU_AGENT_COMMAND_DEFAULT), whereas the actual command that
> > followed always blocked indefinitely
> > (VIR_DOMAIN_QEMU_AGENT_COMMAND_BLOCK). As part of this patch, if a
> > custom timeout is specified that is shorter than
> > 5 seconds, this new timeout also used for 'guest-sync'. If there is
no
> > custom timeout or if the custom timeout is longer than 5 seconds, we
> > will continue to use the 5-second timeout.
> >
> > See
https://bugzilla.redhat.com/show_bug.cgi?id=1705426 for additional
details.
> >
> > Signed-off-by: Jonathon Jongsma <jjongsma(a)redhat.com>
> > @@ -2737,3 +2730,19 @@ qemuAgentGetTimezone(qemuAgentPtr mon,
> >
> > return 0;
> > }
> > +
> > +int
> > +qemuAgentSetTimeout(qemuAgentPtr mon,
> > + int timeout)
> > +{
> > + if (timeout < VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN) {
> > + virReportError(VIR_ERR_INVALID_ARG,
> > + _("guest agent timeout '%d' is "
> > + "less than the minimum '%d'"),
> > + timeout, VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN);
>
> This error is misleading as -1 and -2 are special values and not actual
> timeout.
>
> > + return -1;
> > + }
> > +
> > + mon->timeout = timeout;
> > + return 0;
> > +}
>
> [...]
>
> > diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> > index 1e041a8bac..09251cc9e2 100644
> > --- a/src/qemu/qemu_driver.c
> > +++ b/src/qemu/qemu_driver.c
> > @@ -23434,6 +23434,29 @@ qemuDomainGetGuestInfo(virDomainPtr dom,
> > return ret;
> > }
> >
> > +static int
> > +qemuDomainQemuAgentSetTimeout(virDomainPtr dom,
> > + int timeout)
> > +{
> > + virDomainObjPtr vm = NULL;
> > + qemuAgentPtr agent;
> > + int ret = -1;
> > +
> > + if (!(vm = qemuDomObjFromDomain(dom)))
> > + goto cleanup;
> > +
> > + if (virDomainQemuAgentSetTimeoutEnsureACL(dom->conn, vm->def) <
0)
> > + goto cleanup;
> > +
> > + agent = qemuDomainObjEnterAgent(vm);
>
> You must acquire a job on @vm if you want to call this:
>
> /*
> * obj must be locked before calling
> *
> * To be called immediately before any QEMU agent API call.
> * Must have already called qemuDomainObjBeginAgentJob() or
> * qemuDomainObjBeginJobWithAgent() and checked that the VM is
> * still active.
> *
> * To be followed with qemuDomainObjExitAgent() once complete
> */
> qemuAgentPtr
> qemuDomainObjEnterAgent(virDomainObjPtr obj)
>
>
> > + ret = qemuAgentSetTimeout(agent, timeout);
> > + qemuDomainObjExitAgent(vm, agent);
>
> Also this API is inherently racy if you have two clients setting the
> timeout and it will influence calls of a different client.
As pointed in other part of the thread this is not a problem, rather
it's a fact about libvirt. You assume the management app knows what it
is doing. And if there are more management apps or users meddling with
libvirt there's always the risk of stepping on each others toes.
>
> IMO the only reasonable approach is to add new APIs which have a
> 'timeout' parameter for any agent API which requires tunable timeout to
> prevent any races and unexpected behaviour.
>
> Other possibility may be to add a qemu config file option for this but
> that is not really dynamic.
I guess the key question is whether we actually have a compelling
use case to vary the timeout on a per-command basis.
If not, then we could do fine with a global config that is either
recorded in the domain XML, or in the global QEMU config.
What I strive for is to keep the allowed timeout to minimum. Especially
because we have to query all the machines sequentially for the lack of
threading (either in libvirt or in oVirt). One machine taking too long
to respond can delay the queries for all the other machines.
The possible causes of slowness are
- Host is overloaded so the guest is not being scheduled
- Guest is overloaded so the agent is not being scheduled
This are generally not a problem for keeping the timeout to minimum. We
can expect that the next time we issue the query all will go well. And
if guest/host is overloaded for a long term there's not much we can do
about it (waiting longer for the command to complete won't help).
- The agent has crash/deadlocked
This reminds me a different issue which is there's no API to find out
whether the agent is connected or not. One either has to "give it a
shot" and handle failure, or check the domain XML (which is
inefficient).
- The agent is maliciously not responding
This is a interesting point, but I would put it in the same group as
"guest overloaded" from our POV.
- The command genuinely takes a long time to perform its action
Now this is a genuine problem. Again from our POV it makes sense to keep
long-running commands separate from the group of quick commands and run
those in a separate loop with higher timeout but less often.
The first 4 of those are fine with a global timeout either on guest or
in the driver.
Only the last one really pushes for having this per-public API.
Looking commands, ones I think can take a long time are
- FS trim - this can run for many minutes if there's alot to trim
- FS freeze - this might need to wait for apps to quiesce their I/O IIUC
- Guest exec - it can run any command
We don't really care about the timeout here. At least not at the moment.
The latter isn't used in libvirt, can be be run via the QGA
passthrough
api in libvirt-qemu.so
So sadly I think we genuinely do have a need for per-commad timeouts,
for at least some of the API calls. I don't think all of them need it
though.
I though we could likely add a qemu.conf setting for the globak timeout,
but then add a timeout to individual APIs in specific cases where we
know they can genuinely take a very long time to execute.
I would be against any timeout that cannot be changed via API. This
would not give us the flexibility we require.
Hope this helps,
Tomas
--
Tomáš Golembiovský <tgolembi(a)redhat.com>