Re: [libvirt] [RFC] Add API to change qemu agent response timeout

Friday, 4 October 2019

On Fri, Oct 04, 2019 at 08:14:39AM +0200, Peter Krempa wrote:
...
 On Thu, Oct 03, 2019 at 16:52:12 -0500, Jonathon Jongsma wrote:
 > Some layered products such as oVirt have requested a way to avoid being
 > blocked by guest agent commands when querying a loaded vm. For example,
 > many guest agent commands are polled periodically to monitor changes,
 > and rather than blocking the calling process, they'd prefer to simply
 > time out when an agent query is taking too long.
 > 
 > This patch adds a way for the user to specify a custom agent timeout
 > that is applied to all agent commands.
 > 
 > One special case to note here is the 'guest-sync' command.
'guest-sync'
 > is issued internally prior to calling any other command. (For example,
 > when libvirt wants to call 'guest-get-fsinfo', we first call
 > 'guest-sync' and then call 'guest-get-fsinfo').
 > 
 > Previously, the 'guest-sync' command used a 5-second timeout
 > (VIR_DOMAIN_QEMU_AGENT_COMMAND_DEFAULT), whereas the actual command that
 > followed always blocked indefinitely
 > (VIR_DOMAIN_QEMU_AGENT_COMMAND_BLOCK). As part of this patch, if a
 > custom timeout is specified that is shorter than
 > 5 seconds,  this new timeout also used for 'guest-sync'. If there is no
 > custom timeout or if the custom timeout is longer than 5 seconds, we
 > will continue to use the 5-second timeout.
 > 
 > See https://bugzilla.redhat.com/show_bug.cgi?id=1705426 for additional details.
 > 
 > Signed-off-by: Jonathon Jongsma <jjongsma(a)redhat.com&gt; 

...
 > @@ -2737,3 +2730,19 @@ qemuAgentGetTimezone(qemuAgentPtr mon,
 >  
 >      return 0;
 >  }
 > +
 > +int
 > +qemuAgentSetTimeout(qemuAgentPtr mon,
 > +                    int timeout)
 > +{
 > +    if (timeout < VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN) {
 > +        virReportError(VIR_ERR_INVALID_ARG,
 > +                       _("guest agent timeout '%d' is "
 > +                         "less than the minimum '%d'"),
 > +                       timeout, VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN);

 This error is misleading as -1 and -2 are special values and not actual
 timeout.

 > +        return -1;
 > +    }
 > +
 > +    mon->timeout = timeout;
 > +    return 0;
 > +}

 [...]

 > diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
 > index 1e041a8bac..09251cc9e2 100644
 > --- a/src/qemu/qemu_driver.c
 > +++ b/src/qemu/qemu_driver.c
 > @@ -23434,6 +23434,29 @@ qemuDomainGetGuestInfo(virDomainPtr dom,
 >      return ret;
 >  }
 >  
 > +static int
 > +qemuDomainQemuAgentSetTimeout(virDomainPtr dom,
 > +                              int timeout)
 > +{
 > +    virDomainObjPtr vm = NULL;
 > +    qemuAgentPtr agent;
 > +    int ret = -1;
 > +
 > +    if (!(vm = qemuDomObjFromDomain(dom)))
 > +        goto cleanup;
 > +
 > +    if (virDomainQemuAgentSetTimeoutEnsureACL(dom->conn, vm->def) < 0)
 > +        goto cleanup;
 > +
 > +    agent = qemuDomainObjEnterAgent(vm);

 You must acquire a job on @vm if you want to call this:

 /*
  * obj must be locked before calling
  *
  * To be called immediately before any QEMU agent API call.
  * Must have already called qemuDomainObjBeginAgentJob() or
  * qemuDomainObjBeginJobWithAgent() and checked that the VM is
  * still active.
  *
  * To be followed with qemuDomainObjExitAgent() once complete
  */
 qemuAgentPtr
 qemuDomainObjEnterAgent(virDomainObjPtr obj)

 > +    ret = qemuAgentSetTimeout(agent, timeout);
 > +    qemuDomainObjExitAgent(vm, agent);

 Also this API is inherently racy if you have two clients setting the
 timeout and it will influence calls of a different client.

 IMO the only reasonable approach is to add new APIs which have a
 'timeout' parameter for any agent API which requires tunable timeout to
 prevent any races and unexpected behaviour.

 Other possibility may be to add a qemu config file option for this but
 that is not really dynamic. 
I guess the key question is whether we actually have a compelling
use case to vary the timeout on a per-command basis.

If not, then we could do fine with a global config that is either
recorded in the domain XML, or in the global QEMU config.

The possible causes of slowness are

 - Host is overloaded so the guest is not being scheduled
 - Guest is overloaded so the agent is not being scheduled
 - The agent has crash/deadlocked
 - The agent is maliciously not responding
 - The command genuinely takes a long time to perform its action

The first 4 of those are fine with a global timeout either on guest or
in the driver.

Only the last one really pushes for having this per-public API.

Looking commands, ones I think can take a long time are

 - FS trim - this can run for many minutes if there's alot to trim
 - FS freeze - this might need to wait for apps to quiesce their I/O IIUC
 - Guest exec - it can run any command

The latter isn't used in libvirt, can be be run via the QGA passthrough
api in libvirt-qemu.so

So sadly I think we genuinely do have a need for per-commad timeouts,
for at least some of the API calls. I don't think all of them need it
though.

I though we could likely add a qemu.conf setting for the globak timeout,
but then add a timeout to individual APIs in specific cases where we
know they can genuinely take a very long time to execute.

We must also ensure we NEVER block any regular libvirt APIs when a
guest agent comamnd is running.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [RFC] Add API to change qemu agent response timeout