On Fri, Oct 04, 2019 at 08:14:39AM +0200, Peter Krempa wrote:
On Thu, Oct 03, 2019 at 16:52:12 -0500, Jonathon Jongsma wrote:
> Some layered products such as oVirt have requested a way to avoid being
> blocked by guest agent commands when querying a loaded vm. For example,
> many guest agent commands are polled periodically to monitor changes,
> and rather than blocking the calling process, they'd prefer to simply
> time out when an agent query is taking too long.
>
> This patch adds a way for the user to specify a custom agent timeout
> that is applied to all agent commands.
>
> One special case to note here is the 'guest-sync' command.
'guest-sync'
> is issued internally prior to calling any other command. (For example,
> when libvirt wants to call 'guest-get-fsinfo', we first call
> 'guest-sync' and then call 'guest-get-fsinfo').
>
> Previously, the 'guest-sync' command used a 5-second timeout
> (VIR_DOMAIN_QEMU_AGENT_COMMAND_DEFAULT), whereas the actual command that
> followed always blocked indefinitely
> (VIR_DOMAIN_QEMU_AGENT_COMMAND_BLOCK). As part of this patch, if a
> custom timeout is specified that is shorter than
> 5 seconds, this new timeout also used for 'guest-sync'. If there is no
> custom timeout or if the custom timeout is longer than 5 seconds, we
> will continue to use the 5-second timeout.
>
> See
https://bugzilla.redhat.com/show_bug.cgi?id=1705426 for additional details.
>
> Signed-off-by: Jonathon Jongsma <jjongsma(a)redhat.com>
> @@ -2737,3 +2730,19 @@ qemuAgentGetTimezone(qemuAgentPtr mon,
>
> return 0;
> }
> +
> +int
> +qemuAgentSetTimeout(qemuAgentPtr mon,
> + int timeout)
> +{
> + if (timeout < VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN) {
> + virReportError(VIR_ERR_INVALID_ARG,
> + _("guest agent timeout '%d' is "
> + "less than the minimum '%d'"),
> + timeout, VIR_DOMAIN_QEMU_AGENT_COMMAND_MIN);
This error is misleading as -1 and -2 are special values and not actual
timeout.
> + return -1;
> + }
> +
> + mon->timeout = timeout;
> + return 0;
> +}
[...]
> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> index 1e041a8bac..09251cc9e2 100644
> --- a/src/qemu/qemu_driver.c
> +++ b/src/qemu/qemu_driver.c
> @@ -23434,6 +23434,29 @@ qemuDomainGetGuestInfo(virDomainPtr dom,
> return ret;
> }
>
> +static int
> +qemuDomainQemuAgentSetTimeout(virDomainPtr dom,
> + int timeout)
> +{
> + virDomainObjPtr vm = NULL;
> + qemuAgentPtr agent;
> + int ret = -1;
> +
> + if (!(vm = qemuDomObjFromDomain(dom)))
> + goto cleanup;
> +
> + if (virDomainQemuAgentSetTimeoutEnsureACL(dom->conn, vm->def) < 0)
> + goto cleanup;
> +
> + agent = qemuDomainObjEnterAgent(vm);
You must acquire a job on @vm if you want to call this:
/*
* obj must be locked before calling
*
* To be called immediately before any QEMU agent API call.
* Must have already called qemuDomainObjBeginAgentJob() or
* qemuDomainObjBeginJobWithAgent() and checked that the VM is
* still active.
*
* To be followed with qemuDomainObjExitAgent() once complete
*/
qemuAgentPtr
qemuDomainObjEnterAgent(virDomainObjPtr obj)
> + ret = qemuAgentSetTimeout(agent, timeout);
> + qemuDomainObjExitAgent(vm, agent);
Also this API is inherently racy if you have two clients setting the
timeout and it will influence calls of a different client.
IMO the only reasonable approach is to add new APIs which have a
'timeout' parameter for any agent API which requires tunable timeout to
prevent any races and unexpected behaviour.
Other possibility may be to add a qemu config file option for this but
that is not really dynamic.
I guess the key question is whether we actually have a compelling
use case to vary the timeout on a per-command basis.
If not, then we could do fine with a global config that is either
recorded in the domain XML, or in the global QEMU config.
The possible causes of slowness are
- Host is overloaded so the guest is not being scheduled
- Guest is overloaded so the agent is not being scheduled
- The agent has crash/deadlocked
- The agent is maliciously not responding
- The command genuinely takes a long time to perform its action
The first 4 of those are fine with a global timeout either on guest or
in the driver.
Only the last one really pushes for having this per-public API.
Looking commands, ones I think can take a long time are
- FS trim - this can run for many minutes if there's alot to trim
- FS freeze - this might need to wait for apps to quiesce their I/O IIUC
- Guest exec - it can run any command
The latter isn't used in libvirt, can be be run via the QGA passthrough
api in libvirt-qemu.so
So sadly I think we genuinely do have a need for per-commad timeouts,
for at least some of the API calls. I don't think all of them need it
though.
I though we could likely add a qemu.conf setting for the globak timeout,
but then add a timeout to individual APIs in specific cases where we
know they can genuinely take a very long time to execute.
We must also ensure we NEVER block any regular libvirt APIs when a
guest agent comamnd is running.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|