Hi,
On 8/19/19 6:16 AM, Ján Tomko wrote:
On Sun, Aug 18, 2019 at 06:45:29PM -0300, Daniel Henrique Barboza
wrote:
> For some architectures and setups, device removal can take
> longer than the default 5 seconds. This results in commands
> such as 'virsh setvcpus' to fire timeout messages even if
> the actual operation happened in the guest, causing confusion
> for the user.
>
The commit that introduced this error message:
commit e3229f6e4461cd1721dc68a32e16ab1718ae716e
qemu: hotplug: Add support for VCPU unplug
specifically says that we treat this differently than regular device
detach:
As the new code is using device_del all the implications of using it
are present. Contrary to the device deletion code, the vcpu deletion
code fails if the unplug request is not executed in time.
Technically, we already did execute the unplug request so we lie to the
user saying "operation failed".
Maybe we can revisit the decision? [cc-ing pkrempa who added this]
I have thought about making setvcpus asynchronous when it is a
device_del operation. This would be more code (perhaps a new command
to do that? Or a --asynchronous option?), and we can set user expectations
properly by making it clear that this is a unplug request command, and the
user will need to check the result personally in the guest.
But then, if we are prepared to tell the user "go check yourself" we can
simply
change the current timeout message to say "vcpu unplug [...] timeout,
check unplug status in the guest". This would be clearer, and the user
wouldn't automatically assume that timeout == operation failed.
Another thing we can do, instead of exposing the option to the user (which
has a good potential for disaster - hence why I added a minimal value), is
to simply set the timeout to a greater value (10? 15?) and be done with it.
If we set the timeout to 15 seconds and change the timeout message to
let the user know "we don't know, you'll need to check", like I said
above,
we have more resilience and will not alarm the user if a timeout still
occurs.
We'll also avoid exposing the timeout configuration like I'm doing here.
> This patch adds a new qemu.conf parameter called
'unplug_timeout'
> to handle these cases. If left unset, the current default
> timeout is used. To avoid user 'experimentation' with small
> timeouts, the current timeout is also the minimal value
> allowed.
>
The reason for this timeout is that we originally promised something
that we cannot deliver - a synchronous device detach API, while the
operation itself is asynchronous. I'm not a fan of exposing it and
making it configurable.
> Signed-off-by: Daniel Henrique Barboza <danielhb413(a)gmail.com>
> ---
> src/qemu/libvirtd_qemu.aug | 3 +++
> src/qemu/qemu.conf | 4 ++++
> src/qemu/qemu_conf.c | 26 ++++++++++++++++++++++++++
> src/qemu/qemu_conf.h | 5 +++++
> src/qemu/qemu_driver.c | 2 ++
> src/qemu/test_libvirtd_qemu.aug.in | 1 +
> 6 files changed, 41 insertions(+)
>
[...]
> diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h
> index 0cbddd7a9c..29824e4e35 100644
> --- a/src/qemu/qemu_conf.h
> +++ b/src/qemu/qemu_conf.h
> @@ -214,6 +214,8 @@ struct _virQEMUDriverConfig {
> gid_t swtpm_group;
>
> char **capabilityfilters;
> +
> + unsigned int unplugTimeout;
> };
>
> /* Main driver state */
> @@ -294,6 +296,9 @@ struct _virQEMUDriver {
>
> /* Immutable pointer, self-locking APIs */
> virHashAtomicPtr migrationErrors;
> +
> + /* Immutable value */
> + unsigned int unplugTimeout;
> };
>
Why store this value twice?
I wanted the value to be available at the driver object, but saw that the
parsing of the reading file put stuff in config.
However, just realized that we get to the cfg via qemu_driver->config
(as long as the lock is being held, which I think it's the case for all
unplug
operations).
In case we still want this configurable timeout solution, I'll fix this
in the
next spin.
Thanks,
DHB
Jano