Re: [libvirt PATCH v3 5/5] qemu: enable asynchronous teardown on s390x hosts by default

Tuesday, 11 July 2023

On Mon, Jul 10, 2023 at 11:57:34AM +0200, Boris Fiuczynski wrote:
...
 On 7/5/23 4:47 PM, Daniel P. Berrangé wrote:
 > On Wed, Jul 05, 2023 at 04:27:46PM +0200, Boris Fiuczynski wrote:
 > > On 7/5/23 3:08 PM, Daniel P. Berrangé wrote:
 > > > On Wed, Jul 05, 2023 at 02:46:03PM +0200, Claudio Imbrenda wrote:
 > > > > On Wed, 5 Jul 2023 13:26:32 +0100
 > > > > Daniel P. Berrangé <berrange(a)redhat.com&gt; wrote:
 > > > > 
 > > > > [...]
 > > > > 
 > > > > > > > I rather think mgmt apps need to explicitly opt-in to
async teardown,
 > > > > > > > so they're aware that they need to take account of
delayed RAM
 > > > > > > > availablity in their accounting / guest placement
logic.
 > > > > > > 
 > > > > > > what would you think about enabling it by default only for
guests that
 > > > > > > are capable to run in Secure Execution mode?
 > > > > > 
 > > > > > IIUC, that's basically /all/ guests if running on new enough
hardware
 > > > > > with prot_virt=1 enabled on the host OS, so will still present
challenges
 > > > > > to mgmt apps needing to be aware of this behaviour AFAICS.
 > > > > 
 > > > > I think there is some fencing still? I don't think it's
automatic
 > > > 
 > > > IIUC, the following sequence is possible
 > > > 
 > > >     1. Start QEMU with -m 500G
 > > >         -> QEMU spawns async teardown helper process
 > > >     2. Stop QEMU
 > > >         -> Async teardown helper process remains running while
 > > >            kernel releases RAM
 > > >     3. Start QEMU with -m 500G
 > > >         -> Fails with ENOMEM
 > > >     ...time passes...
 > > >     4. Async teardown helper finally terminates
 > > >         -> The full original 500G is only now released for use
 > > > 
 > > > Basically if you can't do
 > > > 
 > > >      while true
 > > >      do
 > > >         virsh start $guest
 > > >         virsh stop $guest
 > > >      done
 > > > 
 > > > then it is a change in libvirt API semantics, as so will require
 > > > explicit opt-in from the mgmt app to use this feature.
 > > 
 > > What is your expectation if libvirt ["virsh stop $guest"] fails to
wait for
 > > qemu to terminate e.g. after 20+ minutes. I think that libvirt does have a
 > > timeout trying to stop qemu and than gives up.
 > > Wouldn't you encounter the same problem that way?
 > 
 > Yes, that would be a bug. We've tried to address these in the past.
 > For example, when there are PCI host devs assigned, the kernel takes
 > quite a bit longer to terminate QEMU. In that case, we extended the
 > timeout we wait for QEMU to exit.
 > 
 > Essentially the idea is that when 'virsh destroy' returns we want the
 > caller to have a strong guarantee that all resources are released.
 > IOW, if it sees an error code the expectation is that QEMU has suffered
 > a serious problem - such as stuck in an uninterruptible  sleep in kernel
 > space. We don't want the caller to see errors in "normal" scenarios.

 so the idea is to extend the wait until QEMU terminates?
 What is your proposal how to fix the bug? 
There is no bug currently.

If virDomainDestroy returns success, then the caller is guaranteed
that QEMU has gone and all resources are released.

If virDomainDestroy returns failure, then the QEMU may or may not
be gone. They can call virDomainDestroy again, or monitor for the
domain lifecycle events to discover when it has finally gone and
all resources are released.

To be more amenable to mgmt apps, we want virDmoainDestroy to
return success as frequently as is practical. If there are some
scenarios where we timeout because QEMU is too slow, then that's
not a bug, just a less desirable outcome.

...
 We had a scenario with a 2TB guest running NOT in Secure Execution
mode
 which termination resulted in libvirt giving up on terminating the guest
 after 40 seconds (10s SIGTERM and 30s SIGKILL) and systemd was able to
 "kill" the QEMU process after about 140s. 
When you say systemd killed the process, do you mean this was when
libvirt talks to systemd to invoke "TerminateMachine" ? If so then
presumably virDomainDestroy would have returned success which is OK.
Or am I mis-understanding what you refer to here ?

...
 We could add additional time depending on the guest memory size BUT
with
 Secure Execution the timeout would need to be increased by factors (two
 digits). Also for libvirt it is not possible to detect if the guest is in
 Secure Execution mode. 
What component is causing this 2 orders of magnitude delay in shutting
down a guest ? If the host can't tell if Secure Execution mode is
enabled or not, why would any code path be different & slower ?

...
 I also assume that timeouts of +1h are not acceptable. Wouldn't a
long
 timeout cause other trouble like stalling "virsh list" run in parallel? 
Well a 1 hour timeout is pretty insane, even with the async teardown
that's terrible as RAM is unable to be used for any new guest for
an incredibly long time.

AFAIR, 'virsh list' should not be stalled by virDomainDestroy, as we
release the exclusive locks during the wait loop.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt PATCH v3 5/5] qemu: enable asynchronous teardown on s390x hosts by default