Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

11 May 2022

      On Wed, May 11, 2022 at 01:03:43PM +0200, Peter Krempa wrote:
...
On Wed, May 11, 2022 at 11:39:29 +0100, Daniel P. Berrangé wrote:
...
On Wed, May 11, 2022 at 10:48:10AM +0200, Peter Krempa wrote:
...
On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
...
There's no need to artificially pause a domain when post-copy fails. The
virtual CPUs may continue running, only the guest tasks that decide to
read a page which has not been migrated yet will get blocked.
IMO not pausing the VM is a policy decision (same way as pausing it was
though) and should be user-configurable at migration start.
I can see that users might want to prevent a half-broken VM from
executing until it gets attention needed to fix it, even when it's safe
from a "theoretical" standpoint.
It isn't even safe from a theoretical standpoint though.
Consider 2 processes in a guest that are communicating with each
other. 1 gets blocked on a page rea due to broken post copy, but
we leave the guest running.  The other process sees no progress
from the blocked process and/or hits time timeout and throws an
error. As a result the guest application workload ends up
completely dead, even if we later recover the the postcopy
migration.
IMO you have to deal with this scenario in a reduced scope anyways when
opting into using post-copy.
Each page transfer is vastly slower than the comparable access into
memory, so if the 'timeout' portion is implied to be on the same order
of magnitde of memory access latency then your software is going to have
a very bad time when being migrated in post-copy mode. If the link gets
congested ... then it's even worse.
That's very different likely order of magnitudes though. A "slow"
page access in post-copy is $LOW seconds. A blocked process due to
a broken post-copy connection is potentially $HIGH minutes long if
the infra takes a long time to fix.

A page access taking a seconds rather than microseconds isn't
going to trip up many app level timeouts IMHO.

A process blocked for many minutes is highly likely to trigger
app level timeouts.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|