Re: [libvirt PATCH v2 81/81] RFC: qemu: Keep vCPUs paused while migration is in postcopy-paused

Monday, 6 June 2022

On Wed, Jun 01, 2022 at 02:50:21PM +0200, Jiri Denemark wrote:
...
 QEMU keeps guest CPUs running even in postcopy-paused migration state
so
 that processes that already have all memory pages they need migrated to
 the destination can keep running. However, this behavior might bring
 unexpected delays in interprocess communication as some processes will
 be stopped until migration is recover and their memory pages migrated.
 So let's make sure all guest CPUs are paused while postcopy migration is
 paused.
 ---

 Notes:
     Version 2:
     - new patch

     - this patch does not currently work as QEMU cannot handle "stop"
       QMP command while in postcopy-paused state... the monitor just
       hangs (see https://gitlab.com/qemu-project/qemu/-/issues/1052 )
     - an ideal solution of the QEMU bug would be if QEMU itself paused
       the CPUs for us and we just got notified about it via QMP events
     - but Peter Xu thinks this behavior is actually worse than keeping
       vCPUs running 
I'd like to know what the rationale is here ?

We've got a long history knowing the behaviour and impact when
pausing a VM as a whole. Of course some apps may have timeouts
that are hit if the paused time was too long, but overall this
scenario is not that different from a bare metal machine doing
suspend-to-ram. Application impact is limited & predictable and
genrally well understood.

I don't think we can say the same about the behaviour & impact
on the guest OS if we selectively block execution of random
CPUs.  An OS where a certain physical CPU simply stops executing
is not a normal scenario that any application or OS is designed
to expect. I think the chance of the guest OS or application
breaking in a non-recoverable way is high. IOW, we might perform
post-copy recovery and all might look well from host POV, but
the guest OS/app is none the less broken.

The overriding goal for migration has to be to minimize the
danger to the guest OS and its applications, and I think that's
only viable if either the guest OS is running all CPUs or no
CPUs.

The length of outage for a CPU when post-copy transport is broken
is potentially orders of magnitude larger than the temporary
blockage while fetching a memory page asynchronously. The latter
is obviously not good for real-time sensitive apps, but most apps
and OS will cope with CPUs being stalled for 100's of milliseconds.
That isn't the case if CPUs get stalled for minutes, or even hours,
at a time due to a broken network link needing admin recovery work
in the host infra.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt PATCH v2 81/81] RFC: qemu: Keep vCPUs paused while migration is in postcopy-paused