On Thu, Sep 30, 2021 at 16:17:44 -0400, Laine Stump wrote:
On 9/30/21 1:09 PM, Laurent Vivier wrote:
> If we want to save a snapshot of a VM to a file, we used to follow the
> following steps:
>
> 1- stop the VM:
> (qemu) stop
>
> 2- migrate the VM to a file:
> (qemu) migrate "exec:cat > snapshot"
>
> 3- resume the VM:
> (qemu) cont
>
> After that we can restore the snapshot with:
> qemu-system-x86_64 ... -incoming "exec:cat snapshot"
> (qemu) cont
This is the basics of what libvirt does for a snapshot, and steps 1+2 are
what it does for a "managedsave" (where it saves the snapshot to disk and
then terminates the qemu process, for later re-animation).
In those cases, it seems like this new parameter could work for us - instead
of explicitly pausing the guest prior to migrating it to disk, we would set
this new parameter to on, then directly migrate-to-disk (relying on qemu to
do the pause). Care will need to be taken to assure that error recovery
behaves the same though.
Yup, see below ...
There are a couple of cases when libvirt apparently *doesn't*
pause the
guest during the migrate-to-disk, both having to do with saving a coredump
of the guest. Since I really have no idea of how common/important that is
In most cases when doing a coredump the guest is paused because of an
emulation/guest error.
One example where the VM is not paused is a 'live' snapshot. It wastes
disk space and is not commonly used thoug.
Where it might become interesting is with the 'background-snapshot'
migration flag. Ideally failover will be fixed to properly work with
that one too. In such case we don't want to pause the VM (but we have to
AFAIK, the backround-snapshot migration can't be done as part of
'transacetion' yet, so we need to pause the VM to kick off the migration
(memory-snapshot) and then snapshot the disks).
(or even if my assessment of the code is correct), I'm Cc'ing
this patch to
libvir-list to make sure it catches the attention of someone who knows the
answers and implications.
Well cc-ing relevant patches to libvirt is always good. Especially if
we'll need to adapt the code to support the new feature.
> But when failover is configured, it doesn't work anymore.
>
> As the failover needs to ask the guest OS to unplug the card
> the machine cannot be paused.
>
> This patch introduces a new migration parameter, "pause-vm", that
> asks the migration to pause the VM during the migration startup
> phase after the the card is unplugged.
Is there a time limit to this? If guest interaction is required it might
take unbounded time.
In case of snapshots the expectation from the user is that the state
capture happens "reasonably" immediately after issuing the command. If
we introduce an possibly unbounded wait time, it will need an
re-imagining of the snapshot workflow and the feature will need to be an
opt-in.
>
> Once the migration is done, we only need to resume the VM with
> "cont" and the card is plugged back:
>
> 1- set the parameter:
> (qemu) migrate_set_parameter pause-vm on
>
> 2- migrate the VM to a file:
> (qemu) migrate "exec:cat > snapshot"
>
> The primary failover card (VFIO) is unplugged and the VM is paused.
>
> 3- resume the VM:
> (qemu) cont
>
> The VM restarts and the primary failover card is plugged back
>
> The VM state sent in the migration stream is "paused", it means
> when the snapshot is loaded or if the stream is sent to a destination
> QEMU, the VM needs to be resumed manually.
This is not a problem, libvirt is already dealing with this internally
anyways.