Re: [libvirt RFCv8 00/27] multifd save restore prototype

10 May 2022

      On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
...
This is v8 of the multifd save prototype, which fixes a few bugs,
adds a few more code splits, and records the number of channels
as well as the compression algorithm, so the restore command is
more user-friendly.
It is now possible to just say:
virsh save mydomain /mnt/saves/mysave --parallel
virsh restore /mnt/saves/mysave --parallel
and things work with the default of 2 channels, no compression.
It is also possible to say of course:
virsh save mydomain /mnt/saves/mysave --parallel
      --parallel-connections 16 --parallel-compression zstd
virsh restore /mnt/saves/mysave --parallel
and things also work fine, due to channels and compression
being stored in the main save file.
For the sake of people following along, the above commands will
result in creation of multiple files

  /mnt/saves/mysave
  /mnt/saves/mysave.0
  /mnt/saves/mysave.1
  ....
  /mnt/saves/mysave.n

Where 'n' is the number of threads used.

Overall I'm not very happy with the approach of doing any of this
on the libvirt side.

Backing up, we know that QEMU can directly save to disk faster than
libvirt can. We mitigated alot of that overhead with previous patches
to increase the pipe buffer size, but some still remains due to the
extra copies inherant in handing this off to libvirt.

Using multifd on the libvirt side, IIUC, gets us better performance
than QEMU can manage if doing non-multifd write to file directly,
but we still have the extra copies in there due to the hand off
to libvirt. If QEMU were to be directly capable to writing to
disk with multifd, it should beat us again.

As a result of how we integrate with QEMU multifd, we're taking the
approach of saving the state across multiple files, because it is
easier than trying to get multiple threads writing to the same file.
It could be solved by using file range locking on the save file.
eg a thread can reserve say 500 MB of space, fill it up, and then
reserve another 500 MB, etc, etc. It is a bit tedious though and
won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
bytes of QEMU RAM ave state header.

The other downside of multiple files is that it complicates life
for both libvirt and apps using libvirt. They need to be aware of
multiple files and move them around together. This is not a simple
as it might sound. For example, IIRC OpenStack would upload a save
image state into a glance bucket for later use. Well, now it needs
multiple distinct buckets and keep track of them all. It also means
we're forced to use the same concurrency level when restoring, which
is not neccessarily desirable if the host environment is different
when restoring. ie The original host might have had 8 CPUs, but the
new host might only have 4 available, or vica-verca.

I know it is appealing to do something on the libvirt side, because
it is quicker than getting an enhancement into new QEMU release. We
have been down this route before with the migration support in libvirt
in the past though, when we introduced the tunnelled live migration
in order to workaround QEMU's inability to do TLS encryption. I very
much regret that we ever did this, because tunnelled migration was
inherantly limited, so for example failed to work with multifd,
and failed to work with NBD based disk migration. In the end I did
what I should have done at the beginning and just added TLS support
to QEMU, making tunnelled migration obsolete, except we still have
to carry the code around in libvirt indefinitely due to apps using
it.

So I'm very concerned about not having history repeat itself and
give us a long term burden for  a solution that turns out to be a
evolutionary dead end.

I like the idea of parallel saving, but I really think we need to
implement this directly in QEMU, not libvirt. As previously
mentioned I think QEMU needs to get a 'file' migration protocol,
along with ability to directly map RAM  segments into fixed
positions in the file. The benefits are many

 - It will save & restore faster because we're eliminating data
   copies that libvirt imposes via the iohelper

 - It is simple for libvirt & mgmt apps as we still only
   have one file to manage

 - It is space efficient because if a guest dirties a
   memory page, we just overwrite the existing contents
   at the fixed location in the file, instead of appending
   new contents to the file

 - It will restore faster too because we only restore
   each memory page once, due to always overwriting the
   file in-place when the guest dirtied a page during save

 - It can save and restore with differing number of threads,
   and can even dynamically change the number of threads
   in the middle of the save/restore operation 

As David G has pointed out the impl is not trivial on the QEMU
side, but from what I understand of the migration code, it is
certainly viable. Most importantly I think it puts us in a
better position for long term feature enhancements later by
taking the middle man (libvirt) out of the equation, letting
QEMU directly know what medium it is saving/restoring to/from.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFCv8 00/27] multifd save restore prototype

Daniel P. Berrangé