Re: [libvirt RFC] add API for parallel Saves (not for committing)

22 Apr 2022

      On Thu, Apr 21, 2022 at 08:06:40PM +0200, Claudio Fontana wrote:
...
On 4/21/22 7:08 PM, Daniel P. Berrangé wrote:
...
On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
...
RFC, starting point for discussion.
Sketch API changes to allow parallel Saves, and open up
and implementation for QEMU to leverage multifd migration to files,
with optional multifd compression.
This allows to improve save times for huge VMs.
The idea is to issue commands like:
virsh save domain /path/savevm --parallel --parallel-connections 2
and have libvirt start a multifd migration to:
/path/savevm   : main migration connection
/path/savevm.1 : multifd channel 1
/path/savevm.2 : multifd channel 2
At a conceptual level the idea would to still have a single file,
but have threads writing to different regions of it. I don't think
that's possible with multifd though, as it doesn't partition RAM
up between threads, its just hands out pages on demand. So if one
thread happens to be quicker it'll send more RAM than another
thread. Also we're basically capturing the migration RAM, and the
multifd channels have control info, in addition to the RAM pages.
That makes me wonder actually, are the multifd streams unidirectional
or bidirectional ?  Our saving to a file logic, relies on the streams
being unidirectional.
Unidirectional. In the meantime I completed an actual libvirt prototype that works (only did the save part, not the restore yet).
...
You've got me thinking, however, whether we can take QEMU out of
the loop entirely for saving RAM.
IIUC with 'x-ignore-shared' migration capability QEMU will skip
saving of RAM region entirely (well technically any region marked
as 'shared', which I guess can cover more things).
Heh I have no idea about this.
...
If the QEMU process is configured with a file backed shared
memory, or memfd, I wonder if we can take advantage of this.
eg
1. pause the VM
  1. write the libvirt header to save.img
  2. sendfile(qemus-memfd, save.img-fd)  to copy the entire
     RAM after header
I don't understand this point very much... if the ram is already
backed by file why are we sending this again..?
It is a file pointing to hugepagefs or tmpfs. It is still actually
RAM, but we exposed it to QEMU via a file, which QEMU then mmap'd.

We don't do this by default, but anyone with large (many GB) VMs
is increasingly likel to be relying on huge pages to optimize
their VM performance.

In our current save scheme we have (at least) 2 copies going
on. QEMU copies from RAM into the FD it uses for migrate.
libvirt IO helper copies from the FD into the file. This involves
multiple threads and multiple userspace/kernel switches and data
copies.  You've been trying to eliminate the 2nd copy in userspace.

If we take advantage of scenario where QEMU RAM is backed by a
tmpfs/hugepagefs file, we can potentially eliminate both copies
in userspace. The kernel can be told to copy direct from the
hugepagefs file into the disk file.
...
...
3. QMP migrate with x-ignore-shared to copy device
     state after RAM
Probably can do the same on restore too.
Do I understand correctly that you suggest to constantly update the RAM to file at runtime?
Given the compute nature of the workload, I'd think this would slow things down.
No, no different to what we do today. I'm just saying we let
the kernl copy straight from  QEMU's RAM backing file into
the dest file, at time of save, so we do *nothing* in userpsace
in either libvirt or QEMU.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|