Re: [libvirt RFC] add API for parallel Saves (not for committing)

25 Apr 2022


      * Daniel P. Berrangé (berrange@redhat.com) wrote:
...
On Mon, Apr 25, 2022 at 01:33:41PM +0100, Dr. David Alan Gilbert wrote:
...
* Daniel P. Berrangé (berrange@redhat.com) wrote:
...
On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
...
* Daniel P. Berrangé (berrange@redhat.com) wrote:
...
I'm worried that we could be taking ourselves down a dead-end by
trying to optimize on the libvirt side, because we've got a
mismatch  between the QMP APIs we're using and the intent of
QEMU.
The QEMU migration APIs were designed around streaming to a
remote instance, and we're essentially playing games to use
them as a way to write to local storage.
Yes.
...
The RAM pages we're saving are of course page aligned in QEMU
because they are mapped RAM. We loose/throwaway the page
alignment because we're sending them over a FD, potentially
adding in each metadata headers to identify which location
the RAM block came from.
QEMU has APIs for doing async I/O to local storage using
O_DIRECT, via the BlockDev layer. QEMU can even use this
for saving state via the loadvm/savevm monitor commands
for internal snapshots. This is not accessible via the
normal migration QMP command though.
I feel to give ourselves the best chance of optimizing the
save/restore, we need to get QEMU to have full knowledge of
what is going on, and get libvirt out of the picture almost
entirely.
If QEMU knows that the migration source/target is a random
access file, rather than a stream, then it will not have
to attach any headers to identify RAM pages. It can just
read/write them directly at a fixed offset in the file.
It can even do this while the CPU is running, just overwriting
the previously written page on disk if the contents changed.
This would mean the save image is a fixed size exactly
matching the RAM size, plus libvirt header and vmstate.
Right now if we save a live snapshot, the save image can
be almost arbitrarily large, since we'll save the same
RAM page over & over again if the VM is modifying the
content.
I think we need to introduce an explicit 'file:' protocol
for the migrate command, that is backed by the blockdev APIs
so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
protocol, we need to be able to tell QEMU whether the 'fd'
is a stream or a regular file, so it can choose between the
regular send/recv APIs, vs the Blockdev APIs (maybe we can
auto-detect with fstat()).  If we do this, then multifd
doesn't end up needing multiple save files on disk, all
the threads can be directly writing to the same file, just
as the relevant offsets on disk to match the RAM page
location.
Hmm so what I'm not sure of is whether it makes sense to use the normal
migration flow/code for this or not; and you're suggesting a few
possibly contradictory things.
Adding a file: protocol would be pretty easy (whether it went via
the blockdev layer or not); getting it to be more efficient is the
tricky part, because we've got loads of levels of stream abstraction in
the RAM save code:
    QEMUFile->channel->OS
but then if you want to enforce alignment you somehow have to make that
go all the way down.
The QIOChannel stuff doesn't add buffering, so I wasn't worried
about alignment there.
QEMUFile has optional buffering which would mess with alignment,
but we could turn that off potentially for the RAM transfer, if
using multifd.
The problem isn't whether they add buffering or not; the problem is you
now need a way to add a mechanism to ask for alignment.
...
I'm confident the performance on the QMEU side though could
exceed what's viable with libvirt's iohelper  today, as we
would definitely be eliminating 1 copy and many context switches.
Yes but you get that just from adding a simple file: (or fd:) mode
without trying to do anything clever with alignment or rewriting the
same offset.
I don't think so, as libvirt supports O_DIRECT today to avoid
trashing the host cache when saving VMs. So to be able to
offload libvirt's work to QEMU, O_DIRECT is a pre-requisite.
I guess you could O_DIRECT it from a buffer in QemuFile or the channel.
...
So we do need the alignment support at the very least. Rewriting
at the same offset isnt mandatory, but I think it'd make multifd
saner if trying to have all threads work on the same file.
Thinking on the fly, you'd need some non trivial changes:

  a) A section entry in the format to say 'align to ... n bytes'
    (easyish)
  b) A way to allocate a location in the file to a RAMBlock
    [ We already have a bitmap address, so that might do,  but
    you need to make it interact with the existing file, so it might
    be easier to do the allocate and record it ]
  c) A way to say to the layers below it while writing RAM that it
    needs to go in a given location.
  d) A clean way for a..c only to happen in this case.
  e) Hmm ram size changes/hotplug/virtio-mem

Dave
...
With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK