* Daniel P. Berrangé (berrange(a)redhat.com) wrote:
On Mon, Apr 25, 2022 at 01:33:41PM +0100, Dr. David Alan Gilbert
wrote:
> * Daniel P. Berrangé (berrange(a)redhat.com) wrote:
> > On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrangé (berrange(a)redhat.com) wrote:
> > > > I'm worried that we could be taking ourselves down a dead-end by
> > > > trying to optimize on the libvirt side, because we've got a
> > > > mismatch between the QMP APIs we're using and the intent of
> > > > QEMU.
> > > >
> > > > The QEMU migration APIs were designed around streaming to a
> > > > remote instance, and we're essentially playing games to use
> > > > them as a way to write to local storage.
> > >
> > > Yes.
> > >
> > > > The RAM pages we're saving are of course page aligned in QEMU
> > > > because they are mapped RAM. We loose/throwaway the page
> > > > alignment because we're sending them over a FD, potentially
> > > > adding in each metadata headers to identify which location
> > > > the RAM block came from.
> > > >
> > > > QEMU has APIs for doing async I/O to local storage using
> > > > O_DIRECT, via the BlockDev layer. QEMU can even use this
> > > > for saving state via the loadvm/savevm monitor commands
> > > > for internal snapshots. This is not accessible via the
> > > > normal migration QMP command though.
> > > >
> > > >
> > > > I feel to give ourselves the best chance of optimizing the
> > > > save/restore, we need to get QEMU to have full knowledge of
> > > > what is going on, and get libvirt out of the picture almost
> > > > entirely.
> > > >
> > > > If QEMU knows that the migration source/target is a random
> > > > access file, rather than a stream, then it will not have
> > > > to attach any headers to identify RAM pages. It can just
> > > > read/write them directly at a fixed offset in the file.
> > > > It can even do this while the CPU is running, just overwriting
> > > > the previously written page on disk if the contents changed.
> > > >
> > > > This would mean the save image is a fixed size exactly
> > > > matching the RAM size, plus libvirt header and vmstate.
> > > > Right now if we save a live snapshot, the save image can
> > > > be almost arbitrarily large, since we'll save the same
> > > > RAM page over & over again if the VM is modifying the
> > > > content.
> > > >
> > > > I think we need to introduce an explicit 'file:' protocol
> > > > for the migrate command, that is backed by the blockdev APIs
> > > > so it can do O_DIRECT and non-blocking AIO. For the 'fd:'
> > > > protocol, we need to be able to tell QEMU whether the 'fd'
> > > > is a stream or a regular file, so it can choose between the
> > > > regular send/recv APIs, vs the Blockdev APIs (maybe we can
> > > > auto-detect with fstat()). If we do this, then multifd
> > > > doesn't end up needing multiple save files on disk, all
> > > > the threads can be directly writing to the same file, just
> > > > as the relevant offsets on disk to match the RAM page
> > > > location.
> > >
> > > Hmm so what I'm not sure of is whether it makes sense to use the
normal
> > > migration flow/code for this or not; and you're suggesting a few
> > > possibly contradictory things.
> > >
> > > Adding a file: protocol would be pretty easy (whether it went via
> > > the blockdev layer or not); getting it to be more efficient is the
> > > tricky part, because we've got loads of levels of stream abstraction
in
> > > the RAM save code:
> > > QEMUFile->channel->OS
> > > but then if you want to enforce alignment you somehow have to make that
> > > go all the way down.
> >
> > The QIOChannel stuff doesn't add buffering, so I wasn't worried
> > about alignment there.
> >
> > QEMUFile has optional buffering which would mess with alignment,
> > but we could turn that off potentially for the RAM transfer, if
> > using multifd.
>
> The problem isn't whether they add buffering or not; the problem is you
> now need a way to add a mechanism to ask for alignment.
>
> > I'm confident the performance on the QMEU side though could
> > exceed what's viable with libvirt's iohelper today, as we
> > would definitely be eliminating 1 copy and many context switches.
>
> Yes but you get that just from adding a simple file: (or fd:) mode
> without trying to do anything clever with alignment or rewriting the
> same offset.
I don't think so, as libvirt supports O_DIRECT today to avoid
trashing the host cache when saving VMs. So to be able to
offload libvirt's work to QEMU, O_DIRECT is a pre-requisite.
I guess you could O_DIRECT it from a buffer in QemuFile or the channel.
So we do need the alignment support at the very least. Rewriting
at the same offset isnt mandatory, but I think it'd make multifd
saner if trying to have all threads work on the same file.
Thinking on the fly, you'd need some non trivial changes:
a) A section entry in the format to say 'align to ... n bytes'
(easyish)
b) A way to allocate a location in the file to a RAMBlock
[ We already have a bitmap address, so that might do, but
you need to make it interact with the existing file, so it might
be easier to do the allocate and record it ]
c) A way to say to the layers below it while writing RAM that it
needs to go in a given location.
d) A clean way for a..c only to happen in this case.
e) Hmm ram size changes/hotplug/virtio-mem
Dave
--
Dr. David Alan Gilbert / dgilbert(a)redhat.com / Manchester, UK