On Mon, Apr 25, 2022 at 03:25:43PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrangé (berrange@redhat.com) wrote:
On Mon, Apr 25, 2022 at 01:33:41PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrangé (berrange@redhat.com) wrote:
On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrangé (berrange@redhat.com) wrote:
I think we need to introduce an explicit 'file:' protocol for the migrate command, that is backed by the blockdev APIs so it can do O_DIRECT and non-blocking AIO. For the 'fd:' protocol, we need to be able to tell QEMU whether the 'fd' is a stream or a regular file, so it can choose between the regular send/recv APIs, vs the Blockdev APIs (maybe we can auto-detect with fstat()). If we do this, then multifd doesn't end up needing multiple save files on disk, all the threads can be directly writing to the same file, just as the relevant offsets on disk to match the RAM page location.
Hmm so what I'm not sure of is whether it makes sense to use the normal migration flow/code for this or not; and you're suggesting a few possibly contradictory things.
Adding a file: protocol would be pretty easy (whether it went via the blockdev layer or not); getting it to be more efficient is the tricky part, because we've got loads of levels of stream abstraction in the RAM save code: QEMUFile->channel->OS but then if you want to enforce alignment you somehow have to make that go all the way down.
The QIOChannel stuff doesn't add buffering, so I wasn't worried about alignment there.
QEMUFile has optional buffering which would mess with alignment, but we could turn that off potentially for the RAM transfer, if using multifd.
The problem isn't whether they add buffering or not; the problem is you now need a way to add a mechanism to ask for alignment.
I'm confident the performance on the QMEU side though could exceed what's viable with libvirt's iohelper today, as we would definitely be eliminating 1 copy and many context switches.
Yes but you get that just from adding a simple file: (or fd:) mode without trying to do anything clever with alignment or rewriting the same offset.
I don't think so, as libvirt supports O_DIRECT today to avoid trashing the host cache when saving VMs. So to be able to offload libvirt's work to QEMU, O_DIRECT is a pre-requisite.
I guess you could O_DIRECT it from a buffer in QemuFile or the channel.
So we do need the alignment support at the very least. Rewriting at the same offset isnt mandatory, but I think it'd make multifd saner if trying to have all threads work on the same file.
Thinking on the fly, you'd need some non trivial changes:
a) A section entry in the format to say 'align to ... n bytes' (easyish)
Yep
b) A way to allocate a location in the file to a RAMBlock [ We already have a bitmap address, so that might do, but you need to make it interact with the existing file, so it might be easier to do the allocate and record it ]
IIUC, the migration protocol first serializes all RAM, and then serializes the VMstate for devices. When libvirt creates a save image for a VM it has its own header + XML dump and then appends the migrate stream. So we get a layout of +------------------+ | libvirt header | +------------------+ | libvirt XML | | ... | +------------------+ | migration stream | | ... | +------------------+ The 'migration stream' is opaque as far as libvirt is concerned, but we happen to know that from QEMU POV it expands to +------------------+ | RAM stream | | ... | +------------------+ | vmstate | | ... | +------------------+ Where 'RAM stream' is a stream of the RAM block header and RAM block contents, for every page. In the suggestion above, my desire is to achieve this layout +------------------+ | libvirt header | +------------------+ | libvirt XML | | ... | +------------------+ | RAM | | ... | +------------------+ | vmstate | | ... | +------------------+ The key difference being 'RAM' instead of 'RAM stream'. Libvirt would have to tell QEMU what offset in the file it is permitted to start at - say 16 MB. 'RAM' would be a 1:1 mapping of the guest RAM, simply offset by that 16 MB. IOW, I'm assuming that a 4 GB RAM VM would have its RAM written starting from offfset 16 MB and ending at 4 GB + 16 MB. I'm thinking stuff like virtio-mem / RAM hotplug makes life harder though as there can be many distinct blocks of RAM contributing to a QEMU map, and we need to be able to declare an ordering of them, for mapping to the file.
c) A way to say to the layers below it while writing RAM that it needs to go in a given location.
Yes, QEMUFile would need some new APIs accepting offsets, and we would need a way to report whether a given impl supports random access, vs streaming only.
d) A clean way for a..c only to happen in this case.
e) Hmm ram size changes/hotplug/virtio-mem
This is "easy" from libvirt POV, since none of those things are possible to be invoked during save/restore/migrate. Hard(er) for QEMU since it doesn't place that restriction on users in the general case. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|