On Wed, May 11, 2022 at 01:47:13PM +0200, Claudio Fontana wrote:
On 5/11/22 10:27 AM, Christophe Marie Francois Dupont de Dinechin
wrote:
>
>
>> On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange(a)redhat.com> wrote:
>>
>> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
>>> This is v8 of the multifd save prototype, which fixes a few bugs,
>>> adds a few more code splits, and records the number of channels
>>> as well as the compression algorithm, so the restore command is
>>> more user-friendly.
>>>
>>> It is now possible to just say:
>>>
>>> virsh save mydomain /mnt/saves/mysave --parallel
>>>
>>> virsh restore /mnt/saves/mysave --parallel
>>>
>>> and things work with the default of 2 channels, no compression.
>>>
>>> It is also possible to say of course:
>>>
>>> virsh save mydomain /mnt/saves/mysave --parallel
>>> --parallel-connections 16 --parallel-compression zstd
>>>
>>> virsh restore /mnt/saves/mysave --parallel
>>>
>>> and things also work fine, due to channels and compression
>>> being stored in the main save file.
>>
>> For the sake of people following along, the above commands will
>> result in creation of multiple files
>>
>> /mnt/saves/mysave
>> /mnt/saves/mysave.0
>> /mnt/saves/mysave.1
>> ....
>> /mnt/saves/mysave.n
>>
>> Where 'n' is the number of threads used.
>>
>> Overall I'm not very happy with the approach of doing any of this
>> on the libvirt side.
>>
>> Backing up, we know that QEMU can directly save to disk faster than
>> libvirt can. We mitigated alot of that overhead with previous patches
>> to increase the pipe buffer size, but some still remains due to the
>> extra copies inherant in handing this off to libvirt.
>>
>> Using multifd on the libvirt side, IIUC, gets us better performance
>> than QEMU can manage if doing non-multifd write to file directly,
>> but we still have the extra copies in there due to the hand off
>> to libvirt. If QEMU were to be directly capable to writing to
>> disk with multifd, it should beat us again.
>>
>> As a result of how we integrate with QEMU multifd, we're taking the
>> approach of saving the state across multiple files, because it is
>> easier than trying to get multiple threads writing to the same file.
>> It could be solved by using file range locking on the save file.
>> eg a thread can reserve say 500 MB of space, fill it up, and then
>> reserve another 500 MB, etc, etc. It is a bit tedious though and
>> won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
>> bytes of QEMU RAM ave state header.
I am not familiar enough to know if this approach would work with multifd without
breaking
the existing format, maybe David could answer this.
>
> First, I do not understand why you would write things that are
> not page-aligned to start with? (As an aside, I don’t know
> how any dirty tracking would work if you do not keep things
> page-aligned).
Yes, alignment is one issue I encountered, and that in my view would _still_ need to be
solved,
and that is _whatever_ we put inside QEMU in the future,
as it breaks also any attempt to be more efficient (using alternative APIs to read/write
etc),
and is the reason why iohelper is still needed in my patchset at all for the main file,
causing one extra copy for the main channel.
The libvirt header, including metadata, domain xml etc, that wraps the QEMU VM ends at an
arbitrary address, f.e:
00000000: 4c69 6276 6972 7451 656d 7564 5361 7665 LibvirtQemudSave
00000010: 0300 0000 5b13 0100 0100 0000 0000 0000 ....[...........
00000020: 3613 0000 0200 0000 0000 0000 0000 0000 6...............
00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000050: 0000 0000 0000 0000 0000 0000 3c64 6f6d ............<dom
00000060: 6169 6e20 7479 7065 3d27 6b76 6d27 3e0a ain type='kvm'>.
000113a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000113b0: 0000 0000 0000 0051 4556 4d00 0000 0307 .......QEVM.....
000113c0: 0000 000d 7063 2d69 3434 3066 782d 362e ....pc-i440fx-6.
000113d0: 3201 0000 0003 0372 616d 0000 0000 0000 2......ram......
000113e0: 0004 0000 0008 c00c 2004 0670 632e 7261 ........ ..pc.ra
000113f0: 6d00 0000 08c0 0000 0014 2f72 6f6d 4065 m........./rom@e
00011400: 7463 2f61 6370 692f 7461 626c 6573 0000 tc/acpi/tables..
00011410: 0000 0002 0000 0770 632e 6269 6f73 0000 .......pc.bios..
00011420: 0000 0004 0000 1f30 3030 303a 3030 3a30 .......0000:00:0
00011430: 322e 302f 7669 7274 696f 2d6e 6574 2d70 2.0/virtio-net-p
00011440: 6369 2e72 6f6d 0000 0000 0004 0000 0670 ci.rom.........p
00011450: 632e 726f 6d00 0000 0000 0200 0015 2f72 c.rom........./r
00011460: 6f6d 4065 7463 2f74 6162 6c65 2d6c 6f61 om@etc/table-loa
00011470: 6465 7200 0000 0000 0010 0012 2f72 6f6d der........./rom
00011480: 4065 7463 2f61 6370 692f 7273 6470 0000 @etc/acpi/rsdp..
00011490: 0000 0000 1000 0000 0000 0000 0010 7e00 ..............~.
000114a0: 0000 0302 0000 0003 0000 0000 0000 2002 .............. .
000114b0: 0670 632e 7261 6d00 0000 0000 0000 3022 .pc.ram.......0"
in my view at the minimum we have to start by adding enough padding before starting the
QEMU VM (QEVM magic)
to be at a page-aligned address.
I would add one patch to this effect to my prototype, as this should not be very
controversial I think.
We already add padding before the QEMU migration stream begins, but
we're just doing a fixed 64kb. The intent was to allow us to edit
the embedded XML. It could easily round this upto to a sensible
boundary if needed.
With regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|