On 5/11/22 2:02 PM, Daniel P. Berrangé wrote:
On Wed, May 11, 2022 at 01:52:05PM +0200, Claudio Fontana wrote:
> On 5/11/22 11:51 AM, Daniel P. Berrangé wrote:
>> On Wed, May 11, 2022 at 09:26:10AM +0200, Claudio Fontana wrote:
>>> Hi Daniel,
>>>
>>> thanks for looking at this,
>>>
>>> On 5/10/22 8:38 PM, Daniel P. Berrangé wrote:
>>>> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
>>>>> This is v8 of the multifd save prototype, which fixes a few bugs,
>>>>> adds a few more code splits, and records the number of channels
>>>>> as well as the compression algorithm, so the restore command is
>>>>> more user-friendly.
>>>>>
>>>>> It is now possible to just say:
>>>>>
>>>>> virsh save mydomain /mnt/saves/mysave --parallel
>>>>>
>>>>> virsh restore /mnt/saves/mysave --parallel
>>>>>
>>>>> and things work with the default of 2 channels, no compression.
>>>>>
>>>>> It is also possible to say of course:
>>>>>
>>>>> virsh save mydomain /mnt/saves/mysave --parallel
>>>>> --parallel-connections 16 --parallel-compression zstd
>>>>>
>>>>> virsh restore /mnt/saves/mysave --parallel
>>>>>
>>>>> and things also work fine, due to channels and compression
>>>>> being stored in the main save file.
>>>>
>>>> For the sake of people following along, the above commands will
>>>> result in creation of multiple files
>>>>
>>>> /mnt/saves/mysave
>>>> /mnt/saves/mysave.0
>>>
>>> just minor correction, there is no .0
>>
>> Heh, off-by-1
>>
>>>
>>>> /mnt/saves/mysave.1
>>>> ....
>>>> /mnt/saves/mysave.n
>>>>
>>>> Where 'n' is the number of threads used.
>>>>
>>>> Overall I'm not very happy with the approach of doing any of this
>>>> on the libvirt side.
>>>
>>>
>>> Ok I understand your concern.
>>>
>>>>
>>>> Backing up, we know that QEMU can directly save to disk faster than
>>>> libvirt can. We mitigated alot of that overhead with previous patches
>>>> to increase the pipe buffer size, but some still remains due to the
>>>> extra copies inherant in handing this off to libvirt.
>>>
>>> Right;
>>> still the performance we get is insufficient for the use case we are trying
to address,
>>> even without libvirt in the picture.
>>>
>>> Instead, with parallel save + compression we can make the numbers add up.
>>> For parallel save using multifd, the overhead of libvirt is negligible.
>>>
>>>>
>>>> Using multifd on the libvirt side, IIUC, gets us better performance
>>>> than QEMU can manage if doing non-multifd write to file directly,
>>>> but we still have the extra copies in there due to the hand off
>>>> to libvirt. If QEMU were to be directly capable to writing to
>>>> disk with multifd, it should beat us again.
>>>
>>> Hmm I am thinking about this point, and at first glance I don't
>>> think this is 100% accurate;
>>>
>>> if we do parallel save like in this series with multifd,
>>> the overhead of libvirt is almost non-existent in my view
>>> compared with doing it with qemu only, skipping libvirt,
>>> it is limited to the one iohelper for the main channel
>>> (which is the smallest of the transfers),
>>> and maybe this could be removed as well.
>>
>> Libvirt adds overhead due to the multiple data copies in
>> the save process. Using multifd doesn't get rid of this
>> overhead, it merely distributes the overhead across many
>> CPUs. The overall wallclock time is reduced but in aggregate
>> the CPUs still have the same amount of total work todo
>> copying data around.
>>
>> I don't recall the scale of the libvirt overhead that remains
>> after the pipe buffer optimizations, but whatever is less is
>> still taking up host CPU time that can be used for other guests.
>>
>> It also just ocurred to me that currently our save/restore
>> approach is bypassing all resource limits applied to the
>> guest. eg block I/O rate limits, CPU affinity controls,
>> etc, because most of the work is done in the iohelper.
>> If we had this done in QEMU, then the save/restore process
>> is confined by the existing CPU affinity / I/o limits
>> applied to the guest. This mean we would not negatively
>> impact other co-hosted guests to the same extent.
>>
>>> This is because even without libvirt in the picture, we
>>> are still migrating to a socket, and something needs to
>>> transfer data from that socket to a file. At that point
>>> I think both libvirt and a custom made script are in the
>>> same position.
>>
>> If QEMU had explicit support for a "file" backend, there
>> would be no socket involved at all. QEMU would be copying
>> guest RAM directly to a file with no intermediate steps.
>> If QEMU mmap'd the save state file, then saving of the
>> guest RAM could even possibly reduce to a mere 'memcpy()'
>
> Agree, but still, to align with your requirement to have only one file,
> libvirt would need to add some padding after the libvirt header and before the QEMU
VM starts in the file,
> so that the QEMU VM starts at a block-friendly address.
That's trivial, as we already add padding in this place.
That's great, I love when things are simple.
If indeed we want to remove the copy in libvirt (which will also mean explicitly fsyncing
elsewhere, as the iohelper would not be there anymore to do that for us on image
creation),
with QEMU having a "file" protocol support for migration,
do we plan to have libvirt and QEMU both open the file for writing concurrently, with QEMU
opening O_DIRECT?
The alternative being having libvirt open the file with O_DIRECT, write some libvirt stuff
in a new, O_DIRECT-friendly format, and then pass the fd to qemu to migrate to,
and QEMU sending its new O_DIRECT friendly stream there.
In any case, the expectation here is to have a new "file://pathname" or
"file:://fdname" as an added feature in QEMU,
where QEMU would write a new O_DIRECT friendly stream directly into the file, taking care
of both optional parallelization and compression.
Is that the gist of it? Seems a lot of work, just trying to roughly figure out the
boundaries of this.
Thanks,
Claudio