Re: [libvirt RFC] add API for parallel Saves (not for committing)

Friday, 22 April 2022

On 4/22/22 10:19 AM, Daniel P. Berrangé wrote:
...
 On Thu, Apr 21, 2022 at 08:06:40PM +0200, Claudio Fontana wrote:
> On 4/21/22 7:08 PM, Daniel P. Berrangé wrote:
>> On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
>>> RFC, starting point for discussion.
>>>
>>> Sketch API changes to allow parallel Saves, and open up
>>> and implementation for QEMU to leverage multifd migration to files,
>>> with optional multifd compression.
>>>
>>> This allows to improve save times for huge VMs.
>>>
>>> The idea is to issue commands like:
>>>
>>> virsh save domain /path/savevm --parallel --parallel-connections 2
>>>
>>> and have libvirt start a multifd migration to:
>>>
>>> /path/savevm   : main migration connection
>>> /path/savevm.1 : multifd channel 1
>>> /path/savevm.2 : multifd channel 2
>>
>> At a conceptual level the idea would to still have a single file,
>> but have threads writing to different regions of it. I don't think
>> that's possible with multifd though, as it doesn't partition RAM
>> up between threads, its just hands out pages on demand. So if one
>> thread happens to be quicker it'll send more RAM than another
>> thread. Also we're basically capturing the migration RAM, and the
>> multifd channels have control info, in addition to the RAM pages.
>>
>> That makes me wonder actually, are the multifd streams unidirectional
>> or bidirectional ?  Our saving to a file logic, relies on the streams
>> being unidirectional.
>
>
> Unidirectional. In the meantime I completed an actual libvirt prototype that works
(only did the save part, not the restore yet).
>
>
>>
>> You've got me thinking, however, whether we can take QEMU out of
>> the loop entirely for saving RAM.
>>
>> IIUC with 'x-ignore-shared' migration capability QEMU will skip
>> saving of RAM region entirely (well technically any region marked
>> as 'shared', which I guess can cover more things). 
>
> Heh I have no idea about this.
>
>>
>> If the QEMU process is configured with a file backed shared
>> memory, or memfd, I wonder if we can take advantage of this.
>> eg
>>
>>   1. pause the VM
>>   1. write the libvirt header to save.img
>>   2. sendfile(qemus-memfd, save.img-fd)  to copy the entire
>>      RAM after header
>
> I don't understand this point very much... if the ram is already
> backed by file why are we sending this again..?

 It is a file pointing to hugepagefs or tmpfs. It is still actually
 RAM, but we exposed it to QEMU via a file, which QEMU then mmap'd.

 We don't do this by default, but anyone with large (many GB) VMs
 is increasingly likel to be relying on huge pages to optimize
 their VM performance. 
For what I could observe I'd say it depends on the specific scenario,
how much memory we have to work with, the general compromise between cpu, memory, disk,
... all of which is subject to cost optimization.

...

 In our current save scheme we have (at least) 2 copies going
 on. QEMU copies from RAM into the FD it uses for migrate.
 libvirt IO helper copies from the FD into the file. This involves
 multiple threads and multiple userspace/kernel switches and data
 copies.  You've been trying to eliminate the 2nd copy in userspace. 
I've been trying to eliminate the 2nd copy in userspace, but this is just aspect 1) I
have in mind,
it is good but gives only so much, and for huge VMs things fall apart when reaching the
file cache trashing problem.

Aspect 2) in my mind is the file cache trashing that the kernel gets into, is the reason
that we need O_DIRECT at all with huge VMs I think,
which creates a lot of complications (ie we are kinda forced to have a helper anyway to
ensure block aligned source, destination addresses and length),
and suboptimal performance.

This is what was attempted to be solved by, in my understanding:

https://lwn.net/Articles/806980/

which seemed more promising to me, but unfortunately the implementation went to /dev/null
apparently.

There was also posix_fadvise POSIX_FADV_NOREUSE, which I think in practice is a very
clunky API, and which also got lost.

Aspect 3) is a practical solution that I already prototyped and yields very good results
in practice,
which is to make better use of the resources we have, since we have a certain number of
cpus assigned to run VMs,
and the save/restore operations we need happen with a suspended guest, so we can exploit
this to get those cpus to good use,
and reduce the problem size by leveraging multifd and compression which comes for free
from qemu.

I think that until the file cache issue remains unsolved, we are stuck with O_DIRECT, so
we are stuck with a helper,
and at that point we can easily have a

multifd-helper

that reuses the code from iohelper, and performs O_DIRECT writes of the compressed streams
to multiple files in parallel.

...

 If we take advantage of scenario where QEMU RAM is backed by a
 tmpfs/hugepagefs file, we can potentially eliminate both copies
 in userspace. The kernel can be told to copy direct from the
 hugepagefs file into the disk file. 
Interesting, still we incur in the file cache trashing as we write though right?

...

>>   3. QMP migrate with x-ignore-shared to copy device
>>      state after RAM
>>
>> Probably can do the same on restore too.
>
>
> Do I understand correctly that you suggest to constantly update the RAM to file at
runtime?
> Given the compute nature of the workload, I'd think this would slow things down.

 No, no different to what we do today. I'm just saying we let
 the kernl copy straight from  QEMU's RAM backing file into
 the dest file, at time of save, so we do *nothing* in userpsace
 in either libvirt or QEMU.

 With regards,
 Daniel

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt RFC] add API for parallel Saves (not for committing)