Thanks for the details and recommendations Daniel!!
On 8/2/2022 11:19 AM, Daniel P. Berrangé wrote:
On Mon, Aug 01, 2022 at 11:03:49AM -0500, Praveen K Paladugu wrote:
> Folks,
>
> We are implementing Live Migration support in "ch" driver of Libvirt.
I'd
> like to confirm if the approach we have chosen would be accepted upstream
> once implemented.
>
>
> Our immediate goal is to implement "Hypervisor Native" + "Managed
Direct"
> mode of migration. "Hypervisor Native" here referring to VMM(ch) being
> responsible for data flow. This in contrast to TUNNELED migration where data
> is sent over libvirt rpc.
Avoiding TUNNELLED migration is a very good idea. This was a short term
hack to workaround the lack of TLS support in QEMU. It is more efficient
to have TLS natively integrated in the hypervisor layer than libvirt.
IOW, "Hypervisor native" is a good choice.
>
> "Managed Direct" referring to virsh client responsible for control flow
> between source and dest hosts. The libvirtd daemons on source and
> destination do not have to communicate with each other. These modes are
> described further at
>
https://libvirt.org/migration.html#network-data-transports.
I'd caution that I think 'managed direct' migration leaves you with
fewer nice options for ensuring resilience of the migration.
IOW, if the client application goes away, I think it'll be harder
for the libvirt CH driver to recover from that scenario.
Also if a client app is using the DigitalOcean 'go-libvirt' API
instead of our 'libvit-go-module' API, things are even more
limited since thg 'go-libvirt' API directly speaks to the RPC
protocol, bypassing libvirt.so logic related to migration
process steps.
With the peer-to-peer mode, migration can carry on even if the
client app goes away, since the client app isn't a part of the
control loop.
So overall, I'd encourage peer-to-peer migration as the preferrable
option, unless you can hand-off absolutely everything to the CH
code and not have libvirt involved in orchestrating the migration
steps at all ?
Makes sense to prioritize peer-to-peer migration. Our current
project is
an internship and has strict time constraints. As we are well under way
for "Managed Direct" mode, we will finish this and focus on peer-to-peer
migration mode right after.
> At the moment, Cloud-Hypervisor supports receiving migration data only on
> Unix Domain Sockets. Also, Cloud-Hypervisor does not encrypt the VM data
> while sending.
Hmm, that's quite limiting.
>
> We are considering forking "socat" processes as documented at
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/live_....
> The socat processes will be forked in "Prepare" and "Perform"
phases on
> Destination and Source hosts respectively.
>
> I couldn't find any existing implementation in libvirt to connect Domain
> Sockets on different hosts. Please let me know, if you'd recommend a
> different approach from forking socat processes to connect Domain Sockets on
> source and dest hosts to allow Live VM Migration.
I think building something around socat will get you going quickly, but
ultimately be harmful over the long term.
Makes sense. We were also concerned about
long term maintenance so
wanted to check on this mailing list. As there isn't better mechanism to
connect domain sockets on source and dest hosts, we will finish up the
"socat" based implementation and get it to work end-to-end.
Our experiance with QEMU has been that to maximise performance you need
the lowest level in full control. These days QEMU can open multiple TCP
connections concurrently from multiple, so that throughput isn't limited
by data copy performance of a single CPU. It also has ability to take
advantage of kernel features like zerocopy. Use of an socat proxy is
going to add many data copies to the transport which can only harm your
performance.
So my recommendation would be to invest time in first extending CH so
that it natively supports opening TCP connections, and then take advantage
of that in libvirt from the start. You then have the basic foundation
right on which to add stuff like TLS, zerocopy, multi-conection, and more
Again, thanks for the details and the recommendation. Enabling TCP
connections and other low-level features in cloud-hypervisor isn't
something we can tackle within our current time constraints. But will
follow up with cloud-hypervisor community and open a tracking issue for
this work.
With regards,
Daniel
--
Regards,
Praveen K Paladugu