On 11/02/2013 04:27 PM, Daniel P. Berrange wrote:
On Sat, Nov 02, 2013 at 12:56:37AM +0000, Christian Benvenuti (benve)
wrote:
> Hello,
> based on the 3D below, it seems that the most logical way to
> add support for
>
> container live migration
>
> to libvirt is to integrate the latter with CRIU.
> If I understand it correctly, Daniel's suggestion below would be
> that of
> - 1st converting CRIU to a library
> - 2nd making libvirt use that library to C/R the container/s
>
> CRIU has recently announced support for
>
> CRIU as a service
>
> and the reason why they opted for a service instead of a library [1] seems
> to be associated with a use case they had:
>
> ability for an application to invoke a self-dump C/R
>
> In Libvirt's case it would not be the container to ask for a self dump, but it
would
> be libvirt itself to orchestrate it.
>
> In light of the new CRIU as a service feature, is libvirt's preference still
> that of using a library? Would a service be equally good?
I can't easily answer this question, without delving into the technical
details of it. The big question is how this would fit into libvirt's
architecture. I'm not fixed on any single approach - just thought that
having it as a library might be the easiest way to integrate - we tend
to prefer APIs over forking external programs whereever available, since
they're often easier to do good error reporting with, and lower overhead
to use.
There's an RPC API for CRIU described at
http://criu.org/RPC . It requires
CRIU service to be up and running as daemon. Would that fit the libvirt
architecture?
Making the whole CRIU available as .so seems to have useless -- it would
only be suitable for programs written in C/C++ _and_ running as root.
Currently libvirtd spawns libvirt_lxc which then spawns the actual
container. The libvirt_lxc process is daemonized and is set as the
parent of the container init process, and has a UNIX socket back
to libvirtd. The libvirt_lxc process also owns the master side of two
psuedo TTYs, one in the host's /dev/pts instance, and one in the guest's
/dev/pts instance.
What I'm fuzzy on is where the C/R would best take place. eg would it
cover the container processes only, or the container process + libvirt_lxc.
In the former case, I'm not sure how the libvirt_lxc container wouuld get
back a handle to the master PTY. In the latter case, I'm not sure how
libvirtd would re-establish the UNIX socket connection to libvirt_lxc.
I've heard that there's some problems with CRIU and UNIX sockets when
only one end of the socket is being C/R'd.
Well, we seem to have the same problem with systemd (I don't remember the
details, though) -- a container's one has UNIX connection to the host's.
If there's a way to distinguish these sockets from each other, we can
declare an API for CRIU extensions, that would help it to dump and restore
the connection state.
Similar issue exists with device files -- if a task has some /dev/foo
file opened, we (in theory) can ask for external .so to dump its state
and restore it back. This scenario has the "distinguish these from one
another" way -- the device major:minor pair. Can you suggest how to do
it with UNIX connections?
Also curious how CRIU deals with network interfaces. When libvirt
starts a container, if using an SRIOV NIC with multiple virtual
functions, then libvirt will select a random function to assign to
the container at startup. We can assume this same function is still
available when restoring the containe - we'd likely need to select
a different virtual function to give to the container as its "eth0"
Well, by default CRIU tries to dump all the information about net devices
met within net namespace and (!) re-_creates_ these devices back on restore.
Then configures them. However, there's a notion of "external" device, which
are considered to be available inside the net namespace, and CRIU does not
create them, only configures.
So, when the container's init is spawned CRIU calls for external script to set
the container up. In this script one can create the desired "external" net
devices and move them into the respective net namespace. Later CRIU would
configure the devices.
The only problem with this right now is that device is classified to be the
"external" one on dump stage. Right now only the OpenVZ-specific venet is such.
Probably we need some way to tell CRIU that any device is "external".
Similarly with SELinux, we dynamically assign a unique MCS level to
containers when starting them. This will again need to be newly
allocated at restore time and will almost certain differ from the
previous MCS level.
We haven't made anything to work with SELinux in CRIU. Can you point to some
doc to read about what libvirt does with it?
I've not looked at CRIU in enough detail know if it copes with
stuff
like this or not.
Well, I haven't looked at libvirt in enough either :) but we're
willing to, so if we can work out an acceptable API between CRIU
and libvirt, I think we can start some development in that area.
> Is there anyone actively working or looking at this
(libvirt+CRIU)?
Not that I am aware of.
Daniel
Thanks,
Pavel