FYI this mail i just sent to containers(a)lists.linux-foundation.org
where all the kernel container developers hang out.
Daniel
----- Forwarded message from "Daniel P. Berrange" <berrange(a)redhat.com>
-----
Date: Wed, 17 Sep 2008 16:06:35 +0100
From: "Daniel P. Berrange" <berrange(a)redhat.com>
To: containers(a)lists.linux-foundation.org
Subject: An introduction to libvirt's LXC (LinuX Container) support
This is a short^H^H^H^H^H long mail to introduce / walk-through some
recent developments in libvirt to support native Linux hosted
container virtualization using the kernel capabilities the people
on this list have been adding in recent releases. We've been working
on this for a few months now, but not really publicised it before
now, and I figure the people working on container virt extensions
for Linux might be interested in how it is being used.
For those who aren't familiar with libvirt, it provides a stable API
for managing virtualization hosts and their guests. It started with
a Xen driver, and over time has evolved to add support for QEMU, KVM,
OpenVZ and most recently of all a driver we're calling "LXC" short
for "LinuX Containers". The key is that no matter what hypervisor
you are using, there is a consistent set of APIs, and standardized
configuration format for userspace management applications in the
host (and remote secure RPC to the host).
The LXC driver is the result of a combined effort from a number of
people in the libvirt community, most notably Dave Leskovec contributed
the original code, and Dan Smith now leads development along with my
own contributions to its architecture to better integrate with libvirt.
We have a couple of goals in this work. Overall, libvirt wants to be
the defacto standard, open source management API for all virtualization
platforms and native Linux virtualization capabilities are a strong
focus. The LXC driver is attempting to provide a general purpose
management solution for two container virt use cases:
- Application workload isolation
- Virtual private servers
In the first use case we want to provide the ability to run an
application in primary host OS with partial restrictons on its
resource / service access. It will still run with the same root
directory as the host OS, but its filesystem namespace may have
some additional private mount points present. It may have a
private network namespace to restrict its connectivity, and it
will ultimately have restrictions on its resource usage (eg
memory, CPU time, CPU affinity, I/O bandwidth).
In the second use case, we want to provide completely virtualized
operating system in the container (running the host kernel of
course), akin to the capabilities of OpenVZ / Linux-VServer. The
container will have a totally private root filesystem, private
networking namespace, whatever other namespace isolation the
kernel provides, and again resource restirctions. Some people
like to think of this as 'a better chroot than chroot'.
In terms of technical implementation, at its core is direct usage
of the new clone() flags. By default all containers get created
with CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWUTS, CLONE_NEWUSER, and
CLONE_NEWIPC. If private network config was requested they also
get CLONE_NEWNET.
For the workload isolation case, after creating the container we
just add a number of filesystem mounts in the containers private
FS namespace. In the VPS case, we'll do a pivot_root() onto the
new root directory, and then add any extra filesystem mounts the
container config requested.
The stdin/out/err of the process leader in the container is bound
to the slave end of a Psuedo TTY, libvirt owning the master end
so it can provide a virtual text console into the guest container.
Once the basic container setup is complete, libvirt exec the so
called 'init' process. Things are thus setup such that when the
'init' process exits, the container is terminated / cleaned up.
On the host side, the libvirt LXC driver creates what we call a
'controller' process for each container. This is done with a small
binary /usr/libexec/libvirt_lxc. This is the process which owns the
master end of the Pseduo-TTY, along with a second Pseduo-TTY pair.
When the host admin wants to interact with the contain, they use
the command 'virsh console CONTAINER-NAME'. The LXC controller
process takes care of forwarding I/O between the two slave PTYs,
one slave opened by virsh console, the other being the containers'
stdin/out/err. If you kill the controller, then the container
also dies. Basically you can think of the libvirt_lxc controller
as serving the equivalent purpose to the 'qemu' command for full
machine virtualization - it provides the interface between host
and guest, in this case just the container setup, and access to
text console - perhaps more in the future.
For networking, libvirt provides two core concepts
- Shared physical device. A bridge containing one of your
physical network interfaces on the host, along with one or
more of the guest vnet interfaces. So the container appears
as if its directly on the LAN
- Virtual network. A bridge containing only guest vnet
interfaces, and NO physical device from the host. IPtables
and forwarding provide routed (+ optionally NATed)
connectivity to the LAN for guests.
The latter use case is particularly useful for machines without
a permanent wired ethernet - eg laptops, using wifi, as it lets
guests talk to each other even when there's no active host network.
Both of these network setups are fully supported in the LXC driver
in precense of a suitably new host kernel.
That's a 100ft overview and the current functionality is working
quite well from an architectural/technical point of view, but there
is plenty more work we still need todo to provide an system which
is mature enough for real world production deployment.
- Integration with cgroups. Although I talked about resource
restrictions, we've not implemented any of this yet. In the
most immediate timeframe we want to use cgroups' device
ACL support to prevent the container having any ability to
access to device nodes other than the usual suspects of
/dev/{null,full,zero,console}, and possibly /dev/urandom.
The other important one is to provide a memory cap across
the entire container. CPU based resource control is lower
priority at the moment.
- Efficient query of resource utilization. We need to be able
to get the cumulative CPU time of all the processes inside
the container, without having to iterate over every PIDs'
/proc/$PID/stat file. I'm not sure how we'll do this yet..
We want to get this data this for all CPUs, and per-CPU.
- devpts virtualization. libvirt currently just bind mount the
host's /dev/pts into the container. Clearly this isn't a
serious impl. We've been monitoring the devpts namespace
patches and these look like they will provide the capabilities
we need for the full virtual private server use case
- network sysfs virtualization. libvirt can't currently use the
CLONE_NEWNET flag in most Linux distros, since current released
kernel has this capability conflicting with SYSFS in KConfig.
Again we're looking forward to seeing this addressed in next
kernel
- UID/GID virtualization. While we spawn all containers as root,
applications inside the container may witch to unprivileged
UIDs. We don't (neccessarily) want users in the host with
equivalent UIDs to be able to kill processes inside the
container. It would also be desirable to allow unprivileged
users to create containers without needing root on the host,
but allowing them to be root & any other user inside their
container. I'm not aware if anyone's working on this kind of
thing yet ?
There're probably more things Dan Smith is thinking of but that
list is a good starting point.
Finally, a 30 second overview of actually using LXC usage with
libvirt to create a simple VPS using busybox in its root fs...
- Create a simple chroot environment using busybox
mkdir /root/mycontainer
mkdir /root/mycontainer/bin
mkdir /root/mycontainer/sbin
cp /sbin/busybox /root/mycontainer/sbin
for cmd in sh ls chdir chmod rm cat vi
do
ln -s /root/mycontainer/bin/$cmd ../sbin/busybox
done
cat > /root/mycontainer/sbin/init <<EOF
#!/sbin/busybox
sh
EOF
- Create a simple libvirt configuration file for the
container, defining the root filesystem, the network
connection (bridged to br0 in this case), and the
path to the 'init' binary (defaults to /sbin/init if
omitted)
# cat > mycontainer.xml <<EOF
<domain type='lxc'>
<name>mycontainer</name>
<memory>500000</memory>
<os>
<type>exe</type>
<init>/sbin/init</init>
</os>
<devices>
<filesystem type='mount'>
<source dir='/root/mycontainer'/>
<target dir='/'/>
</filesystem>
<interface type='bridge'>
<source network='br0'/>
<mac address='00:11:22:34:34:34'/>
</interface>
<console type='pty' />
</devices>
</domain>
EOF
- Load the configuration into libvirt
# virsh --connect lxc:/// define mycontainer.xml
# virsh --connect lxc:/// list --inactive
Id Name State
----------------------------------
- mycontainer shutdown
- Start the VM and query some information about it
# virsh --connect lxc:/// start mycontainer
# virsh --connect lxc:/// list
Id Name State
----------------------------------
28407 mycontainer running
# virsh --connect lxc:/// dominfo mycontainer
Id: 28407
Name: mycontainer
UUID: 8369f1ac-7e46-e869-4ca5-759d51478066
OS Type: exe
State: running
CPU(s): 1
Max memory: 500000 kB
Used memory: 500000 kB
NB. the CPU/memory info here is not enforce yet.
- Interact with the container
# virsh --connect lxc:/// console mycontainer
NB, Ctrl+] to exit when done
- Query the live config - eg to discover what PTY its
console is connected to
# virsh --connect lxc:/// dumpxml mycontainer
<domain type='lxc' id='28407'>
<name>mycontainer</name>
<uuid>8369f1ac-7e46-e869-4ca5-759d51478066</uuid>
<memory>500000</memory>
<currentMemory>500000</currentMemory>
<vcpu>1</vcpu>
<os>
<type arch='i686'>exe</type>
<init>/sbin/init</init>
</os>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<devices>
<filesystem type='mount'>
<source dir='/root/mycontainer'/>
<target dir='/'/>
</filesystem>
<console type='pty' tty='/dev/pts/22'>
<source path='/dev/pts/22'/>
<target port='0'/>
</console>
</devices>
</domain>
- Shutdown the container
# virsh --connect lxc:/// destroy mycontainer
There is lots more I could say, but hopefully this serves as
a useful introduction to the LXC work in libvirt and how it
is making use of the kernel's container based virtualization
support. For those interested in finding out more, all the
source is in the libvirt CVS repo, the files being those
named src/lxc_conf.c, src/lxc_container.c, src/lxc_controller.c
and src/lxc_driver.c.
http://libvirt.org/downloads.html
or via the GIT mirror of our CVS repo
git clone
git://git.et.redhat.com/libvirt.git
Regards,
Daniel
--
|: Red Hat, Engineering, London -o-
http://people.redhat.com/berrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org -o-
http://ovirt.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
_______________________________________________
Containers mailing list
Containers(a)lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers
:|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|