[libvirt] [FW: An introduction to libvirt's LXC (LinuX Container) support]
by Daniel P. Berrange
FYI this mail i just sent to containers(a)lists.linux-foundation.org
where all the kernel container developers hang out.
Daniel
----- Forwarded message from "Daniel P. Berrange" <berrange(a)redhat.com> -----
> Date: Wed, 17 Sep 2008 16:06:35 +0100
> From: "Daniel P. Berrange" <berrange(a)redhat.com>
> To: containers(a)lists.linux-foundation.org
> Subject: An introduction to libvirt's LXC (LinuX Container) support
>
> This is a short^H^H^H^H^H long mail to introduce / walk-through some
> recent developments in libvirt to support native Linux hosted
> container virtualization using the kernel capabilities the people
> on this list have been adding in recent releases. We've been working
> on this for a few months now, but not really publicised it before
> now, and I figure the people working on container virt extensions
> for Linux might be interested in how it is being used.
>
> For those who aren't familiar with libvirt, it provides a stable API
> for managing virtualization hosts and their guests. It started with
> a Xen driver, and over time has evolved to add support for QEMU, KVM,
> OpenVZ and most recently of all a driver we're calling "LXC" short
> for "LinuX Containers". The key is that no matter what hypervisor
> you are using, there is a consistent set of APIs, and standardized
> configuration format for userspace management applications in the
> host (and remote secure RPC to the host).
>
> The LXC driver is the result of a combined effort from a number of
> people in the libvirt community, most notably Dave Leskovec contributed
> the original code, and Dan Smith now leads development along with my
> own contributions to its architecture to better integrate with libvirt.
>
> We have a couple of goals in this work. Overall, libvirt wants to be
> the defacto standard, open source management API for all virtualization
> platforms and native Linux virtualization capabilities are a strong
> focus. The LXC driver is attempting to provide a general purpose
> management solution for two container virt use cases:
>
> - Application workload isolation
> - Virtual private servers
>
> In the first use case we want to provide the ability to run an
> application in primary host OS with partial restrictons on its
> resource / service access. It will still run with the same root
> directory as the host OS, but its filesystem namespace may have
> some additional private mount points present. It may have a
> private network namespace to restrict its connectivity, and it
> will ultimately have restrictions on its resource usage (eg
> memory, CPU time, CPU affinity, I/O bandwidth).
>
> In the second use case, we want to provide completely virtualized
> operating system in the container (running the host kernel of
> course), akin to the capabilities of OpenVZ / Linux-VServer. The
> container will have a totally private root filesystem, private
> networking namespace, whatever other namespace isolation the
> kernel provides, and again resource restirctions. Some people
> like to think of this as 'a better chroot than chroot'.
>
> In terms of technical implementation, at its core is direct usage
> of the new clone() flags. By default all containers get created
> with CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWUTS, CLONE_NEWUSER, and
> CLONE_NEWIPC. If private network config was requested they also
> get CLONE_NEWNET.
>
> For the workload isolation case, after creating the container we
> just add a number of filesystem mounts in the containers private
> FS namespace. In the VPS case, we'll do a pivot_root() onto the
> new root directory, and then add any extra filesystem mounts the
> container config requested.
>
> The stdin/out/err of the process leader in the container is bound
> to the slave end of a Psuedo TTY, libvirt owning the master end
> so it can provide a virtual text console into the guest container.
> Once the basic container setup is complete, libvirt exec the so
> called 'init' process. Things are thus setup such that when the
> 'init' process exits, the container is terminated / cleaned up.
>
> On the host side, the libvirt LXC driver creates what we call a
> 'controller' process for each container. This is done with a small
> binary /usr/libexec/libvirt_lxc. This is the process which owns the
> master end of the Pseduo-TTY, along with a second Pseduo-TTY pair.
> When the host admin wants to interact with the contain, they use
> the command 'virsh console CONTAINER-NAME'. The LXC controller
> process takes care of forwarding I/O between the two slave PTYs,
> one slave opened by virsh console, the other being the containers'
> stdin/out/err. If you kill the controller, then the container
> also dies. Basically you can think of the libvirt_lxc controller
> as serving the equivalent purpose to the 'qemu' command for full
> machine virtualization - it provides the interface between host
> and guest, in this case just the container setup, and access to
> text console - perhaps more in the future.
>
> For networking, libvirt provides two core concepts
>
> - Shared physical device. A bridge containing one of your
> physical network interfaces on the host, along with one or
> more of the guest vnet interfaces. So the container appears
> as if its directly on the LAN
>
> - Virtual network. A bridge containing only guest vnet
> interfaces, and NO physical device from the host. IPtables
> and forwarding provide routed (+ optionally NATed)
> connectivity to the LAN for guests.
>
> The latter use case is particularly useful for machines without
> a permanent wired ethernet - eg laptops, using wifi, as it lets
> guests talk to each other even when there's no active host network.
> Both of these network setups are fully supported in the LXC driver
> in precense of a suitably new host kernel.
>
> That's a 100ft overview and the current functionality is working
> quite well from an architectural/technical point of view, but there
> is plenty more work we still need todo to provide an system which
> is mature enough for real world production deployment.
>
> - Integration with cgroups. Although I talked about resource
> restrictions, we've not implemented any of this yet. In the
> most immediate timeframe we want to use cgroups' device
> ACL support to prevent the container having any ability to
> access to device nodes other than the usual suspects of
> /dev/{null,full,zero,console}, and possibly /dev/urandom.
> The other important one is to provide a memory cap across
> the entire container. CPU based resource control is lower
> priority at the moment.
>
> - Efficient query of resource utilization. We need to be able
> to get the cumulative CPU time of all the processes inside
> the container, without having to iterate over every PIDs'
> /proc/$PID/stat file. I'm not sure how we'll do this yet..
> We want to get this data this for all CPUs, and per-CPU.
>
> - devpts virtualization. libvirt currently just bind mount the
> host's /dev/pts into the container. Clearly this isn't a
> serious impl. We've been monitoring the devpts namespace
> patches and these look like they will provide the capabilities
> we need for the full virtual private server use case
>
> - network sysfs virtualization. libvirt can't currently use the
> CLONE_NEWNET flag in most Linux distros, since current released
> kernel has this capability conflicting with SYSFS in KConfig.
> Again we're looking forward to seeing this addressed in next
> kernel
>
> - UID/GID virtualization. While we spawn all containers as root,
> applications inside the container may witch to unprivileged
> UIDs. We don't (neccessarily) want users in the host with
> equivalent UIDs to be able to kill processes inside the
> container. It would also be desirable to allow unprivileged
> users to create containers without needing root on the host,
> but allowing them to be root & any other user inside their
> container. I'm not aware if anyone's working on this kind of
> thing yet ?
>
> There're probably more things Dan Smith is thinking of but that
> list is a good starting point.
>
> Finally, a 30 second overview of actually using LXC usage with
> libvirt to create a simple VPS using busybox in its root fs...
>
> - Create a simple chroot environment using busybox
>
> mkdir /root/mycontainer
> mkdir /root/mycontainer/bin
> mkdir /root/mycontainer/sbin
> cp /sbin/busybox /root/mycontainer/sbin
> for cmd in sh ls chdir chmod rm cat vi
> do
> ln -s /root/mycontainer/bin/$cmd ../sbin/busybox
> done
> cat > /root/mycontainer/sbin/init <<EOF
> #!/sbin/busybox
> sh
> EOF
>
>
> - Create a simple libvirt configuration file for the
> container, defining the root filesystem, the network
> connection (bridged to br0 in this case), and the
> path to the 'init' binary (defaults to /sbin/init if
> omitted)
>
> # cat > mycontainer.xml <<EOF
> <domain type='lxc'>
> <name>mycontainer</name>
> <memory>500000</memory>
> <os>
> <type>exe</type>
> <init>/sbin/init</init>
> </os>
> <devices>
> <filesystem type='mount'>
> <source dir='/root/mycontainer'/>
> <target dir='/'/>
> </filesystem>
> <interface type='bridge'>
> <source network='br0'/>
> <mac address='00:11:22:34:34:34'/>
> </interface>
> <console type='pty' />
> </devices>
> </domain>
> EOF
>
> - Load the configuration into libvirt
>
> # virsh --connect lxc:/// define mycontainer.xml
> # virsh --connect lxc:/// list --inactive
> Id Name State
> ----------------------------------
> - mycontainer shutdown
>
>
>
> - Start the VM and query some information about it
>
> # virsh --connect lxc:/// start mycontainer
> # virsh --connect lxc:/// list
> Id Name State
> ----------------------------------
> 28407 mycontainer running
>
> # virsh --connect lxc:/// dominfo mycontainer
> Id: 28407
> Name: mycontainer
> UUID: 8369f1ac-7e46-e869-4ca5-759d51478066
> OS Type: exe
> State: running
> CPU(s): 1
> Max memory: 500000 kB
> Used memory: 500000 kB
>
>
> NB. the CPU/memory info here is not enforce yet.
>
> - Interact with the container
>
> # virsh --connect lxc:/// console mycontainer
>
> NB, Ctrl+] to exit when done
>
> - Query the live config - eg to discover what PTY its
> console is connected to
>
>
> # virsh --connect lxc:/// dumpxml mycontainer
> <domain type='lxc' id='28407'>
> <name>mycontainer</name>
> <uuid>8369f1ac-7e46-e869-4ca5-759d51478066</uuid>
> <memory>500000</memory>
> <currentMemory>500000</currentMemory>
> <vcpu>1</vcpu>
> <os>
> <type arch='i686'>exe</type>
> <init>/sbin/init</init>
> </os>
> <clock offset='utc'/>
> <on_poweroff>destroy</on_poweroff>
> <on_reboot>restart</on_reboot>
> <on_crash>destroy</on_crash>
> <devices>
> <filesystem type='mount'>
> <source dir='/root/mycontainer'/>
> <target dir='/'/>
> </filesystem>
> <console type='pty' tty='/dev/pts/22'>
> <source path='/dev/pts/22'/>
> <target port='0'/>
> </console>
> </devices>
> </domain>
>
> - Shutdown the container
>
> # virsh --connect lxc:/// destroy mycontainer
>
> There is lots more I could say, but hopefully this serves as
> a useful introduction to the LXC work in libvirt and how it
> is making use of the kernel's container based virtualization
> support. For those interested in finding out more, all the
> source is in the libvirt CVS repo, the files being those
> named src/lxc_conf.c, src/lxc_container.c, src/lxc_controller.c
> and src/lxc_driver.c.
>
> http://libvirt.org/downloads.html
>
> or via the GIT mirror of our CVS repo
>
> git clone git://git.et.redhat.com/libvirt.git
>
> Regards,
> Daniel
> --
> |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :|
> |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :|
> |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
> |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
> _______________________________________________
> Containers mailing list
> Containers(a)lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>
----- End forwarded message -----
--
|: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
16 years, 2 months
[libvirt] [PATCH] Improved error messages in XM conf module
by Richard W.M. Jones
The attached patch improves error handling in the XM config file
parser (src/conf.c).
Currently it has a custom error function called virConfError which has
three problems. Firstly the conf argument is ignored and therefore
pointless to even pass. Secondly the function takes a line number
parameter (for reporting the line number where parsing failed), but
this is swallowed and not printed in error messages. Thirdly, and
most importantly, the name of the file where the error occurs is not
printed by default unless the caller happens to print it.
If there is an _empty_ file in /etc/xen we get this error:
# virsh list --all
libvir: error : failed to read configuration file /etc/xen/foobar
but if the spurious file under /etc/xen is non-empty, like a script,
you get completely anonymous errors such as:
libvir: error : configuration file syntax error: expecting an assignment
or:
libvir: error : configuration file syntax error: expecting a value
The patch fixes this by printing out the filename and line number if
these are available from the parser context (and the parser context is
passed to virConfError instead of the unused virConfPtr). With this
patch you'll get errors for the second case like this:
# virsh list --inactive
libvir: error : /etc/xen/foobar:1: expecting a value
Rich.
--
Richard Jones, Emerging Technologies, Red Hat http://et.redhat.com/~rjones
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine. Supports Linux and Windows.
http://et.redhat.com/~rjones/virt-df/
16 years, 2 months
[libvirt] [PATCH 0/3]: Cleanup LVM pool
by Chris Lalancette
All,
The following patch series is a set of (simple) cleanups for the
storage_backend_logical stuff. Most of it is uncontroversial, except for one
bit in the 3'rd patch (which I'll point out there). Please review.
Thanks,
--
Chris Lalancette
16 years, 2 months
[libvirt] [PATCH] rename "blocked" to "idle"
by John Levon
<movement> can we please please rename "blocked" to "idle"
<movement> literally /everybody/ is confused by that
...
<dansmith> I'm not opposed to changing it, mind you, because I feel the
indirect pain of explaining it to others too :)
<movement> actually, normally, the question is "my guest isn't running"
<dansmith> yeah
regards
john
Index: src/virsh.c
===================================================================
RCS file: /data/cvs/libvirt/src/virsh.c,v
retrieving revision 1.163
diff -r1.163 virsh.c
6275c6275
< return gettext_noop("blocked");
---
> return gettext_noop("idle");
6297c6297
< return gettext_noop("blocked");
---
> return gettext_noop("idle");
16 years, 2 months
[libvirt] [PATCH] Determine kvm max vcpus via version number
by Cole Robinson
The attached patch is a slimmed down version of a patch
I posted a while back. This expands qemu help message
parsing to look for a kvm version number, which can be
used to determine maximum supported vcpus.
A kvmVersion field is added to the qemu_driver structure,
and a check to determine the version is added to the
libvirtd start up routine. If the kvm version isn't found
(say if kvm isn't installed), kvmVersion is set to 0.
This is against Guido Gunther's patch "maxVCPU runtime
detection": his method takes precendence in the code
if it's available.
Comments welcome.
Thanks,
Cole
diff --git a/src/qemu_conf.c b/src/qemu_conf.c
index 22a99ec..f79137f 100644
--- a/src/qemu_conf.c
+++ b/src/qemu_conf.c
@@ -418,13 +418,14 @@ virCapsPtr qemudCapsInit(void) {
int qemudExtractVersionInfo(const char *qemu,
unsigned int *retversion,
+ unsigned int *retkvmversion,
unsigned int *retflags) {
const char *const qemuarg[] = { qemu, "-help", NULL };
const char *const qemuenv[] = { "LC_ALL=C", NULL };
pid_t child;
int newstdout = -1;
- int ret = -1, status;
- unsigned int major, minor, micro;
+ int ret = -1, scanret = -1, status;
+ unsigned int major, minor, micro, kvmver;
unsigned int version;
unsigned int flags = 0;
@@ -443,10 +444,12 @@ int qemudExtractVersionInfo(const char *qemu,
if (len < 0)
goto cleanup2;
- if (sscanf(help, "QEMU PC emulator version %u.%u.%u",
- &major, &minor, µ) != 3) {
+ scanret = sscanf(help, "QEMU PC emulator version %u.%u.%u (kvm-%u",
+ &major, &minor, µ, &kvmver);
+ if (scanret == 3)
+ kvmver = 0;
+ else if (scanret != 4)
goto cleanup2;
- }
version = (major * 1000 * 1000) + (minor * 1000) + micro;
@@ -465,6 +468,8 @@ int qemudExtractVersionInfo(const char *qemu,
if (retversion)
*retversion = version;
+ if (retkvmversion)
+ *retkvmversion = kvmver;
if (retflags)
*retflags = flags;
@@ -472,6 +477,7 @@ int qemudExtractVersionInfo(const char *qemu,
qemudDebug("Version %d %d %d Cooked version: %d, with flags ? %d",
major, minor, micro, version, flags);
+ qemudDebug("KVM version: %d", kvmver);
cleanup2:
VIR_FREE(help);
@@ -500,28 +506,34 @@ rewait:
return ret;
}
-int qemudExtractVersion(virConnectPtr conn,
- struct qemud_driver *driver) {
+int qemudExtractVersion(struct qemud_driver *driver) {
const char *binary;
struct stat sb;
+ struct utsname ut;
if (driver->qemuVersion > 0)
return 0;
+ uname (&ut);
+
if ((binary = virCapabilitiesDefaultGuestEmulator(driver->caps,
"hvm",
+ ut.machine,
+ "kvm")) == NULL &&
+ (binary = virCapabilitiesDefaultGuestEmulator(driver->caps,
+ "hvm",
"i686",
"qemu")) == NULL)
return -1;
if (stat(binary, &sb) < 0) {
- qemudReportError(conn, NULL, NULL, VIR_ERR_INTERNAL_ERROR,
- _("Cannot find QEMU binary %s: %s"), binary,
- strerror(errno));
return -1;
}
- if (qemudExtractVersionInfo(binary, &driver->qemuVersion, NULL) < 0) {
+ if (qemudExtractVersionInfo(binary,
+ &driver->qemuVersion,
+ &driver->kvmVersion,
+ NULL) < 0) {
return -1;
}
diff --git a/src/qemu_conf.h b/src/qemu_conf.h
index 88dfade..8d429fb 100644
--- a/src/qemu_conf.h
+++ b/src/qemu_conf.h
@@ -50,6 +50,7 @@ enum qemud_cmd_flags {
/* Main driver state */
struct qemud_driver {
unsigned int qemuVersion;
+ unsigned int kvmVersion;
int nextvmid;
virDomainObjPtr domains;
@@ -83,10 +84,10 @@ int qemudLoadDriverConfig(struct qemud_driver *driver,
virCapsPtr qemudCapsInit (void);
-int qemudExtractVersion (virConnectPtr conn,
- struct qemud_driver *driver);
+int qemudExtractVersion (struct qemud_driver *driver);
int qemudExtractVersionInfo (const char *qemu,
unsigned int *version,
+ unsigned int *kvmversion,
unsigned int *flags);
int qemudBuildCommandLine (virConnectPtr conn,
diff --git a/src/qemu_driver.c b/src/qemu_driver.c
index 0c04da6..d1b9d7c 100644
--- a/src/qemu_driver.c
+++ b/src/qemu_driver.c
@@ -240,6 +240,12 @@ qemudStartup(void) {
if ((qemu_driver->caps = qemudCapsInit()) == NULL)
goto out_of_memory;
+ // Dependent on capabilities being initialized
+ if (qemudExtractVersion(qemu_driver) < 0) {
+ qemudShutdown();
+ return -1;
+ }
+
if (qemudLoadDriverConfig(qemu_driver, driverConf) < 0) {
qemudShutdown();
return -1;
@@ -923,7 +929,7 @@ static int qemudStartVMDaemon(virConnectPtr conn,
}
if (qemudExtractVersionInfo(vm->def->emulator,
- NULL,
+ NULL, NULL,
&qemuCmdFlags) < 0) {
qemudReportError(conn, NULL, NULL, VIR_ERR_INTERNAL_ERROR,
_("Cannot determine QEMU argv syntax %s"),
@@ -1793,11 +1799,20 @@ static const char *qemudGetType(virConnectPtr conn ATTRIBUTE_UNUSED) {
}
-static int kvmGetMaxVCPUs(void) {
- int maxvcpus = 1;
+static int kvmGetMaxVCPUs(virConnectPtr conn) {
+
+ struct qemud_driver *driver = (struct qemud_driver *)conn->privateData;
+ int maxvcpus = 1, r, fd;
+
+ // KVM-30 added support for up to 4 vcpus
+ // KVM-62 raised this to 16
+ if (driver->kvmVersion < 30)
+ maxvcpus = 1;
+ else if (driver->kvmVersion < 62)
+ maxvcpus = 4;
+ else
+ maxvcpus = 16;
- int r, fd;
-
fd = open(KVM_DEVICE, O_RDONLY);
if (fd < 0) {
qemudLog(QEMUD_WARN, _("Unable to open " KVM_DEVICE ": %s\n"), strerror(errno));
@@ -1820,10 +1835,8 @@ static int qemudGetMaxVCPUs(virConnectPtr conn, const char *type) {
if (STRCASEEQ(type, "qemu"))
return 16;
- /* XXX future KVM will support SMP. Need to probe
- kernel to figure out KVM module version i guess */
if (STRCASEEQ(type, "kvm"))
- return kvmGetMaxVCPUs();
+ return kvmGetMaxVCPUs(conn);
if (STRCASEEQ(type, "kqemu"))
return 1;
@@ -1993,7 +2006,7 @@ static virDomainPtr qemudDomainLookupByName(virConnectPtr conn,
static int qemudGetVersion(virConnectPtr conn, unsigned long *version) {
struct qemud_driver *driver = (struct qemud_driver *)conn->privateData;
- if (qemudExtractVersion(conn, driver) < 0)
+ if (qemudExtractVersion(driver) < 0)
return -1;
*version = qemu_driver->qemuVersion;
@@ -3035,7 +3048,7 @@ static int qemudDomainChangeEjectableMedia(virDomainPtr dom,
}
if (qemudExtractVersionInfo(vm->def->emulator,
- NULL,
+ NULL, NULL,
&qemuCmdFlags) < 0) {
qemudReportError(dom->conn, dom, NULL, VIR_ERR_INTERNAL_ERROR,
_("Cannot determine QEMU argv syntax %s"),
16 years, 2 months
[libvirt] [PATCH] Don't remove devel files in spec
by Cole Robinson
The second iteration of the spec file enhancements
didn't fully remove some pieces that were dependent
on the devel package switch. The attached patch fixes
'make rpm' to work again.
Thanks,
Cole
16 years, 2 months
[libvirt] cpu flags
by Ben Guthro
Hi,
We're finding that we are going to be needing the cpu flags (as reported
in /proc/cpuinfo)
...specifically to find out if we are a vmx enabled machine.
So - off I went looking into this for a patch to submit upstream.
Unfortunately, I ran into some questions which need answering before I
really proceed with this
It seems to me that this info would best be parsed in src/nodeinfo.c
This is where other cpuinfo things are parsed...and stored in the
nodeinfo struct
Perhaps we store this as a bitmask encoded int, as defined in
/usr/include/asm/cpufeature.h and tack this onto the end of sad struct.
My concern is that adding to the nodeinfo struct breaks the API - such
that the structs will be different sizes between versions.
Also - this seems to be x86 specific. Are we primarily destined for x86?
Or would this type of change be unacceptable due to not working on PPC,
for example?
Thoughts?
Ben
16 years, 2 months