On Thu, Oct 16, 2008 at 02:07:20PM +0100, Daniel P. Berrange wrote:
On Fri, Oct 03, 2008 at 08:40:24AM -0700, Dan Smith wrote:
> This patch adds code to the controller to set up a cgroup named after the
> domain name, set the memory limit, and restrict devices. It also
> adds bits to lxc_driver to properly clean up the cgroup on domain death.
The device whitelisting is all very nice, but we completely forgot / ignored
the fact that there's nothing stopping a container mounting the cgroups
device controller and giving itself the device access we just took away :-)
The kernel code says
/*
* Modify the whitelist using allow/deny rules.
* CAP_SYS_ADMIN is needed for this. It's at least separate from CAP_MKNOD
* so we can give a container CAP_MKNOD to let it create devices but not
* modify the whitelist.
* It seems likely we'll want to add a CAP_CONTAINER capability to allow
* us to also grant CAP_SYS_ADMIN to containers without giving away the
* device whitelist controls, but for now we'll stick with CAP_SYS_ADMIN
*
* Taking rules away is always allowed (given CAP_SYS_ADMIN). Granting
* new access is only allowed if you're in the top-level cgroup, or your
* parent cgroup has the access you're asking for.
*/
That last paragraph actually suggest another possible aproach which won't
require messing with capabilities.
Consider a hierarchy of cgroups
$ROOT-CGROUP
|
...
|
+- libvirtd
|
+- lxc
|
+- $VM-NAME
+- $VM-NAME
+- $VM-NAME
libvirtd itself can be in any cgroup, beit the root one, or a child of
the root. If libvirtd is in the root it can grant whatever it likes to
cgroups for guests. If libvirt is not in root, then assuming its parent
has access to the device it can also grant devices as needed.
To prevent a LXC containre from giving itself device acess we need to
make sure either, it doesn't have CAP_SYS_ADMIN, or make sure its
parent cgroup doesn't have access. Fortunately we already have a cgroup
in the heirarchy between libvirtd's cgroup & the one the VM sits.
ie the cgroup named after the driver - 'lxc' in this case. No process
ever lives in this group - we're just using it for filesystem namespace
uniqueness.
So if that source code comment is correct, all we need todo is set a
deny-all rule in that intermediate 'lxc' cgroup, and then containers
will not be able to get access back, even if they have CAP_SYS_ADMIN
Daniel
--
|: Red Hat, Engineering, London -o-
http://people.redhat.com/berrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org -o-
http://ovirt.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|