[libvirt] [PATCH 1/2] lxc: Add virCgroupSetOwner()

Add a new helper function to change the permissions of a control group. This function is needed for user namespaces, we need to chmod() the cgroup to the initial uid/gid such that systemd is allowed to use the cgroup. Signed-off-by: Richard Weinberger <richard@nod.at> --- src/libvirt_private.syms | 1 + src/lxc/lxc_cgroup.c | 7 +++++++ src/util/vircgroup.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ src/util/vircgroup.h | 2 ++ 4 files changed, 56 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 2c9536a..40e72f2 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1056,6 +1056,7 @@ virCgroupSetMemory; virCgroupSetMemoryHardLimit; virCgroupSetMemorySoftLimit; virCgroupSetMemSwapHardLimit; +virCgroupSetOwner; virCgroupSupportsCpuBW; diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index cc0d5e8..de93528 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -484,6 +484,13 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def) &cgroup) < 0) goto cleanup; + /* setup control group permissions for user namespace */ + if (def->idmap.uidmap) { + if (virCgroupSetOwner(cgroup, def->idmap.uidmap[0].target, + def->idmap.gidmap[0].target)) + goto cleanup; + } + cleanup: return cgroup; } diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index a6d60c5..b2666e8 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -3252,6 +3252,52 @@ cleanup: return ret; } +int virCgroupSetOwner(virCgroupPtr cgroup, uid_t uid, gid_t gid) { + size_t i; + + for (i = 0; i < VIR_CGROUP_CONTROLLER_LAST; i++) { + char *base, *entry; + DIR *dh; + struct dirent *de; + + if (!cgroup->controllers[i].mountPoint) + continue; + + if (virAsprintf(&base, "%s%s", cgroup->controllers[i].mountPoint, + cgroup->controllers[i].placement) < 0) { + virReportOOMError(); + return -1; + } + + dh = opendir(base); + while ((de = readdir(dh)) != NULL) { + if (STREQ(de->d_name, ".") || + STREQ(de->d_name, "..")) + continue; + + if (virAsprintf(&entry, "%s/%s", base, de->d_name) < 0) { + VIR_FREE(base); + virReportOOMError(); + } + + if (chown(entry, uid, gid) < 0) + virReportSystemError(errno, _("cannot chown '%s' to (%u, %u)"), + entry, uid, gid); + + VIR_FREE(entry); + } + closedir(dh); + + if (chown(base, uid, gid) < 0) + virReportSystemError(errno, _("cannot chown '%s' to (%u, %u)"), + base, uid, gid); + + VIR_FREE(base); + } + + return 0; +} + /** * virCgroupSupportsCpuBW(): diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index a70eb18..6e00f28 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -225,4 +225,6 @@ int virCgroupIsolateMount(virCgroupPtr group, bool virCgroupSupportsCpuBW(virCgroupPtr cgroup); +int virCgroupSetOwner(virCgroupPtr cgroup, uid_t uid, gid_t gid); + #endif /* __VIR_CGROUP_H__ */ -- 1.8.4.5

Due to security concerns we delegate only VIR_CGROUP_CONTROLLER_SYSTEMD to containers. Currently it is not safe to allow a container access to a resource controller. Signed-off-by: Richard Weinberger <richard@nod.at> --- src/lxc/lxc_container.c | 3 ++- src/util/vircgroup.c | 5 ++++- src/util/vircgroup.h | 3 ++- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index c6bdc8c..abd2db4 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -1654,7 +1654,8 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef, /* Now we can re-mount the cgroups controllers in the * same configuration as before */ - if (virCgroupIsolateMount(cgroup, "/.oldroot/", sec_mount_options) < 0) + if (virCgroupIsolateMount(cgroup, "/.oldroot/", sec_mount_options, + (1 << VIR_CGROUP_CONTROLLER_SYSTEMD)) < 0) goto cleanup; /* Mounts /dev */ diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index b2666e8..c75de9f 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -3166,7 +3166,7 @@ virCgroupGetFreezerState(virCgroupPtr group, char **state) int virCgroupIsolateMount(virCgroupPtr group, const char *oldroot, - const char *mountopts) + const char *mountopts, int controllers) { int ret = -1; size_t i; @@ -3197,6 +3197,9 @@ virCgroupIsolateMount(virCgroupPtr group, const char *oldroot, } for (i = 0; i < VIR_CGROUP_CONTROLLER_LAST; i++) { + if (!((1 << i) & controllers)) + continue; + if (!group->controllers[i].mountPoint) continue; diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index 6e00f28..c005d28 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -221,7 +221,8 @@ int virCgroupKillPainfully(virCgroupPtr group); int virCgroupIsolateMount(virCgroupPtr group, const char *oldroot, - const char *mountopts); + const char *mountopts, + int controllers); bool virCgroupSupportsCpuBW(virCgroupPtr cgroup); -- 1.8.4.5

On Tue, Feb 11, 2014 at 11:51:26PM +0100, Richard Weinberger wrote:
Due to security concerns we delegate only VIR_CGROUP_CONTROLLER_SYSTEMD to containers. Currently it is not safe to allow a container access to a resource controller.
We *do* want to allow all controllers to be visible to the container. eg it is valid for them to have read access to view things like block I/O and CPU accounting information. We just don't want to make it writable for usernamespaces. I've adjusted your first patch and reposted with you on CC, if you can confirm it works for you. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 13.02.2014 18:16, schrieb Daniel P. Berrange:
On Tue, Feb 11, 2014 at 11:51:26PM +0100, Richard Weinberger wrote:
Due to security concerns we delegate only VIR_CGROUP_CONTROLLER_SYSTEMD to containers. Currently it is not safe to allow a container access to a resource controller.
We *do* want to allow all controllers to be visible to the container. eg it is valid for them to have read access to view things like block I/O and CPU accounting information. We just don't want to make it writable for usernamespaces.
Okay. But what if one does not enable user namespaces? Then the controllers are writable within the container.
I've adjusted your first patch and reposted with you on CC, if you can confirm it works for you.
Will test it today. :) Thanks, //richard

On Fri, Feb 14, 2014 at 08:49:07AM +0100, Richard Weinberger wrote:
Am 13.02.2014 18:16, schrieb Daniel P. Berrange:
On Tue, Feb 11, 2014 at 11:51:26PM +0100, Richard Weinberger wrote:
Due to security concerns we delegate only VIR_CGROUP_CONTROLLER_SYSTEMD to containers. Currently it is not safe to allow a container access to a resource controller.
We *do* want to allow all controllers to be visible to the container. eg it is valid for them to have read access to view things like block I/O and CPU accounting information. We just don't want to make it writable for usernamespaces.
Okay. But what if one does not enable user namespaces? Then the controllers are writable within the container.
If you don't enable user namespaces, then containers should be considered insecure unless all processes run non-root and all your filesystems are mounted no-setuid to prevent escalation fo privileges back to root, or you have SELinux applying controls. So once ypou have the requirement that security depends on being non-root then the cgroups are no longer writable, except when your consider is already insecure for other reasons. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 14.02.2014 11:30, schrieb Daniel P. Berrange:
On Fri, Feb 14, 2014 at 08:49:07AM +0100, Richard Weinberger wrote:
Am 13.02.2014 18:16, schrieb Daniel P. Berrange:
On Tue, Feb 11, 2014 at 11:51:26PM +0100, Richard Weinberger wrote:
Due to security concerns we delegate only VIR_CGROUP_CONTROLLER_SYSTEMD to containers. Currently it is not safe to allow a container access to a resource controller.
We *do* want to allow all controllers to be visible to the container. eg it is valid for them to have read access to view things like block I/O and CPU accounting information. We just don't want to make it writable for usernamespaces.
Okay. But what if one does not enable user namespaces? Then the controllers are writable within the container.
If you don't enable user namespaces, then containers should be considered insecure unless all processes run non-root and all your filesystems are mounted no-setuid to prevent escalation fo privileges back to root, or you have SELinux applying controls.
Yeah, I hope all users know that too. Do you plan to support non-user namespace container in future? Maybe one should communicate this to docker.io folks as well. *scnr*
So once ypou have the requirement that security depends on being non-root then the cgroups are no longer writable, except when your consider is already insecure for other reasons.
Yep. Thanks, //richard

On Fri, Feb 14, 2014 at 12:11:13PM +0100, Richard Weinberger wrote:
Am 14.02.2014 11:30, schrieb Daniel P. Berrange:
On Fri, Feb 14, 2014 at 08:49:07AM +0100, Richard Weinberger wrote:
Am 13.02.2014 18:16, schrieb Daniel P. Berrange:
On Tue, Feb 11, 2014 at 11:51:26PM +0100, Richard Weinberger wrote:
Due to security concerns we delegate only VIR_CGROUP_CONTROLLER_SYSTEMD to containers. Currently it is not safe to allow a container access to a resource controller.
We *do* want to allow all controllers to be visible to the container. eg it is valid for them to have read access to view things like block I/O and CPU accounting information. We just don't want to make it writable for usernamespaces.
Okay. But what if one does not enable user namespaces? Then the controllers are writable within the container.
If you don't enable user namespaces, then containers should be considered insecure unless all processes run non-root and all your filesystems are mounted no-setuid to prevent escalation fo privileges back to root, or you have SELinux applying controls.
Yeah, I hope all users know that too. Do you plan to support non-user namespace container in future?
Maybe one should communicate this to docker.io folks as well. *scnr*
Yep, I've gone into this in much detail with Red Hat folks who are working with Docker on their container impl, so they at least know the risks in what they're going.... Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (2)
-
Daniel P. Berrange
-
Richard Weinberger