[libvirt] [RFC PATCH 0/6] Add user namespace support for libvirt lxc

This patchset try to add userns support for libvirt lxc. Since userns is nearly completed in linux-3.9, the old kernel doesn't support userns, I add some New XML elements to let people decide if enable userns.The userns is disabled by default. And because the uninit userns has no right to create devices, so we should create devices for container on host. This patch alse changes the owner of fuse and tty device. Cgroupfs is unavailable in userns now,so don't mount cgroupfs when we enable userns. Gao feng (6): LXC: New XML element for user namespace LXC: introduce virLXCControllerSetupUserns and lxcContainerSetUserns LXC: only mount cgroupfs when userns is disabled LXC: Creating devices for container on host side LXC: create tty device with proper permission for container LXC: fuse: Change files owner to the root user of container docs/formatdomain.html.in | 20 +++++- docs/schemas/domaincommon.rng | 36 ++++++++++ src/conf/domain_conf.c | 36 ++++++++++ src/conf/domain_conf.h | 21 ++++++ src/lxc/lxc_container.c | 122 ++++++++++++++++---------------- src/lxc/lxc_controller.c | 157 +++++++++++++++++++++++++++++++++++++++++- src/lxc/lxc_fuse.c | 6 ++ 7 files changed, 333 insertions(+), 65 deletions(-) -- 1.7.11.7

This patch introduces three new elements in <os> for user namespace. for example <os> <userns enabled='yes'/> <uidmap first='0' low_first='1000' count='10'/> <gidmap first='0' low_first='1000' count='10'/> </os> this new element userns is used for controlling if enable userns for the domain. the other two elements uidmap and gidmap are used for setting proc files /proc/<pid>/{uid_map,gid_map}. Since user namespace is not very complete,we need this element to enable or disable userns manually. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- docs/formatdomain.html.in | 20 +++++++++++++++++++- docs/schemas/domaincommon.rng | 36 ++++++++++++++++++++++++++++++++++++ src/conf/domain_conf.c | 36 ++++++++++++++++++++++++++++++++++++ src/conf/domain_conf.h | 21 +++++++++++++++++++++ src/lxc/lxc_container.c | 14 ++++---------- 5 files changed, 116 insertions(+), 11 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 4cafc92..d274c64 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -264,7 +264,10 @@ no arguments. To specify the initial argv, use the <code>initarg</code> element, repeated as many time as is required. The <code>cmdline</code> element, if set will be used to provide an equivalent to <code>/proc/cmdline</code> - but will not effect init argv. + but will not effect init argv. If you want to enable userns, use the + <code>userns</code> element,the <code>enabled</code> attribute can be + either "yes" or "no".If not specified, default is "no". <span class="since"> + Since 1.0.4 (LXC only)</span>. </p> <pre> @@ -273,9 +276,24 @@ <init>/bin/systemd</init> <initarg>--unit</initarg> <initarg>emergency.service</initarg> + <userns enabled='yes'/> + <uidmap first='0' low_first='1000' count='10'/> + <gidmap first='0' low_first='1000' count='10'/> </os> </pre> + <p> + The uidmap and gidmap elements have three attributs: + </p> + + <dl> + <dt><code>first</code></dt> + <dd>First user id in container</dd> + <dt><code>low_first</code></dt> + <dd>First user id in container will be mapped to this low_first user id in host</dd> + <dt><code>count</code></dt> + <dd>How many users in container being allowed to map to host's user</dd> + </dl> <h3><a name="elementsSysinfo">SMBIOS System Information</a></h3> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 4b60885..03886d1 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -435,6 +435,42 @@ <text/> </element> </zeroOrMore> + <optional> + <element name="userns"> + <attribute name="enabled"> + <choice> + <value>yes</value> + <value>no</value> + </choice> + </attribute> + </element> + </optional> + <optional> + <element name="uidmap"> + <attribute name="first"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="low_first"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="count"> + <ref name="unsignedInt"/> + </attribute> + </element> + </optional> + <optional> + <element name="gidmap"> + <attribute name="first"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="low_first"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="count"> + <ref name="unsignedInt"/> + </attribute> + </element> + </optional> </interleave> </element> </define> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index f7c8af1..7f610c3 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -158,6 +158,11 @@ VIR_ENUM_IMPL(virDomainPMState, VIR_DOMAIN_PM_STATE_LAST, "yes", "no") +VIR_ENUM_IMPL(virDomainUserns, VIR_DOMAIN_USER_NS_LAST, + "default", + "yes", + "no") + VIR_ENUM_IMPL(virDomainDevice, VIR_DOMAIN_DEVICE_LAST, "none", "disk", @@ -10036,6 +10041,27 @@ virDomainDefParseXML(virCapsPtr caps, } def->os.cmdline = virXPathString("string(./os/cmdline[1])", ctxt); + tmp = virXPathString("string(./os/userns/@enabled)", ctxt); + if (tmp) { + def->os.userns = virDomainUsernsTypeFromString(tmp); + if (def->os.userns <= 0) + def->os.userns = VIR_DOMAIN_USER_NS_DISABLED; + else if (def->os.userns == VIR_DOMAIN_USER_NS_ENABLED) { + virXPathUInt("string(./os/uidmap/@first)", ctxt, + &def->os.uidmap.first); + virXPathUInt("string(./os/uidmap/@low_first)", ctxt, + &def->os.uidmap.low_first); + virXPathUInt("string(./os/uidmap/@count)", ctxt, + &def->os.uidmap.count); + virXPathUInt("string(./os/gidmap/@first)", ctxt, + &def->os.gidmap.first); + virXPathUInt("string(./os/gidmap/@low_first)", ctxt, + &def->os.gidmap.low_first); + virXPathUInt("string(./os/gidmap/@count)", ctxt, + &def->os.gidmap.count); + } + } + if ((n = virXPathNodeSet("./os/initarg", ctxt, &nodes)) < 0) { goto error; } @@ -14665,6 +14691,16 @@ virDomainDefFormatInternal(virDomainDefPtr def, for (i = 0 ; def->os.initargv && def->os.initargv[i] ; i++) virBufferEscapeString(buf, " <initarg>%s</initarg>\n", def->os.initargv[i]); + if (def->os.userns != VIR_DOMAIN_USER_NS_DEFAULT) { + const char *enabled = virDomainUsernsTypeToString(def->os.userns); + virBufferAsprintf(buf, " <userns enabled='%s'/>\n", enabled); + virBufferAsprintf(buf, " <uidmap first='%u' low_first='%u'" + " count='%u'/>\n", def->os.uidmap.first, + def->os.uidmap.low_first, def->os.uidmap.count); + virBufferAsprintf(buf, " <gidmap first='%u' low_first='%u'" + " count='%u'/>\n", def->os.gidmap.first, + def->os.gidmap.low_first, def->os.gidmap.count); + } virBufferEscapeString(buf, " <loader>%s</loader>\n", def->os.loader); virBufferEscapeString(buf, " <kernel>%s</kernel>\n", diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 2509193..af39290 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1438,6 +1438,14 @@ enum virDomainBootMenu { VIR_DOMAIN_BOOT_MENU_LAST }; +enum virDomainUserns { + VIR_DOMAIN_USER_NS_DEFAULT = 0, + VIR_DOMAIN_USER_NS_ENABLED, + VIR_DOMAIN_USER_NS_DISABLED, + + VIR_DOMAIN_USER_NS_LAST +}; + enum virDomainFeature { VIR_DOMAIN_FEATURE_ACPI, VIR_DOMAIN_FEATURE_APIC, @@ -1504,6 +1512,14 @@ enum virDomainPMState { VIR_DOMAIN_PM_STATE_LAST }; +typedef struct _virDomainIdMapDef virDomainIdMapDef; +typedef virDomainIdMapDef *virDomainIdMapDefPtr; +struct _virDomainIdMapDef { + unsigned int first; + unsigned int low_first; + unsigned int count; +}; + enum virDomainBIOSUseserial { VIR_DOMAIN_BIOS_USESERIAL_DEFAULT = 0, VIR_DOMAIN_BIOS_USESERIAL_YES, @@ -1540,8 +1556,12 @@ struct _virDomainOSDef { char *bootloader; char *bootloaderArgs; int smbios_mode; + /* enum virDomainUserns */ + int userns; virDomainBIOSDef bios; + virDomainIdMapDef uidmap; + virDomainIdMapDef gidmap; }; enum virDomainTimerNameType { @@ -2272,6 +2292,7 @@ VIR_ENUM_DECL(virDomainTaint) VIR_ENUM_DECL(virDomainVirt) VIR_ENUM_DECL(virDomainBoot) VIR_ENUM_DECL(virDomainBootMenu) +VIR_ENUM_DECL(virDomainUserns) VIR_ENUM_DECL(virDomainFeature) VIR_ENUM_DECL(virDomainFeatureState) VIR_ENUM_DECL(virDomainLifecycle) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 497539c..1d7bc1e 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -2293,16 +2293,10 @@ cleanup: return ret; } -static int userns_supported(void) +static int userns_supported(virDomainDefPtr def) { -#if 1 - /* - * put off using userns until uid mapping is implemented - */ - return 0; -#else - return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; -#endif + return ((def->os.userns == VIR_DOMAIN_USER_NS_ENABLED) && + lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0); } virArch lxcContainerGetAlt32bitArch(virArch arch) @@ -2366,7 +2360,7 @@ int lxcContainerStart(virDomainDefPtr def, cflags = CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|SIGCHLD; - if (userns_supported()) { + if (userns_supported(def)) { VIR_DEBUG("Enable user namespaces"); cflags |= CLONE_NEWUSER; } -- 1.7.11.7

On Mon, Mar 11, 2013 at 02:26:47PM +0800, Gao feng wrote:
This patch introduces three new elements in <os> for user namespace. for example <os> <userns enabled='yes'/> <uidmap first='0' low_first='1000' count='10'/> <gidmap first='0' low_first='1000' count='10'/> </os>
this new element userns is used for controlling if enable userns for the domain.
We've previously used the <features> block to control whether namespaces are enabled. So I'd prefer to see that we use a '<privuser/>' feature flag for this purpose.
the other two elements uidmap and gidmap are used for setting proc files /proc/<pid>/{uid_map,gid_map}.
There can be many entries per maps, so we should be grouping them in some way. I don't think they belong inside <os> since that is about the guest boot mechanism. Instead we want something like <idmap> <uid start="0" count="100" target="1000"/> <uid start="65536" count="1" target="1101"/> <gid start="0" count="100" target="1000"/> <gid start="65536" count="1" target="1101"/> </idmap> If a <idmap> element is present, then we should automatically set the <privuer/> feature flag during parsing, if not already set by the user. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 2013/03/13 18:51, Daniel P. Berrange wrote:
On Mon, Mar 11, 2013 at 02:26:47PM +0800, Gao feng wrote:
This patch introduces three new elements in <os> for user namespace. for example <os> <userns enabled='yes'/> <uidmap first='0' low_first='1000' count='10'/> <gidmap first='0' low_first='1000' count='10'/> </os>
this new element userns is used for controlling if enable userns for the domain.
We've previously used the <features> block to control whether namespaces are enabled. So I'd prefer to see that we use a '<privuser/>' feature flag for this purpose.
Yes, this is more reasonable. Will do it.
the other two elements uidmap and gidmap are used for setting proc files /proc/<pid>/{uid_map,gid_map}.
There can be many entries per maps, so we should be grouping them in some way. I don't think they belong inside <os> since that is about the guest boot mechanism.
Instead we want something like
<idmap> <uid start="0" count="100" target="1000"/> <uid start="65536" count="1" target="1101"/> <gid start="0" count="100" target="1000"/> <gid start="65536" count="1" target="1101"/> </idmap>
If a <idmap> element is present, then we should automatically set the <privuer/> feature flag during parsing, if not already set by the user.
Get it. Thanks!

This patch introduces new helper function virLXCControllerSetupUserns, in this function, we set the files uid_map and gid_map of process libvirt_lxc. lxcContainerSetUserns is used for creating cred for tasks running in container. Since after setuid/setgid, we may be a new user. This patch calls lxcContainerSetUserns at first to make sure the new created files belong to right user. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 55 ++++++++++++++++++++++++++++++---------- src/lxc/lxc_controller.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 107 insertions(+), 14 deletions(-) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 1d7bc1e..5c66ae3 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -329,6 +329,29 @@ int lxcContainerWaitForContinue(int control) /** + * lxcContainerSetUserns: + * + * This function calls setuid and setgid to create proper + * cred for tasks running in container. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcContainerSetUserns(virDomainDefPtr def) +{ + if (def->os.userns != VIR_DOMAIN_USER_NS_ENABLED) + return 0; + + if (virSetUIDGID(def->os.uidmap.first, + def->os.gidmap.first) < 0) { + virReportSystemError(errno, "%s", + _("setuid or setgid failed")); + return -1; + } + + return 0; +} + +/** * lxcContainerRenameAndEnableInterfaces: * @nveths: number of interfaces * @veths: interface names @@ -2221,6 +2244,24 @@ static int lxcContainerChild(void *data) } } + if (!virFileExists(vmDef->os.init)) { + virReportSystemError(errno, + _("cannot find init path '%s' relative to container root"), + vmDef->os.init); + goto cleanup; + } + + /* Wait for interface devices to show up */ + if (lxcContainerWaitForContinue(argv->monitor) < 0) { + virReportSystemError(errno, "%s", + _("Failed to read the container continue message")); + goto cleanup; + } + VIR_DEBUG("Received container continue message"); + + if (lxcContainerSetUserns(vmDef) < 0) + goto cleanup; + VIR_DEBUG("Container TTY path: %s", ttyPath); ttyfd = open(ttyPath, O_RDWR|O_NOCTTY); @@ -2236,20 +2277,6 @@ static int lxcContainerChild(void *data) argv->securityDriver) < 0) goto cleanup; - if (!virFileExists(vmDef->os.init)) { - virReportSystemError(errno, - _("cannot find init path '%s' relative to container root"), - vmDef->os.init); - goto cleanup; - } - - /* Wait for interface devices to show up */ - if (lxcContainerWaitForContinue(argv->monitor) < 0) { - virReportSystemError(errno, "%s", - _("Failed to read the container continue message")); - goto cleanup; - } - VIR_DEBUG("Received container continue message"); /* rename and enable interfaces */ if (lxcContainerRenameAndEnableInterfaces(!!(vmDef->features & diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 15aa334..f17142f 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1028,6 +1028,69 @@ cleanup2: } +/** + * virLXCControllerSetupUserns + * + * Set proc files for user namespace + * + * Returns 0 on success or -1 in case of error + */ +static int virLXCControllerSetupUserns(virLXCControllerPtr ctrl) +{ + char *uid_map = NULL; + char *gid_map = NULL; + char *uidmap_value = NULL; + char *gidmap_value = NULL; + int ret = -1; + + if (ctrl->def->os.userns != VIR_DOMAIN_USER_NS_ENABLED) + return 0; + + if (virAsprintf(&uid_map, "/proc/%d/uid_map", ctrl->initpid) < 0) + goto cleanup; + + if (virAsprintf(&gid_map, "/proc/%d/gid_map", ctrl->initpid) < 0) + goto cleanup; + + if (virAsprintf(&uidmap_value, "%u %u %u", + ctrl->def->os.uidmap.first, + ctrl->def->os.uidmap.low_first, + ctrl->def->os.uidmap.count) < 0) + goto cleanup; + + if (virAsprintf(&gidmap_value, "%u %u %u", + ctrl->def->os.gidmap.first, + ctrl->def->os.gidmap.low_first, + ctrl->def->os.gidmap.count) < 0) + goto cleanup; + + if (virFileWriteStr(uid_map, uidmap_value, 0) < 0) { + if (errno == -ENOENT) + virReportSystemError(errno, + _("%s doesn't exist, please disable userns"), + uid_map); + virReportSystemError(errno, _("unable write to %s"), uid_map); + goto cleanup; + } + + if (virFileWriteStr(gid_map, gidmap_value, 0) < 0) { + if (errno == -ENOENT) + virReportSystemError(errno, + _("%s doesn't exist, please disable userns"), + gid_map); + virReportSystemError(errno, _("unable write to %s"), gid_map); + goto cleanup; + } + + ret = 0; +cleanup: + VIR_FREE(uidmap_value); + VIR_FREE(gidmap_value); + VIR_FREE(uid_map); + VIR_FREE(gid_map); + return ret; +} + /** * virLXCControllerMoveInterfaces @@ -1454,6 +1517,9 @@ virLXCControllerRun(virLXCControllerPtr ctrl) VIR_FORCE_CLOSE(control[1]); VIR_FORCE_CLOSE(containerhandshake[1]); + if (virLXCControllerSetupUserns(ctrl) < 0) + goto cleanup; + if (virLXCControllerMoveInterfaces(ctrl) < 0) goto cleanup; -- 1.7.11.7

On 2013/03/11 14:26, Gao feng wrote:
This patch introduces new helper function virLXCControllerSetupUserns, in this function, we set the files uid_map and gid_map of process libvirt_lxc.
Oops,not libvirt_lxc, it's the init task of container.
lxcContainerSetUserns is used for creating cred for tasks running in container. Since after setuid/setgid, we may be a new user. This patch calls lxcContainerSetUserns at first to make sure the new created files belong to right user.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 55 ++++++++++++++++++++++++++++++---------- src/lxc/lxc_controller.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 107 insertions(+), 14 deletions(-)
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 1d7bc1e..5c66ae3 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -329,6 +329,29 @@ int lxcContainerWaitForContinue(int control)
/** + * lxcContainerSetUserns: + * + * This function calls setuid and setgid to create proper + * cred for tasks running in container. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcContainerSetUserns(virDomainDefPtr def) +{ + if (def->os.userns != VIR_DOMAIN_USER_NS_ENABLED) + return 0; + + if (virSetUIDGID(def->os.uidmap.first, + def->os.gidmap.first) < 0) { + virReportSystemError(errno, "%s", + _("setuid or setgid failed")); + return -1; + } + + return 0; +} + +/** * lxcContainerRenameAndEnableInterfaces: * @nveths: number of interfaces * @veths: interface names @@ -2221,6 +2244,24 @@ static int lxcContainerChild(void *data) } }
+ if (!virFileExists(vmDef->os.init)) { + virReportSystemError(errno, + _("cannot find init path '%s' relative to container root"), + vmDef->os.init); + goto cleanup; + } + + /* Wait for interface devices to show up */ + if (lxcContainerWaitForContinue(argv->monitor) < 0) { + virReportSystemError(errno, "%s", + _("Failed to read the container continue message")); + goto cleanup; + } + VIR_DEBUG("Received container continue message"); + + if (lxcContainerSetUserns(vmDef) < 0) + goto cleanup; + VIR_DEBUG("Container TTY path: %s", ttyPath);
ttyfd = open(ttyPath, O_RDWR|O_NOCTTY); @@ -2236,20 +2277,6 @@ static int lxcContainerChild(void *data) argv->securityDriver) < 0) goto cleanup;
- if (!virFileExists(vmDef->os.init)) { - virReportSystemError(errno, - _("cannot find init path '%s' relative to container root"), - vmDef->os.init); - goto cleanup; - } - - /* Wait for interface devices to show up */ - if (lxcContainerWaitForContinue(argv->monitor) < 0) { - virReportSystemError(errno, "%s", - _("Failed to read the container continue message")); - goto cleanup; - } - VIR_DEBUG("Received container continue message");
/* rename and enable interfaces */ if (lxcContainerRenameAndEnableInterfaces(!!(vmDef->features & diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 15aa334..f17142f 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1028,6 +1028,69 @@ cleanup2: }
+/** + * virLXCControllerSetupUserns + * + * Set proc files for user namespace + * + * Returns 0 on success or -1 in case of error + */ +static int virLXCControllerSetupUserns(virLXCControllerPtr ctrl) +{ + char *uid_map = NULL; + char *gid_map = NULL; + char *uidmap_value = NULL; + char *gidmap_value = NULL; + int ret = -1; + + if (ctrl->def->os.userns != VIR_DOMAIN_USER_NS_ENABLED) + return 0; + + if (virAsprintf(&uid_map, "/proc/%d/uid_map", ctrl->initpid) < 0) + goto cleanup; + + if (virAsprintf(&gid_map, "/proc/%d/gid_map", ctrl->initpid) < 0) + goto cleanup; + + if (virAsprintf(&uidmap_value, "%u %u %u", + ctrl->def->os.uidmap.first, + ctrl->def->os.uidmap.low_first, + ctrl->def->os.uidmap.count) < 0) + goto cleanup; + + if (virAsprintf(&gidmap_value, "%u %u %u", + ctrl->def->os.gidmap.first, + ctrl->def->os.gidmap.low_first, + ctrl->def->os.gidmap.count) < 0) + goto cleanup; + + if (virFileWriteStr(uid_map, uidmap_value, 0) < 0) { + if (errno == -ENOENT) + virReportSystemError(errno, + _("%s doesn't exist, please disable userns"), + uid_map); + virReportSystemError(errno, _("unable write to %s"), uid_map); + goto cleanup; + } + + if (virFileWriteStr(gid_map, gidmap_value, 0) < 0) { + if (errno == -ENOENT) + virReportSystemError(errno, + _("%s doesn't exist, please disable userns"), + gid_map); + virReportSystemError(errno, _("unable write to %s"), gid_map); + goto cleanup; + } + + ret = 0; +cleanup: + VIR_FREE(uidmap_value); + VIR_FREE(gidmap_value); + VIR_FREE(uid_map); + VIR_FREE(gid_map); + return ret; +} +
/** * virLXCControllerMoveInterfaces @@ -1454,6 +1517,9 @@ virLXCControllerRun(virLXCControllerPtr ctrl) VIR_FORCE_CLOSE(control[1]); VIR_FORCE_CLOSE(containerhandshake[1]);
+ if (virLXCControllerSetupUserns(ctrl) < 0) + goto cleanup; + if (virLXCControllerMoveInterfaces(ctrl) < 0) goto cleanup;

On Mon, Mar 11, 2013 at 02:26:48PM +0800, Gao feng wrote:
This patch introduces new helper function virLXCControllerSetupUserns, in this function, we set the files uid_map and gid_map of process libvirt_lxc.
lxcContainerSetUserns is used for creating cred for tasks running in container. Since after setuid/setgid, we may be a new user. This patch calls lxcContainerSetUserns at first to make sure the new created files belong to right user.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 55 ++++++++++++++++++++++++++++++---------- src/lxc/lxc_controller.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 107 insertions(+), 14 deletions(-)
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 1d7bc1e..5c66ae3 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -329,6 +329,29 @@ int lxcContainerWaitForContinue(int control)
/** + * lxcContainerSetUserns: + * + * This function calls setuid and setgid to create proper + * cred for tasks running in container. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcContainerSetUserns(virDomainDefPtr def) +{ + if (def->os.userns != VIR_DOMAIN_USER_NS_ENABLED) + return 0; + + if (virSetUIDGID(def->os.uidmap.first, + def->os.gidmap.first) < 0) { + virReportSystemError(errno, "%s", + _("setuid or setgid failed")); + return -1; + } + + return 0; +}
I don't see why we should force the init process to have the first UID in the map. The init process should always start as uid:gid 0:0 regardless of mapping IMHO. If we want a capability to start the init process as a different uid:gid, then that should involve separate XML configuration.
@@ -2221,6 +2244,24 @@ static int lxcContainerChild(void *data) } }
+ if (!virFileExists(vmDef->os.init)) { + virReportSystemError(errno, + _("cannot find init path '%s' relative to container root"), + vmDef->os.init); + goto cleanup; + } + + /* Wait for interface devices to show up */ + if (lxcContainerWaitForContinue(argv->monitor) < 0) { + virReportSystemError(errno, "%s", + _("Failed to read the container continue message")); + goto cleanup; + } + VIR_DEBUG("Received container continue message");
I don't really see why you're moving these lines - they are unrelated to user namespaces.
+ + if (lxcContainerSetUserns(vmDef) < 0) + goto cleanup; + VIR_DEBUG("Container TTY path: %s", ttyPath);
ttyfd = open(ttyPath, O_RDWR|O_NOCTTY); @@ -2236,20 +2277,6 @@ static int lxcContainerChild(void *data) argv->securityDriver) < 0) goto cleanup;
- if (!virFileExists(vmDef->os.init)) { - virReportSystemError(errno, - _("cannot find init path '%s' relative to container root"), - vmDef->os.init); - goto cleanup; - } - - /* Wait for interface devices to show up */ - if (lxcContainerWaitForContinue(argv->monitor) < 0) { - virReportSystemError(errno, "%s", - _("Failed to read the container continue message")); - goto cleanup; - } - VIR_DEBUG("Received container continue message");
/* rename and enable interfaces */ if (lxcContainerRenameAndEnableInterfaces(!!(vmDef->features & diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 15aa334..f17142f 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1028,6 +1028,69 @@ cleanup2: }
+/** + * virLXCControllerSetupUserns + * + * Set proc files for user namespace + * + * Returns 0 on success or -1 in case of error + */ +static int virLXCControllerSetupUserns(virLXCControllerPtr ctrl) +{ + char *uid_map = NULL; + char *gid_map = NULL; + char *uidmap_value = NULL; + char *gidmap_value = NULL; + int ret = -1; + + if (ctrl->def->os.userns != VIR_DOMAIN_USER_NS_ENABLED) + return 0; + + if (virAsprintf(&uid_map, "/proc/%d/uid_map", ctrl->initpid) < 0) + goto cleanup; + + if (virAsprintf(&gid_map, "/proc/%d/gid_map", ctrl->initpid) < 0) + goto cleanup; + + if (virAsprintf(&uidmap_value, "%u %u %u", + ctrl->def->os.uidmap.first, + ctrl->def->os.uidmap.low_first, + ctrl->def->os.uidmap.count) < 0) + goto cleanup; + + if (virAsprintf(&gidmap_value, "%u %u %u", + ctrl->def->os.gidmap.first, + ctrl->def->os.gidmap.low_first, + ctrl->def->os.gidmap.count) < 0) + goto cleanup; + + if (virFileWriteStr(uid_map, uidmap_value, 0) < 0) { + if (errno == -ENOENT) + virReportSystemError(errno, + _("%s doesn't exist, please disable userns"), + uid_map); + virReportSystemError(errno, _("unable write to %s"), uid_map); + goto cleanup; + } + + if (virFileWriteStr(gid_map, gidmap_value, 0) < 0) { + if (errno == -ENOENT) + virReportSystemError(errno, + _("%s doesn't exist, please disable userns"), + gid_map); + virReportSystemError(errno, _("unable write to %s"), gid_map); + goto cleanup; + } + + ret = 0; +cleanup: + VIR_FREE(uidmap_value); + VIR_FREE(gidmap_value); + VIR_FREE(uid_map); + VIR_FREE(gid_map); + return ret; +}
I think we should have one method can write to either mapping file, and that should be invoked once to setup UID mapping and once to setup GID mapping. This would remove all the duplication in this method. Also we need to be able to write multiple lines to the mapping files in one go. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 2013/03/13 18:57, Daniel P. Berrange wrote:
On Mon, Mar 11, 2013 at 02:26:48PM +0800, Gao feng wrote:
This patch introduces new helper function virLXCControllerSetupUserns, in this function, we set the files uid_map and gid_map of process libvirt_lxc.
lxcContainerSetUserns is used for creating cred for tasks running in container. Since after setuid/setgid, we may be a new user. This patch calls lxcContainerSetUserns at first to make sure the new created files belong to right user.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 55 ++++++++++++++++++++++++++++++---------- src/lxc/lxc_controller.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 107 insertions(+), 14 deletions(-)
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 1d7bc1e..5c66ae3 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -329,6 +329,29 @@ int lxcContainerWaitForContinue(int control)
/** + * lxcContainerSetUserns: + * + * This function calls setuid and setgid to create proper + * cred for tasks running in container. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcContainerSetUserns(virDomainDefPtr def) +{ + if (def->os.userns != VIR_DOMAIN_USER_NS_ENABLED) + return 0; + + if (virSetUIDGID(def->os.uidmap.first, + def->os.gidmap.first) < 0) { + virReportSystemError(errno, "%s", + _("setuid or setgid failed")); + return -1; + } + + return 0; +}
I don't see why we should force the init process to have the first UID in the map. The init process should always start as uid:gid 0:0 regardless of mapping IMHO. If we want a capability to start the init process as a different uid:gid, then that should involve separate XML configuration.
Yes,kernel provides flexible interface.but we can force the [u,g]ser id of init process to 0:0.
@@ -2221,6 +2244,24 @@ static int lxcContainerChild(void *data) } }
+ if (!virFileExists(vmDef->os.init)) { + virReportSystemError(errno, + _("cannot find init path '%s' relative to container root"), + vmDef->os.init); + goto cleanup; + } + + /* Wait for interface devices to show up */ + if (lxcContainerWaitForContinue(argv->monitor) < 0) { + virReportSystemError(errno, "%s", + _("Failed to read the container continue message")); + goto cleanup; + } + VIR_DEBUG("Received container continue message");
I don't really see why you're moving these lines - they are unrelated to user namespaces.
This change ensure the init process has new cred first. see lxcContainerSetUserns below.
+ + if (lxcContainerSetUserns(vmDef) < 0) + goto cleanup; + VIR_DEBUG("Container TTY path: %s", ttyPath);
ttyfd = open(ttyPath, O_RDWR|O_NOCTTY); @@ -2236,20 +2277,6 @@ static int lxcContainerChild(void *data) argv->securityDriver) < 0) goto cleanup;
- if (!virFileExists(vmDef->os.init)) { - virReportSystemError(errno, - _("cannot find init path '%s' relative to container root"), - vmDef->os.init); - goto cleanup; - } - - /* Wait for interface devices to show up */ - if (lxcContainerWaitForContinue(argv->monitor) < 0) { - virReportSystemError(errno, "%s", - _("Failed to read the container continue message")); - goto cleanup; - } - VIR_DEBUG("Received container continue message");
/* rename and enable interfaces */ if (lxcContainerRenameAndEnableInterfaces(!!(vmDef->features & diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 15aa334..f17142f 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1028,6 +1028,69 @@ cleanup2: }
+/** + * virLXCControllerSetupUserns + * + * Set proc files for user namespace + * + * Returns 0 on success or -1 in case of error + */ +static int virLXCControllerSetupUserns(virLXCControllerPtr ctrl) +{ + char *uid_map = NULL; + char *gid_map = NULL; + char *uidmap_value = NULL; + char *gidmap_value = NULL; + int ret = -1; + + if (ctrl->def->os.userns != VIR_DOMAIN_USER_NS_ENABLED) + return 0; + + if (virAsprintf(&uid_map, "/proc/%d/uid_map", ctrl->initpid) < 0) + goto cleanup; + + if (virAsprintf(&gid_map, "/proc/%d/gid_map", ctrl->initpid) < 0) + goto cleanup; + + if (virAsprintf(&uidmap_value, "%u %u %u", + ctrl->def->os.uidmap.first, + ctrl->def->os.uidmap.low_first, + ctrl->def->os.uidmap.count) < 0) + goto cleanup; + + if (virAsprintf(&gidmap_value, "%u %u %u", + ctrl->def->os.gidmap.first, + ctrl->def->os.gidmap.low_first, + ctrl->def->os.gidmap.count) < 0) + goto cleanup; + + if (virFileWriteStr(uid_map, uidmap_value, 0) < 0) { + if (errno == -ENOENT) + virReportSystemError(errno, + _("%s doesn't exist, please disable userns"), + uid_map); + virReportSystemError(errno, _("unable write to %s"), uid_map); + goto cleanup; + } + + if (virFileWriteStr(gid_map, gidmap_value, 0) < 0) { + if (errno == -ENOENT) + virReportSystemError(errno, + _("%s doesn't exist, please disable userns"), + gid_map); + virReportSystemError(errno, _("unable write to %s"), gid_map); + goto cleanup; + } + + ret = 0; +cleanup: + VIR_FREE(uidmap_value); + VIR_FREE(gidmap_value); + VIR_FREE(uid_map); + VIR_FREE(gid_map); + return ret; +}
I think we should have one method can write to either mapping file, and that should be invoked once to setup UID mapping and once to setup GID mapping. This would remove all the duplication in this method. Also we need to be able to write multiple lines to the mapping files in one go.
sounds reasonable,will do. Thanks!

Since we can't mount cgroupfs in uninit user namespace now. only mount cgroupfs when userns is disabled. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 5c66ae3..92af3e5 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -1979,7 +1979,8 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef, /* Now we can re-mount the cgroups controllers in the * same configuration as before */ - if (lxcContainerMountCGroups(mounts, nmounts, + if (vmDef->os.userns != VIR_DOMAIN_USER_NS_ENABLED && + lxcContainerMountCGroups(mounts, nmounts, cgroupRoot, sec_mount_options) < 0) goto cleanup; @@ -2087,7 +2088,8 @@ static int lxcContainerSetupExtraMounts(virDomainDefPtr vmDef, /* Now we can re-mount the cgroups controllers in the * same configuration as before */ - if (lxcContainerMountCGroups(mounts, nmounts, + if (vmDef->os.userns != VIR_DOMAIN_USER_NS_ENABLED && + lxcContainerMountCGroups(mounts, nmounts, cgroupRoot, sec_mount_options) < 0) goto cleanup; -- 1.7.11.7

On Mon, Mar 11, 2013 at 02:26:49PM +0800, Gao feng wrote:
Since we can't mount cgroupfs in uninit user namespace now. only mount cgroupfs when userns is disabled.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 5c66ae3..92af3e5 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -1979,7 +1979,8 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef,
/* Now we can re-mount the cgroups controllers in the * same configuration as before */ - if (lxcContainerMountCGroups(mounts, nmounts, + if (vmDef->os.userns != VIR_DOMAIN_USER_NS_ENABLED && + lxcContainerMountCGroups(mounts, nmounts, cgroupRoot, sec_mount_options) < 0) goto cleanup;
@@ -2087,7 +2088,8 @@ static int lxcContainerSetupExtraMounts(virDomainDefPtr vmDef,
/* Now we can re-mount the cgroups controllers in the * same configuration as before */ - if (lxcContainerMountCGroups(mounts, nmounts, + if (vmDef->os.userns != VIR_DOMAIN_USER_NS_ENABLED && + lxcContainerMountCGroups(mounts, nmounts, cgroupRoot, sec_mount_options) < 0) goto cleanup;
I'm not sure that this is the right approach for this. If we can't mount the cgroups filesystems, then we need preserve the existing mounts from the host in some way, rather than unmounting them. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 2013/03/13 18:59, Daniel P. Berrange wrote:
On Mon, Mar 11, 2013 at 02:26:49PM +0800, Gao feng wrote:
Since we can't mount cgroupfs in uninit user namespace now. only mount cgroupfs when userns is disabled.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 5c66ae3..92af3e5 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -1979,7 +1979,8 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef,
/* Now we can re-mount the cgroups controllers in the * same configuration as before */ - if (lxcContainerMountCGroups(mounts, nmounts, + if (vmDef->os.userns != VIR_DOMAIN_USER_NS_ENABLED && + lxcContainerMountCGroups(mounts, nmounts, cgroupRoot, sec_mount_options) < 0) goto cleanup;
@@ -2087,7 +2088,8 @@ static int lxcContainerSetupExtraMounts(virDomainDefPtr vmDef,
/* Now we can re-mount the cgroups controllers in the * same configuration as before */ - if (lxcContainerMountCGroups(mounts, nmounts, + if (vmDef->os.userns != VIR_DOMAIN_USER_NS_ENABLED && + lxcContainerMountCGroups(mounts, nmounts, cgroupRoot, sec_mount_options) < 0) goto cleanup;
I'm not sure that this is the right approach for this. If we can't mount the cgroups filesystems, then we need preserve the existing mounts from the host in some way, rather than unmounting them.
I wonder if we should use mount --bind to set cgroupfs for container. we can mount the directory /sys/fs/cgroup/memory/libvirt/lxc/domain of host to the directory /sys/fs/cgroup/memory of container. This can also resolve the cgroup configuration leak problem, and can also resolve the "failed to mount cgroup" problem reported by Yin Olivia.

user namespace doesn't allow to create devices in uninit userns. We should create devices on host side. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 47 +++++++-------------------- src/lxc/lxc_controller.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 94 insertions(+), 36 deletions(-) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 92af3e5..58d6ee5 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -681,22 +681,10 @@ static int lxcContainerMountFSDevPTS(virDomainFSDefPtr root) return rc; } -static int lxcContainerPopulateDevices(char **ttyPaths, size_t nttyPaths) +static int lxcContainerSetupDevices(char **ttyPaths, size_t nttyPaths) { size_t i; const struct { - int maj; - int min; - mode_t mode; - const char *path; - } devs[] = { - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_NULL, 0666, "/dev/null" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_ZERO, 0666, "/dev/zero" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_FULL, 0666, "/dev/full" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_RANDOM, 0666, "/dev/random" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_URANDOM, 0666, "/dev/urandom" }, - }; - const struct { const char *src; const char *dst; } links[] = { @@ -706,18 +694,6 @@ static int lxcContainerPopulateDevices(char **ttyPaths, size_t nttyPaths) { "/proc/self/fd", "/dev/fd" }, }; - /* Populate /dev/ with a few important bits */ - for (i = 0 ; i < ARRAY_CARDINALITY(devs) ; i++) { - dev_t dev = makedev(devs[i].maj, devs[i].min); - if (mknod(devs[i].path, S_IFCHR, dev) < 0 || - chmod(devs[i].path, devs[i].mode)) { - virReportSystemError(errno, - _("Failed to make device %s"), - devs[i].path); - return -1; - } - } - for (i = 0 ; i < ARRAY_CARDINALITY(links) ; i++) { if (symlink(links[i].src, links[i].dst) < 0) { virReportSystemError(errno, @@ -737,15 +713,6 @@ static int lxcContainerPopulateDevices(char **ttyPaths, size_t nttyPaths) _("Failed to bind /dev/pts/ptmx on to /dev/ptmx")); return -1; } - } else { - /* Legacy devpts, so we need to just use shared one */ - dev_t dev = makedev(LXC_DEV_MAJ_TTY, LXC_DEV_MIN_PTMX); - if (mknod("/dev/ptmx", S_IFCHR, dev) < 0 || - chmod("/dev/ptmx", 0666)) { - virReportSystemError(errno, "%s", - _("Failed to make device /dev/ptmx")); - return -1; - } } for (i = 0 ; i < nttyPaths ; i++) { @@ -1988,8 +1955,8 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef, if (lxcContainerMountFSDevPTS(root) < 0) goto cleanup; - /* Populates device nodes in /dev/ */ - if (lxcContainerPopulateDevices(ttyPaths, nttyPaths) < 0) + /* Setup device nodes in /dev/ */ + if (lxcContainerSetupDevices(ttyPaths, nttyPaths) < 0) goto cleanup; /* Sets up any non-root mounts from guest config */ @@ -2306,6 +2273,14 @@ static int lxcContainerChild(void *data) goto cleanup; } + /* Wait for controller creating devices for container */ + if (lxcContainerWaitForContinue(argv->monitor) < 0) { + virReportSystemError(errno, "%s", + _("Failed to read the container continue message")); + goto cleanup; + } + VIR_DEBUG("Received container continue message, create devices success."); + ret = 0; cleanup: VIR_FREE(ttyPath); diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index f17142f..c6f8c3b 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1092,6 +1092,79 @@ cleanup: } +static int virLXCControllerPopulateDevices(virLXCControllerPtr ctrl) +{ + size_t i; + char *ptmx = NULL; + const struct { + int maj; + int min; + mode_t mode; + const char *path; + } devs[] = { + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_NULL, 0666, "/dev/null" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_ZERO, 0666, "/dev/zero" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_FULL, 0666, "/dev/full" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_RANDOM, 0666, "/dev/random" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_URANDOM, 0666, "/dev/urandom" }, + }; + virDomainFSDefPtr root = virDomainGetRootFilesystem(ctrl->def); + + if (root == NULL || root->src == NULL) + return 0; + + /* Populate /dev/ with a few important bits */ + for (i = 0 ; i < ARRAY_CARDINALITY(devs) ; i++) { + char *path = NULL; + if (virAsprintf(&path, "/proc/%llu/root/%s", + (unsigned long long)ctrl->initpid, + devs[i].path) < 0) { + virReportOOMError(); + return -1; + } + dev_t dev = makedev(devs[i].maj, devs[i].min); + if (mknod(path, S_IFCHR, dev) < 0 || + chmod(path, devs[i].mode)) { + virReportSystemError(errno, + _("Failed to make device %s"), + path); + VIR_FREE(path); + return -1; + } + VIR_FREE(path); + } + + if (virAsprintf(&ptmx, "/proc/%llu/root/dev/pts/ptmx", + (unsigned long long)ctrl->initpid) < 0) { + virReportOOMError(); + return -1; + } + + if (access(ptmx, W_OK)) { + char *path = NULL; + if (virAsprintf(&path, "/proc/%llu/root/dev/ptmx", + (unsigned long long)ctrl->initpid) < 0) { + virReportOOMError(); + VIR_FREE(ptmx); + return -1; + } + /* Legacy devpts, so we need to just use shared one */ + dev_t dev = makedev(LXC_DEV_MAJ_TTY, LXC_DEV_MIN_PTMX); + if (mknod(path, S_IFCHR, dev) < 0 || + chmod(path, 0666)) { + virReportSystemError(errno, + _("Failed to make device %s"), path); + VIR_FREE(path); + VIR_FREE(ptmx); + return -1; + } + VIR_FREE(path); + } + VIR_FREE(ptmx); + return 0; +} + + /** * virLXCControllerMoveInterfaces * @nveths: number of interfaces @@ -1535,6 +1608,16 @@ virLXCControllerRun(virLXCControllerPtr ctrl) goto cleanup; } + /* Populate devices for container */ + if (virLXCControllerPopulateDevices(ctrl) < 0) + goto cleanup; + + if (lxcContainerSendContinue(control[0]) < 0) { + virReportSystemError(errno, "%s", + _("Unable to send container continue message")); + goto cleanup; + } + /* Now the container is fully setup... */ /* ...we can close the loop devices... */ -- 1.7.11.7

On Mon, Mar 11, 2013 at 02:26:50PM +0800, Gao feng wrote:
user namespace doesn't allow to create devices in uninit userns. We should create devices on host side.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 47 +++++++-------------------- src/lxc/lxc_controller.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 94 insertions(+), 36 deletions(-)
We actually need this change independently of user namespaces. Currently the cgroup devices setup we do allows 'mknod' permission, when it really ought to be blocked. If we move the setup code into the controller then we can fix the cgroup devices setup to block mknod too.
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 92af3e5..58d6ee5 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -681,22 +681,10 @@ static int lxcContainerMountFSDevPTS(virDomainFSDefPtr root) return rc; }
-static int lxcContainerPopulateDevices(char **ttyPaths, size_t nttyPaths) +static int lxcContainerSetupDevices(char **ttyPaths, size_t nttyPaths) { size_t i; const struct { - int maj; - int min; - mode_t mode; - const char *path; - } devs[] = { - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_NULL, 0666, "/dev/null" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_ZERO, 0666, "/dev/zero" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_FULL, 0666, "/dev/full" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_RANDOM, 0666, "/dev/random" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_URANDOM, 0666, "/dev/urandom" }, - }; - const struct { const char *src; const char *dst; } links[] = { @@ -706,18 +694,6 @@ static int lxcContainerPopulateDevices(char **ttyPaths, size_t nttyPaths) { "/proc/self/fd", "/dev/fd" }, };
- /* Populate /dev/ with a few important bits */ - for (i = 0 ; i < ARRAY_CARDINALITY(devs) ; i++) { - dev_t dev = makedev(devs[i].maj, devs[i].min); - if (mknod(devs[i].path, S_IFCHR, dev) < 0 || - chmod(devs[i].path, devs[i].mode)) { - virReportSystemError(errno, - _("Failed to make device %s"), - devs[i].path); - return -1; - } - } - for (i = 0 ; i < ARRAY_CARDINALITY(links) ; i++) { if (symlink(links[i].src, links[i].dst) < 0) { virReportSystemError(errno, @@ -737,15 +713,6 @@ static int lxcContainerPopulateDevices(char **ttyPaths, size_t nttyPaths) _("Failed to bind /dev/pts/ptmx on to /dev/ptmx")); return -1; } - } else { - /* Legacy devpts, so we need to just use shared one */ - dev_t dev = makedev(LXC_DEV_MAJ_TTY, LXC_DEV_MIN_PTMX); - if (mknod("/dev/ptmx", S_IFCHR, dev) < 0 || - chmod("/dev/ptmx", 0666)) { - virReportSystemError(errno, "%s", - _("Failed to make device /dev/ptmx")); - return -1; - } }
for (i = 0 ; i < nttyPaths ; i++) { @@ -1988,8 +1955,8 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef, if (lxcContainerMountFSDevPTS(root) < 0) goto cleanup;
- /* Populates device nodes in /dev/ */ - if (lxcContainerPopulateDevices(ttyPaths, nttyPaths) < 0) + /* Setup device nodes in /dev/ */ + if (lxcContainerSetupDevices(ttyPaths, nttyPaths) < 0) goto cleanup;
/* Sets up any non-root mounts from guest config */ @@ -2306,6 +2273,14 @@ static int lxcContainerChild(void *data) goto cleanup; }
+ /* Wait for controller creating devices for container */ + if (lxcContainerWaitForContinue(argv->monitor) < 0) { + virReportSystemError(errno, "%s", + _("Failed to read the container continue message")); + goto cleanup; + } + VIR_DEBUG("Received container continue message, create devices success."); + ret = 0; cleanup: VIR_FREE(ttyPath); diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index f17142f..c6f8c3b 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1092,6 +1092,79 @@ cleanup: }
+static int virLXCControllerPopulateDevices(virLXCControllerPtr ctrl) +{ + size_t i; + char *ptmx = NULL; + const struct { + int maj; + int min; + mode_t mode; + const char *path; + } devs[] = { + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_NULL, 0666, "/dev/null" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_ZERO, 0666, "/dev/zero" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_FULL, 0666, "/dev/full" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_RANDOM, 0666, "/dev/random" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_URANDOM, 0666, "/dev/urandom" }, + }; + virDomainFSDefPtr root = virDomainGetRootFilesystem(ctrl->def); + + if (root == NULL || root->src == NULL) + return 0; + + /* Populate /dev/ with a few important bits */ + for (i = 0 ; i < ARRAY_CARDINALITY(devs) ; i++) { + char *path = NULL; + if (virAsprintf(&path, "/proc/%llu/root/%s", + (unsigned long long)ctrl->initpid, + devs[i].path) < 0) { + virReportOOMError(); + return -1; + } + dev_t dev = makedev(devs[i].maj, devs[i].min); + if (mknod(path, S_IFCHR, dev) < 0 || + chmod(path, devs[i].mode)) { + virReportSystemError(errno, + _("Failed to make device %s"), + path); + VIR_FREE(path); + return -1; + } + VIR_FREE(path); + } + + if (virAsprintf(&ptmx, "/proc/%llu/root/dev/pts/ptmx", + (unsigned long long)ctrl->initpid) < 0) { + virReportOOMError(); + return -1; + } + + if (access(ptmx, W_OK)) { + char *path = NULL; + if (virAsprintf(&path, "/proc/%llu/root/dev/ptmx", + (unsigned long long)ctrl->initpid) < 0) { + virReportOOMError(); + VIR_FREE(ptmx); + return -1; + } + /* Legacy devpts, so we need to just use shared one */ + dev_t dev = makedev(LXC_DEV_MAJ_TTY, LXC_DEV_MIN_PTMX); + if (mknod(path, S_IFCHR, dev) < 0 || + chmod(path, 0666)) { + virReportSystemError(errno, + _("Failed to make device %s"), path); + VIR_FREE(path); + VIR_FREE(ptmx); + return -1; + } + VIR_FREE(path); + } + VIR_FREE(ptmx); + return 0; +} + + /** * virLXCControllerMoveInterfaces * @nveths: number of interfaces @@ -1535,6 +1608,16 @@ virLXCControllerRun(virLXCControllerPtr ctrl) goto cleanup; }
+ /* Populate devices for container */ + if (virLXCControllerPopulateDevices(ctrl) < 0) + goto cleanup;
We ought to be able to invoke this before lxcContainerStart, so that we can avoid the send/wait continue bit Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 2013/03/13 19:02, Daniel P. Berrange wrote:
On Mon, Mar 11, 2013 at 02:26:50PM +0800, Gao feng wrote:
user namespace doesn't allow to create devices in uninit userns. We should create devices on host side.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_container.c | 47 +++++++-------------------- src/lxc/lxc_controller.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 94 insertions(+), 36 deletions(-)
We actually need this change independently of user namespaces. Currently the cgroup devices setup we do allows 'mknod' permission, when it really ought to be blocked. If we move the setup code into the controller then we can fix the cgroup devices setup to block mknod too.
Yes, Agree with you. I will add this one into my patchset.
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 92af3e5..58d6ee5 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -681,22 +681,10 @@ static int lxcContainerMountFSDevPTS(virDomainFSDefPtr root) return rc; }
-static int lxcContainerPopulateDevices(char **ttyPaths, size_t nttyPaths) +static int lxcContainerSetupDevices(char **ttyPaths, size_t nttyPaths) { size_t i; const struct { - int maj; - int min; - mode_t mode; - const char *path; - } devs[] = { - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_NULL, 0666, "/dev/null" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_ZERO, 0666, "/dev/zero" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_FULL, 0666, "/dev/full" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_RANDOM, 0666, "/dev/random" }, - { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_URANDOM, 0666, "/dev/urandom" }, - }; - const struct { const char *src; const char *dst; } links[] = { @@ -706,18 +694,6 @@ static int lxcContainerPopulateDevices(char **ttyPaths, size_t nttyPaths) { "/proc/self/fd", "/dev/fd" }, };
- /* Populate /dev/ with a few important bits */ - for (i = 0 ; i < ARRAY_CARDINALITY(devs) ; i++) { - dev_t dev = makedev(devs[i].maj, devs[i].min); - if (mknod(devs[i].path, S_IFCHR, dev) < 0 || - chmod(devs[i].path, devs[i].mode)) { - virReportSystemError(errno, - _("Failed to make device %s"), - devs[i].path); - return -1; - } - } - for (i = 0 ; i < ARRAY_CARDINALITY(links) ; i++) { if (symlink(links[i].src, links[i].dst) < 0) { virReportSystemError(errno, @@ -737,15 +713,6 @@ static int lxcContainerPopulateDevices(char **ttyPaths, size_t nttyPaths) _("Failed to bind /dev/pts/ptmx on to /dev/ptmx")); return -1; } - } else { - /* Legacy devpts, so we need to just use shared one */ - dev_t dev = makedev(LXC_DEV_MAJ_TTY, LXC_DEV_MIN_PTMX); - if (mknod("/dev/ptmx", S_IFCHR, dev) < 0 || - chmod("/dev/ptmx", 0666)) { - virReportSystemError(errno, "%s", - _("Failed to make device /dev/ptmx")); - return -1; - } }
for (i = 0 ; i < nttyPaths ; i++) { @@ -1988,8 +1955,8 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef, if (lxcContainerMountFSDevPTS(root) < 0) goto cleanup;
- /* Populates device nodes in /dev/ */ - if (lxcContainerPopulateDevices(ttyPaths, nttyPaths) < 0) + /* Setup device nodes in /dev/ */ + if (lxcContainerSetupDevices(ttyPaths, nttyPaths) < 0) goto cleanup;
/* Sets up any non-root mounts from guest config */ @@ -2306,6 +2273,14 @@ static int lxcContainerChild(void *data) goto cleanup; }
+ /* Wait for controller creating devices for container */ + if (lxcContainerWaitForContinue(argv->monitor) < 0) { + virReportSystemError(errno, "%s", + _("Failed to read the container continue message")); + goto cleanup; + } + VIR_DEBUG("Received container continue message, create devices success."); + ret = 0; cleanup: VIR_FREE(ttyPath); diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index f17142f..c6f8c3b 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1092,6 +1092,79 @@ cleanup: }
+static int virLXCControllerPopulateDevices(virLXCControllerPtr ctrl) +{ + size_t i; + char *ptmx = NULL; + const struct { + int maj; + int min; + mode_t mode; + const char *path; + } devs[] = { + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_NULL, 0666, "/dev/null" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_ZERO, 0666, "/dev/zero" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_FULL, 0666, "/dev/full" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_RANDOM, 0666, "/dev/random" }, + { LXC_DEV_MAJ_MEMORY, LXC_DEV_MIN_URANDOM, 0666, "/dev/urandom" }, + }; + virDomainFSDefPtr root = virDomainGetRootFilesystem(ctrl->def); + + if (root == NULL || root->src == NULL) + return 0; + + /* Populate /dev/ with a few important bits */ + for (i = 0 ; i < ARRAY_CARDINALITY(devs) ; i++) { + char *path = NULL; + if (virAsprintf(&path, "/proc/%llu/root/%s", + (unsigned long long)ctrl->initpid, + devs[i].path) < 0) { + virReportOOMError(); + return -1; + } + dev_t dev = makedev(devs[i].maj, devs[i].min); + if (mknod(path, S_IFCHR, dev) < 0 || + chmod(path, devs[i].mode)) { + virReportSystemError(errno, + _("Failed to make device %s"), + path); + VIR_FREE(path); + return -1; + } + VIR_FREE(path); + } + + if (virAsprintf(&ptmx, "/proc/%llu/root/dev/pts/ptmx", + (unsigned long long)ctrl->initpid) < 0) { + virReportOOMError(); + return -1; + } + + if (access(ptmx, W_OK)) { + char *path = NULL; + if (virAsprintf(&path, "/proc/%llu/root/dev/ptmx", + (unsigned long long)ctrl->initpid) < 0) { + virReportOOMError(); + VIR_FREE(ptmx); + return -1; + } + /* Legacy devpts, so we need to just use shared one */ + dev_t dev = makedev(LXC_DEV_MAJ_TTY, LXC_DEV_MIN_PTMX); + if (mknod(path, S_IFCHR, dev) < 0 || + chmod(path, 0666)) { + virReportSystemError(errno, + _("Failed to make device %s"), path); + VIR_FREE(path); + VIR_FREE(ptmx); + return -1; + } + VIR_FREE(path); + } + VIR_FREE(ptmx); + return 0; +} + + /** * virLXCControllerMoveInterfaces * @nveths: number of interfaces @@ -1535,6 +1608,16 @@ virLXCControllerRun(virLXCControllerPtr ctrl) goto cleanup; }
+ /* Populate devices for container */ + if (virLXCControllerPopulateDevices(ctrl) < 0) + goto cleanup;
We ought to be able to invoke this before lxcContainerStart, so that we can avoid the send/wait continue bit
We have to wait for init task of container running. So we can populate devices at the right placement.

Since the root user of container may be a normal user on host, we should make sure the container has rights to use the tty device. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_controller.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index c6f8c3b..4715f84 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1311,6 +1311,7 @@ virLXCControllerSetupDevPTS(virLXCControllerPtr ctrl) char *opts = NULL; char *devpts = NULL; int ret = -1; + uid_t uid = 0; if (!root) { if (ctrl->nconsoles != 1) { @@ -1367,10 +1368,13 @@ virLXCControllerSetupDevPTS(virLXCControllerPtr ctrl) goto cleanup; } + if (ctrl->def->os.userns == VIR_DOMAIN_USER_NS_ENABLED) + uid = ctrl->def->os.uidmap.low_first; + /* XXX should we support gid=X for X!=5 for distros which use * a different gid for tty? */ - if (virAsprintf(&opts, "newinstance,ptmxmode=0666,mode=0620,gid=5%s", - (mount_options ? mount_options : "")) < 0) { + if (virAsprintf(&opts, "newinstance,ptmxmode=0666,mode=0620,uid=%d,gid=5%s", + uid, (mount_options ? mount_options : "")) < 0) { virReportOOMError(); goto cleanup; } -- 1.7.11.7

On Mon, Mar 11, 2013 at 02:26:51PM +0800, Gao feng wrote:
Since the root user of container may be a normal user on host, we should make sure the container has rights to use the tty device.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_controller.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index c6f8c3b..4715f84 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1311,6 +1311,7 @@ virLXCControllerSetupDevPTS(virLXCControllerPtr ctrl) char *opts = NULL; char *devpts = NULL; int ret = -1; + uid_t uid = 0;
if (!root) { if (ctrl->nconsoles != 1) { @@ -1367,10 +1368,13 @@ virLXCControllerSetupDevPTS(virLXCControllerPtr ctrl) goto cleanup; }
+ if (ctrl->def->os.userns == VIR_DOMAIN_USER_NS_ENABLED) + uid = ctrl->def->os.uidmap.low_first; + /* XXX should we support gid=X for X!=5 for distros which use * a different gid for tty? */ - if (virAsprintf(&opts, "newinstance,ptmxmode=0666,mode=0620,gid=5%s", - (mount_options ? mount_options : "")) < 0) { + if (virAsprintf(&opts, "newinstance,ptmxmode=0666,mode=0620,uid=%d,gid=5%s", + uid, (mount_options ? mount_options : "")) < 0) { virReportOOMError(); goto cleanup; }
This is bogus, if no 'uid' parameter is set for devpts, then the PTYs that are created automatically get given the uid associated with the calling process, which is what you want. With this change, you are hardcoding the 'uid' regardless of what UID the process in the container is running as, which will break things if any container process changes its uid. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 2013/03/13 19:08, Daniel P. Berrange wrote:
On Mon, Mar 11, 2013 at 02:26:51PM +0800, Gao feng wrote:
Since the root user of container may be a normal user on host, we should make sure the container has rights to use the tty device.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_controller.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index c6f8c3b..4715f84 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1311,6 +1311,7 @@ virLXCControllerSetupDevPTS(virLXCControllerPtr ctrl) char *opts = NULL; char *devpts = NULL; int ret = -1; + uid_t uid = 0;
if (!root) { if (ctrl->nconsoles != 1) { @@ -1367,10 +1368,13 @@ virLXCControllerSetupDevPTS(virLXCControllerPtr ctrl) goto cleanup; }
+ if (ctrl->def->os.userns == VIR_DOMAIN_USER_NS_ENABLED) + uid = ctrl->def->os.uidmap.low_first; + /* XXX should we support gid=X for X!=5 for distros which use * a different gid for tty? */ - if (virAsprintf(&opts, "newinstance,ptmxmode=0666,mode=0620,gid=5%s", - (mount_options ? mount_options : "")) < 0) { + if (virAsprintf(&opts, "newinstance,ptmxmode=0666,mode=0620,uid=%d,gid=5%s", + uid, (mount_options ? mount_options : "")) < 0) { virReportOOMError(); goto cleanup; }
This is bogus, if no 'uid' parameter is set for devpts, then the PTYs that are created automatically get given the uid associated with the calling process, which is what you want. With this change, you are hardcoding the 'uid' regardless of what UID the process in the container is running as, which will break things if any container process changes its uid.
Thanks for teaching me this! What we should do is change the owner of /dev/pts/x to the low_first user. I am right?

Otherwise we will fail to mount the meminfo file. This patch also allows any users to access the fuse mount point. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_fuse.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/src/lxc/lxc_fuse.c b/src/lxc/lxc_fuse.c index b6808da..2517340 100644 --- a/src/lxc/lxc_fuse.c +++ b/src/lxc/lxc_fuse.c @@ -48,6 +48,9 @@ static int lxcProcGetattr(const char *path, struct stat *stbuf) char *mempath = NULL; struct stat sb; + struct fuse_context *context = fuse_get_context(); + virDomainDefPtr def = (virDomainDefPtr)context->private_data; + memset(stbuf, 0, sizeof(struct stat)); if (virAsprintf(&mempath, "/proc/%s", path) < 0) { virReportOOMError(); @@ -65,6 +68,8 @@ static int lxcProcGetattr(const char *path, struct stat *stbuf) goto cleanup; } + stbuf->st_uid = def->os.uidmap.low_first; + stbuf->st_gid = def->os.gidmap.low_first; stbuf->st_mode = sb.st_mode; stbuf->st_nlink = 1; stbuf->st_blksize = sb.st_blksize; @@ -306,6 +311,7 @@ int lxcSetupFuse(virLXCFusePtr *f, virDomainDefPtr def) /* process name is libvirt_lxc */ if (fuse_opt_add_arg(&args, "libvirt_lxc") == -1 || fuse_opt_add_arg(&args, "-odirect_io") == -1 || + fuse_opt_add_arg(&args, "-oallow_other") == -1 || fuse_opt_add_arg(&args, "-ofsname=libvirt") == -1) goto cleanup1; -- 1.7.11.7
participants (2)
-
Daniel P. Berrange
-
Gao feng