[libvirt] [PATCH 0 of 4] [LXC] RFC: Network interface support

I think I have addressed all of the comments. Changes detailed per patch. I converted to using a dynamic detection routine for determining NETNS support, which I think is much better. That is pulled out into a separate patch for easier distinction. Thanks!

Allow check for containers support to be done without CLONE_NEWNET, and then determine support on the fly by checking for iproute2 support and a successful clone(CLONE_NEWNET). This lets us set a flag for later, as well as not completely disable LXC support on a system without NETNS support. diff -r 8d2afc533c91 -r 405ae197cd7e src/lxc_conf.h --- a/src/lxc_conf.h Tue Jun 17 15:55:03 2008 +0000 +++ b/src/lxc_conf.h Mon Jun 23 11:43:03 2008 -0700 @@ -89,6 +89,7 @@ int ninactivevms; char* configDir; char* stateDir; + int have_netns; }; /* Types and structs */ diff -r 8d2afc533c91 -r 405ae197cd7e src/lxc_driver.c --- a/src/lxc_driver.c Tue Jun 17 15:55:03 2008 +0000 +++ b/src/lxc_driver.c Mon Jun 23 11:43:03 2008 -0700 @@ -66,6 +66,9 @@ #ifndef CLONE_NEWIPC #define CLONE_NEWIPC 0x08000000 #endif +#ifndef CLONE_NEWNET +#define CLONE_NEWNET 0x40000000 /* New network namespace */ +#endif static int lxcStartup(void); static int lxcShutdown(void); @@ -77,11 +80,11 @@ exit(0); } -static int lxcCheckContainerSupport( void ) +static int lxcCheckContainerSupport(int extra_flags) { int rc = 0; int flags = CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWUSER| - CLONE_NEWIPC|SIGCHLD; + CLONE_NEWIPC|SIGCHLD|extra_flags; int cpid; char *childStack; char *stack; @@ -112,7 +115,7 @@ static const char *lxcProbe(void) { #ifdef __linux__ - if (0 == lxcCheckContainerSupport()) { + if (0 == lxcCheckContainerSupport(0)) { return("lxc:///"); } #endif @@ -1039,6 +1042,22 @@ return rc; } +static int lxcCheckNetNsSupport(void) +{ + const char *argv[] = {"ip", "link", "set", "lo", "netns", "-1", NULL}; + int ip_rc; + int user_netns = 0; + int kern_netns = 0; + + if (virRun(NULL, (char **)argv, &ip_rc) == 0) + user_netns = WIFEXITED(ip_rc) && (WEXITSTATUS(ip_rc) != 255); + + if (lxcCheckContainerSupport(CLONE_NEWNET) == 0) + kern_netns = 1; + + return kern_netns && user_netns; +} + static int lxcStartup(void) { uid_t uid = getuid(); @@ -1053,10 +1072,11 @@ } /* Check that this is a container enabled kernel */ - if(0 != lxcCheckContainerSupport()) { + if(0 != lxcCheckContainerSupport(0)) { return -1; } + lxc_driver->have_netns = lxcCheckNetNsSupport(); /* Call function to load lxc driver configuration information */ if (lxcLoadDriverConfig(lxc_driver) < 0) {

Dan Smith wrote:
Allow check for containers support to be done without CLONE_NEWNET, and then determine support on the fly by checking for iproute2 support and a successful clone(CLONE_NEWNET). This lets us set a flag for later, as well as not completely disable LXC support on a system without NETNS support.
The CLONE_NEWNET will fail if the network namespace is not compiled in. I understand this check but it looks like a little random. You are not 100% sure this clone has failed because the network namespace is not supported. That can be another subsystem or namespace which has failed during the initialization of the namespaces. Why don't you simply check the presence of the 'netns' process ? Concerning iproute2, I think this is the work of the installer to check the dependencies, eg. the libvirt rpm depends on iproute2-x.y.z version rpm.

DL> The CLONE_NEWNET will fail if the network namespace is not DL> compiled in. I understand this check but it looks like a little DL> random. You are not 100% sure this clone has failed because the DL> network namespace is not supported. That can be another subsystem DL> or namespace which has failed during the initialization of the DL> namespaces. The check is performed twice, once with the basic set of flags and again with CLONE_NEWNET. If the first check fails, we assume no LXC support (as we did before). If the second fails, we assume LXC but no NETNS. Is there something else I'm missing here? DL> Why don't you simply check the presence of the 'netns' process ? That seems like a valid way as well, although we already do our feature checks by testing the clone. Also, by doing it this way, we have a better confirmation that an actual clone(CLONE_NEWNET) will work, IMHO. DL> Concerning iproute2, I think this is the work of the installer to DL> check the dependencies, eg. the libvirt rpm depends on DL> iproute2-x.y.z version rpm. Agreed, and I'm sure it will. Note DV's second comment here: https://www.redhat.com/archives/libvir-list/2008-June/msg00232.html Thanks! -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

Dan Smith wrote:
DL> The CLONE_NEWNET will fail if the network namespace is not DL> compiled in. I understand this check but it looks like a little DL> random. You are not 100% sure this clone has failed because the DL> network namespace is not supported. That can be another subsystem DL> or namespace which has failed during the initialization of the DL> namespaces.
The check is performed twice, once with the basic set of flags and again with CLONE_NEWNET. If the first check fails, we assume no LXC support (as we did before). If the second fails, we assume LXC but no NETNS. Is there something else I'm missing here?
DL> Why don't you simply check the presence of the 'netns' process ?
That seems like a valid way as well, although we already do our feature checks by testing the clone. Also, by doing it this way, we have a better confirmation that an actual clone(CLONE_NEWNET) will work, IMHO.
You are not doing clone(CLONE_NEWNET) in this code. You call clone(CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWUSER|CLONE_NEWIPC|SIGCHLD|CLONE_NEWNET) When this call fails, you 'assume' netns is not compiled in. I agree the clone should have returned ENOSYS in this case, but it returns EINVAL and that do not means 'netns in not active'. If you look at the processes, you have a kthread called '[netns]', which indicates the network namespaces are actives in the kernel and this is not an assumption.
DL> Concerning iproute2, I think this is the work of the installer to DL> check the dependencies, eg. the libvirt rpm depends on DL> iproute2-x.y.z version rpm.
Agreed, and I'm sure it will. Note DV's second comment here:
https://www.redhat.com/archives/libvir-list/2008-June/msg00232.html
In your code, you launch the ip command and if it fails with a particular exit code, you 'assume' netns is not supported. Another assumption ... IMHO you should rely on the package dependencies/command version. Or if you absolutely want to detect that at startup, perhaps doing "ip link help | grep netns" is more secure :)

DL> You call DL> clone(CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWUSER|CLONE_NEWIPC|SIGCHLD|CLONE_NEWNET) DL> When this call fails, you 'assume' netns is not compiled in. No, actually, I do this: int flags = CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWUSER| CLONE_NEWIPC|SIGCHLD|extra_flags; Where extra_flags=0 for the LXC detection and extra_flags=CLONE_NEWNET for the NETNS detection. See the lxcCheckContainerSupport() calls in lxcProbe() and lxcCheckNetNsSupport(). DL> In your code, you launch the ip command and if it fails with a DL> particular exit code, you 'assume' netns is not supported. Another DL> assumption The ip command returns a different error code for an invalid subcommand than for a failure of a known subcommand. That seems like a pretty reasonable sentinel (and certainly better than scraping the help output), IMHO. DL> ... IMHO you should rely on the package dependencies/command DL> version. Or if you absolutely want to detect that at startup, DL> perhaps doing "ip link help | grep netns" is more secure :) DV has already said he'd like to see it done dynamically at the driver probe stage. I'll let him comment on his preferred way of doing that. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

Dan Smith wrote:
DL> You call DL> clone(CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWUSER|CLONE_NEWIPC|SIGCHLD|CLONE_NEWNET)
DL> When this call fails, you 'assume' netns is not compiled in.
No, actually, I do this:
int flags = CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWUSER| CLONE_NEWIPC|SIGCHLD|extra_flags;
Where extra_flags=0 for the LXC detection and extra_flags=CLONE_NEWNET for the NETNS detection. See the lxcCheckContainerSupport() calls in lxcProbe() and lxcCheckNetNsSupport().
lxcCheckContainerSupport(0) => clone(CLONE_NEWPID| CLONE_NEWNS| CLONE_NEWUTS| CLONE_NEWUSER| CLONE_NEWIPC| SIGCHLD); lxcCheckNetNsSupport() => lxcCheckContainerSupport(CLONE_NEWNET) => clone(CLONE_NEWPID| CLONE_NEWNS| CLONE_NEWUTS| CLONE_NEWUSER| CLONE_NEWIPC| SIGCHLD|CLONE_NEWNET); Did I missed something ?
DL> In your code, you launch the ip command and if it fails with a DL> particular exit code, you 'assume' netns is not supported. Another DL> assumption
The ip command returns a different error code for an invalid subcommand than for a failure of a known subcommand. That seems like a pretty reasonable sentinel (and certainly better than scraping the help output), IMHO.
Who told to scrap the output :) Just verify the return code of the command. Anyway, catching a specific return code for an unknown subcommand makes sense for this check.
DL> ... IMHO you should rely on the package dependencies/command DL> version. Or if you absolutely want to detect that at startup, DL> perhaps doing "ip link help | grep netns" is more secure :)
DV has already said he'd like to see it done dynamically at the driver probe stage. I'll let him comment on his preferred way of doing that.

DL> Did I missed something ? I think I misinterpreted your original statement, so let me go back. You said: DL> When this call fails, you 'assume' netns is not compiled in. Why is this not an appropriate assumption? If I can't clone(CLONE_NETNS) for the check, then why should I not assume that this will fail later when I try to clone() to create the container itself (despite the presence or absence of netns support)? Would you argue that testing for basic container support is not reasonably accomplished by the existing clone() test? If I check for the presence of a kthread, which could go away if the implementation is changed down the road, then I would falsely conclude that the feature is not available. This test would fail, even though I could clone() with the flag and get the proper behavior... Correct? DL> Who told to scrap the output :) Just verify the return code of the DL> command. You're still scraping the output, just pushing the burden for doing so onto grep. But, fair enough :) DL> Anyway, catching a specific return code for an unknown subcommand DL> makes sense for this check. Okay, cool. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

Dan Smith wrote:
DL> Did I missed something ?
I think I misinterpreted your original statement, so let me go back. You said:
DL> When this call fails, you 'assume' netns is not compiled in.
Why is this not an appropriate assumption? If I can't clone(CLONE_NETNS) for the check, then why should I not assume that this will fail later when I try to clone() to create the container itself (despite the presence or absence of netns support)?
Would you argue that testing for basic container support is not reasonably accomplished by the existing clone() test?
1 - You use a combination of flags to check the network namespace is implemented. One of the namespaces can fail when trying to create it and the clone will fail, and you will deduce the network namespace is not compiled in. But you can do the assumption that never happens. 2 - You try to use only the CLONE_NEWNET flag, and that can fails for another reasons than it is not implemented. The network namespace initialization will trigger the initialization of probably more than one hundred network subsystems which can fail for some reasons. You can assume that never happens too. Honestly, these cases are not frequent but they exists. IMO, it is up to me to warn you when there are some corner cases like these. And it is up to you to consider you can ignore them because that happens only when we reach some limits.
If I check for the presence of a kthread, which could go away if the implementation is changed down the road, then I would falsely conclude that the feature is not available. This test would fail, even though I could clone() with the flag and get the proper behavior... Correct?
It is a good point. But I added this kthread because I needed a separate workq to cleanup the namespace, without this kthread the network namespace can not work properly. IMO, it will not be removed very soon :) But again, you can ignore that, if you prefer to use the 'ip' command.
DL> Who told to scrap the output :) Just verify the return code of the DL> command.
You're still scraping the output, just pushing the burden for doing so onto grep. But, fair enough :)
DL> Anyway, catching a specific return code for an unknown subcommand DL> makes sense for this check.
Okay, cool.

DL> Honestly, these cases are not frequent but they exists. IMO, it is DL> up to me to warn you when there are some corner cases like DL> these. And it is up to you to consider you can ignore them because DL> that happens only when we reach some limits. Fair enough :) DL> It is a good point. But I added this kthread because I needed a DL> separate workq to cleanup the namespace, without this kthread the DL> network namespace can not work properly. IMO, it will not be DL> removed very soon :) I'm not saying that will be, but it's a side effect of your implementation, which may or may not be present in a subsequent revision. DL> But again, you can ignore that, if you prefer to use the 'ip' DL> command. When it really comes down to it, it doesn't matter if the kernel support is present if the user doesn't have an updated 'ip' binary. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

On Tue, Jun 24, 2008 at 08:51:33AM -0700, Dan Smith wrote:
Allow check for containers support to be done without CLONE_NEWNET, and then determine support on the fly by checking for iproute2 support and a successful clone(CLONE_NEWNET). This lets us set a flag for later, as well as not completely disable LXC support on a system without NETNS support.
Honnestly this looks fine to me, I see that Daniel has arguments about the detection method, I would just prefer to have it dynamic as this patch implement than something static relying on configure or RPM dependancies. The goal to me is to try to provide best services on various installations, since lxc and the associated kernel and net command will take a while to be fully deployed, I prefer something very flexible at this point, even if it may occasionally lead to a failure in some corner cases. The goal is really to get as much people in a situation to use features the earliest possible, +1 Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

This gives us the ability to create a veth pair so that we can move one into the network namespace of an LXC container. Changes from last time: - Fixed use of sprintf() - Fixed up 'ip link...' argument synthesis - Fixed license headers - Fixed while() {...} to do {...} while() - Fixed copying of const string in vethInterfaceUpOrDown() diff -r 405ae197cd7e -r 203dce381784 src/Makefile.am --- a/src/Makefile.am Mon Jun 23 11:43:03 2008 -0700 +++ b/src/Makefile.am Mon Jun 23 11:53:42 2008 -0700 @@ -64,6 +64,7 @@ lxc_driver.c lxc_driver.h \ lxc_conf.c lxc_conf.h \ lxc_container.c lxc_container.h \ + veth.c veth.h \ nodeinfo.h nodeinfo.c \ util.c util.h diff -r 405ae197cd7e -r 203dce381784 src/veth.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/veth.c Mon Jun 23 11:53:42 2008 -0700 @@ -0,0 +1,219 @@ +/* + * veth.c: Tools for managing veth pairs + * + * Copyright IBM Corp. 2008 + * + * See COPYING.LIB for the License of this software + * + * Authors: + * David L. Leskovec <dlesko at linux.vnet.ibm.com> + */ + +#include <config.h> + +#include <string.h> + +#include "veth.h" +#include "internal.h" +#include "memory.h" +#include "util.h" + +#define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) +#define DEBUG0(msg) VIR_DEBUG(__FILE__, "%s", msg) + +/* Functions */ +/** + * getFreeVethName: + * @veth: name for veth device (NULL to find first open) + * @maxLen: max length of veth name + * @startDev: device number to start at (x in vethx) + * + * Looks in /sys/class/net/ to find the first available veth device + * name. + * + * Returns 0 on success or -1 in case of error + */ +static int getFreeVethName(char *veth, int maxLen, int startDev) +{ + int rc = -1; + int devNum = startDev; + char path[PATH_MAX]; + + do { + snprintf(path, PATH_MAX, "/sys/class/net/veth%d/", devNum); + ++devNum; + } while (virFileExists(path)); + + snprintf(veth, maxLen, "veth%d", devNum); + + rc = devNum; + + return rc; +} + +/** + * vethCreate: + * @veth1: name for one end of veth pair + * @veth1MaxLen: max length of veth1 name + * @veth2: name for one end of veth pair + * @veth2MaxLen: max length of veth1 name + * + * Creates a veth device pair using the ip command: + * ip link add veth1 type veth peer name veth2 + * NOTE: If veth1 and veth2 names are not specified, ip will auto assign + * names. There seems to be two problems here - + * 1) There doesn't seem to be a way to determine the names of the + * devices that it creates. They show up in ip link show and + * under /sys/class/net/ however there is no guarantee that they + * are the devices that this process just created. + * 2) Once one of the veth devices is moved to another namespace, it + * is no longer visible in the parent namespace. This seems to + * confuse the name assignment causing it to fail with File exists. + * Because of these issues, this function currently forces the caller + * to fully specify the veth device names. + * + * Returns 0 on success or -1 in case of error + */ +int vethCreate(char* veth1, int veth1MaxLen, + char* veth2, int veth2MaxLen) +{ + int rc = -1; + const char *argv[] = { + "ip", "link", "add", veth1, "type", "veth", "peer", "name", veth2, NULL + }; + int cmdResult; + int vethDev = 0; + + if ((NULL == veth1) || (NULL == veth2)) { + goto error_out; + } + + DEBUG("veth1: %s veth2: %s", veth1, veth2); + + if (1 > strlen(veth1)) { + vethDev = getFreeVethName(veth1, veth1MaxLen, 0); + ++vethDev; + DEBUG("assigned veth1: %s", veth1); + } + + if (1 > strlen(veth2)) { + vethDev = getFreeVethName(veth2, veth2MaxLen, vethDev); + DEBUG("assigned veth2: %s", veth2); + } + + rc = virRun(NULL, (char**)argv, &cmdResult); + + if (0 == rc) { + rc = cmdResult; + } + +error_out: + return rc; +} + +/** + * vethDelete: + * @veth: name for one end of veth pair + * + * This will delete both veth devices in a pair. Only one end needs to + * be specified. The ip command will identify and delete the other veth + * device as well. + * ip link del veth + * + * Returns 0 on success or -1 in case of error + */ +int vethDelete(const char *veth) +{ + int rc = -1; + const char *argv[] = {"ip", "link", "del", veth, NULL}; + int cmdResult; + + if (NULL == veth) { + goto error_out; + } + + DEBUG("veth: %s", veth); + + rc = virRun(NULL, (char**)argv, &cmdResult); + + if (0 == rc) { + rc = cmdResult; + } + +error_out: + return rc; +} + +/** + * vethInterfaceUpOrDown: + * @veth: name of veth device + * @upOrDown: 0 => down, 1 => up + * + * Enables a veth device using the ifconfig command. A NULL inetAddress + * will cause it to be left off the command line. + * + * Returns 0 on success or -1 in case of error + */ +int vethInterfaceUpOrDown(const char* veth, int upOrDown) +{ + int rc = -1; + const char *argv[] = {"ifconfig", veth, NULL, NULL}; + int cmdResult; + + if (NULL == veth) { + goto error_out; + } + + if (0 == upOrDown) + argv[2] = "down"; + else + argv[2] = "up"; + + rc = virRun(NULL, (char**)argv, &cmdResult); + + if (0 == rc) { + rc = cmdResult; + } + +error_out: + return rc; +} + +/** + * moveInterfaceToNetNs: + * @interface: name of device + * @pidInNs: PID of process in target net namespace + * + * Moves the given device into the target net namespace specified by the given + * pid using this command: + * ip link set interface netns pidInNs + * + * Returns 0 on success or -1 in case of error + */ +int moveInterfaceToNetNs(const char* interface, int pidInNs) +{ + int rc; + /* offset of the pid field in the following args */ + char *pid = NULL; + const char *argv[] = { + "ip", "link", "set", interface, "netns", NULL, NULL + }; + int cmdResult; + int len; + + if (NULL == interface) { + goto error_out; + } + + if (asprintf(&pid, "%d", pidInNs) == -1) + goto error_out; + + argv[5] = pid; + rc = virRun(NULL, (char**)argv, &cmdResult); + if (0 == rc) + rc = cmdResult; + +error_out: + VIR_FREE(pid); + return rc; +} diff -r 405ae197cd7e -r 203dce381784 src/veth.h --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/veth.h Mon Jun 23 11:53:42 2008 -0700 @@ -0,0 +1,24 @@ +/* + * veth.h: Interface to tools for managing veth pairs + * + * Copyright IBM Corp. 2008 + * + * See COPYING.LIB for the License of this software + * + * Authors: + * David L. Leskovec <dlesko at linux.vnet.ibm.com> + */ + +#ifndef VETH_H +#define VETH_H + +#include <config.h> + +/* Function declarations */ +int vethCreate(char* veth1, int veth1MaxLen, char* veth2, + int veth2MaxLen); +int vethDelete(const char* veth); +int vethInterfaceUpOrDown(const char* veth, int upOrDown); +int moveInterfaceToNetNs(const char *interface, int pidInNs); + +#endif /* VETH_H */

On Tue, Jun 24, 2008 at 08:51:34AM -0700, Dan Smith wrote:
This gives us the ability to create a veth pair so that we can move one into the network namespace of an LXC container.
Cool, thanks ! +1 Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Changes: - Throw an error after parsing if nets were specified and NETNS support is not present diff -r 203dce381784 -r bb48967cf19e src/lxc_conf.c --- a/src/lxc_conf.c Mon Jun 23 11:53:42 2008 -0700 +++ b/src/lxc_conf.c Mon Jun 23 11:53:45 2008 -0700 @@ -69,6 +69,190 @@ __virRaiseError(conn, dom, NULL, VIR_FROM_LXC, code, VIR_ERR_ERROR, codeErrorMessage, errorMessage, NULL, 0, 0, codeErrorMessage, errorMessage); +} + +/** + * lxcParseInterfaceXML: + * @conn: pointer to connection + * @nodePtr: pointer to xml node structure + * @vm: pointer to net definition structure to fill in + * + * Parses the XML for a network interface and places the configuration + * in the given structure. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcParseInterfaceXML(virConnectPtr conn, xmlNodePtr nodePtr, + lxc_net_def_t *netDef) +{ + int rc = -1; + xmlNodePtr cur; + xmlChar *type = NULL; + xmlChar *parentIfName = NULL; + xmlChar *network = NULL; + xmlChar *bridge = NULL; + xmlChar *macaddr = NULL; + + netDef->type = LXC_NET_NETWORK; + + type = xmlGetProp(nodePtr, BAD_CAST "type"); + if (type != NULL) { + if (xmlStrEqual(type, BAD_CAST "network")) { + netDef->type = LXC_NET_NETWORK; + } + else if (xmlStrEqual(type, BAD_CAST "bridge")) { + netDef->type = LXC_NET_BRIDGE; + } + else { + lxcError(conn, NULL, VIR_ERR_XML_ERROR, + _("invalid interface type: %s"), type); + goto error_out; + } + } + + cur = nodePtr->children; + for (cur = nodePtr->children; cur != NULL; cur = cur->next) { + if (cur->type == XML_ELEMENT_NODE) { + DEBUG("cur->name: %s", (char*)(cur->name)); + if ((macaddr == NULL) && + (xmlStrEqual(cur->name, BAD_CAST "mac"))) { + macaddr = xmlGetProp(cur, BAD_CAST "address"); + } else if ((network == NULL) && + (netDef->type == LXC_NET_NETWORK) && + (xmlStrEqual(cur->name, BAD_CAST "source"))) { + network = xmlGetProp(cur, BAD_CAST "network"); + parentIfName = xmlGetProp(cur, BAD_CAST "dev"); + } else if ((bridge == NULL) && + (netDef->type == LXC_NET_BRIDGE) && + (xmlStrEqual(cur->name, BAD_CAST "source"))) { + bridge = xmlGetProp(cur, BAD_CAST "bridge"); + } else if ((parentIfName == NULL) && + (xmlStrEqual(cur->name, BAD_CAST "target"))) { + parentIfName = xmlGetProp(cur, BAD_CAST "dev"); + } + } + } + + if (netDef->type == LXC_NET_NETWORK) { + if (network == NULL) { + lxcError(conn, NULL, VIR_ERR_XML_ERROR, + _("No <source> 'network' attribute specified with <interface type='network'/>")); + goto error_out; + } + + netDef->txName = strdup((char *)network); + if (NULL == netDef->txName) { + lxcError(conn, NULL, VIR_ERR_NO_MEMORY, + _("No storage for network name")); + goto error_out; + } + + } else if (netDef->type == LXC_NET_BRIDGE) { + if (bridge == NULL) { + lxcError(conn, NULL, VIR_ERR_XML_ERROR, + _("No <source> 'bridge' attribute specified with <interface type='bridge'/>")); + goto error_out; + } + + netDef->txName = strdup((char *)bridge); + if (NULL == netDef->txName) { + lxcError(conn, NULL, VIR_ERR_NO_MEMORY, + _("No storage for bridge name")); + goto error_out; + } + } + + if (parentIfName != NULL) { + DEBUG("set netDef->parentVeth: %s", netDef->parentVeth); + netDef->parentVeth = strdup((char *)parentIfName); + if (NULL == netDef->parentVeth) { + lxcError(conn, NULL, VIR_ERR_NO_MEMORY, + _("No storage for parent veth device name")); + goto error_out; + } + } else { + netDef->parentVeth = NULL; + DEBUG0("set netDef->parentVeth: NULL"); + } + + rc = 0; + +error_out: + xmlFree(macaddr); + macaddr = NULL; + xmlFree(network); + network = NULL; + xmlFree(bridge); + bridge = NULL; + xmlFree(parentIfName); + parentIfName = NULL; + + return rc; +} + +/** + * lxcParseDomainInterfaces: + * @conn: pointer to connection + * @nets: on success, points to an list of net def structs + * @contextPtr: pointer to xml context + * + * Parses the domain network interfaces and returns the information in a list + * + * Returns 0 on success or -1 in case of error + */ +static int lxcParseDomainInterfaces(virConnectPtr conn, + lxc_net_def_t **nets, + xmlXPathContextPtr contextPtr) +{ + lxc_driver_t *driver = conn->privateData; + int rc = -1; + int i; + lxc_net_def_t *netDef; + lxc_net_def_t *prevDef = NULL; + int numNets = 0; + xmlNodePtr *list; + int res; + + DEBUG0("parsing nets"); + + res = virXPathNodeSet("/domain/devices/interface", contextPtr, &list); + if (res > 0) { + for (i = 0; i < res; ++i) { + netDef = calloc(1, sizeof(lxc_net_def_t)); + if (NULL == netDef) { + lxcError(conn, NULL, VIR_ERR_NO_MEMORY, + _("No storage for net def structure")); + goto parse_complete; + } + + rc = lxcParseInterfaceXML(conn, list[i], netDef); + if (0 > rc) { + DEBUG("failed parsing a net: %d", rc); + + free(netDef); + goto parse_complete; + } + + DEBUG0("parsed a net"); + + /* set the linked list pointers */ + numNets++; + netDef->next = NULL; + if (0 == i) { + *nets = netDef; + } else { + prevDef->next = netDef; + } + prevDef = netDef; + } + free(list); + } + + rc = numNets; + +parse_complete: + DEBUG("parsed %d nets", rc); + return rc; } static int lxcParseMountXML(virConnectPtr conn, xmlNodePtr nodePtr, @@ -372,6 +556,13 @@ containerDef->nmounts = lxcParseDomainMounts(conn, &(containerDef->mounts), contextPtr); if (0 > containerDef->nmounts) { + goto error; + } + + containerDef->numNets = lxcParseDomainInterfaces(conn, + &(containerDef->nets), + contextPtr); + if (0 > containerDef->numNets) { goto error; } @@ -741,6 +932,7 @@ unsigned char *uuid; char uuidstr[VIR_UUID_STRING_BUFLEN]; lxc_mount_t *mount; + lxc_net_def_t *net; if (lxcIsActiveVM(vm)) virBufferVSprintf(&buf, "<domain type='%s' id='%d'>\n", @@ -770,6 +962,27 @@ virBufferAddLit(&buf, " </filesystem>\n"); } + /* loop adding nets */ + for (net = def->nets; net; net = net->next) { + if (net->type == LXC_NET_NETWORK) { + virBufferAddLit(&buf, " <interface type='network'>\n"); + virBufferVSprintf(&buf, " <source network='%s'/>\n", + net->txName); + } else { + virBufferAddLit(&buf, " <interface type='bridge'>\n"); + virBufferVSprintf(&buf, " <source bridge='%s'/>\n", + net->txName); + } + + if (NULL != net->parentVeth) { + virBufferVSprintf(&buf, " <target dev='%s'/>\n", + net->parentVeth); + } + + virBufferAddLit(&buf, " </interface>\n"); + + } + virBufferVSprintf(&buf, " <console tty='%s'/>\n", def->tty); virBufferAddLit(&buf, " </devices>\n"); virBufferAddLit(&buf, "</domain>\n"); @@ -786,6 +999,8 @@ { lxc_mount_t *curMount; lxc_mount_t *nextMount; + lxc_net_def_t *curNet; + lxc_net_def_t *nextNet; if (vmdef == NULL) return; @@ -795,6 +1010,17 @@ nextMount = curMount->next; VIR_FREE(curMount); curMount = nextMount; + } + + curNet = vmdef->nets; + while (curNet) { + nextNet = curNet->next; + printf("Freeing %s:%s\n", curNet->parentVeth, curNet->containerVeth); + VIR_FREE(curNet->parentVeth); + VIR_FREE(curNet->containerVeth); + VIR_FREE(curNet->txName); + VIR_FREE(curNet); + curNet = nextNet; } VIR_FREE(vmdef->name); diff -r 203dce381784 -r bb48967cf19e src/lxc_conf.h --- a/src/lxc_conf.h Mon Jun 23 11:53:42 2008 -0700 +++ b/src/lxc_conf.h Mon Jun 23 11:53:45 2008 -0700 @@ -36,6 +36,22 @@ #define LXC_MAX_ERROR_LEN 1024 #define LXC_DOMAIN_TYPE "lxc" +/* types of networks for containers */ +enum lxc_net_type { + LXC_NET_NETWORK, + LXC_NET_BRIDGE +}; + +typedef struct __lxc_net_def lxc_net_def_t; +struct __lxc_net_def { + int type; + char *parentVeth; /* veth device in parent namespace */ + char *containerVeth; /* veth device in container namespace */ + char *txName; /* bridge or network name */ + + lxc_net_def_t *next; +}; + typedef struct __lxc_mount lxc_mount_t; struct __lxc_mount { char source[PATH_MAX]; /* user's directory */ @@ -61,6 +77,10 @@ /* tty device */ char *tty; + + /* network devices */ + int numNets; + lxc_net_def_t *nets; }; typedef struct __lxc_vm lxc_vm_t; diff -r 203dce381784 -r bb48967cf19e src/lxc_driver.c --- a/src/lxc_driver.c Mon Jun 23 11:53:42 2008 -0700 +++ b/src/lxc_driver.c Mon Jun 23 11:53:45 2008 -0700 @@ -288,6 +288,13 @@ virDomainPtr dom; if (!(def = lxcParseVMDef(conn, xml, NULL))) { + return NULL; + } + + if ((def->nets != NULL) && !(driver->have_netns)) { + lxcError(conn, NULL, VIR_ERR_NO_SUPPORT, + _("System lacks NETNS support")); + lxcFreeVMDef(def); return NULL; }

On Tue, Jun 24, 2008 at 08:51:35AM -0700, Dan Smith wrote:
Changes: - Throw an error after parsing if nets were specified and NETNS support is not present Fine by me
[...]
+error_out: + xmlFree(macaddr); + macaddr = NULL; + xmlFree(network); + network = NULL; + xmlFree(bridge); + bridge = NULL; + xmlFree(parentIfName); + parentIfName = NULL; + + return rc; +}
in that case the = NULL; are superfluous as they are local variables, but i assume the compiler optimizes this easilly :-)
+ res = virXPathNodeSet("/domain/devices/interface", contextPtr, &list); + if (res > 0) { + for (i = 0; i < res; ++i) { + netDef = calloc(1, sizeof(lxc_net_def_t)); + if (NULL == netDef) { + lxcError(conn, NULL, VIR_ERR_NO_MEMORY, + _("No storage for net def structure"));
-> list need to be freed here or you leak
+ goto parse_complete; + } + + rc = lxcParseInterfaceXML(conn, list[i], netDef); + if (0 > rc) { + DEBUG("failed parsing a net: %d", rc); + + free(netDef);
-> same here list need to be freed here or you leak
+ goto parse_complete; + } + + DEBUG0("parsed a net"); + + /* set the linked list pointers */ + numNets++; + netDef->next = NULL; + if (0 == i) { + *nets = netDef; + } else { + prevDef->next = netDef; + } + prevDef = netDef; + } + free(list); + } + + rc = numNets; + +parse_complete: + DEBUG("parsed %d nets", rc); + return rc; }
Looks fine +1 once the two leaks are plugged :-) Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

DV> in that case the = NULL; are superfluous as they are local DV> variables, but i assume the compiler optimizes this easilly :-) Removed. DV> Looks fine +1 once the two leaks are plugged :-) Fixed. I'll go ahead and push these into CVS. Thanks! -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

Changes: - Remove extraneous "i" variables from various functions - Only bring up lo if we have other interfaces (and thus NETNS) - Fail setup of interfaces if NETNS support is not present - Only add CLONE_NEWNET to start flags if domain has interfaces defined - Make lxc_vm_t parameters const in helper functions (where appropriate) - Cleanup DomainStart procedure to bail on start/move/continue errors diff -r bb48967cf19e -r 88267b7327be src/lxc_conf.h --- a/src/lxc_conf.h Mon Jun 23 11:53:45 2008 -0700 +++ b/src/lxc_conf.h Mon Jun 23 11:53:45 2008 -0700 @@ -35,6 +35,12 @@ #define LXC_MAX_XML_LENGTH 16384 #define LXC_MAX_ERROR_LEN 1024 #define LXC_DOMAIN_TYPE "lxc" +#define LXC_PARENT_SOCKET 0 +#define LXC_CONTAINER_SOCKET 1 + +/* messages between parent and container */ +typedef char lxc_message_t; +#define LXC_CONTINUE_MSG 'c' /* types of networks for containers */ enum lxc_net_type { @@ -96,6 +102,8 @@ int parentTty; int containerTtyFd; char *containerTty; + + int sockpair[2]; lxc_vm_def_t *def; diff -r bb48967cf19e -r 88267b7327be src/lxc_container.c --- a/src/lxc_container.c Mon Jun 23 11:53:45 2008 -0700 +++ b/src/lxc_container.c Mon Jun 23 11:53:45 2008 -0700 @@ -36,6 +36,7 @@ #include "lxc_conf.h" #include "util.h" #include "memory.h" +#include "veth.h" #define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) #define DEBUG0(msg) VIR_DEBUG(__FILE__, "%s", msg) @@ -159,6 +160,72 @@ } /** + * lxcWaitForContinue: + * @vm: Pointer to vm structure + * + * This function will wait for the container continue message from the + * parent process. It will send this message on the socket pair stored in + * the vm structure once it has completed the post clone container setup. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcWaitForContinue(lxc_vm_t *vm) +{ + int rc = -1; + lxc_message_t msg; + int readLen; + + readLen = saferead(vm->sockpair[LXC_CONTAINER_SOCKET], &msg, sizeof(msg)); + if (readLen != sizeof(msg)) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("Failed to read the container continue message: %s"), + strerror(errno)); + goto error_out; + } + + DEBUG0("Received container continue message"); + + close(vm->sockpair[LXC_PARENT_SOCKET]); + vm->sockpair[LXC_PARENT_SOCKET] = -1; + close(vm->sockpair[LXC_CONTAINER_SOCKET]); + vm->sockpair[LXC_CONTAINER_SOCKET] = -1; + + rc = 0; + +error_out: + return rc; +} + +/** + * lxcEnableInterfaces: + * @vm: Pointer to vm structure + * + * This function will enable the interfaces for this container. + * + * Returns 0 on success or nonzero in case of error + */ +static int lxcEnableInterfaces(const lxc_vm_t *vm) +{ + int rc = 0; + const lxc_net_def_t *net; + + for (net = vm->def->nets; net; net = net->next) { + DEBUG("Enabling %s", net->containerVeth); + rc = vethInterfaceUpOrDown(net->containerVeth, 1); + if (0 != rc) { + goto error_out; + } + } + + /* enable lo device only if there were other net devices */ + if (vm->def->nets) + rc = vethInterfaceUpOrDown("lo", 1); + +error_out: + return rc; +} + +/** * lxcChild: * @argv: Pointer to container arguments * @@ -210,6 +277,16 @@ goto cleanup; } + /* Wait for interface devices to show up */ + if (0 != (rc = lxcWaitForContinue(vm))) { + goto cleanup; + } + + /* enable interfaces */ + if (0 != (rc = lxcEnableInterfaces(vm))) { + goto cleanup; + } + rc = lxcExecWithTty(vm); /* this function will only return if an error occured */ diff -r bb48967cf19e -r 88267b7327be src/lxc_driver.c --- a/src/lxc_driver.c Mon Jun 23 11:53:45 2008 -0700 +++ b/src/lxc_driver.c Mon Jun 23 11:53:45 2008 -0700 @@ -44,6 +44,9 @@ #include "memory.h" #include "util.h" #include "memory.h" +#include "bridge.h" +#include "qemu_conf.h" +#include "veth.h" /* debug macros */ #define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) @@ -395,6 +398,202 @@ } /** + * lxcSetupInterfaces: + * @conn: pointer to connection + * @vm: pointer to virtual machine structure + * + * Sets up the container interfaces by creating the veth device pairs and + * attaching the parent end to the appropriate bridge. The container end + * will moved into the container namespace later after clone has been called. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcSetupInterfaces(virConnectPtr conn, + lxc_vm_t *vm) +{ + int rc = -1; + lxc_driver_t *driver = conn->privateData; + struct qemud_driver *networkDriver = + (struct qemud_driver *)(conn->networkPrivateData); + lxc_net_def_t *net = vm->def->nets; + char* bridge; + char parentVeth[PATH_MAX] = ""; + char containerVeth[PATH_MAX] = ""; + + if ((vm->def->nets != NULL) && (driver->have_netns == 0)) { + lxcError(conn, NULL, VIR_ERR_NO_SUPPORT, + _("System lacks NETNS support")); + return -1; + } + + for (net = vm->def->nets; net; net = net->next) { + if (LXC_NET_NETWORK == net->type) { + virNetworkPtr network = virNetworkLookupByName(conn, net->txName); + if (!network) { + goto error_exit; + } + + bridge = virNetworkGetBridgeName(network); + + virNetworkFree(network); + + } else { + bridge = net->txName; + } + + DEBUG("bridge: %s", bridge); + if (NULL == bridge) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to get bridge for interface")); + goto error_exit; + } + + DEBUG0("calling vethCreate()"); + if (NULL != net->parentVeth) { + strcpy(parentVeth, net->parentVeth); + } + if (NULL != net->containerVeth) { + strcpy(containerVeth, net->containerVeth); + } + DEBUG("parentVeth: %s, containerVeth: %s", parentVeth, containerVeth); + if (0 != (rc = vethCreate(parentVeth, PATH_MAX, containerVeth, PATH_MAX))) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to create veth device pair: %d"), rc); + goto error_exit; + } + if (NULL == net->parentVeth) { + net->parentVeth = strdup(parentVeth); + } + if (NULL == net->containerVeth) { + net->containerVeth = strdup(containerVeth); + } + + if ((NULL == net->parentVeth) || (NULL == net->containerVeth)) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to allocate veth names")); + goto error_exit; + } + + if (!(networkDriver->brctl) && (rc = brInit(&(networkDriver->brctl)))) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("cannot initialize bridge support: %s"), + strerror(rc)); + goto error_exit; + } + + if (0 != (rc = brAddInterface(networkDriver->brctl, bridge, parentVeth))) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to add %s device to %s: %s"), + parentVeth, + bridge, + strerror(rc)); + goto error_exit; + } + + if (0 != (rc = vethInterfaceUpOrDown(parentVeth, 1))) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to enable parent ns veth device: %d"), rc); + goto error_exit; + } + + } + + rc = 0; + +error_exit: + return rc; +} + +/** + * lxcMoveInterfacesToNetNs: + * @conn: pointer to connection + * @vm: pointer to virtual machine structure + * + * Starts a container process by calling clone() with the namespace flags + * + * Returns 0 on success or -1 in case of error + */ +static int lxcMoveInterfacesToNetNs(virConnectPtr conn, + const lxc_vm_t *vm) +{ + int rc = -1; + lxc_net_def_t *net; + + for (net = vm->def->nets; net; net = net->next) { + if (0 != moveInterfaceToNetNs(net->containerVeth, vm->def->id)) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to move interface %s to ns %d"), + net->containerVeth, vm->def->id); + goto error_exit; + } + } + + rc = 0; + +error_exit: + return rc; +} + +/** + * lxcCleanupInterfaces: + * @conn: pointer to connection + * @vm: pointer to virtual machine structure + * + * Cleans up the container interfaces by deleting the veth device pairs. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcCleanupInterfaces(const lxc_vm_t *vm) +{ + int rc = -1; + lxc_net_def_t *net; + + for (net = vm->def->nets; net; net = net->next) { + if (0 != (rc = vethDelete(net->parentVeth))) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to delete veth: %s"), net->parentVeth); + /* will continue to try to cleanup any other interfaces */ + } + } + + return 0; +} + +/** + * lxcSendContainerContinue: + * @vm: pointer to virtual machine structure + * + * Sends the continue message via the socket pair stored in the vm + * structure. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcSendContainerContinue(const lxc_vm_t *vm) +{ + int rc = -1; + lxc_message_t msg = LXC_CONTINUE_MSG; + int writeCount = 0; + + if (NULL == vm) { + goto error_out; + } + + writeCount = safewrite(vm->sockpair[LXC_PARENT_SOCKET], &msg, + sizeof(msg)); + if (writeCount != sizeof(msg)) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("unable to send container continue message: %s"), + strerror(errno)); + goto error_out; + } + + rc = 0; + +error_out: + return rc; +} + +/** * lxcStartContainer: * @conn: pointer to connection * @driver: pointer to driver structure @@ -422,6 +621,9 @@ stacktop = stack + stacksize; flags = CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWUSER|CLONE_NEWIPC|SIGCHLD; + + if (vm->def->nets != NULL) + flags |= CLONE_NEWNET; vm->def->id = clone(lxcChild, stacktop, flags, (void *)vm); @@ -819,15 +1021,42 @@ close(vm->parentTty); close(vm->containerTtyFd); - rc = lxcStartContainer(conn, driver, vm); - - if (rc == 0) { - vm->state = VIR_DOMAIN_RUNNING; - driver->ninactivevms--; - driver->nactivevms++; + if (0 != (rc = lxcSetupInterfaces(conn, vm))) { + goto cleanup; } + /* create a socket pair to send continue message to the container once */ + /* we've completed the post clone configuration */ + if (0 != socketpair(PF_UNIX, SOCK_STREAM, 0, vm->sockpair)) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("sockpair failed: %s"), strerror(errno)); + goto cleanup; + } + + /* check this rc */ + + rc = lxcStartContainer(conn, driver, vm); + if (rc != 0) + goto cleanup; + + rc = lxcMoveInterfacesToNetNs(conn, vm); + if (rc != 0) + goto cleanup; + + rc = lxcSendContainerContinue(vm); + if (rc != 0) + goto cleanup; + + vm->state = VIR_DOMAIN_RUNNING; + driver->ninactivevms--; + driver->nactivevms++; + cleanup: + close(vm->sockpair[LXC_PARENT_SOCKET]); + vm->sockpair[LXC_PARENT_SOCKET] = -1; + close(vm->sockpair[LXC_CONTAINER_SOCKET]); + vm->sockpair[LXC_CONTAINER_SOCKET] = -1; + return rc; } @@ -957,6 +1186,9 @@ int rc = -1; int waitRc; int childStatus = -1; + + /* if this fails, we'll continue. it will report any errors */ + lxcCleanupInterfaces(vm); while (((waitRc = waitpid(vm->def->id, &childStatus, 0)) == -1) && errno == EINTR);

Dan Smith wrote:
Changes: - Remove extraneous "i" variables from various functions - Only bring up lo if we have other interfaces (and thus NETNS) - Fail setup of interfaces if NETNS support is not present - Only add CLONE_NEWNET to start flags if domain has interfaces defined - Make lxc_vm_t parameters const in helper functions (where appropriate) - Cleanup DomainStart procedure to bail on start/move/continue errors
diff -r bb48967cf19e -r 88267b7327be src/lxc_conf.h --- a/src/lxc_conf.h Mon Jun 23 11:53:45 2008 -0700 +++ b/src/lxc_conf.h Mon Jun 23 11:53:45 2008 -0700 @@ -35,6 +35,12 @@ #define LXC_MAX_XML_LENGTH 16384 #define LXC_MAX_ERROR_LEN 1024 #define LXC_DOMAIN_TYPE "lxc" +#define LXC_PARENT_SOCKET 0 +#define LXC_CONTAINER_SOCKET 1 + +/* messages between parent and container */ +typedef char lxc_message_t; +#define LXC_CONTINUE_MSG 'c'
/* types of networks for containers */ enum lxc_net_type { @@ -96,6 +102,8 @@ int parentTty; int containerTtyFd; char *containerTty; + + int sockpair[2];
lxc_vm_def_t *def;
diff -r bb48967cf19e -r 88267b7327be src/lxc_container.c --- a/src/lxc_container.c Mon Jun 23 11:53:45 2008 -0700 +++ b/src/lxc_container.c Mon Jun 23 11:53:45 2008 -0700 @@ -36,6 +36,7 @@ #include "lxc_conf.h" #include "util.h" #include "memory.h" +#include "veth.h"
#define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) #define DEBUG0(msg) VIR_DEBUG(__FILE__, "%s", msg) @@ -159,6 +160,72 @@ }
/** + * lxcWaitForContinue: + * @vm: Pointer to vm structure + * + * This function will wait for the container continue message from the + * parent process. It will send this message on the socket pair stored in + * the vm structure once it has completed the post clone container setup. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcWaitForContinue(lxc_vm_t *vm) +{ + int rc = -1; + lxc_message_t msg; + int readLen; + + readLen = saferead(vm->sockpair[LXC_CONTAINER_SOCKET], &msg, sizeof(msg)); + if (readLen != sizeof(msg)) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("Failed to read the container continue message: %s"), + strerror(errno)); + goto error_out; + } + + DEBUG0("Received container continue message"); + + close(vm->sockpair[LXC_PARENT_SOCKET]); + vm->sockpair[LXC_PARENT_SOCKET] = -1; + close(vm->sockpair[LXC_CONTAINER_SOCKET]); + vm->sockpair[LXC_CONTAINER_SOCKET] = -1; + + rc = 0; + +error_out: + return rc; +} + +/** + * lxcEnableInterfaces: + * @vm: Pointer to vm structure + * + * This function will enable the interfaces for this container. + * + * Returns 0 on success or nonzero in case of error + */ +static int lxcEnableInterfaces(const lxc_vm_t *vm) +{ + int rc = 0; + const lxc_net_def_t *net; + + for (net = vm->def->nets; net; net = net->next) { + DEBUG("Enabling %s", net->containerVeth); + rc = vethInterfaceUpOrDown(net->containerVeth, 1); + if (0 != rc) { + goto error_out; + } + } + + /* enable lo device only if there were other net devices */ + if (vm->def->nets) + rc = vethInterfaceUpOrDown("lo", 1); + +error_out: + return rc; +} + +/** * lxcChild: * @argv: Pointer to container arguments * @@ -210,6 +277,16 @@ goto cleanup; }
+ /* Wait for interface devices to show up */ + if (0 != (rc = lxcWaitForContinue(vm))) { + goto cleanup; + } + + /* enable interfaces */ + if (0 != (rc = lxcEnableInterfaces(vm))) { + goto cleanup; + } + rc = lxcExecWithTty(vm); /* this function will only return if an error occured */
diff -r bb48967cf19e -r 88267b7327be src/lxc_driver.c --- a/src/lxc_driver.c Mon Jun 23 11:53:45 2008 -0700 +++ b/src/lxc_driver.c Mon Jun 23 11:53:45 2008 -0700 @@ -44,6 +44,9 @@ #include "memory.h" #include "util.h" #include "memory.h" +#include "bridge.h" +#include "qemu_conf.h" +#include "veth.h"
/* debug macros */ #define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) @@ -395,6 +398,202 @@ }
/** + * lxcSetupInterfaces: + * @conn: pointer to connection + * @vm: pointer to virtual machine structure + * + * Sets up the container interfaces by creating the veth device pairs and + * attaching the parent end to the appropriate bridge. The container end + * will moved into the container namespace later after clone has been called. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcSetupInterfaces(virConnectPtr conn, + lxc_vm_t *vm) +{ + int rc = -1; + lxc_driver_t *driver = conn->privateData; + struct qemud_driver *networkDriver = + (struct qemud_driver *)(conn->networkPrivateData); + lxc_net_def_t *net = vm->def->nets; + char* bridge; + char parentVeth[PATH_MAX] = ""; + char containerVeth[PATH_MAX] = ""; + + if ((vm->def->nets != NULL) && (driver->have_netns == 0)) { + lxcError(conn, NULL, VIR_ERR_NO_SUPPORT, + _("System lacks NETNS support")); + return -1; + } + + for (net = vm->def->nets; net; net = net->next) { + if (LXC_NET_NETWORK == net->type) { + virNetworkPtr network = virNetworkLookupByName(conn, net->txName); + if (!network) { + goto error_exit; + } + + bridge = virNetworkGetBridgeName(network); + + virNetworkFree(network); + + } else { + bridge = net->txName; + } + + DEBUG("bridge: %s", bridge); + if (NULL == bridge) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to get bridge for interface")); + goto error_exit; + } + + DEBUG0("calling vethCreate()"); + if (NULL != net->parentVeth) { + strcpy(parentVeth, net->parentVeth); + } + if (NULL != net->containerVeth) { + strcpy(containerVeth, net->containerVeth); + } + DEBUG("parentVeth: %s, containerVeth: %s", parentVeth, containerVeth); + if (0 != (rc = vethCreate(parentVeth, PATH_MAX, containerVeth, PATH_MAX))) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to create veth device pair: %d"), rc); + goto error_exit; + } + if (NULL == net->parentVeth) { + net->parentVeth = strdup(parentVeth); + } + if (NULL == net->containerVeth) { + net->containerVeth = strdup(containerVeth); + } + + if ((NULL == net->parentVeth) || (NULL == net->containerVeth)) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to allocate veth names")); + goto error_exit; + } + + if (!(networkDriver->brctl) && (rc = brInit(&(networkDriver->brctl)))) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("cannot initialize bridge support: %s"), + strerror(rc)); + goto error_exit; + } + + if (0 != (rc = brAddInterface(networkDriver->brctl, bridge, parentVeth))) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to add %s device to %s: %s"), + parentVeth, + bridge, + strerror(rc)); + goto error_exit; + } + + if (0 != (rc = vethInterfaceUpOrDown(parentVeth, 1))) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to enable parent ns veth device: %d"), rc); + goto error_exit; + } + + } + + rc = 0; + +error_exit: + return rc; +} + +/** + * lxcMoveInterfacesToNetNs: + * @conn: pointer to connection + * @vm: pointer to virtual machine structure + * + * Starts a container process by calling clone() with the namespace flags + * + * Returns 0 on success or -1 in case of error + */ +static int lxcMoveInterfacesToNetNs(virConnectPtr conn, + const lxc_vm_t *vm) +{ + int rc = -1; + lxc_net_def_t *net; + + for (net = vm->def->nets; net; net = net->next) { + if (0 != moveInterfaceToNetNs(net->containerVeth, vm->def->id)) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to move interface %s to ns %d"), + net->containerVeth, vm->def->id); + goto error_exit; + } + } + + rc = 0; + +error_exit: + return rc; +} + +/** + * lxcCleanupInterfaces: + * @conn: pointer to connection + * @vm: pointer to virtual machine structure + * + * Cleans up the container interfaces by deleting the veth device pairs. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcCleanupInterfaces(const lxc_vm_t *vm) +{ + int rc = -1; + lxc_net_def_t *net; + + for (net = vm->def->nets; net; net = net->next) { + if (0 != (rc = vethDelete(net->parentVeth))) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to delete veth: %s"), net->parentVeth); + /* will continue to try to cleanup any other interfaces */ + } + } + + return 0; +} + +/** + * lxcSendContainerContinue: + * @vm: pointer to virtual machine structure + * + * Sends the continue message via the socket pair stored in the vm + * structure. + * + * Returns 0 on success or -1 in case of error + */ +static int lxcSendContainerContinue(const lxc_vm_t *vm) +{ + int rc = -1; + lxc_message_t msg = LXC_CONTINUE_MSG; + int writeCount = 0; + + if (NULL == vm) { + goto error_out; + } + + writeCount = safewrite(vm->sockpair[LXC_PARENT_SOCKET], &msg, + sizeof(msg)); + if (writeCount != sizeof(msg)) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("unable to send container continue message: %s"), + strerror(errno)); + goto error_out; + } + + rc = 0; + +error_out: + return rc; +} + +/** * lxcStartContainer: * @conn: pointer to connection * @driver: pointer to driver structure @@ -422,6 +621,9 @@ stacktop = stack + stacksize;
flags = CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWUSER|CLONE_NEWIPC|SIGCHLD; + + if (vm->def->nets != NULL) + flags |= CLONE_NEWNET;
vm->def->id = clone(lxcChild, stacktop, flags, (void *)vm);
@@ -819,15 +1021,42 @@ close(vm->parentTty); close(vm->containerTtyFd);
- rc = lxcStartContainer(conn, driver, vm); - - if (rc == 0) { - vm->state = VIR_DOMAIN_RUNNING; - driver->ninactivevms--; - driver->nactivevms++; + if (0 != (rc = lxcSetupInterfaces(conn, vm))) { + goto cleanup; }
+ /* create a socket pair to send continue message to the container once */ + /* we've completed the post clone configuration */ + if (0 != socketpair(PF_UNIX, SOCK_STREAM, 0, vm->sockpair)) { + lxcError(conn, NULL, VIR_ERR_INTERNAL_ERROR, + _("sockpair failed: %s"), strerror(errno)); + goto cleanup; + } + + /* check this rc */ + + rc = lxcStartContainer(conn, driver, vm); + if (rc != 0) + goto cleanup; + + rc = lxcMoveInterfacesToNetNs(conn, vm); + if (rc != 0) + goto cleanup; + + rc = lxcSendContainerContinue(vm); + if (rc != 0) + goto cleanup; + + vm->state = VIR_DOMAIN_RUNNING; + driver->ninactivevms--; + driver->nactivevms++; + cleanup: + close(vm->sockpair[LXC_PARENT_SOCKET]); + vm->sockpair[LXC_PARENT_SOCKET] = -1; + close(vm->sockpair[LXC_CONTAINER_SOCKET]); + vm->sockpair[LXC_CONTAINER_SOCKET] = -1; + return rc; }
@@ -957,6 +1186,9 @@ int rc = -1; int waitRc; int childStatus = -1; + + /* if this fails, we'll continue. it will report any errors */ + lxcCleanupInterfaces(vm);
Is it called when the last process of the container dies ?
while (((waitRc = waitpid(vm->def->id, &childStatus, 0)) == -1) && errno == EINTR);

DL> Is it called when the last process of the container dies ? Yes, it (lxcVMCleanup()) is called from lxcSigHandler() which gets run if the container dies unexpectedly. It's also called from lxcDomainDestroy() which covers the 'shutdown' and 'destroy' cases as well. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

Dan Smith wrote:
DL> Is it called when the last process of the container dies ?
Yes, it (lxcVMCleanup()) is called from lxcSigHandler() which gets run if the container dies unexpectedly. It's also called from lxcDomainDestroy() which covers the 'shutdown' and 'destroy' cases as well.
So the container is supposed to stay alive until we explicitly shutdown it, right ?

DL> So the container is supposed to stay alive until we explicitly DL> shutdown it, right ? It is supposed to, yes. If it dies early, we catch it with the signal handler and do the cleanup. If it's terminated by the user, we do the cleanup as part of the termination process. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

On Tue, Jun 24, 2008 at 08:51:36AM -0700, Dan Smith wrote:
Changes: - Remove extraneous "i" variables from various functions - Only bring up lo if we have other interfaces (and thus NETNS) - Fail setup of interfaces if NETNS support is not present - Only add CLONE_NEWNET to start flags if domain has interfaces defined - Make lxc_vm_t parameters const in helper functions (where appropriate) - Cleanup DomainStart procedure to bail on start/move/continue errors
Looks fine to me, +1 thanks ! Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
participants (3)
-
Dan Smith
-
Daniel Lezcano
-
Daniel Veillard