[Libvir] [PATCH 1/2] lxc: start container

This is a repost of the start container support. Changes from the last version: * Report an error when allocation for init string fails in lxcExecContainerInit() * Change to find by name in lxcStartDomain() * Move tty forwarding process outside of the container. This allows consolidating the forwarding into a single process at a later time. This also means the the container init process as specified by the user now runs as the container root process with pid = 1. The tty setup will require some (hopefully minor) modifications when pts namespaces are enabled. * Add header comments to a number of the functions. This is an updated rough outline of the functions involved in starting a container and the namespace and process under which they run: lxcVmStart() - runs under libvirtd process lxcSetupTtyTunnel() - opens and configures parent tty lxcSetupContainerTty() - opens container tty fork child process calls lxcTtyForward() see below parent continues lxcStartContainer - see below return lxcStartContainer() - runs in parent namespace, libvirtd process Allocate stack for container clone() - child process will start in lxcChild() see below return lxcChild() - runs within container, child process from clone() mount user filesystems mount container /proc lxcExecWithTty() - see below, will not return lxcExecWithTty() - runs within container, root process lxcSetContainerStdio - sets container tty as primary console lxcExecContainerInit - see below, should not return exit() lxcExecContainerInit() - runs within container, root process exec containers init if exec fails, exit() Thanks! -- Best Regards, Dave Leskovec IBM Linux Technology Center Open Virtualization

Dave Leskovec <dlesko@linux.vnet.ibm.com> wrote:
This is a repost of the start container support. Changes from the last version: ...
Hi Dave, Mostly-impeccable code, as usual. Here are a few suggestions and questions:
Index: b/src/lxc_driver.c =================================================================== ... +static int lxcStartContainer(virConnectPtr conn, + lxc_driver_t* driver, + lxc_vm_t *vm) +{ + int rc = -1; + int flags; + int stacksize = getpagesize() * 4;
I haven't looked at much code using clone, so maybe using 4*getpagesize() is just a standard idiom, but I'll comment anyhow. Four pages is small, in any case, but I'm curious... Do you really always need 4? Even when page size is say, 64K? I.e., is "4" a heuristic, or can you describe how it was derived? Or, if say 32KB is an upper bound, maybe something like this: int stacksize = MIN (getpagesize() * 4, 32 * 1024);
+ void *stack, *stacktop; + + /* allocate a stack for the container */ + stack = malloc(stacksize); + if (!stack) { + lxcError(conn, NULL, VIR_ERR_NO_MEMORY, + _("unable to allocate container stack")); + goto error_exit; + } + stacktop = (char*)stack + stacksize; + + flags = CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWUSER|CLONE_NEWIPC|SIGCHLD; + + vm->def->id = clone(lxcChild, stacktop, flags, (void *)vm);
...
+static int lxcTtyForward(int fd1, int fd2, + int *loopFlag ATTRIBUTE_UNUSED, + int pollmsecs ATTRIBUTE_UNUSED) +{ ... + for (i = 0; i < numFds; ++i) { + if (!fds[i].revents) { + continue; + } + + if (fds[i].revents & POLLIN) { + saferead(fds[i].fd, buf, 1); + if (1 < numFds) { + safewrite(fds[i ^ 1].fd, buf, 1); + }
Don't you want to handle saferead/safewrite failure?
+ } + + } + + } + + rc = 0; + +cleanup: + return rc; +} + +static int lxcVmStart(virConnectPtr conn, + lxc_driver_t * driver, + lxc_vm_t * vm) +{ + int rc = -1; + lxc_vm_def_t *vmDef = vm->def; + + /* open parent tty */ + if (lxcSetupTtyTunnel(conn, vmDef, &vm->parentTty) < 0) { + goto cleanup; + } + + /* open container tty */ + if (lxcSetupContainerTty(conn, &(vm->containerTtyFd), &(vm->containerTty)) < 0) { + goto cleanup; + } + + /* fork process to handle the tty io forwarding */ + if ((vm->pid = fork()) == 0) { + /* child process calls forward routine */ + lxcTtyForward(vm->parentTty, vm->containerTtyFd, NULL, 0); + }
Handling/reporting fork failure would be good.
+ rc = lxcStartContainer(conn, driver, vm); + + if (rc == 0) { + vm->state = VIR_DOMAIN_RUNNING; + driver->ninactivevms--; + driver->nactivevms++; + } + +cleanup: + return rc; +} ...
static int lxcStartup(void) { uid_t uid = getuid();
+ debugFlag = 1;
Is this a debugging relic? ...
Index: b/src/lxc_container.c =================================================================== ... +/* Functions */ +static int lxcExecContainerInit(lxc_vm_def_t *vmDef)
Please make the pointer "const": static int lxcExecContainerInit(const lxc_vm_def_t *vmDef)
+{ + int rc = -1; + char* execString; + int execStringLen = strlen(vmDef->init) + 1 + 5;
s/int/size_t/
+ + if(NULL == (execString = calloc(execStringLen, sizeof(char)))) {
s/if/if /
+ DEBUG0("failed to calloc memory for init string");
Memory allocation failure usually deserves an unconditional diagnostic. Any reason not to do that here?
+ goto error_out; + } + + strcpy(execString, "exec "); + strcat(execString, vmDef->init); + + execl("/bin/sh", "sh", "-c", execString, (char*)NULL); + DEBUG("execl failed: %s", strerror(errno));
Same here, and in several cases below? ...
+#if 0
This function is 99% identical to the non-'#if-0'd one above. Remove it?
+static int lxcTtyForward(int fd1, int fd2, int *loopFlag, int pollmsecs) +{ ... +} + +static pid_t initPid; +static int exitChildLoop; +static void lxcExecChildHandler(int sig ATTRIBUTE_UNUSED, + siginfo_t *signalInfo,
This pointer can be const: const siginfo_t *signalInfo,
+ void *context ATTRIBUTE_UNUSED) +{ ... +static int lxcExecWithTty(lxc_vm_t *vm) +{ + int rc = -1; + lxc_vm_def_t *vmDef = vm->def; +#if 0 + int ttymaster = -1; + int ttyslave = -1; + struct sigaction sigAction; + sigset_t sigMask; + int childStatus; + + if (lxcSetupContainerTty(&ttymaster, &ttyslave) < 0) { + goto exit_with_error; + } + + sigAction.sa_sigaction = lxcExecChildHandler; + sigfillset(&sigMask); + sigAction.sa_mask = sigMask; + sigAction.sa_flags = SA_SIGINFO; + if (0 != sigaction(SIGCHLD, &sigAction, NULL)) { + DEBUG("sigaction failed: %s\n", strerror(errno)); + goto exit_with_error; + } + + exitChildLoop = 0; + if ((initPid = fork()) == 0) {
Check for fork failure here, too.
+ if(lxcSetContainerStdio(ttyslave) < 0) { + exitChildLoop = 1; + goto exit_with_error; + } + + lxcExecContainerInit(vmDef); + /* this function will not return. if it fails, it will exit */ + } + + close(ttyslave);
Probably no big deal here, but in general, it's good practice to report close failure on any writable handle.
+ lxcTtyForward(ttymaster, vm->parentTty, + &exitChildLoop, 100); + + DEBUG("child waiting on pid %d", initPid); + waitpid(initPid, &childStatus, 0);
It's worthwhile to check for waitpid failure, or perhaps to retry on EINTR. I know this is if-0'd, too, but why include it, if not for review?
+ rc = WEXITSTATUS(childStatus); + DEBUG("container exited with rc: %d", rc); +#endif + + if(lxcSetContainerStdio(vm->containerTty) < 0) { + goto exit_with_error; + } + + lxcExecContainerInit(vmDef); + +exit_with_error: + exit(rc); +} ...

Jim - Thanks! I was confused for a little while because I know I had deleted that commented out code. Looks like I grabbed the patch from the wrong directory. It's functionally the same but missing some of the cleanup like deleting commented out code (oops). Sorry about the confusion. Most of your comments were still applicable. I've addressed them inline below and attached an updated patch. Thanks again! Jim Meyering wrote:
Dave Leskovec <dlesko@linux.vnet.ibm.com> wrote:
This is a repost of the start container support. Changes from the last version: ...
Hi Dave,
Mostly-impeccable code, as usual. Here are a few suggestions and questions:
Index: b/src/lxc_driver.c =================================================================== ... +static int lxcStartContainer(virConnectPtr conn, + lxc_driver_t* driver, + lxc_vm_t *vm) +{ + int rc = -1; + int flags; + int stacksize = getpagesize() * 4;
I haven't looked at much code using clone, so maybe using 4*getpagesize() is just a standard idiom, but I'll comment anyhow.
Four pages is small, in any case, but I'm curious... Do you really always need 4? Even when page size is say, 64K? I.e., is "4" a heuristic, or can you describe how it was derived? Or, if say 32KB is an upper bound, maybe something like this:
int stacksize = MIN (getpagesize() * 4, 32 * 1024);
That value is an estimate based on running some containers. My tests so far have been fairly simplistic and I'm sure mileage will vary as containers are utilized. A few options for improving this: - allow the user to specify this in the container definition within the os block (or elsewhere). This would facilitate upping the size if an overrun is encountered. - even if we do this, we still should provide a good default value. Could set this default value based on the architecture. - use mprotect to have a SIGSEGV generated if the stack is overrun
...
+static int lxcTtyForward(int fd1, int fd2, + int *loopFlag ATTRIBUTE_UNUSED, + int pollmsecs ATTRIBUTE_UNUSED) +{ ... + for (i = 0; i < numFds; ++i) { + if (!fds[i].revents) { + continue; + } + + if (fds[i].revents & POLLIN) { + saferead(fds[i].fd, buf, 1); + if (1 < numFds) { + safewrite(fds[i ^ 1].fd, buf, 1); + }
Don't you want to handle saferead/safewrite failure?
Yes, added some handling here.
...
+ /* fork process to handle the tty io forwarding */ + if ((vm->pid = fork()) == 0) { + /* child process calls forward routine */ + lxcTtyForward(vm->parentTty, vm->containerTtyFd, NULL, 0); + }
Handling/reporting fork failure would be good.
Yes, added some handling.
...
static int lxcStartup(void) { uid_t uid = getuid();
+ debugFlag = 1;
Is this a debugging relic?
Yep, removed.
...
Index: b/src/lxc_container.c =================================================================== ... +/* Functions */ +static int lxcExecContainerInit(lxc_vm_def_t *vmDef)
Please make the pointer "const":
static int lxcExecContainerInit(const lxc_vm_def_t *vmDef)
Done.
+{ + int rc = -1; + char* execString; + int execStringLen = strlen(vmDef->init) + 1 + 5;
s/int/size_t/
Done.
+ + if(NULL == (execString = calloc(execStringLen, sizeof(char)))) {
s/if/if /
Done.
+ DEBUG0("failed to calloc memory for init string");
Memory allocation failure usually deserves an unconditional diagnostic. Any reason not to do that here?
I was thinking that I didn't want to call __virRaiseError from within the container. If it's just writing to stderr then that shouldn't be a problem. I've changed many of these but there may be a few more running around. I'll get them as I find em.
+ goto error_out; + } + + strcpy(execString, "exec "); + strcat(execString, vmDef->init); + + execl("/bin/sh", "sh", "-c", execString, (char*)NULL); + DEBUG("execl failed: %s", strerror(errno));
Same here, and in several cases below?
Right
...
+#if 0
This function is 99% identical to the non-'#if-0'd one above. Remove it?
Yes, removed
+static int lxcTtyForward(int fd1, int fd2, int *loopFlag, int pollmsecs) +{ ... +} + +static pid_t initPid; +static int exitChildLoop; +static void lxcExecChildHandler(int sig ATTRIBUTE_UNUSED, + siginfo_t *signalInfo,
This pointer can be const: const siginfo_t *signalInfo,
This was all removed
...
+ + exitChildLoop = 0; + if ((initPid = fork()) == 0) {
Check for fork failure here, too.
This was removed.
+ if(lxcSetContainerStdio(ttyslave) < 0) { + exitChildLoop = 1; + goto exit_with_error; + } + + lxcExecContainerInit(vmDef); + /* this function will not return. if it fails, it will exit */ + } + + close(ttyslave);
Probably no big deal here, but in general, it's good practice to report close failure on any writable handle.
Removed - I'll check other uses of close.
+ lxcTtyForward(ttymaster, vm->parentTty, + &exitChildLoop, 100); + + DEBUG("child waiting on pid %d", initPid); + waitpid(initPid, &childStatus, 0);
It's worthwhile to check for waitpid failure, or perhaps to retry on EINTR. I know this is if-0'd, too, but why include it, if not for review?
Removed - I'll check other uses of waitpid. ... -- Best Regards, Dave Leskovec IBM Linux Technology Center Open Virtualization
participants (2)
-
Dave Leskovec
-
Jim Meyering