[PATCH 0/4] lxc: Let the driver detect CGroups earlier

See 4/4 for explanation. Michal Prívozník (4): lxc_controller: Initialize ctrl->handshakeFd properly lxc_controller: Move closing of handshakeFd out of virLXCControllerDaemonHandshake() lxc: Pass another pipe to lxc_controller lxc: Let the driver detect CGroups earlier src/lxc/lxc_controller.c | 65 +++++++++++++++++++++++++++++++--------- src/lxc/lxc_process.c | 41 ++++++++++++++++++++----- 2 files changed, 84 insertions(+), 22 deletions(-) -- 2.26.3

The lxc_controller has a structure that's keeping its internal state, including so called handshakeFd which is the write end of a pipe that's used to signal to the LXC driver that the container is set up and ready to run. However, the struct member is not initialized to -1, so if anything fails before it is set then the virLXCControllerFree() function tries to close FD 0 (stdin). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/lxc/lxc_controller.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index ab5fc8b88f..50b2987d9a 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -193,8 +193,8 @@ static virLXCController *virLXCControllerNew(const char *name) ctrl->timerShutdown = -1; ctrl->firstClient = true; - ctrl->name = g_strdup(name); + ctrl->handshakeFd = -1; if (!(driver = virLXCControllerDriverNew())) goto error; -- 2.26.3

On Tue, Apr 20, 2021 at 04:15:15PM +0200, Michal Privoznik wrote:
The lxc_controller has a structure that's keeping its internal state, including so called handshakeFd which is the write end of a pipe that's used to signal to the LXC driver that the container is set up and ready to run. However, the struct member is not initialized to -1, so if anything fails before it is set then the virLXCControllerFree() function tries to close FD 0 (stdin).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/lxc/lxc_controller.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index ab5fc8b88f..50b2987d9a 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -193,8 +193,8 @@ static virLXCController *virLXCControllerNew(const char *name)
ctrl->timerShutdown = -1; ctrl->firstClient = true; - ctrl->name = g_strdup(name); + ctrl->handshakeFd = -1;
if (!(driver = virLXCControllerDriverNew())) goto error; -- 2.26.3

Future commits will want to reuse the handshakeFd and thus it mustn't be closed in virLXCControllerDaemonHandshake(). Do the closing explicitly afterwards. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/lxc/lxc_controller.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 50b2987d9a..797547b05c 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -353,7 +353,6 @@ static int virLXCControllerDaemonHandshake(virLXCController *ctrl) _("error sending continue signal to daemon")); return -1; } - VIR_FORCE_CLOSE(ctrl->handshakeFd); return 0; } @@ -2403,6 +2402,9 @@ virLXCControllerRun(virLXCController *ctrl) if (virLXCControllerDaemonHandshake(ctrl) < 0) goto cleanup; + /* and preemptively close handshakeFd */ + VIR_FORCE_CLOSE(ctrl->handshakeFd); + /* We must not hold open a dbus connection for life * of LXC instance, since dbus-daemon is limited to * only a few 100 connections by default -- 2.26.3

On Tue, Apr 20, 2021 at 04:15:16PM +0200, Michal Privoznik wrote:
Future commits will want to reuse the handshakeFd and thus it mustn't be closed in virLXCControllerDaemonHandshake(). Do the closing explicitly afterwards.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/lxc/lxc_controller.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 50b2987d9a..797547b05c 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -353,7 +353,6 @@ static int virLXCControllerDaemonHandshake(virLXCController *ctrl) _("error sending continue signal to daemon")); return -1; } - VIR_FORCE_CLOSE(ctrl->handshakeFd); return 0; }
@@ -2403,6 +2402,9 @@ virLXCControllerRun(virLXCController *ctrl) if (virLXCControllerDaemonHandshake(ctrl) < 0) goto cleanup;
+ /* and preemptively close handshakeFd */ + VIR_FORCE_CLOSE(ctrl->handshakeFd); + /* We must not hold open a dbus connection for life * of LXC instance, since dbus-daemon is limited to * only a few 100 connections by default -- 2.26.3

Currently, there is only a single pipe passed to lxc_controller and it is used by lxc_controller to signal to the LXC driver that the container is set up and ready to run. However, in the next commit we will need to signal that the LXC driver has done its part of startup process and thus the controller can proceed. Unfortunately, virCommand handshake can't be used for this, because it's already used to read controller's PID. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/lxc/lxc_controller.c | 46 ++++++++++++++++++++++++++++------------ src/lxc/lxc_process.c | 21 +++++++++++------- 2 files changed, 46 insertions(+), 21 deletions(-) diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 797547b05c..1c0a370d4b 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -100,7 +100,7 @@ struct _virLXCController { virDomainObj *vm; virDomainDef *def; - int handshakeFd; + int handshakeFds[2]; /* { read FD, write FD } */ pid_t initpid; @@ -194,7 +194,8 @@ static virLXCController *virLXCControllerNew(const char *name) ctrl->timerShutdown = -1; ctrl->firstClient = true; ctrl->name = g_strdup(name); - ctrl->handshakeFd = -1; + ctrl->handshakeFds[0] = -1; + ctrl->handshakeFds[1] = -1; if (!(driver = virLXCControllerDriverNew())) goto error; @@ -311,7 +312,8 @@ static void virLXCControllerFree(virLXCController *ctrl) virCgroupFree(ctrl->cgroup); /* This must always be the last thing to be closed */ - VIR_FORCE_CLOSE(ctrl->handshakeFd); + for (i = 0; i < G_N_ELEMENTS(ctrl->handshakeFds); i++) + VIR_FORCE_CLOSE(ctrl->handshakeFds[i]); g_free(ctrl); } @@ -348,7 +350,7 @@ static int virLXCControllerConsoleSetNonblocking(virLXCControllerConsole *consol static int virLXCControllerDaemonHandshake(virLXCController *ctrl) { - if (lxcContainerSendContinue(ctrl->handshakeFd) < 0) { + if (lxcContainerSendContinue(ctrl->handshakeFds[1]) < 0) { virReportSystemError(errno, "%s", _("error sending continue signal to daemon")); return -1; @@ -2402,8 +2404,9 @@ virLXCControllerRun(virLXCController *ctrl) if (virLXCControllerDaemonHandshake(ctrl) < 0) goto cleanup; - /* and preemptively close handshakeFd */ - VIR_FORCE_CLOSE(ctrl->handshakeFd); + /* and preemptively close handshakeFds */ + for (i = 0; i < G_N_ELEMENTS(ctrl->handshakeFds); i++) + VIR_FORCE_CLOSE(ctrl->handshakeFds[i]); /* We must not hold open a dbus connection for life * of LXC instance, since dbus-daemon is limited to @@ -2431,6 +2434,26 @@ virLXCControllerRun(virLXCController *ctrl) } +static int +parseFDPair(const char *arg, + int (*fd)[2]) +{ + g_auto(GStrv) fds = NULL; + + fds = g_strsplit(arg, ":", 0); + + if (fds[0] == NULL || fds[1] == NULL || fds[2] != NULL || + virStrToLong_i(fds[0], NULL, 10, &(*fd)[0]) < 0 || + virStrToLong_i(fds[1], NULL, 10, &(*fd)[1]) < 0) { + fprintf(stderr, "malformed --handshakefd argument '%s'", + optarg); + return -1; + } + + return 0; +} + + int main(int argc, char *argv[]) { pid_t pid; @@ -2439,7 +2462,7 @@ int main(int argc, char *argv[]) size_t nveths = 0; char **veths = NULL; int ns_fd[VIR_LXC_DOMAIN_NAMESPACE_LAST]; - int handshakeFd = -1; + int handshakeFds[2] = { -1, -1 }; bool bg = false; const struct option options[] = { { "background", 0, NULL, 'b' }, @@ -2515,11 +2538,8 @@ int main(int argc, char *argv[]) break; case 's': - if (virStrToLong_i(optarg, NULL, 10, &handshakeFd) < 0) { - fprintf(stderr, "malformed --handshakefd argument '%s'", - optarg); + if (parseFDPair(optarg, &handshakeFds) < 0) goto cleanup; - } break; case 'N': @@ -2578,7 +2598,7 @@ int main(int argc, char *argv[]) goto cleanup; } - if (handshakeFd < 0) { + if (handshakeFds[0] < 0 || handshakeFds[1] < 0) { fprintf(stderr, "%s: missing --handshakefd argument for container PTY\n", argv[0]); goto cleanup; @@ -2596,7 +2616,7 @@ int main(int argc, char *argv[]) if (!(ctrl = virLXCControllerNew(name))) goto cleanup; - ctrl->handshakeFd = handshakeFd; + memcpy(&ctrl->handshakeFds, &handshakeFds, sizeof(handshakeFds)); if (!(ctrl->securityManager = virSecurityManagerNew(securityDriver, LXC_DRIVER_NAME, 0))) diff --git a/src/lxc/lxc_process.c b/src/lxc/lxc_process.c index ac635efe7a..493e19f03d 100644 --- a/src/lxc/lxc_process.c +++ b/src/lxc/lxc_process.c @@ -939,7 +939,8 @@ virLXCProcessBuildControllerCmd(virLXCDriver *driver, int *nsInheritFDs, int *files, size_t nfiles, - int handshakefd, + int handshakefdW, + int handshakefdR, int * const logfd, const char *pidfile) { @@ -1003,12 +1004,13 @@ virLXCProcessBuildControllerCmd(virLXCDriver *driver, virSecurityManagerGetModel(driver->securityManager)); virCommandAddArg(cmd, "--handshakefd"); - virCommandAddArgFormat(cmd, "%d", handshakefd); + virCommandAddArgFormat(cmd, "%d:%d", handshakefdR, handshakefdW); for (i = 0; veths && veths[i]; i++) virCommandAddArgList(cmd, "--veth", veths[i], NULL); - virCommandPassFD(cmd, handshakefd, 0); + virCommandPassFD(cmd, handshakefdW, 0); + virCommandPassFD(cmd, handshakefdR, 0); virCommandDaemonize(cmd); virCommandSetPidFile(cmd, pidfile); virCommandSetOutputFD(cmd, logfd); @@ -1198,7 +1200,7 @@ int virLXCProcessStart(virConnectPtr conn, g_autofree char *logfile = NULL; int logfd = -1; g_auto(GStrv) veths = NULL; - int handshakefds[2] = { -1, -1 }; + int handshakefds[4] = { -1, -1, -1, -1 }; /* two pipes */ off_t pos = -1; char ebuf[1024]; g_autofree char *timestamp = NULL; @@ -1369,7 +1371,8 @@ int virLXCProcessStart(virConnectPtr conn, goto cleanup; } - if (virPipe(handshakefds) < 0) + if (virPipe(&handshakefds[0]) < 0 || + virPipe(&handshakefds[2]) < 0) goto cleanup; if (!(cmd = virLXCProcessBuildControllerCmd(driver, @@ -1379,6 +1382,7 @@ int virLXCProcessStart(virConnectPtr conn, nsInheritFDs, files, nfiles, handshakefds[1], + handshakefds[2], &logfd, pidfile))) goto cleanup; @@ -1448,7 +1452,8 @@ int virLXCProcessStart(virConnectPtr conn, virDomainObjSetState(vm, VIR_DOMAIN_RUNNING, reason); priv->doneStopEvent = false; - if (VIR_CLOSE(handshakefds[1]) < 0) { + if (VIR_CLOSE(handshakefds[1]) < 0 || + VIR_CLOSE(handshakefds[2]) < 0) { virReportSystemError(errno, "%s", _("could not close handshake fd")); goto cleanup; } @@ -1553,8 +1558,8 @@ int virLXCProcessStart(virConnectPtr conn, virCommandFree(cmd); for (i = 0; i < nttyFDs; i++) VIR_FORCE_CLOSE(ttyFDs[i]); - VIR_FORCE_CLOSE(handshakefds[0]); - VIR_FORCE_CLOSE(handshakefds[1]); + for (i = 0; i < G_N_ELEMENTS(handshakefds); i++) + VIR_FORCE_CLOSE(handshakefds[i]); virObjectUnref(cfg); virObjectUnref(caps); -- 2.26.3

On Tue, Apr 20, 2021 at 04:15:17PM +0200, Michal Privoznik wrote:
Currently, there is only a single pipe passed to lxc_controller and it is used by lxc_controller to signal to the LXC driver that the container is set up and ready to run. However, in the next commit we will need to signal that the LXC driver has done its part of startup process and thus the controller can proceed. Unfortunately, virCommand handshake can't be used for this, because it's already used to read controller's PID.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/lxc/lxc_controller.c | 46 ++++++++++++++++++++++++++++------------ src/lxc/lxc_process.c | 21 +++++++++++------- 2 files changed, 46 insertions(+), 21 deletions(-)
[...]
diff --git a/src/lxc/lxc_process.c b/src/lxc/lxc_process.c index ac635efe7a..493e19f03d 100644 --- a/src/lxc/lxc_process.c +++ b/src/lxc/lxc_process.c @@ -939,7 +939,8 @@ virLXCProcessBuildControllerCmd(virLXCDriver *driver, int *nsInheritFDs, int *files, size_t nfiles, - int handshakefd, + int handshakefdW, + int handshakefdR, int * const logfd, const char *pidfile) { @@ -1003,12 +1004,13 @@ virLXCProcessBuildControllerCmd(virLXCDriver *driver, virSecurityManagerGetModel(driver->securityManager));
virCommandAddArg(cmd, "--handshakefd");
Maybe using plural here?
- virCommandAddArgFormat(cmd, "%d", handshakefd); + virCommandAddArgFormat(cmd, "%d:%d", handshakefdR, handshakefdW);
This was a little bit confusing as I would put it the other way, but ultimately does not matter. It would be easier to spot if you also changed the help output and maybe s/handshakefd/handshakefds/ in the struct option passed to getopt_long(). Anyway, with at least the help output changed: Reviewed-by: Martin Kletzander <mkletzan@redhat.com>

This is the bug I'm facing. I deliberately configured a container so that the source of a <filesystem/> to passthrough doesn't exist. The start fails with: lxcContainerPivotRoot:669 : Failed to create /non-existent/path/.oldroot: Permission denied which is expected. But what is NOT expected is that CGroup hierarchy is left behind. This is because the controller sets up the CGroup hierarchy, user namespace, moves interfaces, etc. and finally checks whether container setup (done in a separate process) succeeded. Only after all this the error is propagated to the LXC driver. The driver aborts the startup and tries to perform the cleanup, but this is missing CGroups because those weren't detected yet. Ideally, whenever a function fails, it tries to unroll back so that is has no artifacts left behind (look at all those frees/FD closes/etc. at end of functions). But with CGroups it is different - the controller process can't clean up after itself, because it is still running inside that CGroup. Therefore, what we have to do is to let the driver detect CGroups as soon as they are created, and proceed with controller execution only after that. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/lxc/lxc_controller.c | 19 +++++++++++++++++-- src/lxc/lxc_process.c | 20 ++++++++++++++++++++ 2 files changed, 37 insertions(+), 2 deletions(-) diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 1c0a370d4b..33f397c11d 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -348,7 +348,7 @@ static int virLXCControllerConsoleSetNonblocking(virLXCControllerConsole *consol } -static int virLXCControllerDaemonHandshake(virLXCController *ctrl) +static int virLXCControllerDaemonHandshakeCont(virLXCController *ctrl) { if (lxcContainerSendContinue(ctrl->handshakeFds[1]) < 0) { virReportSystemError(errno, "%s", @@ -358,6 +358,15 @@ static int virLXCControllerDaemonHandshake(virLXCController *ctrl) return 0; } +static int virLXCControllerDaemonHandshakeWait(virLXCController *ctrl) +{ + if (lxcContainerWaitForContinue(ctrl->handshakeFds[0]) < 0) { + virReportSystemError(errno, "%s", + _("error waiting for continue signal from daemon")); + return -1; + } + return 0; +} static int virLXCControllerValidateNICs(virLXCController *ctrl) { @@ -2372,6 +2381,11 @@ virLXCControllerRun(virLXCController *ctrl) if (virLXCControllerSetupCgroupLimits(ctrl) < 0) goto cleanup; + /* Allow daemon to detect CGroups. */ + if (virLXCControllerDaemonHandshakeCont(ctrl) < 0 || + virLXCControllerDaemonHandshakeWait(ctrl) < 0) + goto cleanup; + if (virLXCControllerSetupUserns(ctrl) < 0) goto cleanup; @@ -2401,7 +2415,8 @@ virLXCControllerRun(virLXCController *ctrl) if (virLXCControllerConsoleSetNonblocking(&(ctrl->consoles[i])) < 0) goto cleanup; - if (virLXCControllerDaemonHandshake(ctrl) < 0) + /* Allow daemon to connect to the monitor. */ + if (virLXCControllerDaemonHandshakeCont(ctrl) < 0) goto cleanup; /* and preemptively close handshakeFds */ diff --git a/src/lxc/lxc_process.c b/src/lxc/lxc_process.c index 493e19f03d..55e74e913b 100644 --- a/src/lxc/lxc_process.c +++ b/src/lxc/lxc_process.c @@ -1473,6 +1473,7 @@ int virLXCProcessStart(virConnectPtr conn, if (g_atomic_int_add(&driver->nactive, 1) == 0 && driver->inhibitCallback) driver->inhibitCallback(true, driver->inhibitOpaque); + /* The first synchronization point is when the controller creates CGroups. */ if (lxcContainerWaitForContinue(handshakefds[0]) < 0) { char out[1024]; @@ -1504,6 +1505,25 @@ int virLXCProcessStart(virConnectPtr conn, goto cleanup; } + if (lxcContainerSendContinue(handshakefds[3]) < 0) { + virReportSystemError(errno, "%s", + _("Failed to send continue signal to controller")); + goto cleanup; + } + + /* The second synchronization point is when the controller finished + * creating the container. */ + if (lxcContainerWaitForContinue(handshakefds[0]) < 0) { + char out[1024]; + + if (!(virLXCProcessReadLogOutput(vm, logfile, pos, out, 1024) < 0)) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("guest failed to start: %s"), out); + } + + goto cleanup; + } + /* And we can get the first monitor connection now too */ if (!(priv->monitor = virLXCProcessConnectMonitor(driver, vm))) { /* Intentionally overwrite the real monitor error message, -- 2.26.3

On Tue, Apr 20, 2021 at 04:15:18PM +0200, Michal Privoznik wrote:
This is the bug I'm facing. I deliberately configured a container so that the source of a <filesystem/> to passthrough doesn't exist. The start fails with:
lxcContainerPivotRoot:669 : Failed to create /non-existent/path/.oldroot: Permission denied
which is expected. But what is NOT expected is that CGroup hierarchy is left behind. This is because the controller sets up the CGroup hierarchy, user namespace, moves interfaces, etc. and finally checks whether container setup (done in a separate process) succeeded. Only after all this the error is propagated to the LXC driver. The driver aborts the startup and tries to perform the cleanup, but this is missing CGroups because those weren't detected yet.
Ideally, whenever a function fails, it tries to unroll back so that is has no artifacts left behind (look at all those frees/FD closes/etc. at end of functions). But with CGroups it is different - the controller process can't clean up after itself, because it is still running inside that CGroup.
Therefore, what we have to do is to let the driver detect CGroups as soon as they are created, and proceed with controller execution only after that.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/lxc/lxc_controller.c | 19 +++++++++++++++++-- src/lxc/lxc_process.c | 20 ++++++++++++++++++++ 2 files changed, 37 insertions(+), 2 deletions(-)
[...]
diff --git a/src/lxc/lxc_process.c b/src/lxc/lxc_process.c index 493e19f03d..55e74e913b 100644 --- a/src/lxc/lxc_process.c +++ b/src/lxc/lxc_process.c @@ -1473,6 +1473,7 @@ int virLXCProcessStart(virConnectPtr conn, if (g_atomic_int_add(&driver->nactive, 1) == 0 && driver->inhibitCallback) driver->inhibitCallback(true, driver->inhibitOpaque);
+ /* The first synchronization point is when the controller creates CGroups. */ if (lxcContainerWaitForContinue(handshakefds[0]) < 0) { char out[1024];
@@ -1504,6 +1505,25 @@ int virLXCProcessStart(virConnectPtr conn, goto cleanup; }
+ if (lxcContainerSendContinue(handshakefds[3]) < 0) { + virReportSystemError(errno, "%s", + _("Failed to send continue signal to controller")); + goto cleanup; + } + + /* The second synchronization point is when the controller finished + * creating the container. */ + if (lxcContainerWaitForContinue(handshakefds[0]) < 0) { + char out[1024]; + + if (!(virLXCProcessReadLogOutput(vm, logfile, pos, out, 1024) < 0)) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("guest failed to start: %s"), out);
It would be nice to differentiate the error message reported here to the one reported originally handful of lines above. Just for the sake of future debugging. Other than that the patches could be nicer to read if the variables were split or named differently, but after reading through a lot of the code around these parts I see that such a request would be unreasonable for this patch. Maybe a rewrite of the signalling could be an item for a TODO list, although I am not sure what would be the best way to handle this. Can we use for example eventfd()s? Would that help us at all? Anyway, with one of the error messages tweaked a bit: Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
participants (2)
-
Martin Kletzander
-
Michal Privoznik