[libvirt] [RFC v2 0/4] LXC with block device and enabled userns

Hi all, This patch series aims to resolve https://bugzilla.redhat.com/show_bug.cgi?id=1328946 For background information about the issue see v1 of this RFC. https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html The current state of this series enables the start of LXC container with NBD file system and enabled user namespace. However, container shutdown causes "kernel BUG at fs/buffer.c:3058!" https://pastebin.com/raw/y0ycSM0H The reason for this is because qemu-nbd process is terminated/killed without unmounting the container root file system. This issue has been reported in [1] and [2]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110 [2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html As a workaround we could unmount the root file system of container before shutdown. For example with: $ CT_PID=$(pidof libvirt_lxc) $ sudo nsenter \ --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \ /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/" I noticed that we already have the functions lxcContainerUnmountSubtree and virProcessRunInMountNamespace. Any suggestions on how to properly implement this? Thanks, Radostin Stoyanov (4): lxc: Make lxcContainerMountFSBlock non static lxc: Move up virLXCControllerAppendNBDPids lxc: Mount NBD devices before clone lxc: Remove unused lxcContainerPrepareRoot src/lxc/lxc_container.c | 58 +------------- src/lxc/lxc_container.h | 4 + src/lxc/lxc_controller.c | 158 +++++++++++++++++++++++---------------- 3 files changed, 97 insertions(+), 123 deletions(-) -- 2.17.1

--- src/lxc/lxc_container.c | 5 +---- src/lxc/lxc_container.h | 4 ++++ 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 665b93a0ac..3f6be9f44d 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -113,9 +113,6 @@ struct __lxc_child_argv { int *nsInheritFDs; }; -static int lxcContainerMountFSBlock(virDomainFSDefPtr fs, - const char *srcprefix, - const char *sec_mount_options); /* @@ -1499,7 +1496,7 @@ static int lxcContainerMountFSBlockHelper(virDomainFSDefPtr fs, } -static int lxcContainerMountFSBlock(virDomainFSDefPtr fs, +int lxcContainerMountFSBlock(virDomainFSDefPtr fs, const char *srcprefix, const char *sec_mount_options) { diff --git a/src/lxc/lxc_container.h b/src/lxc/lxc_container.h index 641e2d4607..3a3564b794 100644 --- a/src/lxc/lxc_container.h +++ b/src/lxc/lxc_container.h @@ -68,4 +68,8 @@ int lxcContainerChown(virDomainDefPtr def, const char *path); bool lxcIsBasicMountLocation(const char *path); +int lxcContainerMountFSBlock(virDomainFSDefPtr fs, + const char *srcprefix, + const char *sec_mount_options); + #endif /* LXC_CONTAINER_H */ -- 2.17.1

There is no functional change in this patch. It only moves virLXCControllerAppendNBDPids above virLXCControllerSetupNBDDeviceFS. --- src/lxc/lxc_controller.c | 96 ++++++++++++++++++++-------------------- 1 file changed, 49 insertions(+), 47 deletions(-) diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 4e84391bf5..c9f416aaab 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -481,6 +481,55 @@ static int virLXCControllerSetupLoopDeviceDisk(virDomainDiskDefPtr disk) } +static int virLXCControllerAppendNBDPids(virLXCControllerPtr ctrl, + const char *dev) +{ + char *pidpath = NULL; + pid_t *pids = NULL; + size_t npids = 0; + size_t i; + int ret = -1; + size_t loops = 0; + pid_t pid; + + if (!STRPREFIX(dev, "/dev/") || + virAsprintf(&pidpath, "/sys/devices/virtual/block/%s/pid", dev + 5) < 0) + goto cleanup; + + /* Wait for the pid file to appear */ + while (!virFileExists(pidpath)) { + /* wait for 100ms before checking again, but don't do it for ever */ + if (errno == ENOENT && loops < 10) { + usleep(100 * 1000); + loops++; + } else { + virReportSystemError(errno, + _("Cannot check NBD device %s pid"), + dev + 5); + goto cleanup; + } + } + + if (virPidFileReadPath(pidpath, &pid) < 0) + goto cleanup; + + if (virProcessGetPids(pid, &npids, &pids) < 0) + goto cleanup; + + for (i = 0; i < npids; i++) { + if (VIR_APPEND_ELEMENT(ctrl->nbdpids, ctrl->nnbdpids, pids[i]) < 0) + goto cleanup; + } + + ret = 0; + + cleanup: + VIR_FREE(pids); + VIR_FREE(pidpath); + return ret; +} + + static int virLXCControllerSetupNBDDeviceFS(virDomainFSDefPtr fs) { char *dev; @@ -545,53 +594,6 @@ static int virLXCControllerSetupNBDDeviceDisk(virDomainDiskDefPtr disk) return 0; } -static int virLXCControllerAppendNBDPids(virLXCControllerPtr ctrl, - const char *dev) -{ - char *pidpath = NULL; - pid_t *pids = NULL; - size_t npids = 0; - size_t i; - int ret = -1; - size_t loops = 0; - pid_t pid; - - if (!STRPREFIX(dev, "/dev/") || - virAsprintf(&pidpath, "/sys/devices/virtual/block/%s/pid", dev + 5) < 0) - goto cleanup; - - /* Wait for the pid file to appear */ - while (!virFileExists(pidpath)) { - /* wait for 100ms before checking again, but don't do it for ever */ - if (errno == ENOENT && loops < 10) { - usleep(100 * 1000); - loops++; - } else { - virReportSystemError(errno, - _("Cannot check NBD device %s pid"), - dev + 5); - goto cleanup; - } - } - - if (virPidFileReadPath(pidpath, &pid) < 0) - goto cleanup; - - if (virProcessGetPids(pid, &npids, &pids) < 0) - goto cleanup; - - for (i = 0; i < npids; i++) { - if (VIR_APPEND_ELEMENT(ctrl->nbdpids, ctrl->nnbdpids, pids[i]) < 0) - goto cleanup; - } - - ret = 0; - - cleanup: - VIR_FREE(pids); - VIR_FREE(pidpath); - return ret; -} static int virLXCControllerSetupLoopDevices(virLXCControllerPtr ctrl) { -- 2.17.1

When user-namespace is enabled we are not allowed to mount block/NBD devices. Instead, mount /dev/nbdX to /run/libvirt/lxc/<domain>.root and set: fs->src->path = /run/libvirt/lxc/<domain>.root fs->type = VIR_DOMAIN_FS_TYPE_MOUNT --- src/lxc/lxc_controller.c | 62 ++++++++++++++++++++++++++++------------ 1 file changed, 43 insertions(+), 19 deletions(-) diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index c9f416aaab..78b52b7079 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -530,33 +530,63 @@ static int virLXCControllerAppendNBDPids(virLXCControllerPtr ctrl, } -static int virLXCControllerSetupNBDDeviceFS(virDomainFSDefPtr fs) +static int virLXCControllerSetupNBDDeviceFS(virLXCControllerPtr ctrl, + virDomainFSDefPtr fs) { - char *dev; + char *dev = NULL; + char *dst = NULL; + char *tmp = NULL; + char *sec_mount_options; + int ret = -1; + + virDomainDefPtr def = ctrl->def; + virSecurityManagerPtr securityDriver = ctrl->securityManager; if (fs->format <= VIR_STORAGE_FILE_NONE) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", _("An explicit disk format must be specified")); - return -1; + goto cleanup; } + if (virAsprintf(&dst, "%s/%s.root/", + LXC_STATE_DIR, def->name) < 0) + goto cleanup; + + if (!(sec_mount_options = virSecurityManagerGetMountOptions(securityDriver, def))) + goto cleanup; + if (virFileNBDDeviceAssociate(fs->src->path, fs->format, fs->readonly, &dev) < 0) - return -1; + goto cleanup; - VIR_DEBUG("Changing fs %s to use type=block for dev %s", - fs->src->path, dev); - /* - * We now change it into a block device type, so that - * the rest of container setup 'just works' - */ - fs->type = VIR_DOMAIN_FS_TYPE_BLOCK; VIR_FREE(fs->src->path); fs->src->path = dev; - return 0; + tmp = fs->dst; + fs->dst = dst; + + if (lxcContainerMountFSBlock(fs, "", sec_mount_options) < 0) { + fs->dst = tmp; + goto cleanup; + } + + fs->dst = tmp; + fs->type = VIR_DOMAIN_FS_TYPE_MOUNT; + + if (virLXCControllerAppendNBDPids(ctrl, fs->src->path) < 0) + return -1; + + VIR_STEAL_PTR(fs->src->path, dst); + + ret = 0; + + cleanup: + VIR_FREE(dev); + VIR_FREE(dst); + VIR_FREE(sec_mount_options); + return ret; } @@ -637,13 +667,7 @@ static int virLXCControllerSetupLoopDevices(virLXCControllerPtr ctrl) } ctrl->loopDevFds[ctrl->nloopDevs - 1] = fd; } else if (fs->fsdriver == VIR_DOMAIN_FS_DRIVER_TYPE_NBD) { - if (virLXCControllerSetupNBDDeviceFS(fs) < 0) - goto cleanup; - - /* The NBD device will be cleaned up while the cgroup will end. - * For this we need to remember the qemu-nbd pid and add it to - * the cgroup*/ - if (virLXCControllerAppendNBDPids(ctrl, fs->src->path) < 0) + if (virLXCControllerSetupNBDDeviceFS(ctrl, fs) < 0) goto cleanup; } else { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, -- 2.17.1

lxcContainerPrepareRoot was introduced with commit 8dbe858 Ensure root filesystem is mounted if a file/block mount. For a root filesystem with type=file or type=block, the LXC container was forgetting to actually mount it, before doing the pivot root step. However, this method is no-longer needed because in the cases when the container root file system is of type 'file' or 'block', it will be mounted by virLXCControllerSetupNBDDeviceFS. Therefore the function lxcContainerPrepareRoot will always be called with: root->type = VIR_DOMAIN_FS_TYPE_MOUNT Which makes the following condition tautology: if (root->type == VIR_DOMAIN_FS_TYPE_MOUNT) return 0; Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com> --- src/lxc/lxc_container.c | 53 ----------------------------------------- 1 file changed, 53 deletions(-) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 3f6be9f44d..3360c608ff 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -658,55 +658,6 @@ static int lxcContainerResolveSymlinks(virDomainFSDefPtr fs, bool gentle) return 0; } -static int lxcContainerPrepareRoot(virDomainDefPtr def, - virDomainFSDefPtr root, - const char *sec_mount_options) -{ - char *dst; - char *tmp; - - VIR_DEBUG("Prepare root %d", root->type); - - if (root->type == VIR_DOMAIN_FS_TYPE_MOUNT) - return 0; - - if (root->type == VIR_DOMAIN_FS_TYPE_FILE) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Unexpected root filesystem without loop device")); - return -1; - } - - if (root->type != VIR_DOMAIN_FS_TYPE_BLOCK) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("Unsupported root filesystem type %s"), - virDomainFSTypeToString(root->type)); - return -1; - } - - if (lxcContainerResolveSymlinks(root, false) < 0) - return -1; - - if (virAsprintf(&dst, "%s/%s.root", - LXC_STATE_DIR, def->name) < 0) - return -1; - - tmp = root->dst; - root->dst = dst; - - if (lxcContainerMountFSBlock(root, "", sec_mount_options) < 0) { - root->dst = tmp; - VIR_FREE(dst); - return -1; - } - - root->dst = tmp; - root->type = VIR_DOMAIN_FS_TYPE_MOUNT; - VIR_FREE(root->src->path); - root->src->path = dst; - - return 0; -} - static int lxcContainerPivotRoot(virDomainFSDefPtr root) { int ret; @@ -1755,10 +1706,6 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef, if (virFileResolveAllLinks(LXC_STATE_DIR, &stateDir) < 0) goto cleanup; - /* Ensure the root filesystem is mounted */ - if (lxcContainerPrepareRoot(vmDef, root, sec_mount_options) < 0) - goto cleanup; - /* Gives us a private root, leaving all parent OS mounts on /.oldroot */ if (lxcContainerPivotRoot(root) < 0) goto cleanup; -- 2.17.1

On Sun, Jun 10, 2018 at 12:14:22PM +0100, Radostin Stoyanov wrote:
Hi all,
This patch series aims to resolve https://bugzilla.redhat.com/show_bug.cgi?id=1328946
For background information about the issue see v1 of this RFC. https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html
The current state of this series enables the start of LXC container with NBD file system and enabled user namespace.
However, container shutdown causes "kernel BUG at fs/buffer.c:3058!" https://pastebin.com/raw/y0ycSM0H
The reason for this is because qemu-nbd process is terminated/killed without unmounting the container root file system.
This issue has been reported in [1] and [2]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110 [2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html
This is not really a kernel bug at the end of the day. We have a filesystem backed by NBD block device, and we're killing the NBD block device. So there's nothing the kernel can really do here if there's outstanding I/O pendnig at this time. There is also this BZ reported against libvirt that has more info: https://bugzilla.redhat.com/show_bug.cgi?id=1570902
As a workaround we could unmount the root file system of container before shutdown.
For example with: $ CT_PID=$(pidof libvirt_lxc) $ sudo nsenter \ --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \ /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/"
I noticed that we already have the functions lxcContainerUnmountSubtree and virProcessRunInMountNamespace.
Any suggestions on how to properly implement this?
We can't unmount the filesystem directly because we don't have any process running inside the container's mount namespace at this time. The libvirt_lxc controller is running in a custom mount namespace that is different from what the container has. The first thing we need todo is take qemu-nbd out of the cgroups. This will ensure that it doesn't get killed at the same time as we're killing off all the container PIDs. It will also fix the OOM deadlocks we see when the memory controller prevents qemu-nbd allocating RAM needed to proces I/O. Then, we can kill all processes in the container as normal. Once they are all gone, we know the kernel will have cleaned up the mount namespace. We can thus safely kill qemu-nbd at this point. Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN was release (ie when filesystem was unmounted). This is something you can enable for loopback devices, but I'm not sure it works for NBD. THis would be a useful kernel enhancement if someone feels adventurous. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 13/06/18 11:46, Daniel P. Berrangé wrote:
Hi all,
This patch series aims to resolve https://bugzilla.redhat.com/show_bug.cgi?id=1328946
For background information about the issue see v1 of this RFC. https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html
The current state of this series enables the start of LXC container with NBD file system and enabled user namespace.
However, container shutdown causes "kernel BUG at fs/buffer.c:3058!" https://pastebin.com/raw/y0ycSM0H
The reason for this is because qemu-nbd process is terminated/killed without unmounting the container root file system.
This issue has been reported in [1] and [2]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110 [2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html This is not really a kernel bug at the end of the day. We have a filesystem backed by NBD block device, and we're killing the NBD block device. So there's nothing the kernel can really do here if there's outstanding I/O pendnig at
On Sun, Jun 10, 2018 at 12:14:22PM +0100, Radostin Stoyanov wrote: this time.
There is also this BZ reported against libvirt that has more info:
https://bugzilla.redhat.com/show_bug.cgi?id=1570902
As a workaround we could unmount the root file system of container before shutdown.
For example with: $ CT_PID=$(pidof libvirt_lxc) $ sudo nsenter \ --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \ /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/"
I noticed that we already have the functions lxcContainerUnmountSubtree and virProcessRunInMountNamespace.
Any suggestions on how to properly implement this? We can't unmount the filesystem directly because we don't have any process running inside the container's mount namespace at this time. The libvirt_lxc controller is running in a custom mount namespace that is different from what the container has.
The first thing we need todo is take qemu-nbd out of the cgroups. This will ensure that it doesn't get killed at the same time as we're killing off all the container PIDs. It will also fix the OOM deadlocks we see when the memory controller prevents qemu-nbd allocating RAM needed to proces I/O.
Then, we can kill all processes in the container as normal. Once they are all gone, we know the kernel will have cleaned up the mount namespace. We can thus safely kill qemu-nbd at this point. Thank you for the pointers! Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN was release (ie when filesystem was unmounted). This is something you can enable for loopback devices, but I'm not sure it works for NBD. THis would be a useful kernel enhancement if someone feels adventurous. It seems like qemu-nbd terminates automatically when the last client disconnects.
https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-nbd.c;h=51b9d38c72732c821cb4e... I will send a patch thattakes qemu-nbd out of the cgroups and disconnects qemu-nbd on container shutdown. Radostin

On Wed, Jun 13, 2018 at 03:18:02PM +0100, Radostin Stoyanov wrote:
Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN was release (ie when filesystem was unmounted). This is something you can enable for loopback devices, but I'm not sure it works for NBD. THis would be a useful kernel enhancement if someone feels adventurous. It seems like qemu-nbd terminates automatically when the last client disconnects.
Right, but in the case where the kernel has connected qemu-nbd to a block device, the kernel itself is a client that won't exit. So what I was describing was that the kernel should automatically delete the /dev/ndb0 device when the last mount was unmounted, and thus in turn it can close the last client of qemu-nbd
https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-nbd.c;h=51b9d38c72732c821cb4e...
I will send a patch thattakes qemu-nbd out of the cgroups and disconnects qemu-nbd on container shutdown.
Radostin
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (2)
-
Daniel P. Berrangé
-
Radostin Stoyanov