
On 13/06/18 11:46, Daniel P. Berrangé wrote:
Hi all,
This patch series aims to resolve https://bugzilla.redhat.com/show_bug.cgi?id=1328946
For background information about the issue see v1 of this RFC. https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html
The current state of this series enables the start of LXC container with NBD file system and enabled user namespace.
However, container shutdown causes "kernel BUG at fs/buffer.c:3058!" https://pastebin.com/raw/y0ycSM0H
The reason for this is because qemu-nbd process is terminated/killed without unmounting the container root file system.
This issue has been reported in [1] and [2]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110 [2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html This is not really a kernel bug at the end of the day. We have a filesystem backed by NBD block device, and we're killing the NBD block device. So there's nothing the kernel can really do here if there's outstanding I/O pendnig at
On Sun, Jun 10, 2018 at 12:14:22PM +0100, Radostin Stoyanov wrote: this time.
There is also this BZ reported against libvirt that has more info:
https://bugzilla.redhat.com/show_bug.cgi?id=1570902
As a workaround we could unmount the root file system of container before shutdown.
For example with: $ CT_PID=$(pidof libvirt_lxc) $ sudo nsenter \ --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \ /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/"
I noticed that we already have the functions lxcContainerUnmountSubtree and virProcessRunInMountNamespace.
Any suggestions on how to properly implement this? We can't unmount the filesystem directly because we don't have any process running inside the container's mount namespace at this time. The libvirt_lxc controller is running in a custom mount namespace that is different from what the container has.
The first thing we need todo is take qemu-nbd out of the cgroups. This will ensure that it doesn't get killed at the same time as we're killing off all the container PIDs. It will also fix the OOM deadlocks we see when the memory controller prevents qemu-nbd allocating RAM needed to proces I/O.
Then, we can kill all processes in the container as normal. Once they are all gone, we know the kernel will have cleaned up the mount namespace. We can thus safely kill qemu-nbd at this point. Thank you for the pointers! Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN was release (ie when filesystem was unmounted). This is something you can enable for loopback devices, but I'm not sure it works for NBD. THis would be a useful kernel enhancement if someone feels adventurous. It seems like qemu-nbd terminates automatically when the last client disconnects.
https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-nbd.c;h=51b9d38c72732c821cb4e... I will send a patch thattakes qemu-nbd out of the cgroups and disconnects qemu-nbd on container shutdown. Radostin