Re: [libvirt] [RFC v2 0/4] LXC with block device and enabled userns

13 Jun 2018

      On 13/06/18 11:46, Daniel P. Berrangé wrote:
...
...
Hi all,
This patch series aims to resolve
https://bugzilla.redhat.com/show_bug.cgi?id=1328946
For background information about the issue see v1 of this RFC.
https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html
The current state of this series enables the start of LXC container with NBD
file system and enabled user namespace.
However, container shutdown causes "kernel BUG at fs/buffer.c:3058!"
https://pastebin.com/raw/y0ycSM0H
The reason for this is because qemu-nbd process is terminated/killed without
unmounting the container root file system.
This issue has been reported in [1] and [2].
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110
[2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html
This is not really a kernel bug at the end of the day. We have a filesystem
backed by NBD block device, and we're killing the NBD block device. So there's
nothing the kernel can really do here if there's outstanding I/O pendnig at
On Sun, Jun 10, 2018 at 12:14:22PM +0100, Radostin Stoyanov wrote:
this time.
There is also this BZ reported against libvirt that has more info:
https://bugzilla.redhat.com/show_bug.cgi?id=1570902
...
As a workaround we could unmount the root file system of container before shutdown.
For example with:
    $ CT_PID=$(pidof libvirt_lxc)
    $ sudo nsenter \
        --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \
        /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/"
I noticed that we already have the functions lxcContainerUnmountSubtree
and virProcessRunInMountNamespace.
Any suggestions on how to properly implement this?
We can't unmount the filesystem directly because we don't have any process
running inside the container's mount namespace at this time. The libvirt_lxc
controller is running in a custom mount namespace that is different from what
the container has.
The first thing we need todo is take qemu-nbd out of the cgroups. This will
ensure that it doesn't get killed at the same time as we're killing off all
the container PIDs. It will also fix the OOM deadlocks we see when the memory
controller prevents qemu-nbd allocating RAM needed to proces I/O.
Then, we can kill all processes in the container as normal. Once they are
all gone, we know the kernel will have cleaned up the mount namespace. We
can thus safely kill qemu-nbd at this point.
Thank you for the pointers!
Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN
was release (ie when filesystem was unmounted). This is something you can
enable for loopback devices, but I'm not sure it works for NBD. THis would
be a useful kernel enhancement if someone feels adventurous.
It seems like qemu-nbd terminates automatically when the last client
disconnects.
https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-nbd.c;h=51b9d38c72732c821cb4e...

I will send a patch thattakes qemu-nbd out of the cgroups and
disconnects qemu-nbd on container shutdown.

Radostin