On 1/18/23 01:58, Jim Fehlig wrote:
On 12/16/22 08:04, Michal Privoznik wrote:
> Our code relies on mount events propagating into the namespace we
> create for a domain. However, there's one caveat. In v8.8.0-rc1~8
> I've tried to make us detect differences in mount tables between
> the namespace in which libvirtd runs and the domain namespace.
> This is crucial for any mounts that happen after the domain was
> started (for instance new hugetlbfs can be mounted on say
> /dev/hugepages1G).
>
> Therefore, we take a look into /proc/$(pgrep qemu)/mounts to see
> what filesystems are mounted under /dev. Now, since we don't
> umount the original /dev, just mount a tmpfs over it, we get all
> the events (e.g. aforementioned hugetlbfs mount on
> /dev/hugepages1G), but we are not really able to access it
> because of the tmpfs that's placed on top. This then confuses our
> algorithm for detecting which filesystems are mounted (the
> algorithm is implemented in qemuDomainGetPreservedMounts()).
>
> To break the link between host's and guest's /dev we just need to
> umount() the original /dev in the namespace. Just before our
> artificially created tmpfs is moved into its place.
>
> Fixes: 46b03819ae8d833b11c2aaccb2c2a0361727f51b
> Resolves:
https://bugzilla.redhat.com/show_bug.cgi?id=2151869#c6
> Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
> ---
> src/qemu/qemu_namespace.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c
> index 90c0b90024..a6b9af1307 100644
> --- a/src/qemu/qemu_namespace.c
> +++ b/src/qemu/qemu_namespace.c
> @@ -775,6 +775,11 @@ qemuDomainUnshareNamespace(virQEMUDriverConfig *cfg,
> goto cleanup;
> }
> + if (umount("/dev") < 0) {
> + virReportSystemError(errno, "%s", _("failed to umount devfs
> on /dev"));
> + return -1;
> + }
Hi Michal,
While doing some downstream testing of 9.0.0 with apparmor confinement
of libvirtd enabled, I encountered the above error when attempting to
start a VM.
Ah, I keep forgetting about apparmor, sorry.
Additionally from the audit subsystem
kernel: audit: type=1400 audit(1674002774.461:13): apparmor="DENIED"
operation="umount" profile="libvirtd" name="/dev/"
pid=4778
comm="rpc-libvirtd"
Hmm, libvirtd needs to umount /dev, is that right? I added
umount /dev/,
to the usr.sbin.libvirtd profile, reloaded it, and was then able to
start the VM. That seems like a big hammer. I was expecting the umount
in /run/libvirt/qemu/*.dev, but the profile already contains a rule for
that path.
Yeah, we really need to umount /dev and not /run/libvirt/.../*.dev. The
reason is twofold:
1) security - we used to mount a tmpfs over an existing /dev. Therefore,
the original /dev with all devices was still available. If a malicious
QEMU did just 'umount /dev' it would unmount the top most /dev (i.e. our
tmpfs) exposing the original /dev. I agree that this is more theoretical
threat than anything, because QEMU wouldn't probably have the capability
to play with mount table.
2) as the commit message explains, it's needed because of
qemuDomainGetPreservedMounts(). Long story short, when a QEMU is running
and we are deciding whether to make a new node under /dev we look at the
mount table of the QEMU process. This is because it's only devtmpfs we
are replacing. All other filesystems mounted under /dev are preserved
(e.g. /dev/pts, /dev/mqueue, /dev/shm, ...). Since these are preserved
(i.e. they are NOT a copy of their image in the parent namespace, they
ARE the parent namespace mounts), we don't need to mknod() anything.
For instance, when hotplugging a serial console with /dev/pts/123
backend, we must not create it in the QEMU's namespace as it already
exists. But we need to mknod() say /dev/sda1. Now, up until this it
worked just fine. Problem is, when a new filesystem is mounted after
QEMU was started.
What I've seen (and what the associated bug is about), is when another
hugetlbfs is mounted after a guest was started. For instance:
1) virsh start guest
2) mount -t hugetlbfs hugetlbfs /dev/hugepages1G -o pagesize=1G
3) systemctl restart libvirtd
4) hotplug a <memory/> into the guest that is backed by 1GB hugepages
At the last step, we want to take a look into the mount table of running
QEMU and see whether /dev/hugepages1G is mounted. Well, if the original
devtmpfs is kept mounted, then the mount event is propagated into QEMU's
mount table (even though it's inaccessible). Maybe the following shell
commands explain it better:
Domain 'fedora' started
# nsenter -a -t $(pgrep qemu) mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
hugetlb on /sys/fs/cgroup/hugetlb type cgroup
(rw,nosuid,nodev,noexec,relatime,hugetlb)
# mount -t hugetlbfs hugetlbfs /dev/hugepages1G -o pagesize=1G
# nsenter -a -t $(pgrep qemu) mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
hugetlb on /sys/fs/cgroup/hugetlb type cgroup
(rw,nosuid,nodev,noexec,relatime,hugetlb)
hugetlbfs on /dev/hugepages1G type hugetlbfs (rw,relatime,pagesize=1024M)
# nsenter -a -t $(pgrep qemu) ls -l /dev/hugepages1G
ls: cannot access '/dev/hugepages1G': No such file or directory
The reason why this new mount point exists in the QEMU's mount table is
the original devtmpfs. Therefore, to break this 'link', we need to
umount the original devtmpfs after which things work as expected:
# virsh start fedora
Domain 'fedora' started
# nsenter -a -t $(pgrep qemu) mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
hugetlb on /sys/fs/cgroup/hugetlb type cgroup
(rw,nosuid,nodev,noexec,relatime,hugetlb)
# mount -t hugetlbfs hugetlbfs /dev/hugepages1G -o pagesize=1G
# nsenter -a -t $(pgrep qemu) mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
hugetlb on /sys/fs/cgroup/hugetlb type cgroup
(rw,nosuid,nodev,noexec,relatime,hugetlb)
I hope this explains the reason why we need to unmount the original /dev.
Michal