[PATCH 0/4] qemu_process.c: Propagate hugetlbfs mounts on reconnect

See 4/4 for full explanation. Michal Prívozník (4): qemu_namespace: Tolerate missing ACLs when creating a path in namespace qemu_namespace: Fix a corner case in qemuDomainGetPreservedMounts() qemu_namespace: Introduce qemuDomainNamespaceSetupPath() qemu_process.c: Propagate hugetlbfs mounts on reconnect docs/kbase/qemu-passthrough-security.rst | 6 ----- src/qemu/qemu_namespace.c | 32 +++++++++++++++++++++--- src/qemu/qemu_namespace.h | 4 +++ src/qemu/qemu_process.c | 3 +++ 4 files changed, 36 insertions(+), 9 deletions(-) -- 2.35.1

When creating a path in a domain's mount namespace we try to set ACLs on it, so that it's a verbatim copy of the path in parent's namespace. The ACLs are queried upfront (by qemuNamespaceMknodItemInit()) but this is fault tolerant so the pointer to ACLs might be NULL (meaning no ACLs were queried, for instance because the underlying filesystem does not support them). But then we take this NULL and pass it to virFileSetACLs() which immediately returns an error because NULL is invalid value. Mimic what we do with SELinux label - just set it if we queried it successfully before. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_namespace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 98cd794666..71e3366ca5 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -1040,8 +1040,7 @@ qemuNamespaceMknodOne(qemuNamespaceMknodItem *data) goto cleanup; } - /* Symlinks don't have ACLs. */ - if (!isLink && + if (data->acl && virFileSetACLs(data->file, data->acl) < 0 && errno != ENOTSUP) { virReportSystemError(errno, -- 2.35.1

On Mon, Sep 12, 2022 at 03:46:38PM +0200, Michal Privoznik wrote:
When creating a path in a domain's mount namespace we try to set ACLs on it, so that it's a verbatim copy of the path in parent's namespace. The ACLs are queried upfront (by qemuNamespaceMknodItemInit()) but this is fault tolerant so the pointer to ACLs might be NULL (meaning no ACLs were queried, for instance because the underlying filesystem does not support them). But then we take this NULL and pass it to virFileSetACLs() which immediately returns an error because NULL is invalid value.
Mimic what we do with SELinux label - just set it if we queried it successfully before.
It would be less confusing if you just said something along the lines of: Only set ACLs if they are non-NULL which includes symlinks.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
--- src/qemu/qemu_namespace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 98cd794666..71e3366ca5 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -1040,8 +1040,7 @@ qemuNamespaceMknodOne(qemuNamespaceMknodItem *data) goto cleanup; }
- /* Symlinks don't have ACLs. */ - if (!isLink && + if (data->acl && virFileSetACLs(data->file, data->acl) < 0 && errno != ENOTSUP) { virReportSystemError(errno, -- 2.35.1

When setting up namespace for QEMU we look at mount points under /dev (like /dev/pts, /dev/mqueue/, etc.) because we want to preserve those (which is done by moving them to a temp location, unshare(), and then moving them back). We have a convenience helper - qemuDomainGetPreservedMounts() - that processes the mount table and (optionally) moves the other filesystems too. This helper is also used when attempting to create a path in NS, because the path, while starting with "/dev/" prefix, may actually lead to one of those filesystems that we preserved. And here comes the corner case: while we require the parent mount table to be in shared mode (equivalent of `mount --make-rshared /'), these mount events propagate iff the target path exist inside the slave mount table (= QEMU's private namespace). And since we create only a subset of /dev nodes, well, that assumption is not always the case. For instance, assume that a domain is already running, no hugepages were configured for it nor any hugetlbfs is mounted. Now, when a hugetlbfs is mounted into '/dev/hugepages', this is propagated into the QEMU's namespace, but since the target dir does not exist in the private /dev, the FS is not mounted in the namespace. Fortunately, this difference between namespaces is visible when comparing /proc/mounts and /proc/$PID/mounts (where PID is the QEMU's PID). Therefore, if possible we should look at the latter. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_namespace.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 71e3366ca5..807ec37c91 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -109,6 +109,8 @@ qemuDomainGetPreservedMountPath(virQEMUDriverConfig *cfg, * b) generate backup path for all the entries in a) * * Any of the return pointers can be NULL. Both arrays are NULL-terminated. + * Get the mount table either from @vm's PID (if running), or from the + * namespace we're in (if @vm's not running). * * Returns 0 on success, -1 otherwise (with error reported) */ @@ -123,12 +125,18 @@ qemuDomainGetPreservedMounts(virQEMUDriverConfig *cfg, size_t nmounts = 0; g_auto(GStrv) paths = NULL; g_auto(GStrv) savePaths = NULL; + g_autofree char *mountsPath = NULL; size_t i; if (ndevPath) *ndevPath = 0; - if (virFileGetMountSubtree(QEMU_PROC_MOUNTS, "/dev", &mounts, &nmounts) < 0) + if (vm->pid > 0) + mountsPath = g_strdup_printf("/proc/%lld/mounts", (long long) vm->pid); + else + mountsPath = g_strdup(QEMU_PROC_MOUNTS); + + if (virFileGetMountSubtree(mountsPath, "/dev", &mounts, &nmounts) < 0) return -1; if (nmounts == 0) -- 2.35.1

On Mon, Sep 12, 2022 at 03:46:39PM +0200, Michal Privoznik wrote:
When setting up namespace for QEMU we look at mount points under /dev (like /dev/pts, /dev/mqueue/, etc.) because we want to preserve those (which is done by moving them to a temp location, unshare(), and then moving them back). We have a convenience helper - qemuDomainGetPreservedMounts() - that processes the mount table and (optionally) moves the other filesystems too. This helper is also used when attempting to create a path in NS, because the path, while starting with "/dev/" prefix, may actually lead to one of those filesystems that we preserved.
And here comes the corner case: while we require the parent mount table to be in shared mode (equivalent of `mount --make-rshared /'), these mount events propagate iff the target path exist inside the slave mount table (= QEMU's private namespace). And since we create only a subset of /dev nodes, well, that assumption is not always the case.
For instance, assume that a domain is already running, no hugepages were configured for it nor any hugetlbfs is mounted. Now, when a hugetlbfs is mounted into '/dev/hugepages', this is propagated into the QEMU's namespace, but since the target dir does not exist in the private /dev, the FS is not mounted in the namespace.
Fortunately, this difference between namespaces is visible when comparing /proc/mounts and /proc/$PID/mounts (where PID is the QEMU's PID). Therefore, if possible we should look at the latter.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
--- src/qemu/qemu_namespace.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 71e3366ca5..807ec37c91 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -109,6 +109,8 @@ qemuDomainGetPreservedMountPath(virQEMUDriverConfig *cfg, * b) generate backup path for all the entries in a) * * Any of the return pointers can be NULL. Both arrays are NULL-terminated. + * Get the mount table either from @vm's PID (if running), or from the + * namespace we're in (if @vm's not running). * * Returns 0 on success, -1 otherwise (with error reported) */ @@ -123,12 +125,18 @@ qemuDomainGetPreservedMounts(virQEMUDriverConfig *cfg, size_t nmounts = 0; g_auto(GStrv) paths = NULL; g_auto(GStrv) savePaths = NULL; + g_autofree char *mountsPath = NULL; size_t i;
if (ndevPath) *ndevPath = 0;
- if (virFileGetMountSubtree(QEMU_PROC_MOUNTS, "/dev", &mounts, &nmounts) < 0) + if (vm->pid > 0) + mountsPath = g_strdup_printf("/proc/%lld/mounts", (long long) vm->pid); + else + mountsPath = g_strdup(QEMU_PROC_MOUNTS); + + if (virFileGetMountSubtree(mountsPath, "/dev", &mounts, &nmounts) < 0) return -1;
if (nmounts == 0) -- 2.35.1

Sometimes it may come handy to just bind mount a directory/file into domain's namespace. Implement a thin wrapper over qemuNamespaceMknodPaths() which has all the logic we need. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_namespace.c | 19 +++++++++++++++++++ src/qemu/qemu_namespace.h | 4 ++++ 2 files changed, 23 insertions(+) diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 807ec37c91..09e235e120 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -1424,6 +1424,25 @@ qemuNamespaceUnlinkPaths(virDomainObj *vm, } +int +qemuDomainNamespaceSetupPath(virDomainObj *vm, + const char *path, + bool *created) +{ + g_autoptr(virGSListString) paths = NULL; + + if (!qemuDomainNamespaceEnabled(vm, QEMU_DOMAIN_NS_MOUNT)) + return 0; + + paths = g_slist_prepend(paths, g_strdup(path)); + + if (qemuNamespaceMknodPaths(vm, paths, created) < 0) + return -1; + + return 0; +} + + int qemuDomainNamespaceSetupDisk(virDomainObj *vm, virStorageSource *src, diff --git a/src/qemu/qemu_namespace.h b/src/qemu/qemu_namespace.h index fbea865c70..85d990f460 100644 --- a/src/qemu/qemu_namespace.h +++ b/src/qemu/qemu_namespace.h @@ -48,6 +48,10 @@ void qemuDomainDestroyNamespace(virQEMUDriver *driver, bool qemuDomainNamespaceAvailable(qemuDomainNamespace ns); +int qemuDomainNamespaceSetupPath(virDomainObj *vm, + const char *path, + bool *created); + int qemuDomainNamespaceSetupDisk(virDomainObj *vm, virStorageSource *src, bool *created); -- 2.35.1

On Mon, Sep 12, 2022 at 03:46:40PM +0200, Michal Privoznik wrote:
Sometimes it may come handy to just bind mount a directory/file into domain's namespace. Implement a thin wrapper over qemuNamespaceMknodPaths() which has all the logic we need.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
--- src/qemu/qemu_namespace.c | 19 +++++++++++++++++++ src/qemu/qemu_namespace.h | 4 ++++ 2 files changed, 23 insertions(+)
diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 807ec37c91..09e235e120 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -1424,6 +1424,25 @@ qemuNamespaceUnlinkPaths(virDomainObj *vm, }
+int +qemuDomainNamespaceSetupPath(virDomainObj *vm, + const char *path, + bool *created) +{ + g_autoptr(virGSListString) paths = NULL; + + if (!qemuDomainNamespaceEnabled(vm, QEMU_DOMAIN_NS_MOUNT)) + return 0; + + paths = g_slist_prepend(paths, g_strdup(path)); + + if (qemuNamespaceMknodPaths(vm, paths, created) < 0) + return -1; + + return 0; +} + + int qemuDomainNamespaceSetupDisk(virDomainObj *vm, virStorageSource *src, diff --git a/src/qemu/qemu_namespace.h b/src/qemu/qemu_namespace.h index fbea865c70..85d990f460 100644 --- a/src/qemu/qemu_namespace.h +++ b/src/qemu/qemu_namespace.h @@ -48,6 +48,10 @@ void qemuDomainDestroyNamespace(virQEMUDriver *driver,
bool qemuDomainNamespaceAvailable(qemuDomainNamespace ns);
+int qemuDomainNamespaceSetupPath(virDomainObj *vm, + const char *path, + bool *created); + int qemuDomainNamespaceSetupDisk(virDomainObj *vm, virStorageSource *src, bool *created); -- 2.35.1

When reconnecting to a running QEMU process, we construct the per-domain path in all hugetlbfs mounts. This is a relict from the past (v3.4.0-100-g5b24d25062) where we switched to a per-domain path and we want to create those paths when libvirtd restarts on upgrade. And with namespaces enabled there is one corner case where the path is not created. In fact an error is reported and the reconnect fails. Ideally, all mount events are propagated into the QEMU's namespace. And they probably are, except when the target path does not exist inside the namespace. Now, it's pretty common for users to mount hugetlbfs under /dev (e.g. /dev/hugepages), but if domain is started without hugepages (or more specifically - private hugetlbfs path wasn't created on domain startup), then the reconnect code tries to create it. But it fails to do so, well, it fails to set seclabels on the path because, because the path does not exist in the private namespace. And it doesn't exist because we specifically create only a subset of all possible /dev nodes. Therefore, the mount event, whilst propagated, is not successful and hence the filesystem is not mounted. We have to do it ourselves. If hugetlbfs is mount anywhere else there's no problem and this is effectively a dead code. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2123196 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/kbase/qemu-passthrough-security.rst | 6 ------ src/qemu/qemu_process.c | 3 +++ 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/docs/kbase/qemu-passthrough-security.rst b/docs/kbase/qemu-passthrough-security.rst index 106c3cc5b9..ef10d8af9b 100644 --- a/docs/kbase/qemu-passthrough-security.rst +++ b/docs/kbase/qemu-passthrough-security.rst @@ -172,9 +172,3 @@ command before any guest is started: :: # mount --make-rshared / - -Another requirement for dynamic mount point propagation is to not place -``hugetlbfs`` mount points under ``/dev`` because these won't be propagated as -corresponding directories do not exist in the private namespace. Or just use -``memfd`` memory backend instead which does not require ``hugetlbfs`` mount -points. diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index cbfdd3bda5..b05ad059c3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3976,6 +3976,9 @@ qemuProcessBuildDestroyMemoryPathsImpl(virQEMUDriver *driver, return -1; } + if (qemuDomainNamespaceSetupPath(vm, path, NULL) < 0) + return -1; + if (qemuSecurityDomainSetPathLabel(driver, vm, path, true) < 0) return -1; } else { -- 2.35.1

On Mon, Sep 12, 2022 at 03:46:41PM +0200, Michal Privoznik wrote:
When reconnecting to a running QEMU process, we construct the per-domain path in all hugetlbfs mounts. This is a relict from the past (v3.4.0-100-g5b24d25062) where we switched to a per-domain path and we want to create those paths when libvirtd restarts on upgrade.
And with namespaces enabled there is one corner case where the path is not created. In fact an error is reported and the reconnect fails. Ideally, all mount events are propagated into the QEMU's namespace. And they probably are, except when the target path does not exist inside the namespace. Now, it's pretty common for users to mount hugetlbfs under /dev (e.g. /dev/hugepages), but if domain is started without hugepages (or more specifically - private hugetlbfs path wasn't created on domain startup), then the reconnect code tries to create it. But it fails to do so, well, it fails to set seclabels on the path because, because the path does not exist in the private
s/because, // at least =)
namespace. And it doesn't exist because we specifically create only a subset of all possible /dev nodes. Therefore, the mount event, whilst propagated, is not successful and hence the filesystem is not mounted. We have to do it ourselves.
If hugetlbfs is mount anywhere else there's no problem and this is effectively a dead code.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2123196 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/kbase/qemu-passthrough-security.rst | 6 ------ src/qemu/qemu_process.c | 3 +++ 2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/docs/kbase/qemu-passthrough-security.rst b/docs/kbase/qemu-passthrough-security.rst index 106c3cc5b9..ef10d8af9b 100644 --- a/docs/kbase/qemu-passthrough-security.rst +++ b/docs/kbase/qemu-passthrough-security.rst @@ -172,9 +172,3 @@ command before any guest is started: ::
# mount --make-rshared / - -Another requirement for dynamic mount point propagation is to not place -``hugetlbfs`` mount points under ``/dev`` because these won't be propagated as -corresponding directories do not exist in the private namespace. Or just use -``memfd`` memory backend instead which does not require ``hugetlbfs`` mount -points. diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index cbfdd3bda5..b05ad059c3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3976,6 +3976,9 @@ qemuProcessBuildDestroyMemoryPathsImpl(virQEMUDriver *driver, return -1; }
+ if (qemuDomainNamespaceSetupPath(vm, path, NULL) < 0) + return -1; +
What's not visible here is that this branch is accessible only if the domain has any hugepage backing configured. And it is only created on startup and reconnect. That means if you start a VM without any hugepages mounted, then mount them somewhere under /dev and restart the daemon (which was always required) then attaching a hugepage-backed dimm, for example, will work without namespaces, but will fail with namespaces because we skipped the creation on restart. I think just recreating the structure unconditionally would provide more benefit than doubly isolating qemu from hugepages (once with permissions and once with namespaces). Since this patch solves the issue of us killing perfectly fine qemu I think it's fine if that's a follow'up patch: Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
if (qemuSecurityDomainSetPathLabel(driver, vm, path, true) < 0) return -1; } else { -- 2.35.1

On 9/23/22 15:06, Martin Kletzander wrote:
On Mon, Sep 12, 2022 at 03:46:41PM +0200, Michal Privoznik wrote:
When reconnecting to a running QEMU process, we construct the per-domain path in all hugetlbfs mounts. This is a relict from the past (v3.4.0-100-g5b24d25062) where we switched to a per-domain path and we want to create those paths when libvirtd restarts on upgrade.
And with namespaces enabled there is one corner case where the path is not created. In fact an error is reported and the reconnect fails. Ideally, all mount events are propagated into the QEMU's namespace. And they probably are, except when the target path does not exist inside the namespace. Now, it's pretty common for users to mount hugetlbfs under /dev (e.g. /dev/hugepages), but if domain is started without hugepages (or more specifically - private hugetlbfs path wasn't created on domain startup), then the reconnect code tries to create it. But it fails to do so, well, it fails to set seclabels on the path because, because the path does not exist in the private
s/because, // at least =)
namespace. And it doesn't exist because we specifically create only a subset of all possible /dev nodes. Therefore, the mount event, whilst propagated, is not successful and hence the filesystem is not mounted. We have to do it ourselves.
If hugetlbfs is mount anywhere else there's no problem and this is effectively a dead code.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2123196 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/kbase/qemu-passthrough-security.rst | 6 ------ src/qemu/qemu_process.c | 3 +++ 2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/docs/kbase/qemu-passthrough-security.rst b/docs/kbase/qemu-passthrough-security.rst index 106c3cc5b9..ef10d8af9b 100644 --- a/docs/kbase/qemu-passthrough-security.rst +++ b/docs/kbase/qemu-passthrough-security.rst @@ -172,9 +172,3 @@ command before any guest is started: ::
# mount --make-rshared / - -Another requirement for dynamic mount point propagation is to not place -``hugetlbfs`` mount points under ``/dev`` because these won't be propagated as -corresponding directories do not exist in the private namespace. Or just use -``memfd`` memory backend instead which does not require ``hugetlbfs`` mount -points. diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index cbfdd3bda5..b05ad059c3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3976,6 +3976,9 @@ qemuProcessBuildDestroyMemoryPathsImpl(virQEMUDriver *driver, return -1; }
+ if (qemuDomainNamespaceSetupPath(vm, path, NULL) < 0) + return -1; +
What's not visible here is that this branch is accessible only if the domain has any hugepage backing configured. And it is only created on startup and reconnect. That means if you start a VM without any hugepages mounted, then mount them somewhere under /dev and restart the daemon (which was always required) then attaching a hugepage-backed dimm, for example, will work without namespaces, but will fail with namespaces because we skipped the creation on restart. I think just recreating the structure unconditionally would provide more benefit than doubly isolating qemu from hugepages (once with permissions and once with namespaces).
Actually, the path is created also on memory hotplug: src/qemu/qemu_hotplug.c=2231=qemuDomainAttachMemory(virQEMUDriver *driver, -- src/qemu/qemu_hotplug.c-2275- src/qemu/qemu_hotplug.c:2276: if (qemuProcessBuildDestroyMemoryPaths(driver, vm, mem, true) < 0) src/qemu/qemu_hotplug.c-2277- goto cleanup; So the per-domain dir should be created in all three cases (startup, reconnect & hotplug).
Since this patch solves the issue of us killing perfectly fine qemu I think it's fine if that's a follow'up patch:
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
Pushed, thanks. Michal
participants (3)
-
Martin Kletzander
-
Michal Privoznik
-
Michal Prívozník