Re: [libvirt] [PATCH v1 09/21] qemu: Spawn qemu under mount namespace

5 Dec 2016

On 05.12.2016 14:26, Daniel P. Berrange wrote:
...
On Thu, Nov 24, 2016 at 03:47:58PM +0100, Michal Privoznik wrote:
...
Prime time. When it comes to spawning qemu process and
relabelling all the devices it's going to touch, there's inherent
race with other applications in the system (e.g. udev). Instead
of trying convincing udev to not touch libvirt managed devices,
we can create a separate mount namespace for the qemu, and mount
our own /dev there. Of course this puts more work onto us as we
have to maintain /dev files on each domain start and device
hot(un-)plug. On the other hand, this enhances security also.
...
From technical POV, on domain startup process the parent
(libvirtd) creates:
/var/lib/libvirt/qemu/$domain.dev
  /var/lib/libvirt/qemu/$domain.devpts
The child (which is going to be qemu eventually) calls unshare()
to create new mount namespace. From now on anything that child
does is invisible to the parent. Child then mounts tmpfs on
$domain.dev (so that it still sees original /dev from the host)
and creates some devices (as explained in one of the previous
patches). The devices have to be created exactly as they are in
the host (including perms, seclabels, ACLs, ...). After that it
moves $domain.dev mount to /dev.
What's the $domain.devpts mount there for then you ask? QEMU can
create PTYs for some chardevs. And historically we exposed the
host ends in our domain XML allowing users to connect to them.
Therefore we must preserve devpts mount to be shared with the
host's one.
To make this patch as small as possible, creating of devices
configured for domain in question is implemented in next patches.
IIUC, this means that QEMU startup will be broken for any guests
that need host devices for disk, usb/pci passthrough, etc ?
Yep.
...
It is nice to keep the work incremental though.  I wonder if there
is any value in doing everything *except* the MS_MOVE of
/var/lib/libvirt/qemu/$domain.dev -> /dev/. ie only turn it on once
all the code is in place ?
Ah, okay.
...
...
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
 src/qemu/qemu_domain.c  | 289 ++++++++++++++++++++++++++++++++++++++++++++++++
 src/qemu/qemu_domain.h  |  10 ++
 src/qemu/qemu_process.c |  13 +++
 3 files changed, 312 insertions(+)

diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index 137d4d5..d6a1c29 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -55,6 +55,9 @@
#include <sys/time.h>
 #include <fcntl.h>
+#if defined(HAVE_SYS_MOUNT_H)
+# include <sys/mount.h>
+#endif
#include <libxml/xpathInternals.h>
@@ -1627,6 +1630,9 @@ qemuDomainObjPrivateXMLFormat(virBufferPtr buf,
                           virDomainChrTypeToString(priv->monConfig->type));
     }
+    if (priv->containerized)
+        virBufferAddLit(buf, "<containerized/>\n");
I wonder if it is worth explicitly listing the namespaces we enabled ?
eg
<namespaces>
     <mount/>
  </namespaces>
so we're prepared to deal with us adding use of more private namespaces
in future ?
Right. I was thinking about this but was unable to come with a reason
for other namespaces. But this doesn't hurt and looks like to be more
future-proof, so why not, right?
...
...
+
     qemuDomainObjPrivateXMLFormatVcpus(buf, vm->def);
if (priv->qemuCaps) {
@@ -1809,6 +1815,8 @@ qemuDomainObjPrivateXMLParse(xmlXPathContextPtr ctxt,
         goto error;
     }
+    priv->containerized = virXPathBoolean("count(./containerized) > 0", ctxt) > 0;
+
     if ((n = virXPathNodeSet("./vcpus/vcpu", ctxt, &nodes)) < 0)
         goto error;
@@ -6653,3 +6661,284 @@ qemuDomainSupportsVideoVga(virDomainVideoDefPtr video,
return true;
 }
+
+
+static int
+qemuDomainCreateDevice(const char *device,
+                       const char *path,
+                       bool allow_noent)
+{
+    char *devicePath = NULL;
+    struct stat sb;
+    int ret = -1;
+
+    if (!STRPREFIX(device, "/dev")) {
+        virReportError(VIR_ERR_INTERNAL_ERROR,
+                       _("invalid device: %s"),
+                       device);
+        goto cleanup;
+    }
+
+    if (virAsprintf(&devicePath, "%s/%s",
+                    path, device + 4) < 0)
+        goto cleanup;
+
+    if (stat(device, &sb) < 0) {
+        if (errno == ENOENT && allow_noent) {
+            /* Ignore non-existent device. */
+            ret = 0;
+            goto cleanup;
+        }
+
+        virReportSystemError(errno, _("Unable to stat %s"), device);
+        goto cleanup;
+    }
+
+    if (virFileMakeParentPath(devicePath) < 0) {
+        virReportSystemError(errno,
+                             _("Unable to create %s"),
+                             devicePath);
+        goto cleanup;
+    }
+
+    if (mknod(devicePath, sb.st_mode, sb.st_rdev) < 0) {
+        virReportSystemError(errno,
+                             _("Failed to make device %s"),
+                             devicePath);
+        goto cleanup;
+    }
+
+    if (chown(devicePath, sb.st_uid, sb.st_gid) < 0) {
+        virReportSystemError(errno,
+                             _("Failed to chown device %s"),
+                             devicePath);
+        goto cleanup;
+    }
+
+    if (virFileCopyACLs(device, devicePath) < 0 &&
+        errno != ENOTSUP) {
+        virReportSystemError(errno,
+                             _("Failed to copy ACLs on device %s"),
+                             devicePath);
+        goto cleanup;
+    }
Per my comments on earlier patches, I don't think we need to have
the chown or ACL copy. Just have the dev owned by root:root and
let our security drivers deal with the rest.
Meanwhile, we discussed this in other thread and found out we really ned
copy as-is.
...
...
+
+    ret = 0;
+ cleanup:
+    VIR_FREE(devicePath);
+    return ret;
+}
+
+
+
+static int
+qemuDomainPopulateDevices(virQEMUDriverPtr driver,
+                          virDomainObjPtr vm ATTRIBUTE_UNUSED,
+                          const char *path)
+{
+    virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
+    const char *const *devices = (const char *const *) cfg->cgroupDeviceACL;
+    size_t i;
+    int ret = -1;
+
+    if (!devices)
+        devices = defaultDeviceACL;
+
+    for (i = 0; devices[i]; i++) {
+        if (qemuDomainCreateDevice(devices[i], path, true) < 0)
+            goto cleanup;
+    }
+
+    ret = 0;
+ cleanup:
+    virObjectUnref(cfg);
+    return ret;
+}
+
+
+static int
+qemuDomainSetupDev(virQEMUDriverPtr driver,
+                   virDomainObjPtr vm,
+                   const char *path)
+{
+    char *mount_options = NULL;
+    char *opts = NULL;
+    int ret = -1;
+
+    VIR_DEBUG("Setting up /dev/ for domain %s", vm->def->name);
+
+    mount_options = virSecurityManagerGetMountOptions(driver->securityManager,
+                                                      vm->def);
+
+    if (!mount_options &&
+        VIR_STRDUP(mount_options, "") < 0)
+        goto cleanup;
+
+    /*
+     * tmpfs is limited to 64kb, since we only have device nodes in there
+     * and don't want to DOS the entire OS RAM usage
+     */
+    if (virAsprintf(&opts,
+                    "mode=755,size=65536%s", mount_options) < 0)
+        goto cleanup;
+
+    if (virFileSetupDev(path, opts) < 0)
+        goto cleanup;
+
+    if (qemuDomainPopulateDevices(driver, vm, path) < 0)
+        goto cleanup;
+
+    ret = 0;
+ cleanup:
+    VIR_FREE(opts);
+    VIR_FREE(mount_options);
+    return ret;
+}
+
+
+int
+qemuDomainBuildNamespace(virQEMUDriverPtr driver,
+                         virDomainObjPtr vm)
+{
+    virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
+    qemuDomainObjPrivatePtr priv = vm->privateData;
+    const unsigned long mount_flags = MS_MOVE;
+    char *devPath = NULL;
+    char *devptsPath = NULL;
+    int ret = -1;
+
+    if (!priv->containerized) {
+        ret = 0;
+        goto cleanup;
+    }
+
+    if (virAsprintf(&devPath, "%s/%s.dev",
+                    cfg->stateDir, vm->def->name) < 0 ||
+        virAsprintf(&devptsPath, "%s/%s.devpts",
+                    cfg->stateDir, vm->def->name) < 0)
+        goto cleanup;
+
+    if (qemuDomainSetupDev(driver, vm, devPath) < 0)
+        goto cleanup;
+
+    /* Save /dev/pts mount point because /dev/pts/NNN is exposed in our live
+     * XML and other applications are supposed to be able to use it. */
+    if (mount("/dev/pts", devptsPath, NULL, mount_flags, NULL) < 0) {
+        virReportSystemError(errno, "%s",
+                             _("Unable to move /dev/pts mount"));
+        goto cleanup;
+    }
+
+    if (mount(devPath, "/dev", NULL, mount_flags, NULL) < 0) {
+        virReportSystemError(errno,
+                             _("Failed to mount %s on /dev"),
+                             devPath);
+        goto cleanup;
+    }
+
+    if (virFileMakePath("/dev/pts") < 0) {
+        virReportSystemError(errno, "%s",
+                             _("Cannot create /dev/pts"));
+        goto cleanup;
+    }
+
+    if (mount(devptsPath, "/dev/pts", NULL, mount_flags, NULL) < 0) {
+        virReportSystemError(errno,
+                             _("Failed to mount %s on /dev/pts"),
+                             devptsPath);
+        goto cleanup;
+    }
+
+    if (virFileBindMountDevice("/dev/pts/ptmx", "/dev/ptmx") < 0)
+        goto cleanup;
+
+    VIR_DEBUG("blaaah: %d", system("find /dev/ -ls > /tmp/blaaah"));
+    VIR_DEBUG("blaaah: %d", system("echo >> /tmp/blaaah"));
+    VIR_DEBUG("blaaah: %d", system("mount >> /tmp/blaaah"));
Heh, :-)
D'oh! I remembered I needed to undo some debug lines somewhere...
...
...
+    ret = 0;
+ cleanup:
+    virObjectUnref(cfg);
+    VIR_FREE(devPath);
+    return ret;
+}
+
+
+int
+qemuDomainCreateNamespace(virQEMUDriverPtr driver,
+                          virDomainObjPtr vm)
+{
+    virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
+    qemuDomainObjPrivatePtr priv = vm->privateData;
+    int ret = -1;
+    char *path = NULL;
+
+#if !defined(__linux__)
+    /* Namespaces are Linux specific. On other platforms just
+     * carry on with the old behaviour. */
+    return 0;
+#endif
This feels quite likely to create compiler warnings on non-linux
about unreachable code. It feels like we should just stub out
the entire method instead.
Ah, good point.
...
...
+
+    if (!virQEMUDriverIsPrivileged(driver)) {
+        ret = 0;
+        goto cleanup;
+    }
+
+    if (virAsprintf(&path, "%s/%s.dev",
+                    cfg->stateDir, vm->def->name) < 0)
+        goto cleanup;
+
+    if (virFileMakePath(path) < 0) {
+        virReportSystemError(errno,
+                             _("Failed to create %s"),
+                             path);
+        goto cleanup;
+    }
+
+    VIR_FREE(path);
+    if (virAsprintf(&path, "%s/%s.devpts",
+                    cfg->stateDir, vm->def->name) < 0)
+        goto cleanup;
+
+    if (virFileMakePath(path) < 0) {
+        virReportSystemError(errno,
+                             _("Failed to create %s"),
+                             path);
+        goto cleanup;
+    }
+
+    priv->containerized = true;
+    ret = 0;
+ cleanup:
+    VIR_FREE(path);
+    virObjectUnref(cfg);
+    return ret;
+}
+
+
+void
+qemuDomainDeleteNamespace(virQEMUDriverPtr driver,
+                          virDomainObjPtr vm)
+{
+    virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
+    qemuDomainObjPrivatePtr priv = vm->privateData;
+    char *path;
+
+    if (!priv->containerized)
+        return;
+
+    if (virAsprintf(&path, "%s/%s.dev",
+                    cfg->stateDir, vm->def->name) < 0)
+        goto cleanup;
+
+    virFileDeleteTree(path);
+
+    VIR_FREE(path);
+    if (virAsprintf(&path, "%s/%s.devpts",
+                    cfg->stateDir, vm->def->name) < 0)
+        goto cleanup;
+
+    virFileDeleteTree(path);
+ cleanup:
+    virObjectUnref(cfg);
+    VIR_FREE(path);
+}
This is running in the host namespace and custom /dev tmpfs
was only ever mounted in the QEMU private namespace, so
should be invisible to the host. As such, IIUC, these
directories should both be 100% empty. IOW, it seems that
we don't need virFileDeleteTree - plain rmdir should be
sufficient and safer against accidents.
Again, very good point. I'll fix it in v2.

Michal