[libvirt PATCH 0/4] src: add configurable support for cgroups usage

This simple series allows apps to choose how cgroups are managed by libvirt, between cgroupfs and machined, or disabled entirely. Daniel P. Berrangé (4): conf: allow different resource registration modes util: allow choice between machined or direct cgroups filesystem qemu: wire up support for controlling use of cgroups backend lxc: wire up support for controlling use of cgroups backend docs/formatdomain.html.in | 24 +++++++- docs/schemas/domaincommon.rng | 17 +++++- src/conf/domain_conf.c | 46 ++++++++++++---- src/conf/domain_conf.h | 13 +++++ src/libvirt_private.syms | 2 + src/lxc/lxc_cgroup.c | 32 +++++++++++ src/lxc/lxc_cgroup.h | 3 + src/lxc/lxc_process.c | 7 ++- src/qemu/qemu_cgroup.c | 67 +++++++++++++++++++++-- src/qemu/qemu_command.c | 9 +-- src/util/vircgroup.c | 57 ++++++++++++------- src/util/vircgroup.h | 7 +++ tests/lxcxml2xmldata/lxc-capabilities.xml | 2 +- tests/lxcxml2xmldata/lxc-idmap.xml | 2 +- 14 files changed, 239 insertions(+), 49 deletions(-) -- 2.24.1

From: "Daniel P. Berrange" <berrange@redhat.com> Currently the QEMU driver has three ways of setting up cgroups. It either skips them entirely (if non-root), or uses systemd-machined, or uses cgroups directly. This change adds ability to configure the mechanism for registering resources between all these options explicitly. via <resource backend="none|cgroupfs|machined"/> It is further possible to register directly with systemd and bypass machined. We don't support this but systemd-nsspawn does and we ought to consider this at some point. This would involve a new "systemd" backend type alongside "machined". Signed-off-by: Daniel P. Berrange <berrange@redhat.com> --- docs/formatdomain.html.in | 24 +++++++++++- docs/schemas/domaincommon.rng | 17 +++++++-- src/conf/domain_conf.c | 46 ++++++++++++++++++----- src/conf/domain_conf.h | 13 +++++++ src/libvirt_private.syms | 2 + src/lxc/lxc_cgroup.c | 8 ++++ src/lxc/lxc_process.c | 1 + src/qemu/qemu_cgroup.c | 14 +++++++ tests/lxcxml2xmldata/lxc-capabilities.xml | 2 +- tests/lxcxml2xmldata/lxc-idmap.xml | 2 +- 10 files changed, 113 insertions(+), 16 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 0d229386eb..a016e789f1 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -1445,7 +1445,7 @@ </p> <pre> ... -<resource> +<resource backend='none|cgroupfs|machined'> <partition>/virtualmachines/production</partition> </resource> ... @@ -1455,8 +1455,30 @@ Resource partitions are currently supported by the QEMU and LXC drivers, which map partition paths to cgroups directories, in all mounted controllers. <span class="since">Since 1.0.5</span> + There is a choice of implementations to use for resource partitions + controlled via the optional <code>backend</code> attribute. + <span class="since">Since 6.2.0</span>. It accepts the values </p> + <ul> + <dt>none</dt> + <dd>Resource management in libvirt is disabled, with the APIs + returning an error indicating the functionality is not available. + The QEMU will will remain in whatever cgroup the libvirt daemon + was in. On systemd hosts, this will result in QEMU being + terminated at the same time as the privileged libvirt management + daemon which launched them.</dd> + <dt>cgroupfs</dt> + <dd>Cgroups will be directly created via the cgroups virtual filesystem. + This is not recommended for use in scenarios where systemd is in + charge of the cgroup hierarchy, unless the resource partition points + to a subtree that systemd has delegated administrative for.</dd> + <dt>machined</dt> + <dd>Systemd machined will be called to indirectly create cgroups. + This is recommended for any host where systemd is managing the + cgroup hierarchy.</dd> + </ul> + <h3><a id="elementsCPU">CPU model and topology</a></h3> <p> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 6805420451..29ffc3a3cf 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -1126,9 +1126,20 @@ <define name="respartition"> <element name="resource"> - <element name="partition"> - <ref name="absFilePath"/> - </element> + <optional> + <attribute name="backend"> + <choice> + <value>none</value> + <value>cgroupfs</value> + <value>machined</value> + </choice> + </attribute> + </optional> + <optional> + <element name="partition"> + <ref name="absFilePath"/> + </element> + </optional> </element> </define> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index e0432fc47d..ae512283d0 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1244,6 +1244,13 @@ VIR_ENUM_IMPL(virDomainOsDefFirmware, "efi", ); +VIR_ENUM_IMPL(virDomainResourceBackend, + VIR_DOMAIN_RESOURCE_BACKEND_LAST, + "default", + "none", + "cgroupfs", + "machined"); + /* Internal mapping: subset of block job types that can be present in * <mirror> XML (remaining types are not two-phase). */ VIR_ENUM_DECL(virDomainBlockJob); @@ -19100,17 +19107,24 @@ virDomainResourceDefParse(xmlNodePtr node, { VIR_XPATH_NODE_AUTORESTORE(ctxt); virDomainResourceDefPtr def = NULL; + g_autofree char *reg = NULL; ctxt->node = node; if (VIR_ALLOC(def) < 0) goto error; - /* Find out what type of virtualization to use */ - if (!(def->partition = virXPathString("string(./partition)", ctxt))) { - virReportError(VIR_ERR_INTERNAL_ERROR, - "%s", _("missing resource partition attribute")); - goto error; + def->partition = virXPathString("string(./partition)", ctxt); + + reg = virXMLPropString(node, "backend"); + if (reg != NULL) { + if ((def->backend = virDomainResourceBackendTypeFromString(reg)) <= 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + "%s", _("Invalid backend attribute")); + goto error; + } + } else { + def->backend = VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT; } return def; @@ -27983,11 +27997,23 @@ static void virDomainResourceDefFormat(virBufferPtr buf, virDomainResourceDefPtr def) { - virBufferAddLit(buf, "<resource>\n"); - virBufferAdjustIndent(buf, 2); - virBufferEscapeString(buf, "<partition>%s</partition>\n", def->partition); - virBufferAdjustIndent(buf, -2); - virBufferAddLit(buf, "</resource>\n"); + if (def->backend == VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT && + def->partition == NULL) + return; + + virBufferAddLit(buf, "<resource"); + if (def->backend != VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT) + virBufferAsprintf(buf, " backend='%s'", virDomainResourceBackendTypeToString(def->backend)); + + if (def->partition) { + virBufferAddLit(buf, ">\n"); + virBufferAdjustIndent(buf, 2); + virBufferEscapeString(buf, "<partition>%s</partition>\n", def->partition); + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</resource>\n"); + } else { + virBufferAddLit(buf, "/>\n"); + } } diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 73bd097cf8..4bfda29dee 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -2228,7 +2228,19 @@ struct _virDomainPanicDef { void virBlkioDeviceArrayClear(virBlkioDevicePtr deviceWeights, int ndevices); +typedef enum { + VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT, + VIR_DOMAIN_RESOURCE_BACKEND_NONE, + VIR_DOMAIN_RESOURCE_BACKEND_CGROUPFS, + VIR_DOMAIN_RESOURCE_BACKEND_MACHINED, + + VIR_DOMAIN_RESOURCE_BACKEND_LAST, +} virDomainResourceBackend; + +typedef struct _virDomainResourceDef virDomainResourceDef; +typedef virDomainResourceDef *virDomainResourceDefPtr; struct _virDomainResourceDef { + int backend; /* enum virDomainResourceBackend */ char *partition; }; @@ -3525,6 +3537,7 @@ VIR_ENUM_DECL(virDomainIOMMUModel); VIR_ENUM_DECL(virDomainVsockModel); VIR_ENUM_DECL(virDomainShmemModel); VIR_ENUM_DECL(virDomainLaunchSecurity); +VIR_ENUM_DECL(virDomainResourceBackend); /* from libvirt.h */ VIR_ENUM_DECL(virDomainState); VIR_ENUM_DECL(virDomainNostateReason); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 6b305bdd0e..6e5cc201ff 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -575,6 +575,8 @@ virDomainRedirdevBusTypeToString; virDomainRedirdevDefFind; virDomainRedirdevDefFree; virDomainRedirdevDefRemove; +virDomainResourceBackendTypeFromString; +virDomainResourceBackendTypeToString; virDomainRNGBackendTypeToString; virDomainRNGDefFree; virDomainRNGFind; diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index 7df723a4da..326d33981c 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -392,6 +392,14 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, if (!machineName) goto cleanup; + if (def->resource->backend != VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Resource backend '%s' not available"), + virDomainResourceBackendTypeToString( + def->resource->backend)); + goto cleanup; + } + if (def->resource->partition[0] != '/') { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("Resource partition '%s' must start with '/'"), diff --git a/src/lxc/lxc_process.c b/src/lxc/lxc_process.c index 5199f3806e..4ec3cc5619 100644 --- a/src/lxc/lxc_process.c +++ b/src/lxc/lxc_process.c @@ -1260,6 +1260,7 @@ int virLXCProcessStart(virConnectPtr conn, if (VIR_ALLOC(res) < 0) goto cleanup; + res->backend = VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT; res->partition = g_strdup("/machine"); vm->def->resource = res; diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index c0e30f6152..c407431f6b 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -940,11 +940,20 @@ qemuInitCgroup(virDomainObjPtr vm, if (VIR_ALLOC(res) < 0) goto cleanup; + res->backend = VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT; res->partition = g_strdup("/machine"); vm->def->resource = res; } + if (vm->def->resource->backend != VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Resource backend '%s' not available"), + virDomainResourceBackendTypeToString( + vm->def->resource->backend)); + goto cleanup; + } + if (vm->def->resource->partition[0] != '/') { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("Resource partition '%s' must start with '/'"), @@ -1061,6 +1070,11 @@ qemuConnectCgroup(virDomainObjPtr vm) virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(priv->driver); int ret = -1; + if (vm->def->resource && + vm->def->resource->backend == VIR_DOMAIN_RESOURCE_BACKEND_NONE) { + goto done; + } + if (!virQEMUDriverIsPrivileged(priv->driver)) goto done; diff --git a/tests/lxcxml2xmldata/lxc-capabilities.xml b/tests/lxcxml2xmldata/lxc-capabilities.xml index 04d64e3e41..335fdf8b91 100644 --- a/tests/lxcxml2xmldata/lxc-capabilities.xml +++ b/tests/lxcxml2xmldata/lxc-capabilities.xml @@ -4,7 +4,7 @@ <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <vcpu placement='static'>1</vcpu> - <resource> + <resource backend='cgroupfs'> <partition>/machine</partition> </resource> <os> diff --git a/tests/lxcxml2xmldata/lxc-idmap.xml b/tests/lxcxml2xmldata/lxc-idmap.xml index b477636c30..d618d69706 100644 --- a/tests/lxcxml2xmldata/lxc-idmap.xml +++ b/tests/lxcxml2xmldata/lxc-idmap.xml @@ -4,7 +4,7 @@ <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <vcpu placement='static'>1</vcpu> - <resource> + <resource backend='machined'> <partition>/machine</partition> </resource> <os> -- 2.24.1

The cgroups code currently tries to use machined and then falls back to using the cgroups filesystem directly. This introduces a new enum that allows callers to have explicit control over which impl is used. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- src/lxc/lxc_cgroup.c | 1 + src/qemu/qemu_cgroup.c | 1 + src/util/vircgroup.c | 57 ++++++++++++++++++++++++++---------------- src/util/vircgroup.h | 7 ++++++ 4 files changed, 45 insertions(+), 21 deletions(-) diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index 326d33981c..629f9eca1c 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -414,6 +414,7 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, initpid, true, nnicindexes, nicindexes, + VIR_CGROUP_REGISTER_DEFAULT, def->resource->partition, -1, 0, diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index c407431f6b..cd7c381185 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -968,6 +968,7 @@ qemuInitCgroup(virDomainObjPtr vm, vm->pid, false, nnicindexes, nicindexes, + VIR_CGROUP_REGISTER_DEFAULT, vm->def->resource->partition, cfg->cgroupControllers, cfg->maxThreadsPerProc, diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 70d85200cb..0128c8bb60 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1228,6 +1228,7 @@ virCgroupNewMachine(const char *name, bool isContainer, size_t nnicindexes, int *nicindexes, + virCgroupRegister *reg, const char *partition, int controllers, unsigned int maxthreads, @@ -1237,29 +1238,43 @@ virCgroupNewMachine(const char *name, *group = NULL; - if ((rv = virCgroupNewMachineSystemd(name, - drivername, - uuid, - rootdir, - pidleader, - isContainer, - nnicindexes, - nicindexes, - partition, - controllers, - maxthreads, - group)) == 0) - return 0; + if (*reg == VIR_CGROUP_REGISTER_DEFAULT || + *reg == VIR_CGROUP_REGISTER_MACHINED) { + if ((rv = virCgroupNewMachineSystemd(name, + drivername, + uuid, + rootdir, + pidleader, + isContainer, + nnicindexes, + nicindexes, + partition, + controllers, + maxthreads, + group)) == 0) { + *reg = VIR_CGROUP_REGISTER_MACHINED; + return 0; + } - if (rv == -1) - return -1; + if (rv == -1) + return -1; + + if (*reg == VIR_CGROUP_REGISTER_MACHINED) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + "%s", _("Systemd machined requested, but not available")); + return -1; + } + } - return virCgroupNewMachineManual(name, - drivername, - pidleader, - partition, - controllers, - group); + rv = virCgroupNewMachineManual(name, + drivername, + pidleader, + partition, + controllers, + group); + if (rv == 0) + *reg = VIR_CGROUP_REGISTER_DIRECT; + return rv; } diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index 1dcd0688f1..b725bc4c94 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -87,6 +87,12 @@ virCgroupNewDetectMachine(const char *name, virCgroupPtr *group) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); +typedef enum { + VIR_CGROUP_REGISTER_DEFAULT, + VIR_CGROUP_REGISTER_DIRECT, + VIR_CGROUP_REGISTER_MACHINED, +} virCgroupRegister; + int virCgroupNewMachine(const char *name, const char *drivername, const unsigned char *uuid, @@ -95,6 +101,7 @@ int virCgroupNewMachine(const char *name, bool isContainer, size_t nnicindexes, int *nicindexes, + virCgroupRegister *reg, const char *partition, int controllers, unsigned int maxthreads, -- 2.24.1

Currently the QEMU driver chooses between systemd machined and direct cgroups usage of privileged, and does not use either when unprivileged. This wires up support for the new backend choice introduced by the earlier commits, allowing apps to override the default logic in the driver when privileged. This reverts commit c32a7de7d8f81384b17dbe529c6d3b3ac13c631d. --- src/qemu/qemu_cgroup.c | 68 ++++++++++++++++++++++++++++++++--------- src/qemu/qemu_command.c | 9 +++--- 2 files changed, 59 insertions(+), 18 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index cd7c381185..a1b53f6628 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -917,6 +917,46 @@ qemuSetupCpuCgroup(virDomainObjPtr vm) } +static int qemuGetCgroupMode(virDomainObjPtr vm, + virDomainResourceBackend backend, + virCgroupRegister *cgreg) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + bool avail = virQEMUDriverIsPrivileged(priv->driver) && + virCgroupAvailable(); + + switch (backend) { + case VIR_DOMAIN_RESOURCE_BACKEND_NONE: + return 0; + case VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT: + if (!avail) + return 0; + *cgreg = VIR_CGROUP_REGISTER_DEFAULT; + break; + case VIR_DOMAIN_RESOURCE_BACKEND_MACHINED: + if (!avail) + goto unsupported; + *cgreg = VIR_CGROUP_REGISTER_MACHINED; + break; + case VIR_DOMAIN_RESOURCE_BACKEND_CGROUPFS: + if (!avail) + goto unsupported; + *cgreg = VIR_CGROUP_REGISTER_DIRECT; + break; + case VIR_DOMAIN_RESOURCE_BACKEND_LAST: + default: + virReportEnumRangeError(virDomainResourceBackend, backend); + } + + return 1; + + unsupported: + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Resource backend '%s' not available"), + virDomainResourceBackendTypeToString(backend)); + return -1; +} + static int qemuInitCgroup(virDomainObjPtr vm, size_t nnicindexes, @@ -925,11 +965,17 @@ qemuInitCgroup(virDomainObjPtr vm, int ret = -1; qemuDomainObjPrivatePtr priv = vm->privateData; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(priv->driver); + virCgroupRegister reg; + int rv; - if (!virQEMUDriverIsPrivileged(priv->driver)) - goto done; - - if (!virCgroupAvailable()) + rv = qemuGetCgroupMode(vm, + vm->def->resource ? + vm->def->resource->backend : + VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT, + ®); + if (rv < 0) + goto cleanup; + if (rv == 0) goto done; virCgroupFree(&priv->cgroup); @@ -941,18 +987,12 @@ qemuInitCgroup(virDomainObjPtr vm, goto cleanup; res->backend = VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT; - res->partition = g_strdup("/machine"); - vm->def->resource = res; } - if (vm->def->resource->backend != VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("Resource backend '%s' not available"), - virDomainResourceBackendTypeToString( - vm->def->resource->backend)); - goto cleanup; - } + if (vm->def->resource->backend != VIR_DOMAIN_RESOURCE_BACKEND_NONE && + !vm->def->resource->partition) + vm->def->resource->partition = g_strdup("/machine"); if (vm->def->resource->partition[0] != '/') { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, @@ -968,7 +1008,7 @@ qemuInitCgroup(virDomainObjPtr vm, vm->pid, false, nnicindexes, nicindexes, - VIR_CGROUP_REGISTER_DEFAULT, + ®, vm->def->resource->partition, cfg->cgroupControllers, cfg->maxThreadsPerProc, diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 9790c92cf8..eb1c3f6e12 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -9662,7 +9662,8 @@ qemuBuildCommandLineValidate(virQEMUDriverPtr driver, int spice = 0; int egl_headless = 0; - if (!virQEMUDriverIsPrivileged(driver)) { + if (!virQEMUDriverIsPrivileged(driver) || + (def->resource && def->resource->backend == VIR_DOMAIN_RESOURCE_BACKEND_NONE)) { /* If we have no cgroups then we can have no tunings that * require them */ @@ -9670,13 +9671,13 @@ qemuBuildCommandLineValidate(virQEMUDriverPtr driver, virMemoryLimitIsSet(def->mem.soft_limit) || virMemoryLimitIsSet(def->mem.swap_hard_limit)) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("Memory tuning is not available in session mode")); + _("Memory tuning is not available without cgroups")); return -1; } if (def->blkio.weight) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("Block I/O tuning is not available in session mode")); + _("Block I/O tuning is not available without cgroups")); return -1; } @@ -9686,7 +9687,7 @@ qemuBuildCommandLineValidate(virQEMUDriverPtr driver, def->cputune.emulator_quota || def->cputune.iothread_period || def->cputune.iothread_quota) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("CPU tuning is not available in session mode")); + _("CPU tuning is not available without cgroups")); return -1; } } -- 2.24.1

On a Friday in 2020, Daniel P. Berrangé wrote:
Currently the QEMU driver chooses between systemd machined and direct cgroups usage of privileged, and does not use either when unprivileged.
This wires up support for the new backend choice introduced by the earlier commits, allowing apps to override the default logic in the driver when privileged.
This reverts commit c32a7de7d8f81384b17dbe529c6d3b3ac13c631d.
fatal: bad object c32a7de7d8f81384b17dbe529c6d3b3ac13c631d same in the next commit message. Jano

On Fri, Mar 20, 2020 at 04:42:46PM +0100, Ján Tomko wrote:
On a Friday in 2020, Daniel P. Berrangé wrote:
Currently the QEMU driver chooses between systemd machined and direct cgroups usage of privileged, and does not use either when unprivileged.
This wires up support for the new backend choice introduced by the earlier commits, allowing apps to override the default logic in the driver when privileged.
This reverts commit c32a7de7d8f81384b17dbe529c6d3b3ac13c631d.
fatal: bad object c32a7de7d8f81384b17dbe529c6d3b3ac13c631d
same in the next commit message.
Opps, cruft from squashing together some commits Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Currently the LXC driver chooses between systemd machined and direct cgroups usage. This wires up support for the new backend choice introduced by the earlier commits, allowing apps to override the default logic in the driver. This reverts commit c32a7de7d8f81384b17dbe529c6d3b3ac13c631d. --- src/lxc/lxc_cgroup.c | 37 ++++++++++++++++++++++++++++++------- src/lxc/lxc_cgroup.h | 3 +++ src/lxc/lxc_process.c | 6 ++++-- 3 files changed, 37 insertions(+), 9 deletions(-) diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index 629f9eca1c..1cfa831f2b 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -380,6 +380,33 @@ static int virLXCCgroupSetupDeviceACL(virDomainDefPtr def, return 0; } +int virLXCCgroupMode(virDomainResourceBackend backend, + virCgroupRegister *cgreg) +{ + switch (backend) { + case VIR_DOMAIN_RESOURCE_BACKEND_NONE: + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Resource backend '%s' not available"), + virDomainResourceBackendTypeToString(backend)); + return -1; + case VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT: + *cgreg = VIR_CGROUP_REGISTER_DEFAULT; + break; + case VIR_DOMAIN_RESOURCE_BACKEND_MACHINED: + *cgreg = VIR_CGROUP_REGISTER_MACHINED; + break; + case VIR_DOMAIN_RESOURCE_BACKEND_CGROUPFS: + *cgreg = VIR_CGROUP_REGISTER_DIRECT; + break; + case VIR_DOMAIN_RESOURCE_BACKEND_LAST: + default: + virReportEnumRangeError(virDomainResourceBackend, backend); + return -1; + } + + return 0; +} + virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, pid_t initpid, @@ -387,18 +414,14 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, int *nicindexes) { virCgroupPtr cgroup = NULL; + virCgroupRegister reg; char *machineName = virLXCDomainGetMachineName(def, 0); if (!machineName) goto cleanup; - if (def->resource->backend != VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("Resource backend '%s' not available"), - virDomainResourceBackendTypeToString( - def->resource->backend)); + if (virLXCCgroupMode(def->resource->backend, ®) < 0) goto cleanup; - } if (def->resource->partition[0] != '/') { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, @@ -414,7 +437,7 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, initpid, true, nnicindexes, nicindexes, - VIR_CGROUP_REGISTER_DEFAULT, + ®, def->resource->partition, -1, 0, diff --git a/src/lxc/lxc_cgroup.h b/src/lxc/lxc_cgroup.h index 63e9e837b0..df799b2720 100644 --- a/src/lxc/lxc_cgroup.h +++ b/src/lxc/lxc_cgroup.h @@ -26,6 +26,9 @@ #include "lxc_fuse.h" #include "virusb.h" +int virLXCCgroupMode(virDomainResourceBackend backend, + virCgroupRegister *cgreg); + virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, pid_t initpid, size_t nnicindexes, diff --git a/src/lxc/lxc_process.c b/src/lxc/lxc_process.c index 4ec3cc5619..ead27c17c0 100644 --- a/src/lxc/lxc_process.c +++ b/src/lxc/lxc_process.c @@ -1261,11 +1261,13 @@ int virLXCProcessStart(virConnectPtr conn, goto cleanup; res->backend = VIR_DOMAIN_RESOURCE_BACKEND_DEFAULT; - res->partition = g_strdup("/machine"); - vm->def->resource = res; } + if (vm->def->resource->backend != VIR_DOMAIN_RESOURCE_BACKEND_NONE && + !vm->def->resource->partition) + vm->def->resource->partition = g_strdup("/machine"); + logfile = g_strdup_printf("%s/%s.log", cfg->logDir, vm->def->name); if (!(pidfile = virPidFileBuildPath(cfg->stateDir, vm->def->name))) -- 2.24.1

On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
This simple series allows apps to choose how cgroups are managed by libvirt, between cgroupfs and machined, or disabled entirely.
I'm not so sure about this series. The situation with cgroups and systemd is a bit more complex then the current code handles. There is an existing issue where we are violating the delegation rules described by cgroups and systemd. Currently the "cgroupfs" approach is used only on non-systemd hosts and we should keep it that way. Libvirt is not allowed to mangle with cgroups owned by systemd so IMHO the warning in configuration file is not enough because without delegation cgroups will not work properly in libvirt. One example is that without delegation the VM cgroups would not have any controllers enabled by default. Having an option to completely disable cgroups is OK but that's the only thing I would expose for now. When using machined the current topology looks like this (only the files that we use are listed): /sys/fs/cgroup/machine.slice/ └── machine-qemu\x2d1\x2dcentos8.scope ├── emulator │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── vcpu0 │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── vcpu1 │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── vcpu2 │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── vcpu3 │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── cgroup.controllers ├── cgroup.procs ├── cgroup.subtree_control ├── cgroup.type ├── cpu.max ├── cpu.stat ├── cpu.weight ├── io.bfq.weight ├── io.max ├── io.stat ├── io.weight ├── memory.high ├── memory.max ├── memory.stat └── memory.swap.max which is incorrect. The group "machine-qemu\x2d1\x2dcentos8.scope" is marked as delegated which means we can create new sub-cgroups there and that we can access only these files directly in that group: cgroup.subtree_control - to enable cgroup controllers for our sub-cgroups cgroup.procs - to move processes around into our sub-cgroups Other files are off limit and we should not touch them. In addition systemd may create its own sub-cgroups and it would fail to do so because there is another limitation from kernel that processes can live only in the leaves except for the root cgroup. Exactly that is happening right now and it was discovered by user reporting a BUG that systemctl daemon-reloaed changed the values in cpu.shares because it owns that file and it was our fault to change it. Currently we use these cgroups files: cgroup.type - to enable threaded mode that is required for vcpu, emulator and iothread pinning. We should not do that in the delegated cgroup because we may break systemd, this has to be moved to a sub-cgroup together with the pinning. io.weight / io.bfq.weight, cpu.weight - These cannot be moved into the sub-cgroup because it would lose the effect and would be completely pointless to set it. All the *.weight files works in a way that the resources available to parent are distributed using the weight to the children. If we would move it to the sub-cgroup there would be only single child all the time. Therefore we have to use D-Bus to talk to systemd to have these values configured directly for the delegated cgroup. io.max, memory.max, memory.high, memory.swap.max, cpu.max - All of these can be safely set in the sub-cgroup that we will have to create as they are absolute limits and they are not affected by siblings. cpuset.mems, cpuset.cpus - These are used together with the threaded model and can also be safely set in the sub-cgroup. Based on all of the above this is the new topology that has to be created by libvirt if systemd is present: /sys/fs/cgroup/machine.slice/ └── machine-qemu\x2d1\x2dcentos8.scope ├── libvirt-vm (or something else) │ ├── emulator │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── vcpu0 │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── vcpu1 │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── vcpu2 │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── vcpu3 │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── cgroup.controllers │ ├── cgroup.procs │ ├── cgroup.subtree_control │ ├── cgroup.type │ ├── cpu.max │ ├── cpu.stat │ ├── io.max │ ├── io.stat │ ├── memory.high │ ├── memory.max │ ├── memory.stat │ └── memory.swap.max ├── cgroup.procs ├── cgroup.subtree_control ├── cpu.weight (only via systemd using D-Bus) ├── io.bfq.weight (only via systemd using D-Bus) └── io.weight (only via systemd using D-Bus) Pavel

On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
This simple series allows apps to choose how cgroups are managed by libvirt, between cgroupfs and machined, or disabled entirely.
I'm not so sure about this series. The situation with cgroups and systemd is a bit more complex then the current code handles.
There is an existing issue where we are violating the delegation rules described by cgroups and systemd.
Currently the "cgroupfs" approach is used only on non-systemd hosts and we should keep it that way. Libvirt is not allowed to mangle with cgroups owned by systemd so IMHO the warning in configuration file is not enough because without delegation cgroups will not work properly in libvirt.
That isn't the case currently AFAICT from current code. Before this series, the virCgroupNewMachine method will first try systemd and then fallback to directly cgroupfs. This fallback can happen on a systemd host, when machined is not installed, as machined is an optional component. If we need to mandate use of systemd on systemd hosts, then our existing code is broken and needs fixing. I'm happy todo such a fix, and then adjust this series to take account of it. Essentially we'd allow apps to specify 'cgroupfs' or 'machined' but we'd enforce they only make safe choices. ie on a systemd host we'd only allow 'none' or 'machined' On a non-systemd host, or on a systemd host where we've been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
One example is that without delegation the VM cgroups would not have any controllers enabled by default.
Having an option to completely disable cgroups is OK but that's the only thing I would expose for now.
When using machined the current topology looks like this (only the files that we use are listed):
/sys/fs/cgroup/machine.slice/ └── machine-qemu\x2d1\x2dcentos8.scope ├── emulator │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── vcpu0 │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── vcpu1 │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── vcpu2 │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── vcpu3 │ ├── cgroup.controllers │ ├── cgroup.threads │ ├── cgroup.type │ ├── cpuset.cpus │ └── cpuset.mems ├── cgroup.controllers ├── cgroup.procs ├── cgroup.subtree_control ├── cgroup.type ├── cpu.max ├── cpu.stat ├── cpu.weight ├── io.bfq.weight ├── io.max ├── io.stat ├── io.weight ├── memory.high ├── memory.max ├── memory.stat └── memory.swap.max
which is incorrect. The group "machine-qemu\x2d1\x2dcentos8.scope" is marked as delegated which means we can create new sub-cgroups there and that we can access only these files directly in that group:
cgroup.subtree_control - to enable cgroup controllers for our sub-cgroups
cgroup.procs - to move processes around into our sub-cgroups
Other files are off limit and we should not touch them. In addition systemd may create its own sub-cgroups and it would fail to do so because there is another limitation from kernel that processes can live only in the leaves except for the root cgroup.
Exactly that is happening right now and it was discovered by user reporting a BUG that systemctl daemon-reloaed changed the values in cpu.shares because it owns that file and it was our fault to change it.
Currently we use these cgroups files:
cgroup.type
- to enable threaded mode that is required for vcpu, emulator and iothread pinning. We should not do that in the delegated cgroup because we may break systemd, this has to be moved to a sub-cgroup together with the pinning.
io.weight / io.bfq.weight, cpu.weight
- These cannot be moved into the sub-cgroup because it would lose the effect and would be completely pointless to set it. All the *.weight files works in a way that the resources available to parent are distributed using the weight to the children. If we would move it to the sub-cgroup there would be only single child all the time. Therefore we have to use D-Bus to talk to systemd to have these values configured directly for the delegated cgroup.
io.max, memory.max, memory.high, memory.swap.max, cpu.max
- All of these can be safely set in the sub-cgroup that we will have to create as they are absolute limits and they are not affected by siblings.
cpuset.mems, cpuset.cpus
- These are used together with the threaded model and can also be safely set in the sub-cgroup.
Based on all of the above this is the new topology that has to be created by libvirt if systemd is present:
/sys/fs/cgroup/machine.slice/ └── machine-qemu\x2d1\x2dcentos8.scope ├── libvirt-vm (or something else)
So IIUC, we can allow 'cgroupfs' on a systemd host, provided that the path for the <partition> points to a sub-tree that is delegated.
│ ├── emulator │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── vcpu0 │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── vcpu1 │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── vcpu2 │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── vcpu3 │ │ ├── cgroup.controllers │ │ ├── cgroup.threads │ │ ├── cgroup.type │ │ ├── cpuset.cpus │ │ └── cpuset.mems │ ├── cgroup.controllers │ ├── cgroup.procs │ ├── cgroup.subtree_control │ ├── cgroup.type │ ├── cpu.max │ ├── cpu.stat │ ├── io.max │ ├── io.stat │ ├── memory.high │ ├── memory.max │ ├── memory.stat │ └── memory.swap.max ├── cgroup.procs ├── cgroup.subtree_control ├── cpu.weight (only via systemd using D-Bus) ├── io.bfq.weight (only via systemd using D-Bus) └── io.weight (only via systemd using D-Bus)
Pavel
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
This simple series allows apps to choose how cgroups are managed by libvirt, between cgroupfs and machined, or disabled entirely.
I'm not so sure about this series. The situation with cgroups and systemd is a bit more complex then the current code handles.
There is an existing issue where we are violating the delegation rules described by cgroups and systemd.
Currently the "cgroupfs" approach is used only on non-systemd hosts and we should keep it that way. Libvirt is not allowed to mangle with cgroups owned by systemd so IMHO the warning in configuration file is not enough because without delegation cgroups will not work properly in libvirt.
That isn't the case currently AFAICT from current code.
Before this series, the virCgroupNewMachine method will first try systemd and then fallback to directly cgroupfs.
This fallback can happen on a systemd host, when machined is not installed, as machined is an optional component.
Right, I should have checked the code.
If we need to mandate use of systemd on systemd hosts, then our existing code is broken and needs fixing.
I think we should do that because without delegation we should not touch anything in cgroups.
I'm happy todo such a fix, and then adjust this series to take account of it. Essentially we'd allow apps to specify 'cgroupfs' or 'machined' but we'd enforce they only make safe choices.
ie on a systemd host we'd only allow 'none' or 'machined'
On a non-systemd host, or on a systemd host where we've been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
This sounds reasonable, I was thinking about the check if we have delegated subtree but decided not to mention it. The only place where we can check the delegation is using systemd API, probably over D-Bus. It's not reflected anywhere in the cgroup files. Pavel

On Fri, Mar 20, 2020 at 04:48:58PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
This simple series allows apps to choose how cgroups are managed by libvirt, between cgroupfs and machined, or disabled entirely.
I'm not so sure about this series. The situation with cgroups and systemd is a bit more complex then the current code handles.
There is an existing issue where we are violating the delegation rules described by cgroups and systemd.
Currently the "cgroupfs" approach is used only on non-systemd hosts and we should keep it that way. Libvirt is not allowed to mangle with cgroups owned by systemd so IMHO the warning in configuration file is not enough because without delegation cgroups will not work properly in libvirt.
That isn't the case currently AFAICT from current code.
Before this series, the virCgroupNewMachine method will first try systemd and then fallback to directly cgroupfs.
This fallback can happen on a systemd host, when machined is not installed, as machined is an optional component.
Right, I should have checked the code.
If we need to mandate use of systemd on systemd hosts, then our existing code is broken and needs fixing.
I think we should do that because without delegation we should not touch anything in cgroups.
I'm happy todo such a fix, and then adjust this series to take account of it. Essentially we'd allow apps to specify 'cgroupfs' or 'machined' but we'd enforce they only make safe choices.
ie on a systemd host we'd only allow 'none' or 'machined'
On a non-systemd host, or on a systemd host where we've been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
This sounds reasonable, I was thinking about the check if we have delegated subtree but decided not to mention it. The only place where we can check the delegation is using systemd API, probably over D-Bus.
It's not reflected anywhere in the cgroup files.
The key scenario for cgroupfs would be if libvirtd was run inside a container on a host that otherwise uses systemd. IIUC, the container should get given a delegated subtree and we need to be able to use cgroupfs backend here. Currently this works (accidentally) because we'll try to talk to machined and fail, so fallback to direct usage. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, Mar 20, 2020 at 03:59:41PM +0000, Daniel P. Berrangé wrote:
On Fri, Mar 20, 2020 at 04:48:58PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
This simple series allows apps to choose how cgroups are managed by libvirt, between cgroupfs and machined, or disabled entirely.
I'm not so sure about this series. The situation with cgroups and systemd is a bit more complex then the current code handles.
There is an existing issue where we are violating the delegation rules described by cgroups and systemd.
Currently the "cgroupfs" approach is used only on non-systemd hosts and we should keep it that way. Libvirt is not allowed to mangle with cgroups owned by systemd so IMHO the warning in configuration file is not enough because without delegation cgroups will not work properly in libvirt.
That isn't the case currently AFAICT from current code.
Before this series, the virCgroupNewMachine method will first try systemd and then fallback to directly cgroupfs.
This fallback can happen on a systemd host, when machined is not installed, as machined is an optional component.
Right, I should have checked the code.
If we need to mandate use of systemd on systemd hosts, then our existing code is broken and needs fixing.
I think we should do that because without delegation we should not touch anything in cgroups.
I'm happy todo such a fix, and then adjust this series to take account of it. Essentially we'd allow apps to specify 'cgroupfs' or 'machined' but we'd enforce they only make safe choices.
ie on a systemd host we'd only allow 'none' or 'machined'
On a non-systemd host, or on a systemd host where we've been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
This sounds reasonable, I was thinking about the check if we have delegated subtree but decided not to mention it. The only place where we can check the delegation is using systemd API, probably over D-Bus.
It's not reflected anywhere in the cgroup files.
The key scenario for cgroupfs would be if libvirtd was run inside a container on a host that otherwise uses systemd. IIUC, the container should get given a delegated subtree and we need to be able to use cgroupfs backend here. Currently this works (accidentally) because we'll try to talk to machined and fail, so fallback to direct usage.
That should be OK and we can support this even outside of containers if we can verify that the cgroup subtree is delegated and it's safe to use it for VMs. The wrong usage of delegation by libvirt is the source of the reported BUG so we should be restrictive. Pavel

On Fri, Mar 20, 2020 at 05:42:57PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 03:59:41PM +0000, Daniel P. Berrangé wrote:
On Fri, Mar 20, 2020 at 04:48:58PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
This simple series allows apps to choose how cgroups are managed by libvirt, between cgroupfs and machined, or disabled entirely.
I'm not so sure about this series. The situation with cgroups and systemd is a bit more complex then the current code handles.
There is an existing issue where we are violating the delegation rules described by cgroups and systemd.
Currently the "cgroupfs" approach is used only on non-systemd hosts and we should keep it that way. Libvirt is not allowed to mangle with cgroups owned by systemd so IMHO the warning in configuration file is not enough because without delegation cgroups will not work properly in libvirt.
That isn't the case currently AFAICT from current code.
Before this series, the virCgroupNewMachine method will first try systemd and then fallback to directly cgroupfs.
This fallback can happen on a systemd host, when machined is not installed, as machined is an optional component.
Right, I should have checked the code.
If we need to mandate use of systemd on systemd hosts, then our existing code is broken and needs fixing.
I think we should do that because without delegation we should not touch anything in cgroups.
I'm happy todo such a fix, and then adjust this series to take account of it. Essentially we'd allow apps to specify 'cgroupfs' or 'machined' but we'd enforce they only make safe choices.
ie on a systemd host we'd only allow 'none' or 'machined'
On a non-systemd host, or on a systemd host where we've been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
This sounds reasonable, I was thinking about the check if we have delegated subtree but decided not to mention it. The only place where we can check the delegation is using systemd API, probably over D-Bus.
It's not reflected anywhere in the cgroup files.
The key scenario for cgroupfs would be if libvirtd was run inside a container on a host that otherwise uses systemd. IIUC, the container should get given a delegated subtree and we need to be able to use cgroupfs backend here. Currently this works (accidentally) because we'll try to talk to machined and fail, so fallback to direct usage.
That should be OK and we can support this even outside of containers if we can verify that the cgroup subtree is delegated and it's safe to use it for VMs.
The wrong usage of delegation by libvirt is the source of the reported BUG so we should be restrictive.
Do you have a pointer to the bugs you've seen in this area. In particular I'm wondering if there's a need for different behaviour with v1 vs v2 cgroups. Systemd has always wanted proper delegation behaviour from apps, but IIUC, this was mostly a nice-to-have under v1, compared to a must-have under v2. My big concern is that we put in a restriction and accidentally break someone's deployment scenario under v1. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, Mar 20, 2020 at 05:01:40PM +0000, Daniel P. Berrangé wrote:
On Fri, Mar 20, 2020 at 05:42:57PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 03:59:41PM +0000, Daniel P. Berrangé wrote:
On Fri, Mar 20, 2020 at 04:48:58PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote: > This simple series allows apps to choose how cgroups are managed by > libvirt, between cgroupfs and machined, or disabled entirely.
I'm not so sure about this series. The situation with cgroups and systemd is a bit more complex then the current code handles.
There is an existing issue where we are violating the delegation rules described by cgroups and systemd.
Currently the "cgroupfs" approach is used only on non-systemd hosts and we should keep it that way. Libvirt is not allowed to mangle with cgroups owned by systemd so IMHO the warning in configuration file is not enough because without delegation cgroups will not work properly in libvirt.
That isn't the case currently AFAICT from current code.
Before this series, the virCgroupNewMachine method will first try systemd and then fallback to directly cgroupfs.
This fallback can happen on a systemd host, when machined is not installed, as machined is an optional component.
Right, I should have checked the code.
If we need to mandate use of systemd on systemd hosts, then our existing code is broken and needs fixing.
I think we should do that because without delegation we should not touch anything in cgroups.
I'm happy todo such a fix, and then adjust this series to take account of it. Essentially we'd allow apps to specify 'cgroupfs' or 'machined' but we'd enforce they only make safe choices.
ie on a systemd host we'd only allow 'none' or 'machined'
On a non-systemd host, or on a systemd host where we've been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
This sounds reasonable, I was thinking about the check if we have delegated subtree but decided not to mention it. The only place where we can check the delegation is using systemd API, probably over D-Bus.
It's not reflected anywhere in the cgroup files.
The key scenario for cgroupfs would be if libvirtd was run inside a container on a host that otherwise uses systemd. IIUC, the container should get given a delegated subtree and we need to be able to use cgroupfs backend here. Currently this works (accidentally) because we'll try to talk to machined and fail, so fallback to direct usage.
That should be OK and we can support this even outside of containers if we can verify that the cgroup subtree is delegated and it's safe to use it for VMs.
The wrong usage of delegation by libvirt is the source of the reported BUG so we should be restrictive.
Do you have a pointer to the bugs you've seen in this area. In particular I'm wondering if there's a need for different behaviour with v1 vs v2 cgroups. Systemd has always wanted proper delegation behaviour from apps, but IIUC, this was mostly a nice-to-have under v1, compared to a must-have under v2.
Sure, the first BZ was actually reported on RHEL-7: https://bugzilla.redhat.com/show_bug.cgi?id=1789824 and we have clones for RHEL-8 and RHEL-AV-8, the issue is present with both v1 and v2. It is nice to have with v1 because nothing is really enforced, not even the no-processes-in-inner-nodes rule. In addition cgroups v1 are broken in many ways. The only difference that I can remember now is that with cgroups v2 we cannot enable controllers that are not supported by systemd because of the unified hierarchy and the fact that we should not touch the cgroups.subtree_control file. In cgroups v1 where each controller is usually mounted separately we are free to do whatever we like with the controllers not managed by systemd. I have to check the code, but we probably violate the restriction with v2 when we fallback to the non-machined codepath.
My big concern is that we put in a restriction and accidentally break someone's deployment scenario under v1.
That might happen and unfortunately they should fix their scenario as it might not work as they are expecting as systemd is the owner of non-delegated cgroups in both v1 and v2 as the BZ demonstrates. Pavel
participants (3)
-
Daniel P. Berrangé
-
Ján Tomko
-
Pavel Hrdina