[libvirt PATCH v3] cgroup/LXC: Do not condition availability of v2 by controllers

systemd in hybrid mode uses v1 hierarchies for controllers and v2 for process tracking. The LXC code uses virCgroupAddMachineProcess() to move processes into appropriate cgroup by manipulating cgroupfs directly. (Note, despite libvirt also supports talking to systemd directly via org.freedesktop.machine1 API.) If this path is taken, libvirt/lxc must convince systemd that processes really belong to new cgroup, i.e. also the tracking v2 hierarchy must undergo migration too. The current check would evaluate v2 backend as unavailable with hybrid mode (because there are no available controllers). Simplify the condition and consider the mounted cgroup2 as sufficient to touch v2 hierarchy. This consequently creates an issue with binding the V2 mount. In hybrid mode the V2 filesystem may be mounted upon the V1 filesystem. By reversing the order in which backends are mounted in virCgroupBindMount this problem is circumvented. Fixes: #182 Signed-off-by: Eric van Blokland <mail@ericvanblokland.nl> --- src/util/vircgroup.c | 8 +++++--- src/util/vircgroupv2.c | 12 ------------ 2 files changed, 5 insertions(+), 15 deletions(-) diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index a6a409af3d..48fbcf625a 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -2924,9 +2924,11 @@ virCgroupBindMount(virCgroup *group, const char *oldroot, size_t i; virCgroup *parent = virCgroupGetNested(group); - for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { - if (parent->backends[i] && - parent->backends[i]->bindMount(parent, oldroot, mountopts) < 0) { + /* In hybrid environments, V2 may be mounted over V1. + * Mount the backends in reverse order. */ + for (i = 1; i <= VIR_CGROUP_BACKEND_TYPE_LAST; i++) { + if (parent->backends[VIR_CGROUP_BACKEND_TYPE_LAST - i] && + parent->backends[VIR_CGROUP_BACKEND_TYPE_LAST - i]->bindMount(parent, oldroot, mountopts) < 0) { return -1; } } diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c index 4c110940cf..0e0c61d466 100644 --- a/src/util/vircgroupv2.c +++ b/src/util/vircgroupv2.c @@ -75,22 +75,10 @@ virCgroupV2Available(void) if (STRNEQ(entry.mnt_type, "cgroup2")) continue; - /* Systemd uses cgroup v2 for process tracking but no controller is - * available. We should consider this configuration as cgroup v2 is - * not available. */ - contFile = g_strdup_printf("%s/cgroup.controllers", entry.mnt_dir); - - if (virFileReadAll(contFile, 1024 * 1024, &contStr) < 0) - goto cleanup; - - if (STREQ(contStr, "")) - continue; - ret = true; break; } - cleanup: VIR_FORCE_FCLOSE(mounts); return ret; } -- 2.35.3

On 10/23/22 14:08, Eric van Blokland wrote:
systemd in hybrid mode uses v1 hierarchies for controllers and v2 for process tracking.
The LXC code uses virCgroupAddMachineProcess() to move processes into appropriate cgroup by manipulating cgroupfs directly. (Note, despite libvirt also supports talking to systemd directly via org.freedesktop.machine1 API.)
If this path is taken, libvirt/lxc must convince systemd that processes really belong to new cgroup, i.e. also the tracking v2 hierarchy must undergo migration too.
The current check would evaluate v2 backend as unavailable with hybrid mode (because there are no available controllers). Simplify the condition and consider the mounted cgroup2 as sufficient to touch v2 hierarchy.
This consequently creates an issue with binding the V2 mount. In hybrid mode the V2 filesystem may be mounted upon the V1 filesystem. By reversing the order in which backends are mounted in virCgroupBindMount this problem is circumvented.
Fixes: #182
We prefer full URL because it enables to just click the link when viewing git log. Also, v3? Somehow I missed v2 and v1 :-D
Signed-off-by: Eric van Blokland <mail@ericvanblokland.nl> --- src/util/vircgroup.c | 8 +++++--- src/util/vircgroupv2.c | 12 ------------ 2 files changed, 5 insertions(+), 15 deletions(-)
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index a6a409af3d..48fbcf625a 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -2924,9 +2924,11 @@ virCgroupBindMount(virCgroup *group, const char *oldroot, size_t i; virCgroup *parent = virCgroupGetNested(group);
- for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { - if (parent->backends[i] && - parent->backends[i]->bindMount(parent, oldroot, mountopts) < 0) { + /* In hybrid environments, V2 may be mounted over V1. + * Mount the backends in reverse order. */ + for (i = 1; i <= VIR_CGROUP_BACKEND_TYPE_LAST; i++) { + if (parent->backends[VIR_CGROUP_BACKEND_TYPE_LAST - i] && + parent->backends[VIR_CGROUP_BACKEND_TYPE_LAST - i]->bindMount(parent, oldroot, mountopts) < 0) {
This works, but I'd rather have the variable count down. For that we'd need to change the type of it from size_t to ssize_t but that's okay.
return -1; } } diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c index 4c110940cf..0e0c61d466 100644 --- a/src/util/vircgroupv2.c +++ b/src/util/vircgroupv2.c @@ -75,22 +75,10 @@ virCgroupV2Available(void) if (STRNEQ(entry.mnt_type, "cgroup2")) continue;
- /* Systemd uses cgroup v2 for process tracking but no controller is - * available. We should consider this configuration as cgroup v2 is - * not available. */ - contFile = g_strdup_printf("%s/cgroup.controllers", entry.mnt_dir); - - if (virFileReadAll(contFile, 1024 * 1024, &contStr) < 0) - goto cleanup; - - if (STREQ(contStr, "")) - continue; - ret = true; break; }
- cleanup: VIR_FORCE_FCLOSE(mounts); return ret; }
I'm fixing the small nit in the first hunk, cleaning up the commit message and merging. Congratulations on your first libvirt contribution! Reviewed-by: Michal Privoznik <mprivozn@redhat.com> Michal

On Sun, Oct 23, 2022 at 02:08:28PM +0200, Eric van Blokland wrote:
systemd in hybrid mode uses v1 hierarchies for controllers and v2 for process tracking.
The LXC code uses virCgroupAddMachineProcess() to move processes into appropriate cgroup by manipulating cgroupfs directly. (Note, despite libvirt also supports talking to systemd directly via org.freedesktop.machine1 API.)
If this path is taken, libvirt/lxc must convince systemd that processes really belong to new cgroup, i.e. also the tracking v2 hierarchy must undergo migration too.
The current check would evaluate v2 backend as unavailable with hybrid mode (because there are no available controllers). Simplify the condition and consider the mounted cgroup2 as sufficient to touch v2 hierarchy.
This consequently creates an issue with binding the V2 mount. In hybrid mode the V2 filesystem may be mounted upon the V1 filesystem. By reversing the order in which backends are mounted in virCgroupBindMount this problem is circumvented.
Fixes: #182 Signed-off-by: Eric van Blokland <mail@ericvanblokland.nl> --- src/util/vircgroup.c | 8 +++++--- src/util/vircgroupv2.c | 12 ------------ 2 files changed, 5 insertions(+), 15 deletions(-)
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index a6a409af3d..48fbcf625a 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -2924,9 +2924,11 @@ virCgroupBindMount(virCgroup *group, const char *oldroot, size_t i; virCgroup *parent = virCgroupGetNested(group);
- for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { - if (parent->backends[i] && - parent->backends[i]->bindMount(parent, oldroot, mountopts) < 0) { + /* In hybrid environments, V2 may be mounted over V1. + * Mount the backends in reverse order. */
I don't understand what you mean by mounted over?
+ for (i = 1; i <= VIR_CGROUP_BACKEND_TYPE_LAST; i++) { + if (parent->backends[VIR_CGROUP_BACKEND_TYPE_LAST - i] && + parent->backends[VIR_CGROUP_BACKEND_TYPE_LAST - i]->bindMount(parent, oldroot, mountopts) < 0) { return -1; } } diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c index 4c110940cf..0e0c61d466 100644 --- a/src/util/vircgroupv2.c +++ b/src/util/vircgroupv2.c @@ -75,22 +75,10 @@ virCgroupV2Available(void) if (STRNEQ(entry.mnt_type, "cgroup2")) continue;
- /* Systemd uses cgroup v2 for process tracking but no controller is - * available. We should consider this configuration as cgroup v2 is - * not available. */ - contFile = g_strdup_printf("%s/cgroup.controllers", entry.mnt_dir); - - if (virFileReadAll(contFile, 1024 * 1024, &contStr) < 0) - goto cleanup; - - if (STREQ(contStr, "")) - continue; -
I don't like this at all and IMO this is incorrect fix of the issue you are trying to address. In hybrid mode with systemd the cgroup v2 controller is not a real controller. It's something systemd uses for process tracking and some other features. It is owned by systemd and we should not touch it directly at all. We need to use proper systemd APIs to make any changes to that directory or if needed ask systemd to create cgroup with Delegate=yes which in this case is probably also not the correct approach. I know it was already pushed but I'll most likely revert this patch and we should find better and proper solution. Pavel
ret = true; break; }
- cleanup: VIR_FORCE_FCLOSE(mounts); return ret; } -- 2.35.3

On 10/24/22 13:54, Pavel Hrdina wrote:
On Sun, Oct 23, 2022 at 02:08:28PM +0200, Eric van Blokland wrote:
systemd in hybrid mode uses v1 hierarchies for controllers and v2 for process tracking.
The LXC code uses virCgroupAddMachineProcess() to move processes into appropriate cgroup by manipulating cgroupfs directly. (Note, despite libvirt also supports talking to systemd directly via org.freedesktop.machine1 API.)
If this path is taken, libvirt/lxc must convince systemd that processes really belong to new cgroup, i.e. also the tracking v2 hierarchy must undergo migration too.
The current check would evaluate v2 backend as unavailable with hybrid mode (because there are no available controllers). Simplify the condition and consider the mounted cgroup2 as sufficient to touch v2 hierarchy.
This consequently creates an issue with binding the V2 mount. In hybrid mode the V2 filesystem may be mounted upon the V1 filesystem. By reversing the order in which backends are mounted in virCgroupBindMount this problem is circumvented.
Fixes: #182 Signed-off-by: Eric van Blokland <mail@ericvanblokland.nl> --- src/util/vircgroup.c | 8 +++++--- src/util/vircgroupv2.c | 12 ------------ 2 files changed, 5 insertions(+), 15 deletions(-)
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index a6a409af3d..48fbcf625a 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -2924,9 +2924,11 @@ virCgroupBindMount(virCgroup *group, const char *oldroot, size_t i; virCgroup *parent = virCgroupGetNested(group);
- for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { - if (parent->backends[i] && - parent->backends[i]->bindMount(parent, oldroot, mountopts) < 0) { + /* In hybrid environments, V2 may be mounted over V1. + * Mount the backends in reverse order. */
I don't understand what you mean by mounted over?
+ for (i = 1; i <= VIR_CGROUP_BACKEND_TYPE_LAST; i++) { + if (parent->backends[VIR_CGROUP_BACKEND_TYPE_LAST - i] && + parent->backends[VIR_CGROUP_BACKEND_TYPE_LAST - i]->bindMount(parent, oldroot, mountopts) < 0) { return -1; } } diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c index 4c110940cf..0e0c61d466 100644 --- a/src/util/vircgroupv2.c +++ b/src/util/vircgroupv2.c @@ -75,22 +75,10 @@ virCgroupV2Available(void) if (STRNEQ(entry.mnt_type, "cgroup2")) continue;
- /* Systemd uses cgroup v2 for process tracking but no controller is - * available. We should consider this configuration as cgroup v2 is - * not available. */ - contFile = g_strdup_printf("%s/cgroup.controllers", entry.mnt_dir); - - if (virFileReadAll(contFile, 1024 * 1024, &contStr) < 0) - goto cleanup; - - if (STREQ(contStr, "")) - continue; -
I don't like this at all and IMO this is incorrect fix of the issue you are trying to address. In hybrid mode with systemd the cgroup v2 controller is not a real controller. It's something systemd uses for process tracking and some other features. It is owned by systemd and we should not touch it directly at all. We need to use proper systemd APIs to make any changes to that directory or if needed ask systemd to create cgroup with Delegate=yes which in this case is probably also not the correct approach.
Maybe the proper way is to have systemd developers explain to us how the hybrid mode should be handled. Because we came up with more solutions but neither was good (enough). Either the cgroup hierarchy works for qemu driver, or lxc driver or machined but never for all three at the same time. Alternatively, we may just refuse to use cgroups in hybrid mode which even systemd maintainer admitted was a step sideways [1]. Michal 1: https://github.com/systemd/systemd/issues/10107#issuecomment-424028793

Hello. (Sorry for a stitched mail below, I'm not subscribed to the ML so this is what I got from public archives. Please, keep me Cced.) On 10/24/22 13:54, Pavel Hrdina wrote:
I don't like this at all and IMO this is incorrect fix of the issue you are trying to address. In hybrid mode with systemd the cgroup v2 controller is not a real controller. It's something systemd uses for process tracking and some other features. It is owned by systemd and we should not touch it directly at all.
I very much agree with this. Despite that I suggested the posted patch [1] because libvirt/lxc apparently violates systemd rules of exclusive access to cgroup hierarchies already. (At least that's how I understand the migration between libvirtd.service and target machine.slice/.../*.scope.) The patch is meant as a quick (and admittedly dirty) fix to the issue that users have been reporting with libvirt/lxc in the hybrid mode for several months.
We need to use proper systemd APIs to make any changes to that directory or if needed ask systemd to create cgroup with Delegate=yes which in this case is probably also not the correct approach.
Yes, as I wrote in the commit message, there seems to be some inconsistency in what libvirt core vs libvirt/lxc uses to access cgroups. Using always systemd API would seem most consistent but I'd retain the current patch as a workaround provided it doesn't break things more (than the current state). On 10/24/22 13:28, Pavel Hrdina wrote:
What I've seen is that hybrid systemd environments have:
/sys/fs/cgroup/unified /sys/fs/cgroup/memory /sys/fs/cgroup/blkio ...
but it doesn't mean that it is mounted in V1 cgroup filesystem.
In this case the /sys/fs/cgroup is tmpfs filesystem and there is nothing mounted over V1.
As a more detailed explanation why the reversed order works: It relies on the tmpfs mount over the ro-path in the v1 virCgroupBindMount so that ../unified mount directory can be created for the v2 mount. (IIUC, this is related to the environment inside the container where libvirt/lxc attempts to prepare hybrid-like cgroup setup.) HTH, Michal [1] v1 in https://bugzilla.opensuse.org/show_bug.cgi?id=1183247#c35

On Mon, Oct 24, 2022 at 05:52:04PM +0200, Michal Koutný wrote:
Hello.
(Sorry for a stitched mail below, I'm not subscribed to the ML so this is what I got from public archives. Please, keep me Cced.)
On 10/24/22 13:54, Pavel Hrdina wrote:
I don't like this at all and IMO this is incorrect fix of the issue you are trying to address. In hybrid mode with systemd the cgroup v2 controller is not a real controller. It's something systemd uses for process tracking and some other features. It is owned by systemd and we should not touch it directly at all.
I very much agree with this. Despite that I suggested the posted patch [1] because libvirt/lxc apparently violates systemd rules of exclusive access to cgroup hierarchies already. (At least that's how I understand the migration between libvirtd.service and target machine.slice/.../*.scope.) The patch is meant as a quick (and admittedly dirty) fix to the issue that users have been reporting with libvirt/lxc in the hybrid mode for several months.
Unfortunately this breaks a lot of things and we cannot use it even as workaround. Libvirt code has few assumptions as that's how unified mode works and when the cgroups v2 backend is enabled for hybrid topology we assume that cpuacct, devices controllers are enabled. But in the case of hybrid topology they are enabled by cgroups v1. Due to this change we no longer limit access to devices and cpu accounting no longer works.
We need to use proper systemd APIs to make any changes to that directory or if needed ask systemd to create cgroup with Delegate=yes which in this case is probably also not the correct approach.
Yes, as I wrote in the commit message, there seems to be some inconsistency in what libvirt core vs libvirt/lxc uses to access cgroups. Using always systemd API would seem most consistent but I'd retain the current patch as a workaround provided it doesn't break things more (than the current state).
On 10/24/22 13:28, Pavel Hrdina wrote:
What I've seen is that hybrid systemd environments have:
/sys/fs/cgroup/unified /sys/fs/cgroup/memory /sys/fs/cgroup/blkio ...
but it doesn't mean that it is mounted in V1 cgroup filesystem.
In this case the /sys/fs/cgroup is tmpfs filesystem and there is nothing mounted over V1.
As a more detailed explanation why the reversed order works: It relies on the tmpfs mount over the ro-path in the v1 virCgroupBindMount so that ../unified mount directory can be created for the v2 mount.
(IIUC, this is related to the environment inside the container where libvirt/lxc attempts to prepare hybrid-like cgroup setup.)
I did a bit of digging and now I see what the issue is. The commit message wording and explanation here was not enough for me to understand the problem. In the virCgroupV1BindMount we have a code to create /sys/fs/cgroup directory before we start mounting the controllers, however, in the virCgroupV2BindMount there is no code to do that as cgroups v2 in unified mode are mounted directly at /sys/fs/cgroup. But with the hybrid mode this doesn't work because we try to mount /sys/fs/cgroup/unified without the /sys/fs/cgroup directory existing and it fails. This workaround works but I would still rather fix the root of the issue by modifying both functions where they would create the necessary directory only if it doesn't exist so there is no need to care about the call order. I'm going to send a patch to revert this one and we need to find yet another solution to the issue. Just a note for future patches, single patch should fix only one issue so this patch should have been split into two patches. Pavel
HTH, Michal
[1] v1 in https://bugzilla.opensuse.org/show_bug.cgi?id=1183247#c35

Hello. On Tue, Oct 25, 2022 at 11:33:56AM +0200, Pavel Hrdina <phrdina@redhat.com> wrote:
Unfortunately this breaks a lot of things and we cannot use it even as workaround. Libvirt code has few assumptions as that's how unified mode works and when the cgroups v2 backend is enabled for hybrid topology we assume that cpuacct, devices controllers are enabled.
To be on the same side -- there's no cpuacct and devices on cgroup v2 (regardless of hybrid or unified mode).
But in the case of hybrid topology they are enabled by cgroups v1. Due to this change we no longer limit access to devices and cpu accounting no longer works.
Shouldn't v1 backend still carry that out? (On a higher level -- I understand that having both backends enabled for the hybrid mode is what's not working well.)
I'm going to send a patch to revert this one and we need to find yet another solution to the issue.
BTW do you have any pointers (a commit) when the hybrid mode could have become broken for libvirt/lxc? (I estimate from the (lack of) user reports that it must have worked at the time of libvirt 4.0.) Thanks, Michal

On Tue, Oct 25, 2022 at 02:31:21PM +0200, Michal Koutný wrote:
Hello.
On Tue, Oct 25, 2022 at 11:33:56AM +0200, Pavel Hrdina <phrdina@redhat.com> wrote:
Unfortunately this breaks a lot of things and we cannot use it even as workaround. Libvirt code has few assumptions as that's how unified mode works and when the cgroups v2 backend is enabled for hybrid topology we assume that cpuacct, devices controllers are enabled.
To be on the same side -- there's no cpuacct and devices on cgroup v2 (regardless of hybrid or unified mode).
From cgroup POV yes there is no cpuacct and devices controller on cgroup v2, but the functionality of cpuacct is provided directly by cgroup v2 and for devices there is BPF.
But in the case of hybrid topology they are enabled by cgroups v1. Due to this change we no longer limit access to devices and cpu accounting no longer works.
Shouldn't v1 backend still carry that out? (On a higher level -- I understand that having both backends enabled for the hybrid mode is what's not working well.)
Exactly, there is the same ordering issue, if cgroup v2 backend is enabled it is the only one used for cpuacct and devices within libvirt code.
I'm going to send a patch to revert this one and we need to find yet another solution to the issue.
BTW do you have any pointers (a commit) when the hybrid mode could have become broken for libvirt/lxc? (I estimate from the (lack of) user reports that it must have worked at the time of libvirt 4.0.)
Only this specific hybrid mode of systemd which is cgroup v1 for everything and unified for process tracking. As I've already mentioned in previous reply, to fix the process tracking we should ideally use org.freedesktop.systemd1.Manager.AttachProcessesToUnit dbus call to add the process into the cgroup and to fix the other issue with bind mount we need to modify both cgroup v1 and v2 code in libvirt to create the necessary directory if it doesn't exist. Pavel
Thanks, Michal

On Tue, Oct 25, 2022 at 02:49:35PM +0200, Pavel Hrdina <phrdina@redhat.com> wrote:
Exactly, there is the same ordering issue, if cgroup v2 backend is enabled it is the only one used for cpuacct and devices within libvirt code.
Aha, I see what's the issue with the approach now.
BTW do you have any pointers (a commit) when the hybrid mode could have become broken for libvirt/lxc? (I estimate from the (lack of) user reports that it must have worked at the time of libvirt 4.0.)
Only this specific hybrid mode of systemd which is cgroup v1 for everything and unified for process tracking.
I mean we've had the hybrid mode since Leap 15.0 and apparently there were no (reported) issues of this kind (libvirt/lxc) until Leap 15.3 (which roughly translates to libvirt >= 7.1.0). So it must have worked somehow and something changed meanwhile (move to use the systemd API partially to manage cgroups?).
As I've already mentioned in previous reply, to fix the process tracking we should ideally use
org.freedesktop.systemd1.Manager.AttachProcessesToUnit
dbus call to add the process into the cgroup
This is a rather low-level (and as you mentioned internal purposes) API. It should primarily use org.freedesktop.machine1.Manager.CreateMachine() or org.freedesktop.systemd1.Manager.StartTransientUnit() if machined is unwanted. These calls would ensure that processes are migrated in all hierarchies there are (i.e. v2 and all v1s in hybrid mode) by systemd. I thought that recent libvirt (8.7.0 manifests the bug on hybrid too) would only use the systemd API. Do I get it correctly that it only applies to the libvirt "core" and then libvirt/lxc needs to place some more processes together with the leader process? (Which currently works only with full v1 or full v2 mode by direct cgroupfs manipulation.) Thanks, Michal
participants (4)
-
Eric van Blokland
-
Michal Koutný
-
Michal Prívozník
-
Pavel Hrdina