Re: [libvirt PATCH v3] cgroup/LXC: Do not condition availability of v2 by controllers

25 Oct 2022

      On Mon, Oct 24, 2022 at 05:52:04PM +0200, Michal Koutný wrote:
...
Hello.
(Sorry for a stitched mail below, I'm not subscribed to the ML so this is
what I got from public archives. Please, keep me Cced.)
On 10/24/22 13:54, Pavel Hrdina wrote:
...
I don't like this at all and IMO this is incorrect fix of the issue you
are trying to address. In hybrid mode with systemd the cgroup v2
controller is not a real controller. It's something systemd uses for
process tracking and some other features. It is owned by systemd and we
should not touch it directly at all.
I very much agree with this. Despite that I suggested the posted patch
[1] because libvirt/lxc apparently violates systemd rules of exclusive
access to cgroup hierarchies already. (At least that's how I understand
the migration between libvirtd.service and target
machine.slice/.../*.scope.) The patch is meant as a quick (and
admittedly dirty) fix to the issue that users have been reporting with
libvirt/lxc in the hybrid mode for several months.
Unfortunately this breaks a lot of things and we cannot use it even as
workaround. Libvirt code has few assumptions as that's how unified mode
works and when the cgroups v2 backend is enabled for hybrid topology we
assume that cpuacct, devices controllers are enabled. But in the case of
hybrid topology they are enabled by cgroups v1. Due to this change we no
longer limit access to devices and cpu accounting no longer works.
...
...
We need to use proper systemd APIs to make any changes to that
directory or if needed ask systemd to create cgroup with Delegate=yes
which in this case is probably also not the correct approach.
Yes, as I wrote in the commit message, there seems to be some
inconsistency in what libvirt core vs libvirt/lxc uses to access
cgroups. Using always systemd API would seem most consistent but I'd
retain the current patch as a workaround provided it doesn't break
things more (than the current state).
On 10/24/22 13:28, Pavel Hrdina wrote:
...
What I've seen is that hybrid systemd environments have:
/sys/fs/cgroup/unified
    /sys/fs/cgroup/memory
    /sys/fs/cgroup/blkio
    ...
but it doesn't mean that it is mounted in V1 cgroup filesystem.
In this case the /sys/fs/cgroup is tmpfs filesystem and there is nothing
mounted over V1.
As a more detailed explanation why the reversed order works:
It relies on the tmpfs mount over the ro-path in the v1
virCgroupBindMount so that ../unified mount directory can be created
for the v2 mount.
(IIUC, this is related to the environment inside the container where
libvirt/lxc attempts to prepare hybrid-like cgroup setup.)
I did a bit of digging and now I see what the issue is. The commit
message wording and explanation here was not enough for me to understand
the problem.

In the virCgroupV1BindMount we have a code to create /sys/fs/cgroup
directory before we start mounting the controllers, however, in the
virCgroupV2BindMount there is no code to do that as cgroups v2 in
unified mode are mounted directly at /sys/fs/cgroup.

But with the hybrid mode this doesn't work because we try to mount
/sys/fs/cgroup/unified without the /sys/fs/cgroup directory existing
and it fails.

This workaround works but I would still rather fix the root of the issue
by modifying both functions where they would create the necessary
directory only if it doesn't exist so there is no need to care about the
call order.

I'm going to send a patch to revert this one and we need to find yet
another solution to the issue.

Just a note for future patches, single patch should fix only one issue
so this patch should have been split into two patches.

Pavel
...
HTH,
Michal
[1] v1 in https://bugzilla.opensuse.org/show_bug.cgi?id=1183247#c35