[PATCH 0/1] vircgroup: Fix nested cgroup on cleanup

newer
[PATCH 1/1] Signed-off-by: Jiadong...

older
[libvirt PATCH 0/2] Minor fixes to...

Eric Farman

26 Mar 2021 26 Mar '21

5:25 p.m.

Hi Pavel, et al, Running Fedora 33 KVM/QEMU on s390x, I recently noticed a couple of oddities when shutting down my guests, which I bisected between 7.0.0 and 7.1.0 to your commit: commit 184245f53b94fc84f727eb6e8a2aa52df02d69c0 Author: Pavel Hrdina <phrdina@redhat.com> Date: Tue Feb 9 12:33:53 2021 +0100 vircgroup: introduce nested cgroup to properly work with systemd The attached patch is based on some of the tweaks you did to use the parent/nested pointers in a cgroup, rather than a cgroup itself. It fixes both problems I'm encountering, but as this code is quite unfamiliar to me, I might well be way off track and would appreciate your feedback in the matter. Apologies for hitting this on the same day as freeze; I had not noticed the messages previously, and started digging yesterday when I noted the number of file descriptors left open by libvirt. Thanks, Eric Eric Farman (1): vircgroup: Cleanup nested cgroups src/util/vircgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- 2.25.1

Show replies by date

Eric Farman

26 Mar 26 Mar

5:25 p.m.

New subject: [PATCH 1/1] vircgroup: Cleanup nested cgroups

The introduction of nested cgroups used a little macro virCgroupGetNested() to retrieve the nested cgroup pointer, if one exists. But this macro isn't used when removing cgroups, resulting in some messages: Mar 25 20:55:17 fedora33 libvirtd[955]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory Mar 25 20:55:17 fedora33 libvirtd[955]: Failed to remove cgroup for guest That directory exists while the guest is running, as it was created by systemd/machined, so the code probably meant to open the libvirt/ subdirectory from that point. Similarly, there happen to be BPF-related file descriptors that don't get cleaned up in this process too, because they are anchored off the nested cgroup location: [test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 35 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml [test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 42 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown [test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 37 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml [test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 44 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown [test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 39 Let's fix this by using the same macro when removing cgroups, so that it picks up the right structure and can remove the associated resources properly. Fixes: 184245f53b94 ("vircgroup: introduce nested cgroup to properly work with systemd") Signed-off-by: Eric Farman <farman@linux.ibm.com> --- src/util/vircgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 5b6097c335..e606cbfe0b 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -2613,11 +2613,12 @@ virCgroupRemoveRecursively(char *grppath) int virCgroupRemove(virCgroupPtr group) { + virCgroupPtr parent = virCgroupGetNested(group); size_t i; for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { - if (group->backends[i]) { - int rc = group->backends[i]->remove(group); + if (parent->backends[i]) { + int rc = parent->backends[i]->remove(parent); if (rc < 0) return rc; } -- 2.25.1

Pavel Hrdina

7 Apr 7 Apr

3:07 p.m.

New subject: [PATCH 1/1] vircgroup: Cleanup nested cgroups

On Fri, Mar 26, 2021 at 05:25:03PM +0100, Eric Farman wrote:

...

The introduction of nested cgroups used a little macro virCgroupGetNested() to retrieve the nested cgroup pointer, if one exists. But this macro isn't used when removing cgroups, resulting in some messages:

Mar 25 20:55:17 fedora33 libvirtd[955]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory Mar 25 20:55:17 fedora33 libvirtd[955]: Failed to remove cgroup for guest

That directory exists while the guest is running, as it was created by systemd/machined, so the code probably meant to open the libvirt/ subdirectory from that point.

Similarly, there happen to be BPF-related file descriptors that don't get cleaned up in this process too, because they are anchored off the nested cgroup location:

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 35 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 42 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 37 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 44 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 39

Let's fix this by using the same macro when removing cgroups, so that it picks up the right structure and can remove the associated resources properly.

Fixes: 184245f53b94 ("vircgroup: introduce nested cgroup to properly work with systemd") Signed-off-by: Eric Farman <farman@linux.ibm.com> --- src/util/vircgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)

I don't thing this patch is correct. With systemd we would get the same error without the nested cgroup as well. It's because we terminate the qemu process which makes systemd remove the VM root cgroup as well. This happens only on cgroup controllers managed by systemd. For example if you switch to cgroups v1 where each controller is in separate directory not all controllers supported by libvirt are also supported by systemd. In this case libvirt creates all the cgroups by itself and is responsible to cleanup as well. With this patch we would not remove the VM root cgroups in these controllers. This would affect following controllers: cpuset freezer net_cls net_prio perf_event You can verify what happens with cgroups v1 by adding systemd.unified_cgroup_hierarchy=0 to your kernel cmdline. Pavel

Eric Farman

8 Apr 8 Apr

5:28 a.m.

New subject: [PATCH 1/1] vircgroup: Cleanup nested cgroups

On 4/7/21 9:07 AM, Pavel Hrdina wrote:

...

On Fri, Mar 26, 2021 at 05:25:03PM +0100, Eric Farman wrote:

...
The introduction of nested cgroups used a little macro virCgroupGetNested() to retrieve the nested cgroup pointer, if one exists. But this macro isn't used when removing cgroups, resulting in some messages:

Mar 25 20:55:17 fedora33 libvirtd[955]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory Mar 25 20:55:17 fedora33 libvirtd[955]: Failed to remove cgroup for guest

That directory exists while the guest is running, as it was created by systemd/machined, so the code probably meant to open the libvirt/ subdirectory from that point.

Similarly, there happen to be BPF-related file descriptors that don't get cleaned up in this process too, because they are anchored off the nested cgroup location:

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 35 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 42 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 37 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 44 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 39

Let's fix this by using the same macro when removing cgroups, so that it picks up the right structure and can remove the associated resources properly.

Fixes: 184245f53b94 ("vircgroup: introduce nested cgroup to properly work with systemd") Signed-off-by: Eric Farman <farman@linux.ibm.com> --- src/util/vircgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)

I don't thing this patch is correct. With systemd we would get the same error without the nested cgroup as well. It's because we terminate the qemu process which makes systemd remove the VM root cgroup as well.

I don't experience any problems reverting the blamed patch. The qemu cgroup code doesn't make any distinction about systemd or not; it just calls the virCgroupRemove() to clean up the resources that were set up here during init: qemuInitCgroup() virCgroupNewMachine() virCgroupNewMachineSystemd() virCgroupNewNested() The group pointer that's stashed in qemu's struct is that of the machine-qemu...scope group, rather than the nested group, but nothing in the cleanup path touches group->nested. My initial patch is certainly flawed (as you explain below), so maybe something like this is better? @@ -2615,6 +2615,9 @@ virCgroupRemove(virCgroupPtr group) { size_t i; + if (group->nested) + virCgroupRemove(group->nested); + for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { if (group->backends[i]) { int rc = group->backends[i]->remove(group); Not great, since it cleans up the nested group but then still attempts to clean up the machine-qemu...scope group that was setup by systemd. This group wasn't setup by virCgroupV2MakeGroup(), so calling virCgroupV2Remove() seems wrong too. Not sure how to address that.

...

This happens only on cgroup controllers managed by systemd. For example if you switch to cgroups v1 where each controller is in separate directory not all controllers supported by libvirt are also supported by systemd. In this case libvirt creates all the cgroups by itself and is responsible to cleanup as well. With this patch we would not remove the VM root cgroups in these controllers. This would affect following controllers:

cpuset freezer net_cls net_prio perf_event

You can verify what happens with cgroups v1 by adding systemd.unified_cgroup_hierarchy=0 to your kernel cmdline.

Neat, thanks for that tip. Thanks, Eric

...

Pavel

Pavel Hrdina

2:12 p.m.

New subject: [PATCH 1/1] vircgroup: Cleanup nested cgroups

On Wed, Apr 07, 2021 at 11:28:48PM -0400, Eric Farman wrote:

...

On 4/7/21 9:07 AM, Pavel Hrdina wrote:

...
On Fri, Mar 26, 2021 at 05:25:03PM +0100, Eric Farman wrote:

...
The introduction of nested cgroups used a little macro virCgroupGetNested() to retrieve the nested cgroup pointer, if one exists. But this macro isn't used when removing cgroups, resulting in some messages:

Mar 25 20:55:17 fedora33 libvirtd[955]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory Mar 25 20:55:17 fedora33 libvirtd[955]: Failed to remove cgroup for guest

That directory exists while the guest is running, as it was created by systemd/machined, so the code probably meant to open the libvirt/ subdirectory from that point.

Similarly, there happen to be BPF-related file descriptors that don't get cleaned up in this process too, because they are anchored off the nested cgroup location:

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 35 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 42 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 37 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 44 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 39

Let's fix this by using the same macro when removing cgroups, so that it picks up the right structure and can remove the associated resources properly.

Fixes: 184245f53b94 ("vircgroup: introduce nested cgroup to properly work with systemd") Signed-off-by: Eric Farman <farman@linux.ibm.com> --- src/util/vircgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)

I don't thing this patch is correct. With systemd we would get the same error without the nested cgroup as well. It's because we terminate the qemu process which makes systemd remove the VM root cgroup as well.

I don't experience any problems reverting the blamed patch. The qemu cgroup code doesn't make any distinction about systemd or not; it just calls the virCgroupRemove() to clean up the resources that were set up here during init:

qemuInitCgroup() virCgroupNewMachine() virCgroupNewMachineSystemd() virCgroupNewNested()

The group pointer that's stashed in qemu's struct is that of the machine-qemu...scope group, rather than the nested group, but nothing in the cleanup path touches group->nested. My initial patch is certainly flawed (as you explain below), so maybe something like this is better?

@@ -2615,6 +2615,9 @@ virCgroupRemove(virCgroupPtr group) { size_t i;

+ if (group->nested) + virCgroupRemove(group->nested); + for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { if (group->backends[i]) { int rc = group->backends[i]->remove(group);

Not great, since it cleans up the nested group but then still attempts to clean up the machine-qemu...scope group that was setup by systemd. This group wasn't setup by virCgroupV2MakeGroup(), so calling virCgroupV2Remove() seems wrong too. Not sure how to address that.

I'm not sure how this will help. As I've already pointed out calling virCgroupRemove(group) results in calling one or both functions: virCgroupV1Remove() virCgroupV2Remove() Where both of these functions will call virCgroupRemoveRecursively() which will recursively remove all subdirectories including the nested one. So the extra if (group->nested) virCgroupRemove(group->nested); will not help with anything. Looking at the code (I did not test it) it looks like the error message is produced by following this path: qemuProcessStop() qemuRemoveCgroup() virCgroupTerminateMachine() this will make systemd to remove the cgroup virCgroupRemove() virCgroupV2Remove() virCgroupV2DevicesRemoveProg() virCgroupV2DevicesDetectProg() open() Unfortunately we cannot simply ignore ENOENT in the virCgroupV2DevicesDetectProg() because it is used in different places where such error should be reported. What we can do is to introduce new parameter `bool quiet` into virCgroupV2DevicesDetectProg() that would silence the ENOENT error if set to true and we could use it from virCgroupV2DevicesRemoveProg() or something similar. Pavel

...

...
This happens only on cgroup controllers managed by systemd. For example if you switch to cgroups v1 where each controller is in separate directory not all controllers supported by libvirt are also supported by systemd. In this case libvirt creates all the cgroups by itself and is responsible to cleanup as well. With this patch we would not remove the VM root cgroups in these controllers. This would affect following controllers:

cpuset freezer net_cls net_prio perf_event

You can verify what happens with cgroups v1 by adding systemd.unified_cgroup_hierarchy=0 to your kernel cmdline.

Neat, thanks for that tip.

Thanks, Eric

...
Pavel

Eric Farman

9 Apr 9 Apr

4 a.m.

New subject: [PATCH 1/1] vircgroup: Cleanup nested cgroups

On 4/8/21 8:12 AM, Pavel Hrdina wrote:

...

On Wed, Apr 07, 2021 at 11:28:48PM -0400, Eric Farman wrote:

...
On 4/7/21 9:07 AM, Pavel Hrdina wrote:

...
On Fri, Mar 26, 2021 at 05:25:03PM +0100, Eric Farman wrote:

...
The introduction of nested cgroups used a little macro virCgroupGetNested() to retrieve the nested cgroup pointer, if one exists. But this macro isn't used when removing cgroups, resulting in some messages:

Mar 25 20:55:17 fedora33 libvirtd[955]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory Mar 25 20:55:17 fedora33 libvirtd[955]: Failed to remove cgroup for guest

That directory exists while the guest is running, as it was created by systemd/machined, so the code probably meant to open the libvirt/ subdirectory from that point.

Similarly, there happen to be BPF-related file descriptors that don't get cleaned up in this process too, because they are anchored off the nested cgroup location:

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 35 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 42 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 37 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 44 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 39

Let's fix this by using the same macro when removing cgroups, so that it picks up the right structure and can remove the associated resources properly.

Fixes: 184245f53b94 ("vircgroup: introduce nested cgroup to properly work with systemd") Signed-off-by: Eric Farman <farman@linux.ibm.com> --- src/util/vircgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)

I don't thing this patch is correct. With systemd we would get the same error without the nested cgroup as well. It's because we terminate the qemu process which makes systemd remove the VM root cgroup as well.

I don't experience any problems reverting the blamed patch. The qemu cgroup code doesn't make any distinction about systemd or not; it just calls the virCgroupRemove() to clean up the resources that were set up here during init:

qemuInitCgroup() virCgroupNewMachine() virCgroupNewMachineSystemd() virCgroupNewNested()

The group pointer that's stashed in qemu's struct is that of the machine-qemu...scope group, rather than the nested group, but nothing in the cleanup path touches group->nested. My initial patch is certainly flawed (as you explain below), so maybe something like this is better?

@@ -2615,6 +2615,9 @@ virCgroupRemove(virCgroupPtr group) { size_t i;

+ if (group->nested) + virCgroupRemove(group->nested); + for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { if (group->backends[i]) { int rc = group->backends[i]->remove(group);

Not great, since it cleans up the nested group but then still attempts to clean up the machine-qemu...scope group that was setup by systemd. This group wasn't setup by virCgroupV2MakeGroup(), so calling virCgroupV2Remove() seems wrong too. Not sure how to address that.

I'm not sure how this will help. As I've already pointed out calling virCgroupRemove(group) results in calling one or both functions:

virCgroupV1Remove() virCgroupV2Remove()

Where both of these functions will call virCgroupRemoveRecursively() which will recursively remove all subdirectories including the nested one.

So the extra

if (group->nested) virCgroupRemove(group->nested);

will not help with anything.

Looking at the code (I did not test it) it looks like the error message is produced by following this path:

qemuProcessStop() qemuRemoveCgroup() virCgroupTerminateMachine() this will make systemd to remove the cgroup virCgroupRemove() virCgroupV2Remove() virCgroupV2DevicesRemoveProg() virCgroupV2DevicesDetectProg() open()

Yes, this is the path where exit out, and thus never get to the RemoveRecursively routine you mentioned above. But it's not because of the virCgroupTerminateMachine() call, but rather the progfd and mapfd fields in the cgroup passed to virCgroupV2DevicesDetectProg(). With the blamed patch, these fields are zero, so we go ahead and try to do all that other work. Prior to that patch, and with my proposed patch, these fd's are non-zero, and so it exits immediately back to virCgroupV2DevicesRemoveProg() which does a VIR_FORCE_CLOSE on them. The non-zero fd's are related to BPF (see below), and are stashed in that nested cgroup nowadays. (virsh create/shutdown a guest three times) # readlink /proc/$(pgrep libvirtd)/fd/* | grep bpf anon_inode:bpf-map anon_inode:bpf-map anon_inode:bpf-prog anon_inode:bpf-map anon_inode:bpf-prog anon_inode:bpf-prog

...

Unfortunately we cannot simply ignore ENOENT in the virCgroupV2DevicesDetectProg() because it is used in different places where such error should be reported.

What we can do is to introduce new parameter `bool quiet` into virCgroupV2DevicesDetectProg() that would silence the ENOENT error if set to true and we could use it from virCgroupV2DevicesRemoveProg() or something similar.

This smells fishy to me. I tried it, and it does indeed get us to virCgroupRemoveRecursively() later down that callchain. The downside is it is of course called with the same cgroup (grppath="/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/") and returns without doing anything when virDirOpenQuiet() also returns ENOENT. Eric

...

Pavel

...
...
This happens only on cgroup controllers managed by systemd. For example if you switch to cgroups v1 where each controller is in separate directory not all controllers supported by libvirt are also supported by systemd. In this case libvirt creates all the cgroups by itself and is responsible to cleanup as well. With this patch we would not remove the VM root cgroups in these controllers. This would affect following controllers:

cpuset freezer net_cls net_prio perf_event

You can verify what happens with cgroups v1 by adding systemd.unified_cgroup_hierarchy=0 to your kernel cmdline.

Neat, thanks for that tip.

Thanks, Eric

...
Pavel

Pavel Hrdina

12:02 p.m.

New subject: [PATCH 1/1] vircgroup: Cleanup nested cgroups

On Thu, Apr 08, 2021 at 10:00:32PM -0400, Eric Farman wrote:

...

On 4/8/21 8:12 AM, Pavel Hrdina wrote:

...
On Wed, Apr 07, 2021 at 11:28:48PM -0400, Eric Farman wrote:

...
On 4/7/21 9:07 AM, Pavel Hrdina wrote:

...
On Fri, Mar 26, 2021 at 05:25:03PM +0100, Eric Farman wrote:

...
The introduction of nested cgroups used a little macro virCgroupGetNested() to retrieve the nested cgroup pointer, if one exists. But this macro isn't used when removing cgroups, resulting in some messages:

Mar 25 20:55:17 fedora33 libvirtd[955]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory Mar 25 20:55:17 fedora33 libvirtd[955]: Failed to remove cgroup for guest

That directory exists while the guest is running, as it was created by systemd/machined, so the code probably meant to open the libvirt/ subdirectory from that point.

Similarly, there happen to be BPF-related file descriptors that don't get cleaned up in this process too, because they are anchored off the nested cgroup location:

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 35 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 42 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 37 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 44 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 39

Let's fix this by using the same macro when removing cgroups, so that it picks up the right structure and can remove the associated resources properly.

Fixes: 184245f53b94 ("vircgroup: introduce nested cgroup to properly work with systemd") Signed-off-by: Eric Farman <farman@linux.ibm.com> --- src/util/vircgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)

I don't thing this patch is correct. With systemd we would get the same error without the nested cgroup as well. It's because we terminate the qemu process which makes systemd remove the VM root cgroup as well.

I don't experience any problems reverting the blamed patch. The qemu cgroup code doesn't make any distinction about systemd or not; it just calls the virCgroupRemove() to clean up the resources that were set up here during init:

qemuInitCgroup() virCgroupNewMachine() virCgroupNewMachineSystemd() virCgroupNewNested()

The group pointer that's stashed in qemu's struct is that of the machine-qemu...scope group, rather than the nested group, but nothing in the cleanup path touches group->nested. My initial patch is certainly flawed (as you explain below), so maybe something like this is better?

@@ -2615,6 +2615,9 @@ virCgroupRemove(virCgroupPtr group) { size_t i;

+ if (group->nested) + virCgroupRemove(group->nested); + for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { if (group->backends[i]) { int rc = group->backends[i]->remove(group);

Not great, since it cleans up the nested group but then still attempts to clean up the machine-qemu...scope group that was setup by systemd. This group wasn't setup by virCgroupV2MakeGroup(), so calling virCgroupV2Remove() seems wrong too. Not sure how to address that.

I'm not sure how this will help. As I've already pointed out calling virCgroupRemove(group) results in calling one or both functions:

virCgroupV1Remove() virCgroupV2Remove()

Where both of these functions will call virCgroupRemoveRecursively() which will recursively remove all subdirectories including the nested one.

So the extra

if (group->nested) virCgroupRemove(group->nested);

will not help with anything.

Looking at the code (I did not test it) it looks like the error message is produced by following this path:

qemuProcessStop() qemuRemoveCgroup() virCgroupTerminateMachine() this will make systemd to remove the cgroup virCgroupRemove() virCgroupV2Remove() virCgroupV2DevicesRemoveProg() virCgroupV2DevicesDetectProg() open()

Yes, this is the path where exit out, and thus never get to the RemoveRecursively routine you mentioned above. But it's not because of the virCgroupTerminateMachine() call, but rather the progfd and mapfd fields in the cgroup passed to virCgroupV2DevicesDetectProg().

With the blamed patch, these fields are zero, so we go ahead and try to do all that other work. Prior to that patch, and with my proposed patch, these fd's are non-zero, and so it exits immediately back to virCgroupV2DevicesRemoveProg() which does a VIR_FORCE_CLOSE on them. The non-zero fd's are related to BPF (see below), and are stashed in that nested cgroup nowadays.

(virsh create/shutdown a guest three times) # readlink /proc/$(pgrep libvirtd)/fd/* | grep bpf anon_inode:bpf-map anon_inode:bpf-map anon_inode:bpf-prog anon_inode:bpf-map anon_inode:bpf-prog anon_inode:bpf-prog

I still fail to see how calling virCgroupRemove(group->nested) in virCgroupRemove() would help at all. The original issue you mentioned in the commit message is that we log this error: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory So calling virCgroupRemove(group->nested) would also fail with: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/libvirt/': No such file or directory because if the VM root cgroup ('machine-qemu\x2d1\x2dguest.scope') doesn't exist the nested cgroup ('libvirt') will not exist as well. Now to the progfd and mapfd being zero, that depends whether libvirtd was restarted while the VM was running. When a VM is started on host with cgroups v2 libvirt will create BPF program and BPF map to limit access to system devices and stores both in the mentioned variables. But if you restart libvirtd it will load the BPF prog ID and BPF map ID only on demand, for example when destroying the VM. Now on hosts with systemd the owner of the VM root cgroup is systemd and because we use call talk to systemd-machined and register the VM there the VM root cgroup is created for us by systemd-machined and it is associated with qemu PID. When destroying the VM we will kill the qemu process which will trigger systemd-machined to automatically cleanup the cgroup. Once that happens kernel should eventually cleanup both BPF prog and BPF map that were associated with the nested cgroup because it no longer exist and there are no other references to the prog or map. It may take some time before kernel actually removes the prog and map. The only thing on systemd based host we can do is to skip the whole BPF detection code if the directory was already removed which I've already proposed by handling ENOENT in virCgroupV2DevicesDetectProg(). All of the above applies to systemd based hosts. If we consider system without systemd then there is an actual bug as you already pointed out that the BPF prog and BPF map are now associated with the nested cgroup, to fix that we should change only virCgroupV2Remove: diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c index 248d4047e5..4664492c34 100644 --- a/src/util/vircgroupv2.c +++ b/src/util/vircgroupv2.c @@ -523,6 +523,7 @@ static int virCgroupV2Remove(virCgroupPtr group) { g_autofree char *grppath = NULL; + virCgroupPtr parent = virCgroupGetNested(group); int controller; /* Don't delete the root group, if we accidentally @@ -534,7 +535,7 @@ virCgroupV2Remove(virCgroupPtr group) if (virCgroupV2PathOfController(group, controller, "", &grppath) < 0) return 0; - if (virCgroupV2DevicesRemoveProg(group) < 0) + if (virCgroupV2DevicesRemoveProg(parent) < 0) return -1; return virCgroupRemoveRecursively(grppath);

...

...
Unfortunately we cannot simply ignore ENOENT in the virCgroupV2DevicesDetectProg() because it is used in different places where such error should be reported.

What we can do is to introduce new parameter `bool quiet` into virCgroupV2DevicesDetectProg() that would silence the ENOENT error if set to true and we could use it from virCgroupV2DevicesRemoveProg() or something similar.

This smells fishy to me. I tried it, and it does indeed get us to virCgroupRemoveRecursively() later down that callchain. The downside is it is of course called with the same cgroup (grppath="/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/") and returns without doing anything when virDirOpenQuiet() also returns ENOENT.

Well that's the only thing we can do if the directories no longer exist. Pavel

...

Eric

...
Pavel

...
...
This happens only on cgroup controllers managed by systemd. For example if you switch to cgroups v1 where each controller is in separate directory not all controllers supported by libvirt are also supported by systemd. In this case libvirt creates all the cgroups by itself and is responsible to cleanup as well. With this patch we would not remove the VM root cgroups in these controllers. This would affect following controllers:

cpuset freezer net_cls net_prio perf_event

You can verify what happens with cgroups v1 by adding systemd.unified_cgroup_hierarchy=0 to your kernel cmdline.

Neat, thanks for that tip.

Thanks, Eric

...
Pavel

Eric Farman

3:13 p.m.

New subject: [PATCH 1/1] vircgroup: Cleanup nested cgroups

On 4/9/21 6:02 AM, Pavel Hrdina wrote:

...

On Thu, Apr 08, 2021 at 10:00:32PM -0400, Eric Farman wrote:

...
On 4/8/21 8:12 AM, Pavel Hrdina wrote:

...
On Wed, Apr 07, 2021 at 11:28:48PM -0400, Eric Farman wrote:

...
On 4/7/21 9:07 AM, Pavel Hrdina wrote:

...
On Fri, Mar 26, 2021 at 05:25:03PM +0100, Eric Farman wrote:

...
The introduction of nested cgroups used a little macro virCgroupGetNested() to retrieve the nested cgroup pointer, if one exists. But this macro isn't used when removing cgroups, resulting in some messages:

Mar 25 20:55:17 fedora33 libvirtd[955]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory Mar 25 20:55:17 fedora33 libvirtd[955]: Failed to remove cgroup for guest

That directory exists while the guest is running, as it was created by systemd/machined, so the code probably meant to open the libvirt/ subdirectory from that point.

Similarly, there happen to be BPF-related file descriptors that don't get cleaned up in this process too, because they are anchored off the nested cgroup location:

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 35 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 42 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 37 [test@fedora33 ~]# virsh create guest.xml Domain 'guest' created from guest.xml

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 44 [test@fedora33 ~]# virsh shutdown guest Domain 'guest' is being shutdown

[test@fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l 39

Let's fix this by using the same macro when removing cgroups, so that it picks up the right structure and can remove the associated resources properly.

Fixes: 184245f53b94 ("vircgroup: introduce nested cgroup to properly work with systemd") Signed-off-by: Eric Farman <farman@linux.ibm.com> --- src/util/vircgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)

I don't thing this patch is correct. With systemd we would get the same error without the nested cgroup as well. It's because we terminate the qemu process which makes systemd remove the VM root cgroup as well.

I don't experience any problems reverting the blamed patch. The qemu cgroup code doesn't make any distinction about systemd or not; it just calls the virCgroupRemove() to clean up the resources that were set up here during init:

qemuInitCgroup() virCgroupNewMachine() virCgroupNewMachineSystemd() virCgroupNewNested()

The group pointer that's stashed in qemu's struct is that of the machine-qemu...scope group, rather than the nested group, but nothing in the cleanup path touches group->nested. My initial patch is certainly flawed (as you explain below), so maybe something like this is better?

@@ -2615,6 +2615,9 @@ virCgroupRemove(virCgroupPtr group) { size_t i;

+ if (group->nested) + virCgroupRemove(group->nested); + for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) { if (group->backends[i]) { int rc = group->backends[i]->remove(group);

Not great, since it cleans up the nested group but then still attempts to clean up the machine-qemu...scope group that was setup by systemd. This group wasn't setup by virCgroupV2MakeGroup(), so calling virCgroupV2Remove() seems wrong too. Not sure how to address that.

I'm not sure how this will help. As I've already pointed out calling virCgroupRemove(group) results in calling one or both functions:

virCgroupV1Remove() virCgroupV2Remove()

Where both of these functions will call virCgroupRemoveRecursively() which will recursively remove all subdirectories including the nested one.

So the extra

if (group->nested) virCgroupRemove(group->nested);

will not help with anything.

Looking at the code (I did not test it) it looks like the error message is produced by following this path:

qemuProcessStop() qemuRemoveCgroup() virCgroupTerminateMachine() this will make systemd to remove the cgroup virCgroupRemove() virCgroupV2Remove() virCgroupV2DevicesRemoveProg() virCgroupV2DevicesDetectProg() open()

Yes, this is the path where exit out, and thus never get to the RemoveRecursively routine you mentioned above. But it's not because of the virCgroupTerminateMachine() call, but rather the progfd and mapfd fields in the cgroup passed to virCgroupV2DevicesDetectProg().

With the blamed patch, these fields are zero, so we go ahead and try to do all that other work. Prior to that patch, and with my proposed patch, these fd's are non-zero, and so it exits immediately back to virCgroupV2DevicesRemoveProg() which does a VIR_FORCE_CLOSE on them. The non-zero fd's are related to BPF (see below), and are stashed in that nested cgroup nowadays.

(virsh create/shutdown a guest three times) # readlink /proc/$(pgrep libvirtd)/fd/* | grep bpf anon_inode:bpf-map anon_inode:bpf-map anon_inode:bpf-prog anon_inode:bpf-map anon_inode:bpf-prog anon_inode:bpf-prog

I still fail to see how calling virCgroupRemove(group->nested) in virCgroupRemove() would help at all. The original issue you mentioned in the commit message is that we log this error:

unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory

So calling virCgroupRemove(group->nested) would also fail with:

unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/libvirt/': No such file or directory

because if the VM root cgroup ('machine-qemu\x2d1\x2dguest.scope') doesn't exist the nested cgroup ('libvirt') will not exist as well.

While you are correct that the nested cgroup doesn't exist in sysfs, the pointers being handled by libvirt still do since virCgroupFree() hasn't yet been called. This message shows up because the cgroup we are processing (the VM root cgroup) contains zeros for progfd and mapfd, and virCgroupV2DevicesDetectProg() attempts to open the corresponding path that systemd already cleaned up. The nested cgroup ('libvirt') has nonzero file descriptors, so the check at the top of virCgroupV2DevicesDetectProg() would return before attempting to open the nonexistent '/sys/fs/cgroup/.../libvirt/' path. In that case, the message would NOT be generated, and the code will just go back to virCgroupV2DevicesRemoveProg() to close the file descriptors.

...

Now to the progfd and mapfd being zero, that depends whether libvirtd was restarted while the VM was running.

Huh? No, I see the progfd and mapfd fields zero in the VM root cgroup and stored with the BPF file descriptors in the nested ('libvirt') cgroup with your patch. Previously, they were in the VM root cgroup itself. When a VM is started on host

...

with cgroups v2 libvirt will create BPF program and BPF map to limit access to system devices and stores both in the mentioned variables. But if you restart libvirtd it will load the BPF prog ID and BPF map ID only on demand, for example when destroying the VM.

Now on hosts with systemd the owner of the VM root cgroup is systemd and because we use call talk to systemd-machined and register the VM there the VM root cgroup is created for us by systemd-machined and it is associated with qemu PID. When destroying the VM we will kill the qemu process which will trigger systemd-machined to automatically cleanup the cgroup. Once that happens kernel should eventually cleanup both BPF prog and BPF map that were associated with the nested cgroup because it no longer exist and there are no other references to the prog or map. It may take some time before kernel actually removes the prog and map.

This all may be true, but as I said in my commit message libvirt's leaving open two file descriptors for the BPF prog and map that aren't being closed when the guest is shut down. I was attempting to trigger a different bug by doing a virsh create/shutdown in a loop, and eventually my creates would fail because I'd exhausted the number of open files.

...

The only thing on systemd based host we can do is to skip the whole BPF detection code if the directory was already removed which I've already proposed by handling ENOENT in virCgroupV2DevicesDetectProg().

Yes. That removes the message, but doesn't clean up the file descriptors being left open.

...

All of the above applies to systemd based hosts. If we consider system without systemd then there is an actual bug as you already pointed out that the BPF prog and BPF map are now associated with the nested cgroup, to fix that we should change only virCgroupV2Remove:

diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c index 248d4047e5..4664492c34 100644 --- a/src/util/vircgroupv2.c +++ b/src/util/vircgroupv2.c @@ -523,6 +523,7 @@ static int virCgroupV2Remove(virCgroupPtr group) { g_autofree char *grppath = NULL; + virCgroupPtr parent = virCgroupGetNested(group); int controller;

/* Don't delete the root group, if we accidentally @@ -534,7 +535,7 @@ virCgroupV2Remove(virCgroupPtr group) if (virCgroupV2PathOfController(group, controller, "", &grppath) < 0) return 0;

- if (virCgroupV2DevicesRemoveProg(group) < 0) + if (virCgroupV2DevicesRemoveProg(parent) < 0) return -1;

return virCgroupRemoveRecursively(grppath);

This addresses both symptoms I'm experiencing, but it feels weird to be the only place outside of vircgroup.c that needs to navigate/understand the difference between group and group->nested. This is why I'd proposed that logic in the caller, so it's covered by both v1 and v2, but if it's only the v2 code that needs this then okay. I don't have a non-systemd system nearby to try it with. Eric

...

...
...
Unfortunately we cannot simply ignore ENOENT in the virCgroupV2DevicesDetectProg() because it is used in different places where such error should be reported.

What we can do is to introduce new parameter `bool quiet` into virCgroupV2DevicesDetectProg() that would silence the ENOENT error if set to true and we could use it from virCgroupV2DevicesRemoveProg() or something similar.

This smells fishy to me. I tried it, and it does indeed get us to virCgroupRemoveRecursively() later down that callchain. The downside is it is of course called with the same cgroup (grppath="/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/") and returns without doing anything when virDirOpenQuiet() also returns ENOENT.

Well that's the only thing we can do if the directories no longer exist.

Pavel

...
Eric

...
Pavel

...
...
This happens only on cgroup controllers managed by systemd. For example if you switch to cgroups v1 where each controller is in separate directory not all controllers supported by libvirt are also supported by systemd. In this case libvirt creates all the cgroups by itself and is responsible to cleanup as well. With this patch we would not remove the VM root cgroups in these controllers. This would affect following controllers:

cpuset freezer net_cls net_prio perf_event

You can verify what happens with cgroups v1 by adding systemd.unified_cgroup_hierarchy=0 to your kernel cmdline.

Neat, thanks for that tip.

Thanks, Eric

...
Pavel

Pavel Hrdina

4:14 p.m.

New subject: [PATCH 1/1] vircgroup: Cleanup nested cgroups

On Fri, Apr 09, 2021 at 09:13:32AM -0400, Eric Farman wrote:

...

...
I still fail to see how calling virCgroupRemove(group->nested) in virCgroupRemove() would help at all. The original issue you mentioned in the commit message is that we log this error:

unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory

So calling virCgroupRemove(group->nested) would also fail with:

unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/libvirt/': No such file or directory

because if the VM root cgroup ('machine-qemu\x2d1\x2dguest.scope') doesn't exist the nested cgroup ('libvirt') will not exist as well.

While you are correct that the nested cgroup doesn't exist in sysfs, the pointers being handled by libvirt still do since virCgroupFree() hasn't yet been called. This message shows up because the cgroup we are processing (the VM root cgroup) contains zeros for progfd and mapfd, and virCgroupV2DevicesDetectProg() attempts to open the corresponding path that systemd already cleaned up.

The nested cgroup ('libvirt') has nonzero file descriptors, so the check at the top of virCgroupV2DevicesDetectProg() would return before attempting to open the nonexistent '/sys/fs/cgroup/.../libvirt/' path. In that case, the message would NOT be generated, and the code will just go back to virCgroupV2DevicesRemoveProg() to close the file descriptors.

That's one bug that while calling virCgroupV2DevicesRemoveProg() we are passing the root cgroup but we need to pass the nested cgroup instead. This will fix the FD leak when VMs are started and destroyed while libvirt is running the whole time.

...

...
Now to the progfd and mapfd being zero, that depends whether libvirtd was restarted while the VM was running.

Huh? No, I see the progfd and mapfd fields zero in the VM root cgroup and stored with the BPF file descriptors in the nested ('libvirt') cgroup with your patch. Previously, they were in the VM root cgroup itself.

When a VM is started on host

...
with cgroups v2 libvirt will create BPF program and BPF map to limit access to system devices and stores both in the mentioned variables. But if you restart libvirtd it will load the BPF prog ID and BPF map ID only on demand, for example when destroying the VM.

Now on hosts with systemd the owner of the VM root cgroup is systemd and because we use call talk to systemd-machined and register the VM there the VM root cgroup is created for us by systemd-machined and it is associated with qemu PID. When destroying the VM we will kill the qemu process which will trigger systemd-machined to automatically cleanup the cgroup. Once that happens kernel should eventually cleanup both BPF prog and BPF map that were associated with the nested cgroup because it no longer exist and there are no other references to the prog or map. It may take some time before kernel actually removes the prog and map.

This all may be true, but as I said in my commit message libvirt's leaving open two file descriptors for the BPF prog and map that aren't being closed when the guest is shut down. I was attempting to trigger a different bug by doing a virsh create/shutdown in a loop, and eventually my creates would fail because I'd exhausted the number of open files.

...
The only thing on systemd based host we can do is to skip the whole BPF detection code if the directory was already removed which I've already proposed by handling ENOENT in virCgroupV2DevicesDetectProg().

Yes. That removes the message, but doesn't clean up the file descriptors being left open.

Correct, to me it was not obvious from the original commit message that we are leaking file descriptors because I was more focused on the error itself and I was understanding that you are trying to fix the code to stop printing that error.

...

...
All of the above applies to systemd based hosts. If we consider system without systemd then there is an actual bug as you already pointed out that the BPF prog and BPF map are now associated with the nested cgroup, to fix that we should change only virCgroupV2Remove:

diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c index 248d4047e5..4664492c34 100644 --- a/src/util/vircgroupv2.c +++ b/src/util/vircgroupv2.c @@ -523,6 +523,7 @@ static int virCgroupV2Remove(virCgroupPtr group) { g_autofree char *grppath = NULL; + virCgroupPtr parent = virCgroupGetNested(group); int controller;

/* Don't delete the root group, if we accidentally @@ -534,7 +535,7 @@ virCgroupV2Remove(virCgroupPtr group) if (virCgroupV2PathOfController(group, controller, "", &grppath) < 0) return 0;

- if (virCgroupV2DevicesRemoveProg(group) < 0) + if (virCgroupV2DevicesRemoveProg(parent) < 0) return -1;

return virCgroupRemoveRecursively(grppath);

This addresses both symptoms I'm experiencing, but it feels weird to be the only place outside of vircgroup.c that needs to navigate/understand the difference between group and group->nested. This is why I'd proposed that logic in the caller, so it's covered by both v1 and v2, but if it's only the v2 code that needs this then okay. I don't have a non-systemd system nearby to try it with.

We use BPF only with cgroups v2, cgroups v1 have normal devices controller with sysfs files that we can write/read to/from. I removed most of the previous quotes as they are not relevant now. To summarize the current code has two issues: 1) We are leaking BPF prog and BPF map file descriptors because they now live in the nested "libvirt" cgroup instead of the VM root cgroup. This can be addressed by the code above. This happens only if libvirtd was not restarted between VM start and VM destroy. 2) When running on systemd and libvirtd was restarted between VM start and VM destroy we will always attempt to detect BPF prog and BPF map in order to close it right away. Now that I think about it we can most likely drop the detection completely. That will fix the unnecessary error report when the cgroup was already deleted by systemd. I'll test it on non-systemd OS as well and post a patches if you are OK with that. Thanks for noticing the FD leak! Pavel

Eric Farman

7:32 p.m.

New subject: [PATCH 1/1] vircgroup: Cleanup nested cgroups

On 4/9/21 10:14 AM, Pavel Hrdina wrote:

...

On Fri, Apr 09, 2021 at 09:13:32AM -0400, Eric Farman wrote:

...
...
I still fail to see how calling virCgroupRemove(group->nested) in virCgroupRemove() would help at all. The original issue you mentioned in the commit message is that we log this error:

unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory

So calling virCgroupRemove(group->nested) would also fail with:

unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/libvirt/': No such file or directory

because if the VM root cgroup ('machine-qemu\x2d1\x2dguest.scope') doesn't exist the nested cgroup ('libvirt') will not exist as well.

While you are correct that the nested cgroup doesn't exist in sysfs, the pointers being handled by libvirt still do since virCgroupFree() hasn't yet been called. This message shows up because the cgroup we are processing (the VM root cgroup) contains zeros for progfd and mapfd, and virCgroupV2DevicesDetectProg() attempts to open the corresponding path that systemd already cleaned up.

The nested cgroup ('libvirt') has nonzero file descriptors, so the check at the top of virCgroupV2DevicesDetectProg() would return before attempting to open the nonexistent '/sys/fs/cgroup/.../libvirt/' path. In that case, the message would NOT be generated, and the code will just go back to virCgroupV2DevicesRemoveProg() to close the file descriptors.

That's one bug that while calling virCgroupV2DevicesRemoveProg() we are passing the root cgroup but we need to pass the nested cgroup instead. This will fix the FD leak when VMs are started and destroyed while libvirt is running the whole time.

...
...
Now to the progfd and mapfd being zero, that depends whether libvirtd was restarted while the VM was running.

Huh? No, I see the progfd and mapfd fields zero in the VM root cgroup and stored with the BPF file descriptors in the nested ('libvirt') cgroup with your patch. Previously, they were in the VM root cgroup itself.

When a VM is started on host

...
with cgroups v2 libvirt will create BPF program and BPF map to limit access to system devices and stores both in the mentioned variables. But if you restart libvirtd it will load the BPF prog ID and BPF map ID only on demand, for example when destroying the VM.

Now on hosts with systemd the owner of the VM root cgroup is systemd and because we use call talk to systemd-machined and register the VM there the VM root cgroup is created for us by systemd-machined and it is associated with qemu PID. When destroying the VM we will kill the qemu process which will trigger systemd-machined to automatically cleanup the cgroup. Once that happens kernel should eventually cleanup both BPF prog and BPF map that were associated with the nested cgroup because it no longer exist and there are no other references to the prog or map. It may take some time before kernel actually removes the prog and map.

This all may be true, but as I said in my commit message libvirt's leaving open two file descriptors for the BPF prog and map that aren't being closed when the guest is shut down. I was attempting to trigger a different bug by doing a virsh create/shutdown in a loop, and eventually my creates would fail because I'd exhausted the number of open files.

...
The only thing on systemd based host we can do is to skip the whole BPF detection code if the directory was already removed which I've already proposed by handling ENOENT in virCgroupV2DevicesDetectProg().

Yes. That removes the message, but doesn't clean up the file descriptors being left open.

Correct, to me it was not obvious from the original commit message that we are leaking file descriptors because I was more focused on the error itself and I was understanding that you are trying to fix the code to stop printing that error.

I'm glad I mentioned both, though it wasn't obvious to me there was a relation between the two until I started looking into the file descriptor problem.

...

...
...
All of the above applies to systemd based hosts. If we consider system without systemd then there is an actual bug as you already pointed out that the BPF prog and BPF map are now associated with the nested cgroup, to fix that we should change only virCgroupV2Remove:

diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c index 248d4047e5..4664492c34 100644 --- a/src/util/vircgroupv2.c +++ b/src/util/vircgroupv2.c @@ -523,6 +523,7 @@ static int virCgroupV2Remove(virCgroupPtr group) { g_autofree char *grppath = NULL; + virCgroupPtr parent = virCgroupGetNested(group); int controller;

/* Don't delete the root group, if we accidentally @@ -534,7 +535,7 @@ virCgroupV2Remove(virCgroupPtr group) if (virCgroupV2PathOfController(group, controller, "", &grppath) < 0) return 0;

- if (virCgroupV2DevicesRemoveProg(group) < 0) + if (virCgroupV2DevicesRemoveProg(parent) < 0) return -1;

return virCgroupRemoveRecursively(grppath);

This addresses both symptoms I'm experiencing, but it feels weird to be the only place outside of vircgroup.c that needs to navigate/understand the difference between group and group->nested. This is why I'd proposed that logic in the caller, so it's covered by both v1 and v2, but if it's only the v2 code that needs this then okay. I don't have a non-systemd system nearby to try it with.

We use BPF only with cgroups v2, cgroups v1 have normal devices controller with sysfs files that we can write/read to/from.

I removed most of the previous quotes as they are not relevant now.

To summarize the current code has two issues:

1) We are leaking BPF prog and BPF map file descriptors because they now live in the nested "libvirt" cgroup instead of the VM root cgroup. This can be addressed by the code above. This happens only if libvirtd was not restarted between VM start and VM destroy.

Agreed.

...

2) When running on systemd and libvirtd was restarted between VM start and VM destroy we will always attempt to detect BPF prog and BPF map in order to close it right away.

I'll trust you regarding this, because I haven't looked into what happens when libvirtd restarts.

...

Now that I think about it we can most likely drop the detection completely. That will fix the unnecessary error report when the cgroup was already deleted by systemd.

I'll test it on non-systemd OS as well and post a patches if you are OK with that.

That sounds great.

...

Thanks for noticing the FD leak!

You're welcome. Thanks for taking the time to walk through this with me! Cheers, Eric

...

Pavel

Eric Farman

7 Apr 7 Apr

2:25 p.m.

Polite ping, now that release freeze has finished. On 3/26/21 12:25 PM, Eric Farman wrote:

...

Hi Pavel, et al,

Running Fedora 33 KVM/QEMU on s390x, I recently noticed a couple of oddities when shutting down my guests, which I bisected between 7.0.0 and 7.1.0 to your commit:

commit 184245f53b94fc84f727eb6e8a2aa52df02d69c0 Author: Pavel Hrdina <phrdina@redhat.com> Date: Tue Feb 9 12:33:53 2021 +0100

vircgroup: introduce nested cgroup to properly work with systemd

The attached patch is based on some of the tweaks you did to use the parent/nested pointers in a cgroup, rather than a cgroup itself. It fixes both problems I'm encountering, but as this code is quite unfamiliar to me, I might well be way off track and would appreciate your feedback in the matter.

Apologies for hitting this on the same day as freeze; I had not noticed the messages previously, and started digging yesterday when I noted the number of file descriptors left open by libvirt.

Thanks, Eric

Eric Farman (1): vircgroup: Cleanup nested cgroups

src/util/vircgroup.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)

1710

Age (days ago)

1724

Last active (days ago)

List overview

Download

10 comments

2 participants

participants (2)

Eric Farman
Pavel Hrdina

[PATCH 0/1] vircgroup: Fix nested cgroup on cleanup

tags

participants (2)