[libvirt] [REPOST] regarding cgroup v2 support in libvirt

(reposting w/ libvir-list cc'd, sorry about the delay in reposting, was traveling and then on vacation) Hello, Daniel. How have you been? We (facebook) are deploying cgroup v2 and internally use libvirt to manage virtual machines, so I'm trying to add cgroup v2 support to libvirt. Because cgroup v2's resource configurations differ from v1 in varying degrees depending on the specific resource type, it unfortunately introduces new configurations (some completely new configs, others just a different range / format). This means that adding cgroup v2 support to libvirt requires adding new config options to it and maybe implementing some form of translation mechanism between overlapping configs. The upcoming systemd release includes all that's necessary to support v1/v2 compatibility so that users setting resource configs through systemd don't have to worry about whether v1 or v2 is in use. I'm wondering whether it would make sense to make libvirt use dbus calls to systemd to set resource configs when systemd is in use, so that it can piggyback on systemd's v1/v2 compatibility. It is true that, as libvirt can be used without systemd, libvirt will probably want its own direct implementation down the line, but I think there are benefits to going through systemd for resource settings in general given that hierarchy setup is already done through systemd when available. What do you think? Thanks! -- tejun

On Thu, Oct 20, 2016 at 02:59:45PM -0400, Tejun Heo wrote:
(reposting w/ libvir-list cc'd, sorry about the delay in reposting, was traveling and then on vacation)
Hello, Daniel. How have you been?
We (facebook) are deploying cgroup v2 and internally use libvirt to manage virtual machines, so I'm trying to add cgroup v2 support to libvirt.
Because cgroup v2's resource configurations differ from v1 in varying degrees depending on the specific resource type, it unfortunately introduces new configurations (some completely new configs, others just a different range / format). This means that adding cgroup v2 support to libvirt requires adding new config options to it and maybe implementing some form of translation mechanism between overlapping configs.
The upcoming systemd release includes all that's necessary to support v1/v2 compatibility so that users setting resource configs through systemd don't have to worry about whether v1 or v2 is in use. I'm wondering whether it would make sense to make libvirt use dbus calls to systemd to set resource configs when systemd is in use, so that it can piggyback on systemd's v1/v2 compatibility.
The big question I have around cgroup v2 is state of support for all controllers that libvirt uses (cpu, cpuacct, cpuset, memory, devices, freezer, blkio). IIUC, not all of these have been ported to cgroup v2 setup and the cpu port in particular was rejected by Linux maintainers. Libvirt has a general policy that we won't support features that only exist in out of tree patches (applies to kernel and any other software we build against or use). IIRC from earlier discussions, the model for dealing with processes in cgroup v2 was quite different. In libvirt we rely on the ability to assign different threads within a process to different cgroups, because we need to control CPU schedular parameters on different threads in QEMU. eg we have vCPU threads, I/O threads and general emulator threads each of which get different policies. When I spoke with Lennart about cgroup v2, way back in Jan, he indicated that while systemd can technically work with a system where some controllers are mounted as v1, while others are mounted as v2, this would not be an officially supported solution. Thus systemd in Fedora was not likely to switch to v2 until all required controllers could use v2. I'm not sure if this still corresponds to Lennarts current views, so CC'ing him to confirm/deny. I think from Libvirt POV it would greatly simplify life if we could likewise restrict ourselves to dealing with hosts which are exclusively v1 or exclusively v2, and not a mixture. ie we can completely isolate our codebases for v1 vs v2 management, making it easier to reason about and test their correctness, reducing QA testing burden. I recall that systemd policy for v2 was inteded to be that no app should write to cgroup sysfs except for systemd, unless there was a sub-tree created with Delegate=yes set on the scope. So this clearly means when using v2 we'll have to use the systemd DBus APIs for managing cgroups v2 on such hosts.
It is true that, as libvirt can be used without systemd, libvirt will probably want its own direct implementation down the line, but I think there are benefits to going through systemd for resource settings in general given that hierarchy setup is already done through systemd when available.
While it is certainly nice that the vast majority of OS distros have switched over to using systemd for init, there's still enough users out there that I think we'll need to continue to have libvirt support for using sysfs for v2 on non-systemd hosts. Any way in summary, we'd like to see v2 support of course, since that is clearly the future. The big question is what we do about situation wrt not all controllers being supported in v2 - the lack of complete conversion is what has stopped me from doing any work in this area upto now. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Fri, 21.10.16 11:19, Daniel P. Berrange (berrange@redhat.com) wrote:
On Thu, Oct 20, 2016 at 02:59:45PM -0400, Tejun Heo wrote:
(reposting w/ libvir-list cc'd, sorry about the delay in reposting, was traveling and then on vacation)
Hello, Daniel. How have you been?
We (facebook) are deploying cgroup v2 and internally use libvirt to manage virtual machines, so I'm trying to add cgroup v2 support to libvirt.
Because cgroup v2's resource configurations differ from v1 in varying degrees depending on the specific resource type, it unfortunately introduces new configurations (some completely new configs, others just a different range / format). This means that adding cgroup v2 support to libvirt requires adding new config options to it and maybe implementing some form of translation mechanism between overlapping configs.
The upcoming systemd release includes all that's necessary to support v1/v2 compatibility so that users setting resource configs through systemd don't have to worry about whether v1 or v2 is in use. I'm wondering whether it would make sense to make libvirt use dbus calls to systemd to set resource configs when systemd is in use, so that it can piggyback on systemd's v1/v2 compatibility.
The big question I have around cgroup v2 is state of support for all controllers that libvirt uses (cpu, cpuacct, cpuset, memory, devices, freezer, blkio). IIUC, not all of these have been ported to cgroup v2 setup and the cpu port in particular was rejected by Linux maintainers. Libvirt has a general policy that we won't support features that only exist in out of tree patches (applies to kernel and any other software we build against or use).
IIRC from earlier discussions, the model for dealing with processes in cgroup v2 was quite different. In libvirt we rely on the ability to assign different threads within a process to different cgroups, because we need to control CPU schedular parameters on different threads in QEMU. eg we have vCPU threads, I/O threads and general emulator threads each of which get different policies.
When I spoke with Lennart about cgroup v2, way back in Jan, he indicated that while systemd can technically work with a system where some controllers are mounted as v1, while others are mounted as v2, this would not be an officially supported solution. Thus systemd in Fedora was not likely to switch to v2 until all required controllers could use v2. I'm not sure if this still corresponds to Lennarts current views, so CC'ing him to confirm/deny.
So, the "hybrid" mode is probably nothing RHEL or so would want to support. However, I think it might be a good step for Fedora at least. But yes, supporting this mode means additional porting effort for the various daemons that access cgroupfs...
I recall that systemd policy for v2 was inteded to be that no app should write to cgroup sysfs except for systemd, unless there was a sub-tree created with Delegate=yes set on the scope. So this clearly means when using v2 we'll have to use the systemd DBus APIs for managing cgroups v2 on such hosts.
Yes, this is our policy: the cgroup tree is private property of systemd (at least regarding write access), except when your have a service or scope unit where Delegate=yes is set, in which case you can manage your own subtree of that freely. Lennart -- Lennart Poettering, Red Hat

Hello, Daniel. On Fri, Oct 21, 2016 at 11:19:02AM +0100, Daniel P. Berrange wrote:
The big question I have around cgroup v2 is state of support for all controllers that libvirt uses (cpu, cpuacct, cpuset, memory, devices, freezer, blkio). IIUC, not all of these have been ported to cgroup v2 setup and the cpu port in particular was rejected by Linux maintainers. Libvirt has a general policy that we won't support features that only exist in out of tree patches (applies to kernel and any other software we build against or use).
I see and that's understandable. However, I think supporting resource control through systemd can be a good way of navigating the situation. The back and forward compatibility issues are handled by systemd allowing libvirt users to make use of what's available on the system without burdening libvirt with complications.
IIRC from earlier discussions, the model for dealing with processes in cgroup v2 was quite different. In libvirt we rely on the ability to assign different threads within a process to different cgroups, because we need to control CPU schedular parameters on different threads in QEMU. eg we have vCPU threads, I/O threads and general emulator threads each of which get different policies.
How thread granularity will be handled in cgroup v2 is still contentious but I believe that we'll eventually have something. I have always been curious about the QEMU thread control tho. What prevents it from using the usual nice level adjustments? Does it actually require hierarchical resource distribution?
When I spoke with Lennart about cgroup v2, way back in Jan, he indicated that while systemd can technically work with a system where some controllers are mounted as v1, while others are mounted as v2, this would not be an officially supported solution. Thus systemd in Fedora was not likely to switch to v2 until all required controllers could use v2. I'm not sure if this still corresponds to Lennarts current views, so CC'ing him to confirm/deny.
The hybrid mode implemented in systemd uses cgroup v2 for process management (the "name=systemd" hierarchy) but keeps using v1 hierarchies for all resource control. For "Delegate=" users, I don't think it'd matter all that much. Such users either see all v1 hierarchies for all resource controllers as before or the v2 hierarchy.
I think from Libvirt POV it would greatly simplify life if we could likewise restrict ourselves to dealing with hosts which are exclusively v1 or exclusively v2, and not a mixture. ie we can completely isolate our codebases for v1 vs v2 management, making it easier to reason about and test their correctness, reducing QA testing burden.
I think that's gonna be the case. People *may* try to mix v1 and v2 hierarchies for resource control manually but supporting the mixture in any major software project would require a lot of complications which are difficult to justify.
I recall that systemd policy for v2 was inteded to be that no app should write to cgroup sysfs except for systemd, unless there was a sub-tree created with Delegate=yes set on the scope. So this clearly means when using v2 we'll have to use the systemd DBus APIs for managing cgroups v2 on such hosts.
Hmmm... maybe I'm mistaken but it's also kinda broken without "Delegate=" on v1 too and we got bitten by that already. An internal software assumed that it can branch down from the cgroups that the target process is in at the time of startup and ended up building sub-hierarchies at different positions in different hierarchies. Later somebody launched a systemd service which requested some resource accounting and systemd ended up relocating processes from those sub-hierarchies. On systemd systems, I don't think it makes sense to try to do sub-hierarchy management directly without telling systemd about it. The flip side is the same too. With "Delegate=" set, cgroup v2 doesn't pose any more problems than v1 does.
It is true that, as libvirt can be used without systemd, libvirt will probably want its own direct implementation down the line, but I think there are benefits to going through systemd for resource settings in general given that hierarchy setup is already done through systemd when available.
While it is certainly nice that the vast majority of OS distros have switched over to using systemd for init, there's still enough users out there that I think we'll need to continue to have libvirt support for using sysfs for v2 on non-systemd hosts.
Definitely.
Any way in summary, we'd like to see v2 support of course, since that is clearly the future. The big question is what we do about situation wrt not all controllers being supported in v2 - the lack of complete conversion is what has stopped me from doing any work in this area upto now.
What I'm suggesting now is, if available, to use systemd to set up resource control up to delegation point. This also would make control ownership arbitration between systemd and libvirt easier to solve. Beyond arbitration point, libvirt can keep doing whatever it has been doing. If there are v1 hierarchies, it can keep doing the subhierarchy management. If v2, it can ignore it for now until cgroup v2 and libvirt support for it are ready. IMHO, this would give a substantial part of resource containment that people want without libvirt having to deal with the headaches of transitional period. Thanks. -- tejun

On Fri, Oct 21, 2016 at 02:24:27PM -0400, Tejun Heo wrote:
Hello, Daniel.
On Fri, Oct 21, 2016 at 11:19:02AM +0100, Daniel P. Berrange wrote:
The big question I have around cgroup v2 is state of support for all controllers that libvirt uses (cpu, cpuacct, cpuset, memory, devices, freezer, blkio). IIUC, not all of these have been ported to cgroup v2 setup and the cpu port in particular was rejected by Linux maintainers. Libvirt has a general policy that we won't support features that only exist in out of tree patches (applies to kernel and any other software we build against or use).
I see and that's understandable. However, I think supporting resource control through systemd can be a good way of navigating the situation. The back and forward compatibility issues are handled by systemd allowing libvirt users to make use of what's available on the system without burdening libvirt with complications.
I don't think that's satisfactory - the risk is that the semantic behaviour of what is finally merged in the kernel may be different from the semantics of the cpu controller out of tree patches. This could in turn cause behavioural differences for existing deployed VMs.
IIRC from earlier discussions, the model for dealing with processes in cgroup v2 was quite different. In libvirt we rely on the ability to assign different threads within a process to different cgroups, because we need to control CPU schedular parameters on different threads in QEMU. eg we have vCPU threads, I/O threads and general emulator threads each of which get different policies.
How thread granularity will be handled in cgroup v2 is still contentious but I believe that we'll eventually have something. I have always been curious about the QEMU thread control tho. What prevents it from using the usual nice level adjustments? Does it actually require hierarchical resource distribution?
nice level adjustments only apply to individual threads. In some cases we can apply controls to individual threads, but in other cases We need to apply controls to multiple threads as a group. We currently have the following children under the main CPU controller group for a VM: $maincgroup | +- vcpu0 - single thread for VPU 0 +- vcpu1 - single thread for VPU 1 ... +- vcpuN - single thread for VPU N +- iothread0 - multiple threads for device I/O thread group 0 +- iothread1 - multiple threads for device I/O thread group 1 ... +- iothreadN - multiple threads for device I/O thread group N +- emulator - multiple threads (main event loop, migration, file I/O threads) Against the top level group we set the 'shares' tunable which gives us relatively weighting of the entire VM against other VMs. Against each of the child groups we set quota + period, so we have absolute control over usage from different functional parts of QEMU. Setting per-thread nice levels can't replicate any of this functionality afaict.
When I spoke with Lennart about cgroup v2, way back in Jan, he indicated that while systemd can technically work with a system where some controllers are mounted as v1, while others are mounted as v2, this would not be an officially supported solution. Thus systemd in Fedora was not likely to switch to v2 until all required controllers could use v2. I'm not sure if this still corresponds to Lennarts current views, so CC'ing him to confirm/deny.
The hybrid mode implemented in systemd uses cgroup v2 for process management (the "name=systemd" hierarchy) but keeps using v1 hierarchies for all resource control. For "Delegate=" users, I don't think it'd matter all that much. Such users either see all v1 hierarchies for all resource controllers as before or the v2 hierarchy.
I think from Libvirt POV it would greatly simplify life if we could likewise restrict ourselves to dealing with hosts which are exclusively v1 or exclusively v2, and not a mixture. ie we can completely isolate our codebases for v1 vs v2 management, making it easier to reason about and test their correctness, reducing QA testing burden.
I think that's gonna be the case. People *may* try to mix v1 and v2 hierarchies for resource control manually but supporting the mixture in any major software project would require a lot of complications which are difficult to justify.
Ok, that's good to know.
Any way in summary, we'd like to see v2 support of course, since that is clearly the future. The big question is what we do about situation wrt not all controllers being supported in v2 - the lack of complete conversion is what has stopped me from doing any work in this area upto now.
What I'm suggesting now is, if available, to use systemd to set up resource control up to delegation point. This also would make control ownership arbitration between systemd and libvirt easier to solve.
Libvirt currently uses machined to create the cgroup directory eg /machines/foo and then writes to settings /machine/foo/$KEY IIUC, with Delegate=yes, doesn't let you write to tunables at the cgroup /machines/foo - it merely gives libvirt permissions to create /machines/foo/bar and write at /machines/foo/bar/$KEY. So the Delegate=yes feature is only useful to libvirt in the context of LXC guests, as it lets the OS libvirt spawns inside the guest control its sub-hierarchy. Libvirt sitll have to rely on using systemd DBus API to setting the tunables at /machine/foo/$KEY Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
participants (3)
-
Daniel P. Berrange
-
Lennart Poettering
-
Tejun Heo