[libvirt] Suboptimal default cpu Cgroup

Hello, by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct' [1], with 'shares' set to 1024 on every level. This raises two points: 1) Every VM is given an equal amount of CPU time. [2] ($CG/machine.slice/*/shares = 1024) Which means that smaller / less loaded guests are given an advantage. 2) All VMs combined are given 1024 shares. [3] ($CG/machine.slice/shares) This is made even worse on RHEL7, by sched_autogroup_enabled = 0, so every other process in the system is given the same amount of CPU as all VMs combined. It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.) Do we want cgroups in the default at all? (Is OpenStack dealing with these quirks?) Thanks. --- 1: machine.slice/ machine-qemu\\x2d${name}.scope/ {emulator,vcpu*}/ 2: To reproduce, run two guests with > 1 VCPU and execute two spinners on the first and one on the second. The result will be 50%/50% CPU assignment between guests; 66%/33% seems more natural, but it could still be considered as a feature. 3: Run a guest with $n VCPUs and $n spinners in it, and $n spinners in the host - RHEL7: 1/($n + 1)% CPU for the guest -- I'd expect 50%/50%. - Upstream: 50%/50% between guest and host because of autogrouping; if you run $n more spinners in the host, it will still be 50%/50%, instead of seemingly more fair 33%/66%. (And you can run spinners from different groups, so it would be the same as in RHEL7 then.) And it also works the other way: if the host has $n CPUs, then $n/2 tasks in the host suffice to minimize VMs' performance, regardless of the amount of running VCPUs.

----- Original Message -----
From: "Radim Krčmář" <rkrcmar@redhat.com> To: libvir-list@redhat.com Cc: "Daniel P. Berrange" <berrange@redhat.com>, "Andrew Theurer" <atheurer@redhat.com> Sent: Thursday, August 14, 2014 9:25:05 AM Subject: Suboptimal default cpu Cgroup
Hello,
by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct' [1], with 'shares' set to 1024 on every level. This raises two points:
1) Every VM is given an equal amount of CPU time. [2] ($CG/machine.slice/*/shares = 1024)
Which means that smaller / less loaded guests are given an advantage.
2) All VMs combined are given 1024 shares. [3] ($CG/machine.slice/shares)
This is made even worse on RHEL7, by sched_autogroup_enabled = 0, so every other process in the system is given the same amount of CPU as all VMs combined.
It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
Could we have each VM's shares be nr_vcpu * 1024, and the share for $CG/machine.slice be sum of all VM's share?
Do we want cgroups in the default at all? (Is OpenStack dealing with these quirks?)
Thanks.
--- 1: machine.slice/ machine-qemu\\x2d${name}.scope/ {emulator,vcpu*}/
2: To reproduce, run two guests with > 1 VCPU and execute two spinners on the first and one on the second. The result will be 50%/50% CPU assignment between guests; 66%/33% seems more natural, but it could still be considered as a feature.
3: Run a guest with $n VCPUs and $n spinners in it, and $n spinners in the host - RHEL7: 1/($n + 1)% CPU for the guest -- I'd expect 50%/50%. - Upstream: 50%/50% between guest and host because of autogrouping; if you run $n more spinners in the host, it will still be 50%/50%, instead of seemingly more fair 33%/66%. (And you can run spinners from different groups, so it would be the same as in RHEL7 then.)
And it also works the other way: if the host has $n CPUs, then $n/2 tasks in the host suffice to minimize VMs' performance, regardless of the amount of running VCPUs.

2014-08-14 13:55-0400, Andrew Theurer:
----- Original Message -----
From: "Radim Krčmář" <rkrcmar@redhat.com> To: libvir-list@redhat.com Cc: "Daniel P. Berrange" <berrange@redhat.com>, "Andrew Theurer" <atheurer@redhat.com> Sent: Thursday, August 14, 2014 9:25:05 AM Subject: Suboptimal default cpu Cgroup
Hello,
by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct' [1], with 'shares' set to 1024 on every level. This raises two points:
1) Every VM is given an equal amount of CPU time. [2] ($CG/machine.slice/*/shares = 1024)
Which means that smaller / less loaded guests are given an advantage.
2) All VMs combined are given 1024 shares. [3] ($CG/machine.slice/shares)
This is made even worse on RHEL7, by sched_autogroup_enabled = 0, so every other process in the system is given the same amount of CPU as all VMs combined.
It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
Could we have each VM's shares be nr_vcpu * 1024, and the share for $CG/machine.slice be sum of all VM's share?
That would be unfair in a different way ... some examples: VM's shares = nr_vcpu * 1024: - 1 and 10 VCPU guests both running only one task in overcommit, larger guest gets 10 times more CPU. (Feature?) $CG/machine.slice = sum (VM's shares): - 'shares' are bound by 262144 right now, so it wouldn't scale beyond one large guest. (Not a big problem, but has ugly solutions.) - Default system tasks still have 1024, so their share would get unfairly small if we had some idle guests as well. 10 CPU machine with 10*10 VCPU guests, only one of which is actively running: A non-vm task would get just ~1% of the CPU, not ~10%, like we would expect with 11 running tasks. And it would be even worse with autogrouping. ---
[...]
2: To reproduce, run two guests with > 1 VCPU and execute two spinners on the first and one on the second. The result will be 50%/50% CPU assignment between guests; 66%/33% seems more natural, but it could still be considered as a feature.
(Please note a mistake here: the host is implied to have 1-2 CPUs. It would have been better to use nr_cpus as well ...)

On Thu, Aug 14, 2014 at 01:55:11PM -0400, Andrew Theurer wrote:
----- Original Message -----
From: "Radim Krčmář" <rkrcmar@redhat.com> To: libvir-list@redhat.com Cc: "Daniel P. Berrange" <berrange@redhat.com>, "Andrew Theurer" <atheurer@redhat.com> Sent: Thursday, August 14, 2014 9:25:05 AM Subject: Suboptimal default cpu Cgroup
Hello,
by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct' [1], with 'shares' set to 1024 on every level. This raises two points:
1) Every VM is given an equal amount of CPU time. [2] ($CG/machine.slice/*/shares = 1024)
Which means that smaller / less loaded guests are given an advantage.
2) All VMs combined are given 1024 shares. [3] ($CG/machine.slice/shares)
This is made even worse on RHEL7, by sched_autogroup_enabled = 0, so every other process in the system is given the same amount of CPU as all VMs combined.
It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
Could we have each VM's shares be nr_vcpu * 1024, and the share for $CG/machine.slice be sum of all VM's share?
Realistically libvirt can't change what it does by default for VMs wrt to this cgroups setting, because it would cause an immediate functional change for any who has deployed current libvirt versions & upgrades. Management apps like oVirt or OpenStack should explicitly set the policy they desire in this respect. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Aug 14, 2014 at 01:55:11PM -0400, Andrew Theurer wrote:
----- Original Message -----
From: "Radim Krčmář" <rkrcmar@redhat.com> To: libvir-list@redhat.com Cc: "Daniel P. Berrange" <berrange@redhat.com>, "Andrew Theurer" <atheurer@redhat.com> Sent: Thursday, August 14, 2014 9:25:05 AM Subject: Suboptimal default cpu Cgroup
Hello,
by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct' [1], with 'shares' set to 1024 on every level. This raises two points:
1) Every VM is given an equal amount of CPU time. [2] ($CG/machine.slice/*/shares = 1024)
Which means that smaller / less loaded guests are given an advantage.
2) All VMs combined are given 1024 shares. [3] ($CG/machine.slice/shares)
This is made even worse on RHEL7, by sched_autogroup_enabled = 0, so every other process in the system is given the same amount of CPU as all VMs combined.
It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
Could we have each VM's shares be nr_vcpu * 1024, and the share for $CG/machine.slice be sum of all VM's share?
Realistically libvirt can't change what it does by default for VMs wrt to this cgroups setting, because it would cause an immediate functional change for any who has deployed current libvirt versions & upgrades.
Is this another way of saying, "we have already set a bad precedent, so we need to keep it"? I am concerned that anyone who may be experiencing this problem may be unsure of what is causing it, and is not aware of how to fix it.
Management apps like oVirt or OpenStack should explicitly set the policy they desire in this respect.
Shouldn't a user or upper level mgmt have some expectation of sane defaults? A user or mgmt app has already specified a preference in the number of vcpus -shouldn't that be enough? Why have this fix need to be pushed to multiple upper layers when it can be remedied in just one (libvirt)? Honestly, I don't understand how this even got out the way it is. -Andrew
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Aug 15, 2014 at 09:23:35AM -0400, Andrew Theurer wrote:
On Thu, Aug 14, 2014 at 01:55:11PM -0400, Andrew Theurer wrote:
----- Original Message -----
From: "Radim Krčmář" <rkrcmar@redhat.com> To: libvir-list@redhat.com Cc: "Daniel P. Berrange" <berrange@redhat.com>, "Andrew Theurer" <atheurer@redhat.com> Sent: Thursday, August 14, 2014 9:25:05 AM Subject: Suboptimal default cpu Cgroup
Hello,
by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct' [1], with 'shares' set to 1024 on every level. This raises two points:
1) Every VM is given an equal amount of CPU time. [2] ($CG/machine.slice/*/shares = 1024)
Which means that smaller / less loaded guests are given an advantage.
2) All VMs combined are given 1024 shares. [3] ($CG/machine.slice/shares)
This is made even worse on RHEL7, by sched_autogroup_enabled = 0, so every other process in the system is given the same amount of CPU as all VMs combined.
It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
Could we have each VM's shares be nr_vcpu * 1024, and the share for $CG/machine.slice be sum of all VM's share?
Realistically libvirt can't change what it does by default for VMs wrt to this cgroups setting, because it would cause an immediate functional change for any who has deployed current libvirt versions & upgrades.
Is this another way of saying, "we have already set a bad precedent, so we need to keep it"? I am concerned that anyone who may be experiencing this problem may be unsure of what is causing it, and is not aware of how to fix it.
Management apps like oVirt or OpenStack should explicitly set the policy they desire in this respect.
Shouldn't a user or upper level mgmt have some expectation of sane defaults? A user or mgmt app has already specified a preference in the number of vcpus -shouldn't that be enough? Why have this fix need to be pushed to multiple upper layers when it can be remedied in just one (libvirt)? Honestly, I don't understand how this even got out the way it is.
If we hadn't already had this behaviour in libvirt for 3+ years then sure it would be desirable to change it. At this point though, applications have been exposed to current semantics for a long time and can have setup usage policies which are relying on this. If we change the defaults we have a non-negligible risk of causing regressions in behaviour for our existing userbase. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

2014-08-15 14:44+0100, Daniel P. Berrange:
On Fri, Aug 15, 2014 at 09:23:35AM -0400, Andrew Theurer wrote:
On Thu, Aug 14, 2014 at 01:55:11PM -0400, Andrew Theurer wrote:
From: "Radim Krčmář" <rkrcmar@redhat.com> It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
Could we have each VM's shares be nr_vcpu * 1024, and the share for $CG/machine.slice be sum of all VM's share?
Realistically libvirt can't change what it does by default for VMs wrt to this cgroups setting, because it would cause an immediate functional change for any who has deployed current libvirt versions & upgrades.
Is this another way of saying, "we have already set a bad precedent, so we need to keep it"? I am concerned that anyone who may be experiencing this problem may be unsure of what is causing it, and is not aware of how to fix it.
Management apps like oVirt or OpenStack should explicitly set the policy they desire in this respect.
Shouldn't a user or upper level mgmt have some expectation of sane defaults? A user or mgmt app has already specified a preference in the number of vcpus -shouldn't that be enough? Why have this fix need to be pushed to multiple upper layers when it can be remedied in just one (libvirt)? Honestly, I don't understand how this even got out the way it is.
If we hadn't already had this behaviour in libvirt for 3+ years then sure it would be desirable to change it. At this point though, applications have been exposed to current semantics for a long time and can have setup usage policies which are relying on this. If we change the defaults we have a non-negligible risk of causing regressions in behaviour for our existing userbase.
I think that (enterprise) distributions are for this preservation and upstream is only looking forward, so we don't end in the huge pile that is created throughout years. And if we are trying to prevent changes, we should be especially wary of adding new features. Well, depends on the expected lifetime of libvirt.

On Fri, Aug 15, 2014 at 04:13:13PM +0200, Radim Krčmář wrote:
2014-08-15 14:44+0100, Daniel P. Berrange:
On Fri, Aug 15, 2014 at 09:23:35AM -0400, Andrew Theurer wrote:
On Thu, Aug 14, 2014 at 01:55:11PM -0400, Andrew Theurer wrote:
From: "Radim Krčmář" <rkrcmar@redhat.com> It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
Could we have each VM's shares be nr_vcpu * 1024, and the share for $CG/machine.slice be sum of all VM's share?
Realistically libvirt can't change what it does by default for VMs wrt to this cgroups setting, because it would cause an immediate functional change for any who has deployed current libvirt versions & upgrades.
Is this another way of saying, "we have already set a bad precedent, so we need to keep it"? I am concerned that anyone who may be experiencing this problem may be unsure of what is causing it, and is not aware of how to fix it.
Management apps like oVirt or OpenStack should explicitly set the policy they desire in this respect.
Shouldn't a user or upper level mgmt have some expectation of sane defaults? A user or mgmt app has already specified a preference in the number of vcpus -shouldn't that be enough? Why have this fix need to be pushed to multiple upper layers when it can be remedied in just one (libvirt)? Honestly, I don't understand how this even got out the way it is.
If we hadn't already had this behaviour in libvirt for 3+ years then sure it would be desirable to change it. At this point though, applications have been exposed to current semantics for a long time and can have setup usage policies which are relying on this. If we change the defaults we have a non-negligible risk of causing regressions in behaviour for our existing userbase.
I think that (enterprise) distributions are for this preservation and upstream is only looking forward, so we don't end in the huge pile that is created throughout years.
It is not merely a concern of enterprise distro maintainers. It is a general libvirt project goal to try to avoid changes that will cause regressions for our downstream applications, unless the application was relying on what was a bug. Also note that distros will often rebase to newer libvirt versions during their lifetime, so changes will make their way in to the enterprise distros if we make them upstream. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Aug 14, 2014 at 04:25:05PM +0200, Radim Krčmář wrote:
Hello,
by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct' [1], with 'shares' set to 1024 on every level. This raises two points:
1) Every VM is given an equal amount of CPU time. [2] ($CG/machine.slice/*/shares = 1024)
Which means that smaller / less loaded guests are given an advantage.
This is a default with which we do nothing unless the user (or mgmt app) wants to. What you say is true only when there is no spare time (the machines need more time than available). Such overcommit is the problem of the user, I'd say.
2) All VMs combined are given 1024 shares. [3] ($CG/machine.slice/shares)
This is a problem even on system without slices (systemd), because there is /machine/cpu.shares == 1024 anyway. Is there a way to disable hierarchy in this case (to say cpu.shares=-1 for example)? Because if not, then it has only limited use (we cannot prepare the hierarchy and just write a number in some file when we want to start using it). That's a pity, but there are probably less use cases then hundreds of lines of code that would need to be changed in order to support this in kernel.
This is made even worse on RHEL7, by sched_autogroup_enabled = 0, so every other process in the system is given the same amount of CPU as all VMs combined.
But sched_autogroup_enabled = 1 wouldn't make it much better, since it would group the machines together anyway, right?
It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
I agree with you that it's not the best default scenario we can do, and maybe not using cgroups until needed would bring us a good benefit. That is for cgroups like cpu and blkio only, I think.
Do we want cgroups in the default at all? (Is OpenStack dealing with these quirks?)
Thanks.
--- 1: machine.slice/ machine-qemu\\x2d${name}.scope/ {emulator,vcpu*}/
2: To reproduce, run two guests with > 1 VCPU and execute two spinners on the first and one on the second. The result will be 50%/50% CPU assignment between guests; 66%/33% seems more natural, but it could still be considered as a feature.
3: Run a guest with $n VCPUs and $n spinners in it, and $n spinners in the host - RHEL7: 1/($n + 1)% CPU for the guest -- I'd expect 50%/50%. - Upstream: 50%/50% between guest and host because of autogrouping; if you run $n more spinners in the host, it will still be 50%/50%, instead of seemingly more fair 33%/66%. (And you can run spinners from different groups, so it would be the same as in RHEL7 then.)
And it also works the other way: if the host has $n CPUs, then $n/2 tasks in the host suffice to minimize VMs' performance, regardless of the amount of running VCPUs.
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

2014-08-15 10:50+0200, Martin Kletzander:
On Thu, Aug 14, 2014 at 04:25:05PM +0200, Radim Krčmář wrote:
Hello,
by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct' [1], with 'shares' set to 1024 on every level. This raises two points:
1) Every VM is given an equal amount of CPU time. [2] ($CG/machine.slice/*/shares = 1024)
Which means that smaller / less loaded guests are given an advantage.
This is a default with which we do nothing unless the user (or mgmt app) wants to.
(I'd argue that the default is to do nothing at all ;)
What you say is true only when there is no spare time (the machines need more time than available). Such overcommit is the problem of the user, I'd say.
I don't like that it breaks an assumption that VCPU behaves as a task. (Complicated systems are hard to operate without consistency and our behavior is really punishing for users that don't read everything.)
2) All VMs combined are given 1024 shares. [3] ($CG/machine.slice/shares)
This is a problem even on system without slices (systemd), because there is /machine/cpu.shares == 1024 anyway.
(Thanks, haven't noticed this on my professionally deformed userspace choices.)
Is there a way to disable hierarchy in this case (to say cpu.shares=-1 for example)?
Apart from the obvious "don't create what you don't want", probably not, cpu.shares are clamped by 2 and 2^18.
Because if not, then it has only limited use (we cannot prepare the hierarchy and just write a number in some file when we want to start using it). That's a pity, but there are probably less use cases then hundreds of lines of code that would need to be changed in order to support this in kernel.
And hierarchy imposes performance degradation as well, so developers probably never expected we'd create useless cgroups. (Should be proportional to their depth => having {emulator,vcpu*} by default is counterproductive as well.) Creating the hierarchy on demand is not much harder than writing a value, especially if we do it through libvirt anyway. A version of your proposal would extend cgroups with something like categorization: we could add an "effective control group" variable that allows scheduler code to start at a point higher in the hierarchy. Libvirt could continue doing what it does now and performance would improve without creating too many special cases. I can see the flame on LKML.
This is made even worse on RHEL7, by sched_autogroup_enabled = 0, so every other process in the system is given the same amount of CPU as all VMs combined.
But sched_autogroup_enabled = 1 wouldn't make it much better, since it would group the machines together anyway, right?
Yes, it would be just a bit better for VMs, because other processes would be grouped as well.
It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
I agree with you that it's not the best default scenario we can do, and maybe not using cgroups until needed would bring us a good benefit. That is for cgroups like cpu and blkio only, I think.
I haven't delved into other cgroups much, but there is a good question whether we want them :) Does $feature do something useful on top of complicating things?
participants (4)
-
Andrew Theurer
-
Daniel P. Berrange
-
Martin Kletzander
-
Radim Krčmář