
2014-08-15 10:50+0200, Martin Kletzander:
On Thu, Aug 14, 2014 at 04:25:05PM +0200, Radim Krčmář wrote:
Hello,
by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct' [1], with 'shares' set to 1024 on every level. This raises two points:
1) Every VM is given an equal amount of CPU time. [2] ($CG/machine.slice/*/shares = 1024)
Which means that smaller / less loaded guests are given an advantage.
This is a default with which we do nothing unless the user (or mgmt app) wants to.
(I'd argue that the default is to do nothing at all ;)
What you say is true only when there is no spare time (the machines need more time than available). Such overcommit is the problem of the user, I'd say.
I don't like that it breaks an assumption that VCPU behaves as a task. (Complicated systems are hard to operate without consistency and our behavior is really punishing for users that don't read everything.)
2) All VMs combined are given 1024 shares. [3] ($CG/machine.slice/shares)
This is a problem even on system without slices (systemd), because there is /machine/cpu.shares == 1024 anyway.
(Thanks, haven't noticed this on my professionally deformed userspace choices.)
Is there a way to disable hierarchy in this case (to say cpu.shares=-1 for example)?
Apart from the obvious "don't create what you don't want", probably not, cpu.shares are clamped by 2 and 2^18.
Because if not, then it has only limited use (we cannot prepare the hierarchy and just write a number in some file when we want to start using it). That's a pity, but there are probably less use cases then hundreds of lines of code that would need to be changed in order to support this in kernel.
And hierarchy imposes performance degradation as well, so developers probably never expected we'd create useless cgroups. (Should be proportional to their depth => having {emulator,vcpu*} by default is counterproductive as well.) Creating the hierarchy on demand is not much harder than writing a value, especially if we do it through libvirt anyway. A version of your proposal would extend cgroups with something like categorization: we could add an "effective control group" variable that allows scheduler code to start at a point higher in the hierarchy. Libvirt could continue doing what it does now and performance would improve without creating too many special cases. I can see the flame on LKML.
This is made even worse on RHEL7, by sched_autogroup_enabled = 0, so every other process in the system is given the same amount of CPU as all VMs combined.
But sched_autogroup_enabled = 1 wouldn't make it much better, since it would group the machines together anyway, right?
Yes, it would be just a bit better for VMs, because other processes would be grouped as well.
It does not seem to be possible to tune shares and get a good general behavior, so the best solution I can see is to disable the cpu cgroup and let users do it when needed. (Keeping all tasks in $CG/tasks.)
I agree with you that it's not the best default scenario we can do, and maybe not using cgroups until needed would bring us a good benefit. That is for cgroups like cpu and blkio only, I think.
I haven't delved into other cgroups much, but there is a good question whether we want them :) Does $feature do something useful on top of complicating things?