2014-08-15 10:50+0200, Martin Kletzander:
On Thu, Aug 14, 2014 at 04:25:05PM +0200, Radim Krčmář wrote:
>Hello,
>
>by default, libvirt with KVM creates a Cgroup hierarchy in 'cpu,cpuacct'
>[1], with 'shares' set to 1024 on every level. This raises two points:
>
>1) Every VM is given an equal amount of CPU time. [2]
> ($CG/machine.slice/*/shares = 1024)
>
> Which means that smaller / less loaded guests are given an advantage.
>
This is a default with which we do nothing unless the user (or mgmt
app) wants to.
(I'd argue that the default is to do nothing at all ;)
What you say is true only when there is no spare
time
(the machines need more time than available). Such overcommit is the
problem of the user, I'd say.
I don't like that it breaks an assumption that VCPU behaves as a task.
(Complicated systems are hard to operate without consistency and our
behavior is really punishing for users that don't read everything.)
>2) All VMs combined are given 1024 shares. [3]
> ($CG/machine.slice/shares)
>
This is a problem even on system without slices (systemd), because
there is /machine/cpu.shares == 1024 anyway.
(Thanks, haven't noticed this on my professionally deformed userspace
choices.)
Is there a way to
disable hierarchy in this case (to say cpu.shares=-1 for example)?
Apart from the obvious "don't create what you don't want", probably
not,
cpu.shares are clamped by 2 and 2^18.
Because if not, then it has only limited use (we cannot prepare the
hierarchy and just write a number in some file when we want to start
using it). That's a pity, but there are probably less use cases then
hundreds of lines of code that would need to be changed in order to
support this in kernel.
And hierarchy imposes performance degradation as well, so developers
probably never expected we'd create useless cgroups.
(Should be proportional to their depth => having {emulator,vcpu*} by
default is counterproductive as well.)
Creating the hierarchy on demand is not much harder than writing a
value, especially if we do it through libvirt anyway.
A version of your proposal would extend cgroups with something like
categorization: we could add an "effective control group" variable that
allows scheduler code to start at a point higher in the hierarchy.
Libvirt could continue doing what it does now and performance would
improve without creating too many special cases.
I can see the flame on LKML.
> This is made even worse on RHEL7, by sched_autogroup_enabled =
0, so
> every other process in the system is given the same amount of CPU as
> all VMs combined.
>
But sched_autogroup_enabled = 1 wouldn't make it much better, since it
would group the machines together anyway, right?
Yes, it would be just a bit better for VMs, because other processes
would be grouped as well.
>It does not seem to be possible to tune shares and get a good
general
>behavior, so the best solution I can see is to disable the cpu cgroup
>and let users do it when needed. (Keeping all tasks in $CG/tasks.)
>
I agree with you that it's not the best default scenario we can do,
and maybe not using cgroups until needed would bring us a good
benefit. That is for cgroups like cpu and blkio only, I think.
I haven't delved into other cgroups much, but there is a good question
whether we want them :)
Does $feature do something useful on top of complicating things?