On Thu, 14 Jan 2016 11:57:44 +0000
"Daniel P. Berrange" <berrange(a)redhat.com> wrote:
On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
> Reposting my cgroup fixes series:
>
>
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
>
> partially because I originally forgot to CC the author (Henning
> Schild) of the original series for which these patch fix a couple
> of issues discovered during regression testing (virt-test memtune
> failures in Red Hat regression environment), but also to bring them
> up to date with the top of libvirt git.
>
> NB: I did send Henning the changes after the fact, but my resend
> using the same message-id skills so that replies are left in the
> onlist series are lacking. Henning has looked at the first patch -
> with a response here:
>
>
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
>
> Finally, I think these changes should go into 1.3.1 since that's
> when the regression was introduced.
Since this has been puzzelling us for a while, let me recap on the
cgroup setup in general.
First, I'll describe how it used to work *before* Henning's patches
were merged, on a systemd based host.
- The QEMU driver forks a child process, but does *not* exec QEMU yet
The cgroup placement at this point is inherited from libvirtd. It
may look like this:
10:freezer:/
9:cpuset:/
8:perf_event:/
7:hugetlb:/
6:blkio:/system.slice
5:memory:/system.slice
4:net_cls,net_prio:/
3:devices:/system.slice/libvirtd.service
2:cpu,cpuacct:/system.slice
1:name=systemd:/system.slice/libvirtd.service
- The QEMU driver calls virCgroupNewMachine()
- We calll virSystemdCreateMachine with pidleader=$child
- Systemd creates the initial machine scope unit under
the machine slice unit, for the "systemd" controller.
It may also add the PID to *zero* or more other
resource controllers. So at this point the cgroup
placement may look like this:
10:freezer:/
9:cpuset:/
8:perf_event:/
7:hugetlb:/
6:blkio:/
5:memory:/
4:net_cls,net_prio:/
3:devices:/
2:cpu,cpuacct:/
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or may look like this:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or anywhere in between. We have *ZERO* guarantee about
what other resource controllers we may have been placed in by
systemd. There is some fairly complex logic that
determines this, based on what other tasks current exist in sibling
cgroups, and what tasks have *previously* existed in the
cgroups. IOW, you should consider the list of etra
resource controllers essentially non-deterministic
- We call virCgroupAddTask with pid=$child
This places the pid in any resource controllers we need, which
systemd has not already setup. IOW, it guarantees that we now
have placement that should look like this, regardless of what
systemd has done:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- The QEMU driver now lets the child process exec QEMU. QEMU creates
its vCPU threads at this point. All QEMU threads (emulator, vcpu
and I/O threads) now have the cgroup placement shown above.
- We create the emulator cgroup for the cpuset, cpu, cpuacct
controllers move all threads into this new cgroup. All threads
(emulator, vcpu and I/O threads) thus now have placement of:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Yes, we really did move the vcpu threads into the emulator group...
- We now ask QEMU which are the vCPU & I/O threads.
- Foreach CPU thread we new vCPU cgroups and move them into this
place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- Foreach I/O thread we new vCPU cgroups and move them into this
place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Now, lets Henning's three patches
commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5
Author: Henning Schild <henning.schild(a)siemens.com>
Date: Wed Dec 9 17:58:10 2015 +0100
util: cgroups do not implicitly add task to new machine cgroup
This alters virCgroupNewMachine so that it no longer calls
virCgroupAddTask. instead the callers are responsible for doing it.
This is a functional no-opp change, so ostensibly harmless at this
point.
The second patch
commit a41c00b472efaa192d2deae51ab732e65903238f
Author: Henning Schild <henning.schild(a)siemens.com>
Date: Mon Dec 14 15:48:05 2015 -0500
qemu: do not put a task into machine cgroup
This stops qemuInitCgroup from calling virCgroupAddTask. Instead it
has added a call to vriCgroupAddTask in qemuSetupCgroupForEmulator(),
which only affects the cpu, cpuset & cpuacct controllers.
This means we no longer have any guarantee about which resource
controllers the QEMU processes in general are in. All we can say is
that they are in the 'systemd' controller, and the cpu, cpuset &
cpuacct controllers. Systemd may have also placed it into *all* the
other resource controllers, or it may have placed it into none of the
them.
As such, after this change, we are potentially not in the correct
cgroup for the memory, blkio, netcls, devices controllers.
For added fun, the qemuSetupCgroupForEmulator() still has a call to
virCgroupMoveTask, so the addition of virCgroupAddTask to this method
was useless.
The final patch
commit 90b721e43ec9232b5b218e891437bed04548e841
Author: Henning Schild <henning.schild(a)siemens.com>
Date: Mon Dec 14 15:58:05 2015 -0500
qemu cgroups: move new threads to new cgroup after cpuset is set
up
This delays the point at which we call virCgroupAddTask for the
vCPU and I/O thread cgroups, until after we have configured properties
on those cgroups. This change is fine
So I think we need to revert the first 2 of Hennings patches - the 2nd
one is the real serious problem, but once we revert that, the 1st
change becomes pointless and so should also be reverted.
Going back to the key problem Henning was trying to address....
The issue is that we only setup the emulator cgroup /after/ QEMU has
already started running, so there's a period of time where threads are
not confined by the cgroup affinity.
I think we can solve that more simply by just moving the call to
qemuSetupCgroupForEmulator(), into qemuSetupCgroup(). That way we
place the emulator thread into the correct cgroup *before* we even
exec QEMU. So we never have the window where we run in the unconfined
cpuset controller.
Thanks for that very thorough analysis and description, very helpful
indeed! I fully agree with your conclusion and suggestion on how the
code should be changed.
Henning