On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
Reposting my cgroup fixes series:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
partially because I originally forgot to CC the author (Henning Schild)
of the original series for which these patch fix a couple of issues
discovered during regression testing (virt-test memtune failures in
Red Hat regression environment), but also to bring them up to date
with the top of libvirt git.
NB: I did send Henning the changes after the fact, but my resend using
the same message-id skills so that replies are left in the onlist series
are lacking. Henning has looked at the first patch - with a response here:
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
Finally, I think these changes should go into 1.3.1 since that's when the
regression was introduced.
Since this has been puzzelling us for a while, let me recap on the
cgroup setup in general.
First, I'll describe how it used to work *before* Henning's patches
were merged, on a systemd based host.
- The QEMU driver forks a child process, but does *not* exec QEMU yet
The cgroup placement at this point is inherited from libvirtd. It
may look like this:
10:freezer:/
9:cpuset:/
8:perf_event:/
7:hugetlb:/
6:blkio:/system.slice
5:memory:/system.slice
4:net_cls,net_prio:/
3:devices:/system.slice/libvirtd.service
2:cpu,cpuacct:/system.slice
1:name=systemd:/system.slice/libvirtd.service
- The QEMU driver calls virCgroupNewMachine()
- We calll virSystemdCreateMachine with pidleader=$child
- Systemd creates the initial machine scope unit under
the machine slice unit, for the "systemd" controller.
It may also add the PID to *zero* or more other
resource controllers. So at this point the cgroup
placement may look like this:
10:freezer:/
9:cpuset:/
8:perf_event:/
7:hugetlb:/
6:blkio:/
5:memory:/
4:net_cls,net_prio:/
3:devices:/
2:cpu,cpuacct:/
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or may look like this:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or anywhere in between. We have *ZERO* guarantee about what
other resource controllers we may have been placed in by
systemd. There is some fairly complex logic that determines
this, based on what other tasks current exist in sibling
cgroups, and what tasks have *previously* existed in the
cgroups. IOW, you should consider the list of etra resource
controllers essentially non-deterministic
- We call virCgroupAddTask with pid=$child
This places the pid in any resource controllers we need, which
systemd has not already setup. IOW, it guarantees that we now
have placement that should look like this, regardless of what
systemd has done:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- The QEMU driver now lets the child process exec QEMU. QEMU creates
its vCPU threads at this point. All QEMU threads (emulator, vcpu
and I/O threads) now have the cgroup placement shown above.
- We create the emulator cgroup for the cpuset, cpu, cpuacct controllers
move all threads into this new cgroup. All threads (emulator, vcpu
and I/O threads) thus now have placement of:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Yes, we really did move the vcpu threads into the emulator group...
- We now ask QEMU which are the vCPU & I/O threads.
- Foreach CPU thread we new vCPU cgroups and move them into this
place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- Foreach I/O thread we new vCPU cgroups and move them into this
place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Now, lets Henning's three patches
commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5
Author: Henning Schild <henning.schild(a)siemens.com>
Date: Wed Dec 9 17:58:10 2015 +0100
util: cgroups do not implicitly add task to new machine cgroup
This alters virCgroupNewMachine so that it no longer calls virCgroupAddTask.
instead the callers are responsible for doing it. This is a functional
no-opp change, so ostensibly harmless at this point.
The second patch
commit a41c00b472efaa192d2deae51ab732e65903238f
Author: Henning Schild <henning.schild(a)siemens.com>
Date: Mon Dec 14 15:48:05 2015 -0500
qemu: do not put a task into machine cgroup
This stops qemuInitCgroup from calling virCgroupAddTask. Instead it has
added a call to vriCgroupAddTask in qemuSetupCgroupForEmulator(), which
only affects the cpu, cpuset & cpuacct controllers.
This means we no longer have any guarantee about which resource controllers
the QEMU processes in general are in. All we can say is that they are in the
'systemd' controller, and the cpu, cpuset & cpuacct controllers. Systemd may
have also placed it into *all* the other resource controllers, or it may have
placed it into none of the them.
As such, after this change, we are potentially not in the correct cgroup for
the memory, blkio, netcls, devices controllers.
For added fun, the qemuSetupCgroupForEmulator() still has a call to
virCgroupMoveTask, so the addition of virCgroupAddTask to this method
was useless.
The final patch
commit 90b721e43ec9232b5b218e891437bed04548e841
Author: Henning Schild <henning.schild(a)siemens.com>
Date: Mon Dec 14 15:58:05 2015 -0500
qemu cgroups: move new threads to new cgroup after cpuset is set up
This delays the point at which we call virCgroupAddTask for the
vCPU and I/O thread cgroups, until after we have configured properties
on those cgroups. This change is fine
So I think we need to revert the first 2 of Hennings patches - the 2nd
one is the real serious problem, but once we revert that, the 1st change
becomes pointless and so should also be reverted.
Going back to the key problem Henning was trying to address....
The issue is that we only setup the emulator cgroup /after/ QEMU has
already started running, so there's a period of time where threads are
not confined by the cgroup affinity.
I think we can solve that more simply by just moving the call to
qemuSetupCgroupForEmulator(), into qemuSetupCgroup(). That way we
place the emulator thread into the correct cgroup *before* we even
exec QEMU. So we never have the window where we run in the unconfined
cpuset controller.
Regards,
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|