On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote:
Since this has been puzzelling us for a while, let me recap on the
cgroup setup in general.
First, I'll describe how it used to work *before* Henning's patches
were merged, on a systemd based host.
- The QEMU driver forks a child process, but does *not* exec QEMU yet
The cgroup placement at this point is inherited from libvirtd. It
may look like this:
10:freezer:/
9:cpuset:/
8:perf_event:/
7:hugetlb:/
6:blkio:/system.slice
5:memory:/system.slice
4:net_cls,net_prio:/
3:devices:/system.slice/libvirtd.service
2:cpu,cpuacct:/system.slice
1:name=systemd:/system.slice/libvirtd.service
- The QEMU driver calls virCgroupNewMachine()
- We calll virSystemdCreateMachine with pidleader=$child
- Systemd creates the initial machine scope unit under
the machine slice unit, for the "systemd" controller.
It may also add the PID to *zero* or more other
resource controllers. So at this point the cgroup
placement may look like this:
10:freezer:/
9:cpuset:/
8:perf_event:/
7:hugetlb:/
6:blkio:/
5:memory:/
4:net_cls,net_prio:/
3:devices:/
2:cpu,cpuacct:/
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or may look like this:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or anywhere in between. We have *ZERO* guarantee about what
other resource controllers we may have been placed in by
systemd. There is some fairly complex logic that determines
this, based on what other tasks current exist in sibling
cgroups, and what tasks have *previously* existed in the
cgroups. IOW, you should consider the list of etra resource
controllers essentially non-deterministic
- We call virCgroupAddTask with pid=$child
This places the pid in any resource controllers we need, which
systemd has not already setup. IOW, it guarantees that we now
have placement that should look like this, regardless of what
systemd has done:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- The QEMU driver now lets the child process exec QEMU. QEMU creates
its vCPU threads at this point. All QEMU threads (emulator, vcpu
and I/O threads) now have the cgroup placement shown above.
- We create the emulator cgroup for the cpuset, cpu, cpuacct controllers
move all threads into this new cgroup. All threads (emulator, vcpu
and I/O threads) thus now have placement of:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Yes, we really did move the vcpu threads into the emulator group...
- We now ask QEMU which are the vCPU & I/O threads.
- Foreach CPU thread we new vCPU cgroups and move them into this
place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- Foreach I/O thread we new vCPU cgroups and move them into this
place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope
9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
6:blkio:/machine.slice/machine-qemu\x2dserial.scope
5:memory:/machine.slice/machine-qemu\x2dserial.scope
4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
3:devices:/machine.slice/machine-qemu\x2dserial.scope
2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
BTW, on a slight tangent, the kernel is throwing a spanner in the
works in the near future. They have just accepted cgroupv2 into
mainline. Broadly speaking this is very nice because they got rid
of the idea of separate mount point for each controller, and instead
have a single filesystem tree. The problem is that they decided the
granularity of placement is at a *process* level, not a *thread*
level. So it will no longer be possible for us to have the cgroups
for emulator, vcpus & i/o threads. Everything will have to live in
the same cgroup :-( For cpu accounting and cpu affinity I think we
can still achieve what we need by using a combination of cgroups
and sched_setaffinity and /proc. I'm not sure what we'll do about
per-thread schedular policies for period + quota though - not sure
if there's an API for setting those or not ?!?!
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Docu...
Regards,
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|