On Thu, 14 Jan 2016 12:37:18 +0000
"Daniel P. Berrange" <berrange(a)redhat.com> wrote:
On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote:
> Since this has been puzzelling us for a while, let me recap on the
> cgroup setup in general.
>
> First, I'll describe how it used to work *before* Henning's patches
> were merged, on a systemd based host.
>
> - The QEMU driver forks a child process, but does *not* exec QEMU
> yet The cgroup placement at this point is inherited from libvirtd.
> It may look like this:
>
> 10:freezer:/
> 9:cpuset:/
> 8:perf_event:/
> 7:hugetlb:/
> 6:blkio:/system.slice
> 5:memory:/system.slice
> 4:net_cls,net_prio:/
> 3:devices:/system.slice/libvirtd.service
> 2:cpu,cpuacct:/system.slice
> 1:name=systemd:/system.slice/libvirtd.service
>
> - The QEMU driver calls virCgroupNewMachine()
>
> - We calll virSystemdCreateMachine with pidleader=$child
>
> - Systemd creates the initial machine scope unit under
> the machine slice unit, for the "systemd" controller.
> It may also add the PID to *zero* or more other
> resource controllers. So at this point the cgroup
> placement may look like this:
>
> 10:freezer:/
> 9:cpuset:/
> 8:perf_event:/
> 7:hugetlb:/
> 6:blkio:/
> 5:memory:/
> 4:net_cls,net_prio:/
> 3:devices:/
> 2:cpu,cpuacct:/
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> Or may look like this:
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> Or anywhere in between. We have *ZERO* guarantee about
> what other resource controllers we may have been placed in by
> systemd. There is some fairly complex logic that
> determines this, based on what other tasks current exist in sibling
> cgroups, and what tasks have *previously* existed in
> the cgroups. IOW, you should consider the list of etra resource
> controllers essentially non-deterministic
>
> - We call virCgroupAddTask with pid=$child
>
> This places the pid in any resource controllers we need,
> which systemd has not already setup. IOW, it guarantees that we now
> have placement that should look like this, regardless of
> what systemd has done:
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> - The QEMU driver now lets the child process exec QEMU. QEMU
> creates its vCPU threads at this point. All QEMU threads (emulator,
> vcpu and I/O threads) now have the cgroup placement shown above.
>
> - We create the emulator cgroup for the cpuset, cpu, cpuacct
> controllers move all threads into this new cgroup. All threads
> (emulator, vcpu and I/O threads) thus now have placement of:
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> Yes, we really did move the vcpu threads into the emulator
> group...
>
> - We now ask QEMU which are the vCPU & I/O threads.
>
> - Foreach CPU thread we new vCPU cgroups and move them into this
> place
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> - Foreach I/O thread we new vCPU cgroups and move them into this
> place
>
> 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
BTW, on a slight tangent, the kernel is throwing a spanner in the
works in the near future. They have just accepted cgroupv2 into
mainline. Broadly speaking this is very nice because they got rid
of the idea of separate mount point for each controller, and instead
have a single filesystem tree. The problem is that they decided the
granularity of placement is at a *process* level, not a *thread*
level. So it will no longer be possible for us to have the cgroups
for emulator, vcpus & i/o threads. Everything will have to live in
the same cgroup :-( For cpu accounting and cpu affinity I think we
can still achieve what we need by using a combination of cgroups
and sched_setaffinity and /proc. I'm not sure what we'll do about
per-thread schedular policies for period + quota though - not sure
if there's an API for setting those or not ?!?!
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Docu...
Good to know. Do you you have that on the agenda for libvirt? I guess
eventually v1 will get deprecated...