On Thu, 14 Jan 2016 12:37:18 +0000
"Daniel P. Berrange" <berrange(a)redhat.com> wrote:
> On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote:
> > Since this has been puzzelling us for a while, let me recap on the
> > cgroup setup in general.
> >
> > First, I'll describe how it used to work *before* Henning's patches
> > were merged, on a systemd based host.
> >
> > - The QEMU driver forks a child process, but does *not* exec QEMU
> > yet The cgroup placement at this point is inherited from libvirtd.
> > It may look like this:
> >
> > 10:freezer:/
> > 9:cpuset:/
> > 8:perf_event:/
> > 7:hugetlb:/
> > 6:blkio:/system.slice
> > 5:memory:/system.slice
> > 4:net_cls,net_prio:/
> > 3:devices:/system.slice/libvirtd.service
> > 2:cpu,cpuacct:/system.slice
> > 1:name=systemd:/system.slice/libvirtd.service
> >
> > - The QEMU driver calls virCgroupNewMachine()
> >
> > - We calll virSystemdCreateMachine with pidleader=$child
> >
> > - Systemd creates the initial machine scope unit under
> > the machine slice unit, for the "systemd" controller.
> > It may also add the PID to *zero* or more other
> > resource controllers. So at this point the cgroup
> > placement may look like this:
> >
> > 10:freezer:/
> > 9:cpuset:/
> > 8:perf_event:/
> > 7:hugetlb:/
> > 6:blkio:/
> > 5:memory:/
> > 4:net_cls,net_prio:/
> > 3:devices:/
> > 2:cpu,cpuacct:/
> > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> >
> > Or may look like this:
> >
> > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> >
> > Or anywhere in between. We have *ZERO* guarantee about
> > what other resource controllers we may have been placed in by
> > systemd. There is some fairly complex logic that
> > determines this, based on what other tasks current exist in sibling
> > cgroups, and what tasks have *previously* existed in
> > the cgroups. IOW, you should consider the list of etra resource
> > controllers essentially non-deterministic
> >
> > - We call virCgroupAddTask with pid=$child
> >
> > This places the pid in any resource controllers we need,
> > which systemd has not already setup. IOW, it guarantees that we now
> > have placement that should look like this, regardless of
> > what systemd has done:
> >
> > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
> > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
> > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> >
> > - The QEMU driver now lets the child process exec QEMU. QEMU
> > creates its vCPU threads at this point. All QEMU threads (emulator,
> > vcpu and I/O threads) now have the cgroup placement shown above.
> >
> > - We create the emulator cgroup for the cpuset, cpu, cpuacct
> > controllers move all threads into this new cgroup. All threads
> > (emulator, vcpu and I/O threads) thus now have placement of:
> >
> > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
> > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
> > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> >
> > Yes, we really did move the vcpu threads into the emulator
> > group...
> >
> > - We now ask QEMU which are the vCPU & I/O threads.
> >
> > - Foreach CPU thread we new vCPU cgroups and move them into this
> > place
> >
> > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
> > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
> > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> >
> > - Foreach I/O thread we new vCPU cgroups and move them into this
> > place
> >
> > 10:freezer:/machine.slice/machine-qemu\x2dserial.scope
> > 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> > 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
> > 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
> > 6:blkio:/machine.slice/machine-qemu\x2dserial.scope
> > 5:memory:/machine.slice/machine-qemu\x2dserial.scope
> > 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
> > 3:devices:/machine.slice/machine-qemu\x2dserial.scope
> > 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
> > 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
>
> BTW, on a slight tangent, the kernel is throwing a spanner in the
> works in the near future. They have just accepted cgroupv2 into
> mainline. Broadly speaking this is very nice because they got rid
> of the idea of separate mount point for each controller, and instead
> have a single filesystem tree. The problem is that they decided the
> granularity of placement is at a *process* level, not a *thread*
> level. So it will no longer be possible for us to have the cgroups
> for emulator, vcpus & i/o threads. Everything will have to live in
> the same cgroup :-( For cpu accounting and cpu affinity I think we
> can still achieve what we need by using a combination of cgroups
> and sched_setaffinity and /proc. I'm not sure what we'll do about
> per-thread schedular policies for period + quota though - not sure
> if there's an API for setting those or not ?!?!
>
>
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Docu...
Good to know. Do you you have that on the agenda for libvirt? I guess
eventually v1 will get deprecated...
We'll have no choice but to use cgroupv2 as soon as systemd starts
using it....
Regards,
Daniel
--
|: