Re: [libvirt] [REPOST 0/4] Adjustment to recent cgroup/cpuset changes (for 1.3.1)

14 Jan 2016

      On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
...
Reposting my cgroup fixes series:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
partially because I originally forgot to CC the author (Henning Schild)
of the original series for which these patch fix a couple of issues
discovered during regression testing (virt-test memtune failures in
Red Hat regression environment), but also to bring them up to date
with the top of libvirt git.
NB: I did send Henning the changes after the fact, but my resend using
the same message-id skills so that replies are left in the onlist series
are lacking.  Henning has looked at the first patch - with a response here:
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
Finally, I think these changes should go into 1.3.1 since that's when the
regression was introduced.
Since this has been puzzelling us for a while, let me recap on the
cgroup setup in general.

First, I'll describe how it used to work *before* Henning's patches
were merged, on a systemd based host.

 - The QEMU driver forks a child process, but does *not* exec QEMU yet
   The cgroup placement at this point is inherited from libvirtd. It
   may look like this:

     10:freezer:/
     9:cpuset:/
     8:perf_event:/
     7:hugetlb:/
     6:blkio:/system.slice
     5:memory:/system.slice
     4:net_cls,net_prio:/
     3:devices:/system.slice/libvirtd.service
     2:cpu,cpuacct:/system.slice
     1:name=systemd:/system.slice/libvirtd.service

 - The QEMU driver calls virCgroupNewMachine()

      - We calll virSystemdCreateMachine with pidleader=$child

           - Systemd creates the initial machine scope unit under
	     the machine slice unit, for the "systemd" controller.
	     It may also add the PID to *zero* or more other
	     resource controllers. So at this point the cgroup
	     placement may look like this:

              10:freezer:/
              9:cpuset:/
              8:perf_event:/
              7:hugetlb:/
              6:blkio:/
              5:memory:/
              4:net_cls,net_prio:/
              3:devices:/
              2:cpu,cpuacct:/
              1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope

             Or may look like this:

              10:freezer:/machine.slice/machine-qemu\x2dserial.scope
              9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
              8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
              7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
              6:blkio:/machine.slice/machine-qemu\x2dserial.scope
              5:memory:/machine.slice/machine-qemu\x2dserial.scope
              4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
              3:devices:/machine.slice/machine-qemu\x2dserial.scope
              2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
              1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope

             Or anywhere in between. We have *ZERO* guarantee about what
	     other resource controllers we may have been placed in by
	     systemd. There is some fairly complex logic that determines
	     this, based on what other tasks current exist in sibling
	     cgroups, and what tasks have *previously* existed in the
	     cgroups. IOW, you should consider the list of etra resource
	     controllers essentially non-deterministic

      - We call virCgroupAddTask with pid=$child

        This places the pid in any resource controllers we need, which
	systemd has not already setup. IOW, it guarantees that we now
	have placement that should look like this, regardless of what
	systemd has done:

              10:freezer:/machine.slice/machine-qemu\x2dserial.scope
              9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
              8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
              7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
              6:blkio:/machine.slice/machine-qemu\x2dserial.scope
              5:memory:/machine.slice/machine-qemu\x2dserial.scope
              4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
              3:devices:/machine.slice/machine-qemu\x2dserial.scope
              2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
              1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope

 - The QEMU driver now lets the child process exec QEMU. QEMU creates
   its vCPU threads at this point. All QEMU threads (emulator, vcpu
   and I/O threads) now have the cgroup placement shown above.

 - We create the emulator cgroup for the cpuset, cpu, cpuacct controllers
   move all threads into this new cgroup. All threads (emulator, vcpu
   and I/O threads) thus now have placement of:

           10:freezer:/machine.slice/machine-qemu\x2dserial.scope
           9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
           8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
           7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
           6:blkio:/machine.slice/machine-qemu\x2dserial.scope
           5:memory:/machine.slice/machine-qemu\x2dserial.scope
           4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
           3:devices:/machine.slice/machine-qemu\x2dserial.scope
           2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
           1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope

   Yes, we really did move the vcpu threads into the emulator group...

 - We now ask QEMU which are the vCPU & I/O threads.

    - Foreach CPU thread we new vCPU cgroups and move them into this
      place

           10:freezer:/machine.slice/machine-qemu\x2dserial.scope
           9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
           8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
           7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
           6:blkio:/machine.slice/machine-qemu\x2dserial.scope
           5:memory:/machine.slice/machine-qemu\x2dserial.scope
           4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
           3:devices:/machine.slice/machine-qemu\x2dserial.scope
           2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
           1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope

    - Foreach I/O thread we new vCPU cgroups and move them into this
      place

           10:freezer:/machine.slice/machine-qemu\x2dserial.scope
           9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
           8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
           7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
           6:blkio:/machine.slice/machine-qemu\x2dserial.scope
           5:memory:/machine.slice/machine-qemu\x2dserial.scope
           4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
           3:devices:/machine.slice/machine-qemu\x2dserial.scope
           2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
           1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope

Now, lets Henning's three patches

   commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5
   Author: Henning Schild <henning.schild@siemens.com>
   Date:   Wed Dec 9 17:58:10 2015 +0100

    util: cgroups do not implicitly add task to new machine cgroup

This alters virCgroupNewMachine so that it no longer calls virCgroupAddTask.
instead the callers are responsible for doing it. This is a functional
no-opp change, so ostensibly harmless at this point.

The second patch

  commit a41c00b472efaa192d2deae51ab732e65903238f
  Author: Henning Schild <henning.schild@siemens.com>
  Date:   Mon Dec 14 15:48:05 2015 -0500

    qemu: do not put a task into machine cgroup

This stops qemuInitCgroup from calling virCgroupAddTask. Instead it has
added a call to vriCgroupAddTask in qemuSetupCgroupForEmulator(), which
only affects the cpu, cpuset & cpuacct controllers.

This means we no longer have any guarantee about which resource controllers
the QEMU processes in general are in. All we can say is that they are in the
'systemd' controller, and the cpu, cpuset & cpuacct controllers. Systemd may
have also placed it into *all* the other resource controllers, or it may have
placed it into none of the them.

As such, after this change, we are potentially not in the correct cgroup for
the memory, blkio, netcls, devices controllers.

For added fun, the qemuSetupCgroupForEmulator() still has a call to
virCgroupMoveTask, so the addition of virCgroupAddTask to this method
was useless.

The final patch

  commit 90b721e43ec9232b5b218e891437bed04548e841
  Author: Henning Schild <henning.schild@siemens.com>
  Date:   Mon Dec 14 15:58:05 2015 -0500

    qemu cgroups: move new threads to new cgroup after cpuset is set up

This delays the point at which we call virCgroupAddTask for the
vCPU and I/O thread cgroups, until after we have configured properties
on those cgroups.  This change is fine

So I think we need to revert the first 2 of Hennings patches - the 2nd
one is the real serious problem, but once we revert that, the 1st change
becomes pointless and so should also be reverted.

Going back to the key problem Henning was trying to address....

The issue is that we only setup the emulator cgroup /after/ QEMU has
already started running, so there's a period of time where threads are
not confined by the cgroup affinity.

I think we can solve that more simply by just moving the call to
qemuSetupCgroupForEmulator(), into qemuSetupCgroup(). That way we
place the emulator thread into the correct cgroup *before* we even
exec QEMU. So we never have the window where we run in the unconfined
cpuset controller.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

Re: [libvirt] [REPOST 0/4] Adjustment to recent cgroup/cpuset changes (for 1.3.1)

Daniel P. Berrange