Re: [libvirt] [PATCH 4/9] util: cgroups do not implicitly add task to new machine cgroup

26 Feb 2016

On Fri, 26 Feb 2016 12:21:02 +0000
"Daniel P. Berrange" <berrange@redhat.com> wrote:
...
On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote:
...
On Fri, 26 Feb 2016 11:13:13 +0000
"Daniel P. Berrange" <berrange@redhat.com> wrote:
...
On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild wrote:
...
virCgroupNewMachine used to add the pidleader to the newly
created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com>
---
 src/lxc/lxc_cgroup.c   | 11 +++++++++++
 src/qemu/qemu_cgroup.c | 11 +++++++++++
 src/util/vircgroup.c   | 22 ----------------------
 3 files changed, 22 insertions(+), 22 deletions(-)
NACK to this patch once again.
This does not actually work as you think it does.
...

diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c
index 11f33ab..aef8e8c 100644
--- a/src/util/vircgroup.c
+++ b/src/util/vircgroup.c
@@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char
*name, }
     }
-    if (virCgroupAddTask(*group, pidleader) < 0) {
-        virErrorPtr saved = virSaveLastError();
-        virCgroupRemove(*group);
-        virCgroupFree(group);
-        if (saved) {
-            virSetError(saved);
-            virFreeError(saved);
-        }
-    }
Just above this we called virSystemdCreateMachine.  Systemd will
create the cgroup and add the pidleader to those cgroups. Systemd
may add the pidleader to just the 'systemd' controller, or it may
add the pidleader to *ALL* controllers. We have no way of knowing.
This virCgroupAddTask call deals with whatever systemd chose not
todo, so we can guarantee consistent behaviour with the pidleader
in all cgroups.
By removing this you make this method non-deterministic - the pid
may or may not be in the cpu controller now. THis is bad because
it can lead to QEMU/LXC driver code working in some cases but
failing in other cases.
Furthermore, this existing does not cause any problems for the
scenario you care about. THis cgroup placement is being set
in between the time libvirtd calls fork() and exec(). With your
later patch 5, we ensure that the PID is moved across into the
emulator cgroup, before we call exec(). When we call exec all
memory mappings will be replaced, so QEMU will stil start with
the correct vCPU placement and memory allocation placement.
I agree having the task in the wrong cgroup before the exec() seems
harmless. But i am not sure all the fiddling with cgroups is indeed
harmless and does not cause i.e. kernel work on cores that should be
left alone. I have the feeling allowing the task in the parent
cgroup is a bad idea, no matter how short the window seems to be.
Right now the parent cgroup contains all cpus found in
machine.slice, which for pinned VMs is too much. How about we
calculate the size of the child cgroups before and make the parent
the union of them. Or give the parent the emulator pinning and
extend it for the vcpus later. But that might turn out pretty
complicated as well, getting the order right with the mix of
cpusets and sched_setaffinity().
...
Just just drop this patch please.
The point is though that we have *no* choice. Systemd can put the task
in the cpu controller and we've no way to prevent that. So the code
*has* to be able to cope with that happening. Therefore this patch is
wrong it just makes behaviour non-deterministic increasing the
chances that we don't correctly handle the case where systemd adds
the task to the cpu controllers
Understood! I was suggesting a "growing on demand" policy instead of
"shrinking after inheriting all".

If we can not control what systemd does we have to give it harmless
cpus to mess around with. That assumes we can control the size of the
cpuset before systemd puts anything in. Once back in control we grow
the parent group before deriving more child groups.

Would that be possible?

I have no objections to keep using the shrinking approach. Especially
since the controlled growing is harder to implement in the given
codebase. It just feels like it should be the other way around.
...
Regards,
Daniel