On Fri, 26 Feb 2016 13:00:04 +0000
"Daniel P. Berrange" <berrange(a)redhat.com> wrote:
On Fri, Feb 26, 2016 at 01:43:05PM +0100, Henning Schild wrote:
> On Fri, 26 Feb 2016 12:21:02 +0000
> "Daniel P. Berrange" <berrange(a)redhat.com> wrote:
>
> > On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote:
> > > On Fri, 26 Feb 2016 11:13:13 +0000
> > > "Daniel P. Berrange" <berrange(a)redhat.com> wrote:
> > >
> > > > On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild
> > > > wrote:
> > > > > virCgroupNewMachine used to add the pidleader to the newly
> > > > > created machine cgroup. Do not do this implicit anymore.
> > > > >
> > > > > Signed-off-by: Henning Schild
<henning.schild(a)siemens.com>
> > > > > ---
> > > > > src/lxc/lxc_cgroup.c | 11 +++++++++++
> > > > > src/qemu/qemu_cgroup.c | 11 +++++++++++
> > > > > src/util/vircgroup.c | 22 ----------------------
> > > > > 3 files changed, 22 insertions(+), 22 deletions(-)
> > > >
> > > > NACK to this patch once again.
> > > >
> > > > This does not actually work as you think it does.
> > > >
> > > > > diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c
> > > > > index 11f33ab..aef8e8c 100644
> > > > > --- a/src/util/vircgroup.c
> > > > > +++ b/src/util/vircgroup.c
> > > > > @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char
> > > > > *name, }
> > > > > }
> > > > >
> > > > > - if (virCgroupAddTask(*group, pidleader) < 0) {
> > > > > - virErrorPtr saved = virSaveLastError();
> > > > > - virCgroupRemove(*group);
> > > > > - virCgroupFree(group);
> > > > > - if (saved) {
> > > > > - virSetError(saved);
> > > > > - virFreeError(saved);
> > > > > - }
> > > > > - }
> > > >
> > > > Just above this we called virSystemdCreateMachine. Systemd
> > > > will create the cgroup and add the pidleader to those
> > > > cgroups. Systemd may add the pidleader to just the 'systemd'
> > > > controller, or it may add the pidleader to *ALL* controllers.
> > > > We have no way of knowing.
> > > >
> > > > This virCgroupAddTask call deals with whatever systemd chose
> > > > not todo, so we can guarantee consistent behaviour with the
> > > > pidleader in all cgroups.
> > > >
> > > > By removing this you make this method non-deterministic - the
> > > > pid may or may not be in the cpu controller now. THis is bad
> > > > because it can lead to QEMU/LXC driver code working in some
> > > > cases but failing in other cases.
> > > >
> > > > Furthermore, this existing does not cause any problems for the
> > > > scenario you care about. THis cgroup placement is being set
> > > > in between the time libvirtd calls fork() and exec(). With
> > > > your later patch 5, we ensure that the PID is moved across
> > > > into the emulator cgroup, before we call exec(). When we call
> > > > exec all memory mappings will be replaced, so QEMU will stil
> > > > start with the correct vCPU placement and memory allocation
> > > > placement.
> > >
> > > I agree having the task in the wrong cgroup before the exec()
> > > seems harmless. But i am not sure all the fiddling with cgroups
> > > is indeed harmless and does not cause i.e. kernel work on cores
> > > that should be left alone. I have the feeling allowing the task
> > > in the parent cgroup is a bad idea, no matter how short the
> > > window seems to be.
> > >
> > > Right now the parent cgroup contains all cpus found in
> > > machine.slice, which for pinned VMs is too much. How about we
> > > calculate the size of the child cgroups before and make the
> > > parent the union of them. Or give the parent the emulator
> > > pinning and extend it for the vcpus later. But that might turn
> > > out pretty complicated as well, getting the order right with
> > > the mix of cpusets and sched_setaffinity().
> > > > Just just drop this patch please.
> >
> > The point is though that we have *no* choice. Systemd can put the
> > task in the cpu controller and we've no way to prevent that. So
> > the code *has* to be able to cope with that happening. Therefore
> > this patch is wrong it just makes behaviour non-deterministic
> > increasing the chances that we don't correctly handle the case
> > where systemd adds the task to the cpu controllers
>
> Understood! I was suggesting a "growing on demand" policy instead of
> "shrinking after inheriting all".
>
> If we can not control what systemd does we have to give it harmless
> cpus to mess around with. That assumes we can control the size of
> the cpuset before systemd puts anything in. Once back in control we
> grow the parent group before deriving more child groups.
>
> Would that be possible?
>
> I have no objections to keep using the shrinking approach.
> Especially since the controlled growing is harder to implement in
> the given codebase. It just feels like it should be the other way
> around.
I don't see what actual problem you are trying to solve here.
IIUC, the original problem you wanted to address was that vCPU pids
could run the wrong CPU for a short time. ie The original code did
1. Libvirtd forks child pid
... this pid runs on whatever pCPUs that libvirt is permitted to
use... 2. Libvirtd creates root cgroup for VM
... this pid runs on whatever pCPUs the root cgroup inherited...
3. Child pid execs QEMU
... QEMU pid runs on whatever pCPUs the root cgroup inherited...
4. QEMU creates vCPU pids
... vCPU pids run on whatever pCPUs the root cgroup inherited...
4. Libvirtd moves emulator PIDs and vCPU PIDs
... emulator PIDs are running on assigned pCPUs for emulator...
... vCPU PIDs are running on assigned pCPUs for vCPUs....
With the important change in patch 5 this now looks like
1. Libvirtd forks child pid
... this pid runs on whatever pCPUs that libvirt is permitted to
use... 2. Libvirtd creates root cgroup for VM
... this pid runs on whatever pCPUs the root cgroup inherited...
I am trying to come up with a solution that eliminates the above line
from the whole bringup. I.e never allow a pid belonging to the VM
outside the pinnings of libvirtd and the VM configuration.
But until step 4 it should probably be
"... this pid *just sits* on whatever pCPUs the root cgroup
inherited..."
If we are sure that it does not "run" before 4. patch 5 does the trick
already
> 3. Libvirtd moves pid into emulator group
> ... this pid runs on assigned pCPUs for emulator...
> 4. Child pid execs QEMU
> ... QEMU pid runs on assigned pCPUs for emulator...
> 5. QEMU creates vCPU pids
> ... vCPU pids are running on assigned pCPUS for emulator...
> 6. Libvirtd moves vCPU PIDs
> ... emulator PIDs are running on assigned pCPUs for emulator...
> ... vCPU PIDs are running on assigned pCPUs for vCPUs....
>
> Which is good, because vCPU pids don't ever run on un-restricted
> pCPUs.
>
>
> Your patch 4 here is attempting to change step 2 only so that it
> looks like
>
>
> 1. Libvirtd forks child pid
> ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... 2. Libvirtd creates root cgroup for VM
> ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... or depending on what controller system added
... this pid runs on whatever pCPUs the root cgroup inherited...
> 3. Libvirtd adds pid into emulator group
> ... this pid runs on assigned pCPUs for emulator...
> 4. Child pid execs QEMU
> ... QEMU pid runs on assigned pCPUs for emulator...
> 5. QEMU creates vCPU pids
> ... vCPU pids are running on assigned pCPUS for emulator...
> 6. Libvirtd moves vCPU PIDs
> ... emulator PIDs are running on assigned pCPUs for emulator...
> ... vCPU PIDs are running on assigned pCPUs for vCPUs....
>
> At the time we exec QEMU in step 4 the situation is exactly the same
> as before. The vCPU pids are still created in the right place
> straight away.
>
> So this patch 4 doesn't achieve anything useful
>
> Regards,
> Daniel