[libvirt] [REPOST 0/4] Adjustment to recent cgroup/cpuset changes (for 1.3.1)

Reposting my cgroup fixes series: http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html partially because I originally forgot to CC the author (Henning Schild) of the original series for which these patch fix a couple of issues discovered during regression testing (virt-test memtune failures in Red Hat regression environment), but also to bring them up to date with the top of libvirt git. NB: I did send Henning the changes after the fact, but my resend using the same message-id skills so that replies are left in the onlist series are lacking. Henning has looked at the first patch - with a response here: http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html Finally, I think these changes should go into 1.3.1 since that's when the regression was introduced. John Ferlan (4): cgroup: Fix possible bug as a result of code motion for vcpu cgroup setup qemu: Add check for NULL cgroup return from virCgroupNewMachine Revert "qemu: do not put a task into machine cgroup" qemu: Put the emulator cgroup pid into the right task file src/qemu/qemu_cgroup.c | 18 +++++++++++++----- src/qemu/qemu_process.c | 12 ++++++------ 2 files changed, 19 insertions(+), 11 deletions(-) -- 2.5.0

Commit id '90b721e43' moved where the virCgroupAddTask was made until after the check for the vcpupin checks. However, in doing so it missed an option where if the cpumap didn't exist, then the code would continue back to the top of the current vcpu loop. The results was that the virCgroupAddTask wouldn't be called. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_cgroup.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 1c406ce..91b3328 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -1079,10 +1079,7 @@ qemuSetupCgroupForVcpu(virDomainObjPtr vm) } } - if (!cpumap) - continue; - - if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) + if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup; } -- 2.5.0

On Wed, Jan 13, 2016 at 07:29:47AM -0500, John Ferlan wrote:
Commit id '90b721e43' moved where the virCgroupAddTask was made until after the check for the vcpupin checks. However, in doing so it missed an option where if the cpumap didn't exist, then the code would continue back to the top of the current vcpu loop. The results was that the virCgroupAddTask wouldn't be called.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_cgroup.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
ACK && safe for freeze.
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 1c406ce..91b3328 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -1079,10 +1079,7 @@ qemuSetupCgroupForVcpu(virDomainObjPtr vm) } }
- if (!cpumap) - continue; - - if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) + if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup; }
-- 2.5.0
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Wed, Jan 13, 2016 at 07:29:47AM -0500, John Ferlan wrote:
Commit id '90b721e43' moved where the virCgroupAddTask was made until after the check for the vcpupin checks. However, in doing so it missed an option where if the cpumap didn't exist, then the code would continue back to the top of the current vcpu loop. The results was that the virCgroupAddTask wouldn't be called.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_cgroup.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 1c406ce..91b3328 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -1079,10 +1079,7 @@ qemuSetupCgroupForVcpu(virDomainObjPtr vm) } }
- if (!cpumap) - continue; - - if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) + if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup; }
ACK Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Commit id '71ce4759' altered the cgroup processing with respect to the call to virCgroupAddTask being moved out from lower layers into the calling layers especially for qemu processing of emulator and vcpu threads. What was missed in the processing was that it is possible for a code path to return a NULL cgroup *and* a 0 return status via virCgroupNewPartition failure when virCgroupNewIgnoreError succeeded when virCgroupNewMachineManual returns. This was initially seen for lxc cgroup processing and patched by comit id 'ae09988eb'. We'll make the same check here as a subsquent patch will revert commit id 'a41c00b47' which would leave the possibility that priv->cgroup is NULL. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 91b3328..515d76a 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -782,7 +782,7 @@ qemuInitCgroup(virQEMUDriverPtr driver, nnicindexes, nicindexes, vm->def->resource->partition, cfg->cgroupControllers, - &priv->cgroup) < 0) { + &priv->cgroup) < 0 || !priv->cgroup) { if (virCgroupNewIgnoreError()) goto done; -- 2.5.0

On Wed, Jan 13, 2016 at 07:29:48AM -0500, John Ferlan wrote:
Commit id '71ce4759' altered the cgroup processing with respect to the call to virCgroupAddTask being moved out from lower layers into the calling layers especially for qemu processing of emulator and vcpu threads.
What was missed in the processing was that it is possible for a code path to return a NULL cgroup *and* a 0 return status via virCgroupNewPartition failure when virCgroupNewIgnoreError succeeded when virCgroupNewMachineManual returns. This was initially seen for lxc cgroup processing and patched by comit id 'ae09988eb'. We'll make the same check here as a subsquent patch will revert commit id 'a41c00b47' which would leave the possibility that priv->cgroup is NULL.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 91b3328..515d76a 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -782,7 +782,7 @@ qemuInitCgroup(virQEMUDriverPtr driver, nnicindexes, nicindexes, vm->def->resource->partition, cfg->cgroupControllers, - &priv->cgroup) < 0) { + &priv->cgroup) < 0 || !priv->cgroup) { if (virCgroupNewIgnoreError())
Why don't we check for virCgroupNewIgnoreError() in LXC? If cgroups are required there, then why *does* the vircgroup code check for that? I think we should not call that function here, mainly because virCgroupNewMachine() calls that every time cgroup creation fails. Calling this second time probably doesn't hurt now, but could be potentially problematic in hypothetical future change. Anyway, I think the proper call codeflow should be: if (virCgroupNewMachine() < 0) goto cleanup; if (!priv->cgroup) goto done; It's more readable, more error-prone and so on. Same with LXC. Martin

On Wed, Jan 13, 2016 at 07:29:48AM -0500, John Ferlan wrote:
Commit id '71ce4759' altered the cgroup processing with respect to the call to virCgroupAddTask being moved out from lower layers into the calling layers especially for qemu processing of emulator and vcpu threads.
What was missed in the processing was that it is possible for a code path to return a NULL cgroup *and* a 0 return status via virCgroupNewPartition failure when virCgroupNewIgnoreError succeeded when virCgroupNewMachineManual returns. This was initially seen for lxc cgroup processing and patched by comit id 'ae09988eb'. We'll make the same check here as a subsquent patch will revert commit id 'a41c00b47' which would leave the possibility that priv->cgroup is NULL.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
We should be reverting commit 71ce4759, so I don't think thi sis needed Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

This reverts commit a41c00b472efaa192d2deae51ab732e65903238f. The patch was causing erroneous updates to the /proc/$pid/cgroup file. This resulted in some unexpected behavoirs. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_cgroup.c | 15 +++++++++++---- src/qemu/qemu_process.c | 12 ++++++------ 2 files changed, 17 insertions(+), 10 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 515d76a..16c6492 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -789,6 +789,17 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; } + if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { + virErrorPtr saved = virSaveLastError(); + virCgroupRemove(priv->cgroup); + virCgroupFree(&priv->cgroup); + if (saved) { + virSetError(saved); + virFreeError(saved); + } + goto cleanup; + } + done: ret = 0; cleanup: @@ -1157,10 +1168,6 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) goto cleanup; } - /* consider the first thread an emulator-thread */ - if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) - goto cleanup; - virCgroupFree(&cgroup_emulator); return 0; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 3d9e0e5..76df860 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4894,6 +4894,12 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroup(driver, vm, nnicindexes, nicindexes) < 0) goto cleanup; + /* This must be done after cgroup placement to avoid resetting CPU + * affinity */ + if (!vm->def->cputune.emulatorpin && + qemuProcessInitCpuAffinity(vm) < 0) + goto cleanup; + VIR_DEBUG("Setting domain security labels"); if (virSecurityManagerSetAllLabel(driver->securityManager, vm->def, @@ -4940,12 +4946,6 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroupForEmulator(vm) < 0) goto cleanup; - /* This must be done after cgroup placement to avoid resetting CPU - * affinity */ - if (!vm->def->cputune.emulatorpin && - qemuProcessInitCpuAffinity(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup; -- 2.5.0

On Wed, Jan 13, 2016 at 07:29:49AM -0500, John Ferlan wrote:
This reverts commit a41c00b472efaa192d2deae51ab732e65903238f.
The patch was causing erroneous updates to the /proc/$pid/cgroup file. This resulted in some unexpected behavoirs.
Signed-off-by: John Ferlan <jferlan@redhat.com>
ACK, Please also revert commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5 Author: Henning Schild <henning.schild@siemens.com> Date: Wed Dec 9 17:58:10 2015 +0100 util: cgroups do not implicitly add task to new machine cgroup Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 01/14/2016 06:59 AM, Daniel P. Berrange wrote:
On Wed, Jan 13, 2016 at 07:29:49AM -0500, John Ferlan wrote:
This reverts commit a41c00b472efaa192d2deae51ab732e65903238f.
The patch was causing erroneous updates to the /proc/$pid/cgroup file. This resulted in some unexpected behavoirs.
Signed-off-by: John Ferlan <jferlan@redhat.com>
ACK,
Please also revert
commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5 Author: Henning Schild <henning.schild@siemens.com> Date: Wed Dec 9 17:58:10 2015 +0100
util: cgroups do not implicitly add task to new machine cgroup
Regards, Daniel
So the order of revert is patch 2/3 (commit id 'a41c00b472'), patch 1/3 (commit id '71ce4759'), the lxc patch made as a result of patch 1/3 (commit id 'ae09988e'). Then also apply patch 1/4 of this series. Learned quite a bit about cgroups... I'll post patches shortly... John

On Thu, Jan 14, 2016 at 11:10:36AM -0500, John Ferlan wrote:
On 01/14/2016 06:59 AM, Daniel P. Berrange wrote:
On Wed, Jan 13, 2016 at 07:29:49AM -0500, John Ferlan wrote:
This reverts commit a41c00b472efaa192d2deae51ab732e65903238f.
The patch was causing erroneous updates to the /proc/$pid/cgroup file. This resulted in some unexpected behavoirs.
Signed-off-by: John Ferlan <jferlan@redhat.com>
ACK,
Please also revert
commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5 Author: Henning Schild <henning.schild@siemens.com> Date: Wed Dec 9 17:58:10 2015 +0100
util: cgroups do not implicitly add task to new machine cgroup
Regards, Daniel
So the order of revert is patch 2/3 (commit id 'a41c00b472'), patch 1/3 (commit id '71ce4759'), the lxc patch made as a result of patch 1/3 (commit id 'ae09988e').
Then also apply patch 1/4 of this series.
Yep
Learned quite a bit about cgroups...
I re-learned all the bits I had forgotten :-P Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Recently reverted commit id 'a41c00b4' was designed to move the setting of the task file into the right place in the cgroup hierarchy. This patch applies the portion of the reverted patch which writes the pid to the right task file. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_cgroup.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 16c6492..a0ad03f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -1168,6 +1168,10 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) goto cleanup; } + /* consider the first thread an emulator-thread */ + if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) + goto cleanup; + virCgroupFree(&cgroup_emulator); return 0; -- 2.5.0

On Wed, Jan 13, 2016 at 07:29:50AM -0500, John Ferlan wrote:
Recently reverted commit id 'a41c00b4' was designed to move the setting of the task file into the right place in the cgroup hierarchy. This patch applies the portion of the reverted patch which writes the pid to the right task file.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_cgroup.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 16c6492..a0ad03f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -1168,6 +1168,10 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) goto cleanup; }
+ /* consider the first thread an emulator-thread */ + if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) + goto cleanup; + virCgroupFree(&cgroup_emulator); return 0;
This doesn't actually do anything useful. Take a look further up in this method and you'll see virCgroupMoveTask() which moves every single PID, including vm->pid. So NACK to this as it doesn't solve the race condition Henning was addressing. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
Reposting my cgroup fixes series:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
partially because I originally forgot to CC the author (Henning Schild) of the original series for which these patch fix a couple of issues discovered during regression testing (virt-test memtune failures in Red Hat regression environment), but also to bring them up to date with the top of libvirt git.
NB: I did send Henning the changes after the fact, but my resend using the same message-id skills so that replies are left in the onlist series are lacking. Henning has looked at the first patch - with a response here:
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
Finally, I think these changes should go into 1.3.1 since that's when the regression was introduced.
It would be nice to have them in, I really tried reviewing them, but I can't wrap my head around last two of them. Maybe because I'm already late for an appointment I have. So unfortunately I have to leave you without the review for those two as I really need to go, but anyone else feel free to continue. And even re-check my reviews for 1 and 2 if you want. It would be a pity not to fix a regression when we could. On a side note, this mail is missing "PATCH" in the subject, so people filtering those into different folders (and not reading all of them, i.e. not me) could easily miss it. I won't be able to get to this for some time now, so please pursue someone just in case I won't get to that for another day. Sorry for the inconvenience.
John Ferlan (4): cgroup: Fix possible bug as a result of code motion for vcpu cgroup setup qemu: Add check for NULL cgroup return from virCgroupNewMachine Revert "qemu: do not put a task into machine cgroup" qemu: Put the emulator cgroup pid into the right task file
src/qemu/qemu_cgroup.c | 18 +++++++++++++----- src/qemu/qemu_process.c | 12 ++++++------ 2 files changed, 19 insertions(+), 11 deletions(-)
-- 2.5.0
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 01/13/2016 11:51 AM, Martin Kletzander wrote:
On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
Reposting my cgroup fixes series:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
partially because I originally forgot to CC the author (Henning Schild) of the original series for which these patch fix a couple of issues discovered during regression testing (virt-test memtune failures in Red Hat regression environment), but also to bring them up to date with the top of libvirt git.
NB: I did send Henning the changes after the fact, but my resend using the same message-id skills so that replies are left in the onlist series are lacking. Henning has looked at the first patch - with a response here:
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
Finally, I think these changes should go into 1.3.1 since that's when the regression was introduced.
It would be nice to have them in, I really tried reviewing them, but I can't wrap my head around last two of them. Maybe because I'm already late for an appointment I have.
What I found happening is that by moving the virCgroupAddTask from qemuInitCgroup until later in qemuSetupCgroupForEmulator is that for "some reason" on (at least in the test environment) an older Fedora release (f20) that the /proc/$pid/cgroup file was updated "strangely". On a more recent Fedora release (f23) I didn't see the same phenomena. And by strangely - what I saw was even though the *SetupEmulator path was 'supposed to be' modifying only cpuset and cpu,cpuacct - other entries for memory, blkio, and devices were also updated - thus pointing at the wrong place (rather than the machine specific place). This caused the memtune test to fail. What was even stranger is that the code was updating the machine specific area when changing the values (e.g., internally we had the right path), but the test cannot see that so it uses the /proc/$pid/cgroup file to find the path. In any case, it's a very strange anomaly.
So unfortunately I have to leave you without the review for those two as I really need to go, but anyone else feel free to continue. And even re-check my reviews for 1 and 2 if you want. It would be a pity not to fix a regression when we could.
Thanks for your time on the first two... I'll push the first one and wait on the others.
On a side note, this mail is missing "PATCH" in the subject, so people filtering those into different folders (and not reading all of them, i.e. not me) could easily miss it. I won't be able to get to this for some time now, so please pursue someone just in case I won't get to that for another day. Sorry for the inconvenience.
To each their own on filters - there's a version with PATCH in the title and one that only has REPOST. The only difference is the REPOST does the CC and is 'up to date' with top of branch. The PATCH version should apply fine too. John
John Ferlan (4): cgroup: Fix possible bug as a result of code motion for vcpu cgroup setup qemu: Add check for NULL cgroup return from virCgroupNewMachine Revert "qemu: do not put a task into machine cgroup" qemu: Put the emulator cgroup pid into the right task file
src/qemu/qemu_cgroup.c | 18 +++++++++++++----- src/qemu/qemu_process.c | 12 ++++++------ 2 files changed, 19 insertions(+), 11 deletions(-)
-- 2.5.0
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Wed, Jan 13, 2016 at 12:15:23PM -0500, John Ferlan wrote:
On 01/13/2016 11:51 AM, Martin Kletzander wrote:
On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
Reposting my cgroup fixes series:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
partially because I originally forgot to CC the author (Henning Schild) of the original series for which these patch fix a couple of issues discovered during regression testing (virt-test memtune failures in Red Hat regression environment), but also to bring them up to date with the top of libvirt git.
NB: I did send Henning the changes after the fact, but my resend using the same message-id skills so that replies are left in the onlist series are lacking. Henning has looked at the first patch - with a response here:
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
Finally, I think these changes should go into 1.3.1 since that's when the regression was introduced.
It would be nice to have them in, I really tried reviewing them, but I can't wrap my head around last two of them. Maybe because I'm already late for an appointment I have.
What I found happening is that by moving the virCgroupAddTask from qemuInitCgroup until later in qemuSetupCgroupForEmulator is that for "some reason" on (at least in the test environment) an older Fedora release (f20) that the /proc/$pid/cgroup file was updated "strangely". On a more recent Fedora release (f23) I didn't see the same phenomena.
And by strangely - what I saw was even though the *SetupEmulator path was 'supposed to be' modifying only cpuset and cpu,cpuacct - other entries for memory, blkio, and devices were also updated - thus pointing at the wrong place (rather than the machine specific place). This caused the memtune test to fail. What was even stranger is that the code was updating the machine specific area when changing the values (e.g., internally we had the right path), but the test cannot see that so it uses the /proc/$pid/cgroup file to find the path.
You can't assume that the different controllers are independant, though on most systems they will be. The default systemd layout creates a separate mount for each controler under /sys/fs/cgroup/$controller, but it is valid to create a single mount point to hold *all* controllers. In such a case, if you move placement for 'cpu' controller it will affect *all* controllers. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Jan 13, 2016 at 05:51:34PM +0100, Martin Kletzander wrote:
On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
Reposting my cgroup fixes series:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
partially because I originally forgot to CC the author (Henning Schild) of the original series for which these patch fix a couple of issues discovered during regression testing (virt-test memtune failures in Red Hat regression environment), but also to bring them up to date with the top of libvirt git.
NB: I did send Henning the changes after the fact, but my resend using the same message-id skills so that replies are left in the onlist series are lacking. Henning has looked at the first patch - with a response here:
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
Finally, I think these changes should go into 1.3.1 since that's when the regression was introduced.
It would be nice to have them in, I really tried reviewing them, but I can't wrap my head around last two of them. Maybe because I'm already late for an appointment I have.
So unfortunately I have to leave you without the review for those two as I really need to go, but anyone else feel free to continue. And even re-check my reviews for 1 and 2 if you want. It would be a pity not to fix a regression when we could.
I agree we need to get these into the next release, but please hold off from merging them. I want to re-examine Henning's original patches in more detail, as I have a bad feeling we might need to simply revert all of them and start again. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, 13 Jan 2016 17:53:16 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Wed, Jan 13, 2016 at 05:51:34PM +0100, Martin Kletzander wrote:
On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
Reposting my cgroup fixes series:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
partially because I originally forgot to CC the author (Henning Schild) of the original series for which these patch fix a couple of issues discovered during regression testing (virt-test memtune failures in Red Hat regression environment), but also to bring them up to date with the top of libvirt git.
NB: I did send Henning the changes after the fact, but my resend using the same message-id skills so that replies are left in the onlist series are lacking. Henning has looked at the first patch - with a response here:
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
Finally, I think these changes should go into 1.3.1 since that's when the regression was introduced.
It would be nice to have them in, I really tried reviewing them, but I can't wrap my head around last two of them. Maybe because I'm already late for an appointment I have.
So unfortunately I have to leave you without the review for those two as I really need to go, but anyone else feel free to continue. And even re-check my reviews for 1 and 2 if you want. It would be a pity not to fix a regression when we could.
I agree we need to get these into the next release, but please hold off from merging them. I want to re-examine Henning's original patches in more detail, as I have a bad feeling we might need to simply revert all of them and start again.
Until when do these patches need to be reviewed? The 1/4 is obvious but the other ones need a closer look. I can just say that they do not seem right. I pulled virCgroupAddTask out of virCgroupNewMachine* and that should be done. But it seems to me that the way virCgroupAddTask was called contained important error handling that should remain in virCgroupNewMachine. If we only have a couple of days to the next release i would suggest reverting my changes to give us time to figure the whole thing out. I will have limited time to look into that until the end of the month. But i certainly want those changes merged, or something that helps solve/mitigate the realtime problems of the current implementation. My changes address only a fraction of the problem. My suggestion for a proper solution would be using the exclusive flags of cgroups, which will require a totally different cgroup layout. Something like a "machine.slice_excl" next to "machine.slice" with the current structure replicated in there. Or a "top-level" exclusive cgroup per instance. So far i only looked at the implications for cpu_exclusive, where such a setup can work and will properly "protect" the reserved cpus from accidential use. Regards, Henning

On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
Reposting my cgroup fixes series:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
partially because I originally forgot to CC the author (Henning Schild) of the original series for which these patch fix a couple of issues discovered during regression testing (virt-test memtune failures in Red Hat regression environment), but also to bring them up to date with the top of libvirt git.
NB: I did send Henning the changes after the fact, but my resend using the same message-id skills so that replies are left in the onlist series are lacking. Henning has looked at the first patch - with a response here:
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
Finally, I think these changes should go into 1.3.1 since that's when the regression was introduced.
Since this has been puzzelling us for a while, let me recap on the cgroup setup in general. First, I'll describe how it used to work *before* Henning's patches were merged, on a systemd based host. - The QEMU driver forks a child process, but does *not* exec QEMU yet The cgroup placement at this point is inherited from libvirtd. It may look like this: 10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/system.slice 5:memory:/system.slice 4:net_cls,net_prio:/ 3:devices:/system.slice/libvirtd.service 2:cpu,cpuacct:/system.slice 1:name=systemd:/system.slice/libvirtd.service - The QEMU driver calls virCgroupNewMachine() - We calll virSystemdCreateMachine with pidleader=$child - Systemd creates the initial machine scope unit under the machine slice unit, for the "systemd" controller. It may also add the PID to *zero* or more other resource controllers. So at this point the cgroup placement may look like this: 10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/ 5:memory:/ 4:net_cls,net_prio:/ 3:devices:/ 2:cpu,cpuacct:/ 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope Or may look like this: 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope Or anywhere in between. We have *ZERO* guarantee about what other resource controllers we may have been placed in by systemd. There is some fairly complex logic that determines this, based on what other tasks current exist in sibling cgroups, and what tasks have *previously* existed in the cgroups. IOW, you should consider the list of etra resource controllers essentially non-deterministic - We call virCgroupAddTask with pid=$child This places the pid in any resource controllers we need, which systemd has not already setup. IOW, it guarantees that we now have placement that should look like this, regardless of what systemd has done: 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope - The QEMU driver now lets the child process exec QEMU. QEMU creates its vCPU threads at this point. All QEMU threads (emulator, vcpu and I/O threads) now have the cgroup placement shown above. - We create the emulator cgroup for the cpuset, cpu, cpuacct controllers move all threads into this new cgroup. All threads (emulator, vcpu and I/O threads) thus now have placement of: 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope Yes, we really did move the vcpu threads into the emulator group... - We now ask QEMU which are the vCPU & I/O threads. - Foreach CPU thread we new vCPU cgroups and move them into this place 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope - Foreach I/O thread we new vCPU cgroups and move them into this place 10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope Now, lets Henning's three patches commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5 Author: Henning Schild <henning.schild@siemens.com> Date: Wed Dec 9 17:58:10 2015 +0100 util: cgroups do not implicitly add task to new machine cgroup This alters virCgroupNewMachine so that it no longer calls virCgroupAddTask. instead the callers are responsible for doing it. This is a functional no-opp change, so ostensibly harmless at this point. The second patch commit a41c00b472efaa192d2deae51ab732e65903238f Author: Henning Schild <henning.schild@siemens.com> Date: Mon Dec 14 15:48:05 2015 -0500 qemu: do not put a task into machine cgroup This stops qemuInitCgroup from calling virCgroupAddTask. Instead it has added a call to vriCgroupAddTask in qemuSetupCgroupForEmulator(), which only affects the cpu, cpuset & cpuacct controllers. This means we no longer have any guarantee about which resource controllers the QEMU processes in general are in. All we can say is that they are in the 'systemd' controller, and the cpu, cpuset & cpuacct controllers. Systemd may have also placed it into *all* the other resource controllers, or it may have placed it into none of the them. As such, after this change, we are potentially not in the correct cgroup for the memory, blkio, netcls, devices controllers. For added fun, the qemuSetupCgroupForEmulator() still has a call to virCgroupMoveTask, so the addition of virCgroupAddTask to this method was useless. The final patch commit 90b721e43ec9232b5b218e891437bed04548e841 Author: Henning Schild <henning.schild@siemens.com> Date: Mon Dec 14 15:58:05 2015 -0500 qemu cgroups: move new threads to new cgroup after cpuset is set up This delays the point at which we call virCgroupAddTask for the vCPU and I/O thread cgroups, until after we have configured properties on those cgroups. This change is fine So I think we need to revert the first 2 of Hennings patches - the 2nd one is the real serious problem, but once we revert that, the 1st change becomes pointless and so should also be reverted. Going back to the key problem Henning was trying to address.... The issue is that we only setup the emulator cgroup /after/ QEMU has already started running, so there's a period of time where threads are not confined by the cgroup affinity. I think we can solve that more simply by just moving the call to qemuSetupCgroupForEmulator(), into qemuSetupCgroup(). That way we place the emulator thread into the correct cgroup *before* we even exec QEMU. So we never have the window where we run in the unconfined cpuset controller. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, 14 Jan 2016 11:57:44 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
Reposting my cgroup fixes series:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
partially because I originally forgot to CC the author (Henning Schild) of the original series for which these patch fix a couple of issues discovered during regression testing (virt-test memtune failures in Red Hat regression environment), but also to bring them up to date with the top of libvirt git.
NB: I did send Henning the changes after the fact, but my resend using the same message-id skills so that replies are left in the onlist series are lacking. Henning has looked at the first patch - with a response here:
http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
Finally, I think these changes should go into 1.3.1 since that's when the regression was introduced.
Since this has been puzzelling us for a while, let me recap on the cgroup setup in general.
First, I'll describe how it used to work *before* Henning's patches were merged, on a systemd based host.
- The QEMU driver forks a child process, but does *not* exec QEMU yet The cgroup placement at this point is inherited from libvirtd. It may look like this:
10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/system.slice 5:memory:/system.slice 4:net_cls,net_prio:/ 3:devices:/system.slice/libvirtd.service 2:cpu,cpuacct:/system.slice 1:name=systemd:/system.slice/libvirtd.service
- The QEMU driver calls virCgroupNewMachine()
- We calll virSystemdCreateMachine with pidleader=$child
- Systemd creates the initial machine scope unit under the machine slice unit, for the "systemd" controller. It may also add the PID to *zero* or more other resource controllers. So at this point the cgroup placement may look like this:
10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/ 5:memory:/ 4:net_cls,net_prio:/ 3:devices:/ 2:cpu,cpuacct:/ 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or may look like this:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or anywhere in between. We have *ZERO* guarantee about what other resource controllers we may have been placed in by systemd. There is some fairly complex logic that determines this, based on what other tasks current exist in sibling cgroups, and what tasks have *previously* existed in the cgroups. IOW, you should consider the list of etra resource controllers essentially non-deterministic
- We call virCgroupAddTask with pid=$child
This places the pid in any resource controllers we need, which systemd has not already setup. IOW, it guarantees that we now have placement that should look like this, regardless of what systemd has done:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- The QEMU driver now lets the child process exec QEMU. QEMU creates its vCPU threads at this point. All QEMU threads (emulator, vcpu and I/O threads) now have the cgroup placement shown above.
- We create the emulator cgroup for the cpuset, cpu, cpuacct controllers move all threads into this new cgroup. All threads (emulator, vcpu and I/O threads) thus now have placement of:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Yes, we really did move the vcpu threads into the emulator group...
- We now ask QEMU which are the vCPU & I/O threads.
- Foreach CPU thread we new vCPU cgroups and move them into this place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- Foreach I/O thread we new vCPU cgroups and move them into this place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Now, lets Henning's three patches
commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5 Author: Henning Schild <henning.schild@siemens.com> Date: Wed Dec 9 17:58:10 2015 +0100
util: cgroups do not implicitly add task to new machine cgroup
This alters virCgroupNewMachine so that it no longer calls virCgroupAddTask. instead the callers are responsible for doing it. This is a functional no-opp change, so ostensibly harmless at this point.
The second patch
commit a41c00b472efaa192d2deae51ab732e65903238f Author: Henning Schild <henning.schild@siemens.com> Date: Mon Dec 14 15:48:05 2015 -0500
qemu: do not put a task into machine cgroup
This stops qemuInitCgroup from calling virCgroupAddTask. Instead it has added a call to vriCgroupAddTask in qemuSetupCgroupForEmulator(), which only affects the cpu, cpuset & cpuacct controllers.
This means we no longer have any guarantee about which resource controllers the QEMU processes in general are in. All we can say is that they are in the 'systemd' controller, and the cpu, cpuset & cpuacct controllers. Systemd may have also placed it into *all* the other resource controllers, or it may have placed it into none of the them.
As such, after this change, we are potentially not in the correct cgroup for the memory, blkio, netcls, devices controllers.
For added fun, the qemuSetupCgroupForEmulator() still has a call to virCgroupMoveTask, so the addition of virCgroupAddTask to this method was useless.
The final patch
commit 90b721e43ec9232b5b218e891437bed04548e841 Author: Henning Schild <henning.schild@siemens.com> Date: Mon Dec 14 15:58:05 2015 -0500
qemu cgroups: move new threads to new cgroup after cpuset is set up
This delays the point at which we call virCgroupAddTask for the vCPU and I/O thread cgroups, until after we have configured properties on those cgroups. This change is fine
So I think we need to revert the first 2 of Hennings patches - the 2nd one is the real serious problem, but once we revert that, the 1st change becomes pointless and so should also be reverted.
Going back to the key problem Henning was trying to address....
The issue is that we only setup the emulator cgroup /after/ QEMU has already started running, so there's a period of time where threads are not confined by the cgroup affinity.
I think we can solve that more simply by just moving the call to qemuSetupCgroupForEmulator(), into qemuSetupCgroup(). That way we place the emulator thread into the correct cgroup *before* we even exec QEMU. So we never have the window where we run in the unconfined cpuset controller.
Thanks for that very thorough analysis and description, very helpful indeed! I fully agree with your conclusion and suggestion on how the code should be changed. Henning

On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote:
Since this has been puzzelling us for a while, let me recap on the cgroup setup in general.
First, I'll describe how it used to work *before* Henning's patches were merged, on a systemd based host.
- The QEMU driver forks a child process, but does *not* exec QEMU yet The cgroup placement at this point is inherited from libvirtd. It may look like this:
10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/system.slice 5:memory:/system.slice 4:net_cls,net_prio:/ 3:devices:/system.slice/libvirtd.service 2:cpu,cpuacct:/system.slice 1:name=systemd:/system.slice/libvirtd.service
- The QEMU driver calls virCgroupNewMachine()
- We calll virSystemdCreateMachine with pidleader=$child
- Systemd creates the initial machine scope unit under the machine slice unit, for the "systemd" controller. It may also add the PID to *zero* or more other resource controllers. So at this point the cgroup placement may look like this:
10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/ 5:memory:/ 4:net_cls,net_prio:/ 3:devices:/ 2:cpu,cpuacct:/ 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or may look like this:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or anywhere in between. We have *ZERO* guarantee about what other resource controllers we may have been placed in by systemd. There is some fairly complex logic that determines this, based on what other tasks current exist in sibling cgroups, and what tasks have *previously* existed in the cgroups. IOW, you should consider the list of etra resource controllers essentially non-deterministic
- We call virCgroupAddTask with pid=$child
This places the pid in any resource controllers we need, which systemd has not already setup. IOW, it guarantees that we now have placement that should look like this, regardless of what systemd has done:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- The QEMU driver now lets the child process exec QEMU. QEMU creates its vCPU threads at this point. All QEMU threads (emulator, vcpu and I/O threads) now have the cgroup placement shown above.
- We create the emulator cgroup for the cpuset, cpu, cpuacct controllers move all threads into this new cgroup. All threads (emulator, vcpu and I/O threads) thus now have placement of:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Yes, we really did move the vcpu threads into the emulator group...
- We now ask QEMU which are the vCPU & I/O threads.
- Foreach CPU thread we new vCPU cgroups and move them into this place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- Foreach I/O thread we new vCPU cgroups and move them into this place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
BTW, on a slight tangent, the kernel is throwing a spanner in the works in the near future. They have just accepted cgroupv2 into mainline. Broadly speaking this is very nice because they got rid of the idea of separate mount point for each controller, and instead have a single filesystem tree. The problem is that they decided the granularity of placement is at a *process* level, not a *thread* level. So it will no longer be possible for us to have the cgroups for emulator, vcpus & i/o threads. Everything will have to live in the same cgroup :-( For cpu accounting and cpu affinity I think we can still achieve what we need by using a combination of cgroups and sched_setaffinity and /proc. I'm not sure what we'll do about per-thread schedular policies for period + quota though - not sure if there's an API for setting those or not ?!?! https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documen... Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, 14 Jan 2016 12:37:18 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote:
Since this has been puzzelling us for a while, let me recap on the cgroup setup in general.
First, I'll describe how it used to work *before* Henning's patches were merged, on a systemd based host.
- The QEMU driver forks a child process, but does *not* exec QEMU yet The cgroup placement at this point is inherited from libvirtd. It may look like this:
10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/system.slice 5:memory:/system.slice 4:net_cls,net_prio:/ 3:devices:/system.slice/libvirtd.service 2:cpu,cpuacct:/system.slice 1:name=systemd:/system.slice/libvirtd.service
- The QEMU driver calls virCgroupNewMachine()
- We calll virSystemdCreateMachine with pidleader=$child
- Systemd creates the initial machine scope unit under the machine slice unit, for the "systemd" controller. It may also add the PID to *zero* or more other resource controllers. So at this point the cgroup placement may look like this:
10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/ 5:memory:/ 4:net_cls,net_prio:/ 3:devices:/ 2:cpu,cpuacct:/ 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or may look like this:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or anywhere in between. We have *ZERO* guarantee about what other resource controllers we may have been placed in by systemd. There is some fairly complex logic that determines this, based on what other tasks current exist in sibling cgroups, and what tasks have *previously* existed in the cgroups. IOW, you should consider the list of etra resource controllers essentially non-deterministic
- We call virCgroupAddTask with pid=$child
This places the pid in any resource controllers we need, which systemd has not already setup. IOW, it guarantees that we now have placement that should look like this, regardless of what systemd has done:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- The QEMU driver now lets the child process exec QEMU. QEMU creates its vCPU threads at this point. All QEMU threads (emulator, vcpu and I/O threads) now have the cgroup placement shown above.
- We create the emulator cgroup for the cpuset, cpu, cpuacct controllers move all threads into this new cgroup. All threads (emulator, vcpu and I/O threads) thus now have placement of:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Yes, we really did move the vcpu threads into the emulator group...
- We now ask QEMU which are the vCPU & I/O threads.
- Foreach CPU thread we new vCPU cgroups and move them into this place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- Foreach I/O thread we new vCPU cgroups and move them into this place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
BTW, on a slight tangent, the kernel is throwing a spanner in the works in the near future. They have just accepted cgroupv2 into mainline. Broadly speaking this is very nice because they got rid of the idea of separate mount point for each controller, and instead have a single filesystem tree. The problem is that they decided the granularity of placement is at a *process* level, not a *thread* level. So it will no longer be possible for us to have the cgroups for emulator, vcpus & i/o threads. Everything will have to live in the same cgroup :-( For cpu accounting and cpu affinity I think we can still achieve what we need by using a combination of cgroups and sched_setaffinity and /proc. I'm not sure what we'll do about per-thread schedular policies for period + quota though - not sure if there's an API for setting those or not ?!?!
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documen...
Good to know. Do you you have that on the agenda for libvirt? I guess eventually v1 will get deprecated...
Regards, Daniel

On Thu, Jan 14, 2016 at 02:09:52PM +0100, Henning Schild wrote:
On Thu, 14 Jan 2016 12:37:18 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jan 14, 2016 at 11:57:44AM +0000, Daniel P. Berrange wrote:
Since this has been puzzelling us for a while, let me recap on the cgroup setup in general.
First, I'll describe how it used to work *before* Henning's patches were merged, on a systemd based host.
- The QEMU driver forks a child process, but does *not* exec QEMU yet The cgroup placement at this point is inherited from libvirtd. It may look like this:
10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/system.slice 5:memory:/system.slice 4:net_cls,net_prio:/ 3:devices:/system.slice/libvirtd.service 2:cpu,cpuacct:/system.slice 1:name=systemd:/system.slice/libvirtd.service
- The QEMU driver calls virCgroupNewMachine()
- We calll virSystemdCreateMachine with pidleader=$child
- Systemd creates the initial machine scope unit under the machine slice unit, for the "systemd" controller. It may also add the PID to *zero* or more other resource controllers. So at this point the cgroup placement may look like this:
10:freezer:/ 9:cpuset:/ 8:perf_event:/ 7:hugetlb:/ 6:blkio:/ 5:memory:/ 4:net_cls,net_prio:/ 3:devices:/ 2:cpu,cpuacct:/ 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or may look like this:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Or anywhere in between. We have *ZERO* guarantee about what other resource controllers we may have been placed in by systemd. There is some fairly complex logic that determines this, based on what other tasks current exist in sibling cgroups, and what tasks have *previously* existed in the cgroups. IOW, you should consider the list of etra resource controllers essentially non-deterministic
- We call virCgroupAddTask with pid=$child
This places the pid in any resource controllers we need, which systemd has not already setup. IOW, it guarantees that we now have placement that should look like this, regardless of what systemd has done:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- The QEMU driver now lets the child process exec QEMU. QEMU creates its vCPU threads at this point. All QEMU threads (emulator, vcpu and I/O threads) now have the cgroup placement shown above.
- We create the emulator cgroup for the cpuset, cpu, cpuacct controllers move all threads into this new cgroup. All threads (emulator, vcpu and I/O threads) thus now have placement of:
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
Yes, we really did move the vcpu threads into the emulator group...
- We now ask QEMU which are the vCPU & I/O threads.
- Foreach CPU thread we new vCPU cgroups and move them into this place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
- Foreach I/O thread we new vCPU cgroups and move them into this place
10:freezer:/machine.slice/machine-qemu\x2dserial.scope 9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 8:perf_event:/machine.slice/machine-qemu\x2dserial.scope 7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope 6:blkio:/machine.slice/machine-qemu\x2dserial.scope 5:memory:/machine.slice/machine-qemu\x2dserial.scope 4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope 3:devices:/machine.slice/machine-qemu\x2dserial.scope 2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN 1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
BTW, on a slight tangent, the kernel is throwing a spanner in the works in the near future. They have just accepted cgroupv2 into mainline. Broadly speaking this is very nice because they got rid of the idea of separate mount point for each controller, and instead have a single filesystem tree. The problem is that they decided the granularity of placement is at a *process* level, not a *thread* level. So it will no longer be possible for us to have the cgroups for emulator, vcpus & i/o threads. Everything will have to live in the same cgroup :-( For cpu accounting and cpu affinity I think we can still achieve what we need by using a combination of cgroups and sched_setaffinity and /proc. I'm not sure what we'll do about per-thread schedular policies for period + quota though - not sure if there's an API for setting those or not ?!?!
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documen...
Good to know. Do you you have that on the agenda for libvirt? I guess eventually v1 will get deprecated...
We'll have no choice but to use cgroupv2 as soon as systemd starts using it.... Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (4)
-
Daniel P. Berrange
-
Henning Schild
-
John Ferlan
-
Martin Kletzander