[libvirt] [PATCH 0/3] several cgroups/cpuset fixes

Hi, i already explained some of the cgroup problems in some detail so i will not do that again. https://www.redhat.com/archives/libvir-list/2015-October/msg00876.html I managed to solve some of the problems in the current codebase, and am now sharing the patches. But they are really just half of what i had to change to get libvirt to behave in a system with isolated cpus. Other changes/hacks i am not sending here because they do not work for the general case: - create machine.slice before starting libvirtd (smaller than root) ... and hope it wont grow - disabling cpuset.cpus inheritance in libvirtd - allowing only xml with fully specified cputune - set machine cpuset to (vcpupins | emulatorpin) I am not sure how useful the individual fixes are, i am sending them as concrete examples for the problems i described earlier. And i am hoping that will start a discussion. Henning Henning Schild (3): util: cgroups do not implicitly add task to new machine cgroup qemu: do not put a task into machine cgroup qemu cgroups: move new threads to new cgroup after cpuset is set up src/lxc/lxc_cgroup.c | 6 ++++++ src/qemu/qemu_cgroup.c | 23 ++++++++++++++--------- src/util/vircgroup.c | 22 ---------------------- 3 files changed, 20 insertions(+), 31 deletions(-) -- 2.4.10

virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 6 ++++++ src/qemu/qemu_cgroup.c | 6 ++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 12 insertions(+), 22 deletions(-) diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index ad254e4..e5ac893 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -504,6 +504,12 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, &cgroup) < 0) goto cleanup; + if (virCgroupAddTask(cgroup, initpid) < 0) { + virCgroupRemove(cgroup); + virCgroupFree(&cgroup); + goto cleanup; + } + /* setup control group permissions for user namespace */ if (def->idmap.uidmap) { if (virCgroupSetOwner(cgroup, diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index a8e0b8c..28d2ca2 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -769,6 +769,12 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; } + if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { + virCgroupRemove(priv->cgroup); + virCgroupFree(&priv->cgroup); + goto cleanup; + } + done: ret = 0; cleanup: diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 0379c2e..a07f3c2 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1669,16 +1669,6 @@ virCgroupNewMachineSystemd(const char *name, } } - if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - ret = 0; cleanup: virCgroupFree(&parent); @@ -1701,7 +1691,6 @@ int virCgroupTerminateMachine(const char *name, static int virCgroupNewMachineManual(const char *name, const char *drivername, - pid_t pidleader, const char *partition, int controllers, virCgroupPtr *group) @@ -1727,16 +1716,6 @@ virCgroupNewMachineManual(const char *name, group) < 0) goto cleanup; - if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - done: ret = 0; @@ -1783,7 +1762,6 @@ virCgroupNewMachine(const char *name, return virCgroupNewMachineManual(name, drivername, - pidleader, partition, controllers, group); -- 2.4.10

On 11/13/2015 11:56 AM, Henning Schild wrote:
virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 6 ++++++ src/qemu/qemu_cgroup.c | 6 ++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 12 insertions(+), 22 deletions(-)
diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index ad254e4..e5ac893 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -504,6 +504,12 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, &cgroup) < 0) goto cleanup;
+ if (virCgroupAddTask(cgroup, initpid) < 0) { + virCgroupRemove(cgroup); + virCgroupFree(&cgroup); + goto cleanup; + } +
For both this and qemu, the store/restore last error: virErrorPtr saved = virSaveLastError(); ... if (saved) { virSetError(saved); virFreeError(saved); } Is "lost". I realize no other call to virCgroupRemove saves the error, but as I found in a different review: http://www.redhat.com/archives/libvir-list/2015-October/msg00823.html the call to virCgroupPathOfController from virCgroupRemove could overwrite the last error. Even though others don't have it, I think perhaps we should ensure it still exists here. Or perhaps a patch prior to this one that would adjust the virCgroupRemove to "save/restore" the last error around the virCgroupPathOfController call...
/* setup control group permissions for user namespace */ if (def->idmap.uidmap) { if (virCgroupSetOwner(cgroup, diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index a8e0b8c..28d2ca2 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -769,6 +769,12 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; }
+ if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { + virCgroupRemove(priv->cgroup); + virCgroupFree(&priv->cgroup); + goto cleanup; + } + done: ret = 0; cleanup: diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 0379c2e..a07f3c2 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1669,16 +1669,6 @@ virCgroupNewMachineSystemd(const char *name, } }
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - ret = 0; cleanup: virCgroupFree(&parent); @@ -1701,7 +1691,6 @@ int virCgroupTerminateMachine(const char *name, static int virCgroupNewMachineManual(const char *name, const char *drivername, - pid_t pidleader, const char *partition, int controllers, virCgroupPtr *group) @@ -1727,16 +1716,6 @@ virCgroupNewMachineManual(const char *name, group) < 0) goto cleanup;
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - done: ret = 0;
@@ -1783,7 +1762,6 @@ virCgroupNewMachine(const char *name,
return virCgroupNewMachineManual(name, drivername, - pidleader, partition, controllers, group);
Beyond that - things seem reasonable. I usually defer to Martin or Peter for cgroup stuff though... Another thought/addition/change would be to have virCgroupNewMachine return 'cgroup' rather than have it as the last parameter and then check vs. NULL for success/failure rather than 0/-1... Weak ACK - hopefully either Peter/Martin can look. I think Peter in particular may be interested due to upcoming vCpu changes. John

On Tue, 8 Dec 2015 12:23:14 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 11/13/2015 11:56 AM, Henning Schild wrote:
virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 6 ++++++ src/qemu/qemu_cgroup.c | 6 ++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 12 insertions(+), 22 deletions(-)
diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index ad254e4..e5ac893 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -504,6 +504,12 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, &cgroup) < 0) goto cleanup;
+ if (virCgroupAddTask(cgroup, initpid) < 0) { + virCgroupRemove(cgroup); + virCgroupFree(&cgroup); + goto cleanup; + } +
For both this and qemu, the store/restore last error:
virErrorPtr saved = virSaveLastError(); ...
if (saved) { virSetError(saved); virFreeError(saved); }
Is "lost". I realize no other call to virCgroupRemove saves the error, but as I found in a different review:
Yes that was lost and i will get it back in. Further discussions on where it should be are out of the scope of this series.
http://www.redhat.com/archives/libvir-list/2015-October/msg00823.html
the call to virCgroupPathOfController from virCgroupRemove could overwrite the last error.
Even though others don't have it, I think perhaps we should ensure it still exists here. Or perhaps a patch prior to this one that would adjust the virCgroupRemove to "save/restore" the last error around the virCgroupPathOfController call...
/* setup control group permissions for user namespace */ if (def->idmap.uidmap) { if (virCgroupSetOwner(cgroup, diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index a8e0b8c..28d2ca2 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -769,6 +769,12 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; }
+ if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { + virCgroupRemove(priv->cgroup); + virCgroupFree(&priv->cgroup); + goto cleanup; + } + done: ret = 0; cleanup: diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 0379c2e..a07f3c2 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1669,16 +1669,6 @@ virCgroupNewMachineSystemd(const char *name, } }
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - ret = 0; cleanup: virCgroupFree(&parent); @@ -1701,7 +1691,6 @@ int virCgroupTerminateMachine(const char *name, static int virCgroupNewMachineManual(const char *name, const char *drivername, - pid_t pidleader, const char *partition, int controllers, virCgroupPtr *group) @@ -1727,16 +1716,6 @@ virCgroupNewMachineManual(const char *name, group) < 0) goto cleanup;
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - done: ret = 0;
@@ -1783,7 +1762,6 @@ virCgroupNewMachine(const char *name,
return virCgroupNewMachineManual(name, drivername, - pidleader, partition, controllers, group);
Beyond that - things seem reasonable. I usually defer to Martin or Peter for cgroup stuff though...
Another thought/addition/change would be to have virCgroupNewMachine return 'cgroup' rather than have it as the last parameter and then check vs. NULL for success/failure rather than 0/-1...
Weak ACK - hopefully either Peter/Martin can look. I think Peter in particular may be interested due to upcoming vCpu changes.
John

virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 11 +++++++++++ src/qemu/qemu_cgroup.c | 11 +++++++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 22 insertions(+), 22 deletions(-) diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index ad254e4..609e9ea 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -504,6 +504,17 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, &cgroup) < 0) goto cleanup; + if (virCgroupAddTask(cgroup, initpid) < 0) { + virErrorPtr saved = virSaveLastError(); + virCgroupRemove(cgroup); + virCgroupFree(&cgroup); + if (saved) { + virSetError(saved); + virFreeError(saved); + } + goto cleanup; + } + /* setup control group permissions for user namespace */ if (def->idmap.uidmap) { if (virCgroupSetOwner(cgroup, diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 0da6c02..7320046 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -770,6 +770,17 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; } + if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { + virErrorPtr saved = virSaveLastError(); + virCgroupRemove(priv->cgroup); + virCgroupFree(&priv->cgroup); + if (saved) { + virSetError(saved); + virFreeError(saved); + } + goto cleanup; + } + done: ret = 0; cleanup: diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 0379c2e..a07f3c2 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1669,16 +1669,6 @@ virCgroupNewMachineSystemd(const char *name, } } - if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - ret = 0; cleanup: virCgroupFree(&parent); @@ -1701,7 +1691,6 @@ int virCgroupTerminateMachine(const char *name, static int virCgroupNewMachineManual(const char *name, const char *drivername, - pid_t pidleader, const char *partition, int controllers, virCgroupPtr *group) @@ -1727,16 +1716,6 @@ virCgroupNewMachineManual(const char *name, group) < 0) goto cleanup; - if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - done: ret = 0; @@ -1783,7 +1762,6 @@ virCgroupNewMachine(const char *name, return virCgroupNewMachineManual(name, drivername, - pidleader, partition, controllers, group); -- 2.4.10

The machine cgroup is a superset, a parent to the emulator and vcpuX cgroups. The parent cgroup should never have any tasks directly in it. In fact the parent cpuset might contain way more cpus than the sum of emulatorpin and vcpupins. So putting tasks in the superset will allow them to run outside of <cputune>. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 28d2ca2..2c74a22 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -769,12 +769,6 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; } - if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { - virCgroupRemove(priv->cgroup); - virCgroupFree(&priv->cgroup); - goto cleanup; - } - done: ret = 0; cleanup: @@ -1145,6 +1139,10 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) goto cleanup; } + /* consider the first thread an emulator-thread */ + if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) + goto cleanup; + virCgroupFree(&cgroup_emulator); return 0; -- 2.4.10

On 11/13/2015 11:57 AM, Henning Schild wrote:
The machine cgroup is a superset, a parent to the emulator and vcpuX cgroups. The parent cgroup should never have any tasks directly in it. In fact the parent cpuset might contain way more cpus than the sum of emulatorpin and vcpupins. So putting tasks in the superset will allow them to run outside of <cputune>.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 28d2ca2..2c74a22 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -769,12 +769,6 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; }
- if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { - virCgroupRemove(priv->cgroup); - virCgroupFree(&priv->cgroup); - goto cleanup; - } -
Moving this to later would also seem to imply that the code after the qemuSetupCgroup (which calls qemuInitCgroup) from qemuProcessLaunch would need some movement too, e.g.: /* This must be done after cgroup placement to avoid resetting CPU * affinity */ if (!vm->def->cputune.emulatorpin && qemuProcessInitCpuAffinity(vm) < 0) goto cleanup; Theoretically that would then need to be between the following: VIR_DEBUG("Setting cgroup for emulator (if required)"); if (qemuSetupCgroupForEmulator(vm) < 0) goto cleanup; <<<... right here, I believe ...>>> VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup; Again, weak ACK - hopefully Peter/Martin can take a look. In any case a v2 probably should be done. John
done: ret = 0; cleanup: @@ -1145,6 +1139,10 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) goto cleanup; }
+ /* consider the first thread an emulator-thread */ + if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) + goto cleanup; + virCgroupFree(&cgroup_emulator); return 0;

On Tue, 8 Dec 2015 12:23:19 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 11/13/2015 11:57 AM, Henning Schild wrote:
The machine cgroup is a superset, a parent to the emulator and vcpuX cgroups. The parent cgroup should never have any tasks directly in it. In fact the parent cpuset might contain way more cpus than the sum of emulatorpin and vcpupins. So putting tasks in the superset will allow them to run outside of <cputune>.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 28d2ca2..2c74a22 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -769,12 +769,6 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; }
- if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { - virCgroupRemove(priv->cgroup); - virCgroupFree(&priv->cgroup); - goto cleanup; - } -
Moving this to later would also seem to imply that the code after the qemuSetupCgroup (which calls qemuInitCgroup) from qemuProcessLaunch would need some movement too, e.g.:
/* This must be done after cgroup placement to avoid resetting CPU * affinity */ if (!vm->def->cputune.emulatorpin && qemuProcessInitCpuAffinity(vm) < 0) goto cleanup;
Theoretically that would then need to be between the following:
VIR_DEBUG("Setting cgroup for emulator (if required)"); if (qemuSetupCgroupForEmulator(vm) < 0) goto cleanup;
<<<... right here, I believe ...>>>
Good catch! That code is confusing. I will try and merge qemuProcessInitCpuAffinity with qemuProcessSetEmulatorAffinity.
VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup;
Again, weak ACK - hopefully Peter/Martin can take a look. In any case a v2 probably should be done.
John
done: ret = 0; cleanup: @@ -1145,6 +1139,10 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) goto cleanup; }
+ /* consider the first thread an emulator-thread */ + if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) + goto cleanup; + virCgroupFree(&cgroup_emulator); return 0;

The machine cgroup is a superset, a parent to the emulator and vcpuX cgroups. The parent cgroup should never have any tasks directly in it. In fact the parent cpuset might contain way more cpus than the sum of emulatorpin and vcpupins. So putting tasks in the superset will allow them to run outside of <cputune>. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 15 ++++----------- src/qemu/qemu_process.c | 12 ++++++------ 2 files changed, 10 insertions(+), 17 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 7320046..85b8e4e 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -770,17 +770,6 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; } - if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(priv->cgroup); - virCgroupFree(&priv->cgroup); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - goto cleanup; - } - done: ret = 0; cleanup: @@ -1151,6 +1140,10 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) goto cleanup; } + /* consider the first thread an emulator-thread */ + if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) + goto cleanup; + virCgroupFree(&cgroup_emulator); return 0; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f7eb2b6..cfe1da8 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4881,12 +4881,6 @@ int qemuProcessStart(virConnectPtr conn, if (qemuSetupCgroup(driver, vm, nnicindexes, nicindexes) < 0) goto cleanup; - /* This must be done after cgroup placement to avoid resetting CPU - * affinity */ - if (!vm->def->cputune.emulatorpin && - qemuProcessInitCpuAffinity(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting domain security labels"); if (virSecurityManagerSetAllLabel(driver->securityManager, vm->def, stdin_path) < 0) @@ -4934,6 +4928,12 @@ int qemuProcessStart(virConnectPtr conn, if (qemuSetupCgroupForEmulator(vm) < 0) goto cleanup; + /* This must be done after cgroup placement to avoid resetting CPU + * affinity */ + if (!vm->def->cputune.emulatorpin && + qemuProcessInitCpuAffinity(vm) < 0) + goto cleanup; + VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup; -- 2.4.10

Moving tasks to cgroups implied sched_setaffinity. Changing the cpus in a set implies the same for all tasks in the group. The old code put the the thread into the cpuset inherited from the machine cgroup, which allowed it to run outside of vcpupin for a short while. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 2c74a22..ab61a09 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -1030,10 +1030,6 @@ qemuSetupCgroupForVcpu(virDomainObjPtr vm) true, &cgroup_vcpu) < 0) goto cleanup; - /* move the thread for vcpu to sub dir */ - if (virCgroupAddTask(cgroup_vcpu, priv->vcpupids[i]) < 0) - goto cleanup; - if (period || quota) { if (qemuSetupCgroupVcpuBW(cgroup_vcpu, period, quota) < 0) goto cleanup; @@ -1067,6 +1063,11 @@ qemuSetupCgroupForVcpu(virDomainObjPtr vm) if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup; } + + /* move the thread for vcpu to sub dir */ + if (virCgroupAddTask(cgroup_vcpu, priv->vcpupids[i]) < 0) + goto cleanup; + } virCgroupFree(&cgroup_vcpu); VIR_FREE(mem_mask); @@ -1208,11 +1209,6 @@ qemuSetupCgroupForIOThreads(virDomainObjPtr vm) true, &cgroup_iothread) < 0) goto cleanup; - /* move the thread for iothread to sub dir */ - if (virCgroupAddTask(cgroup_iothread, - def->iothreadids[i]->thread_id) < 0) - goto cleanup; - if (period || quota) { if (qemuSetupCgroupVcpuBW(cgroup_iothread, period, quota) < 0) goto cleanup; @@ -1239,6 +1235,11 @@ qemuSetupCgroupForIOThreads(virDomainObjPtr vm) goto cleanup; } + /* move the thread for iothread to sub dir */ + if (virCgroupAddTask(cgroup_iothread, + def->iothreadids[i]->thread_id) < 0) + goto cleanup; + virCgroupFree(&cgroup_iothread); } VIR_FREE(mem_mask); -- 2.4.10

On 11/13/2015 11:57 AM, Henning Schild wrote:
Moving tasks to cgroups implied sched_setaffinity. Changing the cpus in a set implies the same for all tasks in the group. The old code put the the thread into the cpuset inherited from the machine cgroup, which allowed it to run outside of vcpupin for a short while.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
This seems reasonable and follows (more or less) what was done for the emulator at least with respect to order of operations. Another weak ACK - John

On 11/13/2015 11:56 AM, Henning Schild wrote:
Hi,
i already explained some of the cgroup problems in some detail so i will not do that again. https://www.redhat.com/archives/libvir-list/2015-October/msg00876.html
I managed to solve some of the problems in the current codebase, and am now sharing the patches. But they are really just half of what i had to change to get libvirt to behave in a system with isolated cpus.
Other changes/hacks i am not sending here because they do not work for the general case: - create machine.slice before starting libvirtd (smaller than root) ... and hope it wont grow - disabling cpuset.cpus inheritance in libvirtd - allowing only xml with fully specified cputune - set machine cpuset to (vcpupins | emulatorpin)
I am not sure how useful the individual fixes are, i am sending them as concrete examples for the problems i described earlier. And i am hoping that will start a discussion.
Henning
Henning Schild (3): util: cgroups do not implicitly add task to new machine cgroup qemu: do not put a task into machine cgroup qemu cgroups: move new threads to new cgroup after cpuset is set up
src/lxc/lxc_cgroup.c | 6 ++++++ src/qemu/qemu_cgroup.c | 23 ++++++++++++++--------- src/util/vircgroup.c | 22 ---------------------- 3 files changed, 20 insertions(+), 31 deletions(-)
The updated code looks fine to me - although it didn't directly git am -3 to top of tree - I was able to make a few adjustments to get things merged... Since no one has objected to this ordering change - I've pushed. Tks - John

On Mon, 14 Dec 2015 16:27:54 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 11/13/2015 11:56 AM, Henning Schild wrote:
Hi,
i already explained some of the cgroup problems in some detail so i will not do that again. https://www.redhat.com/archives/libvir-list/2015-October/msg00876.html
I managed to solve some of the problems in the current codebase, and am now sharing the patches. But they are really just half of what i had to change to get libvirt to behave in a system with isolated cpus.
Other changes/hacks i am not sending here because they do not work for the general case: - create machine.slice before starting libvirtd (smaller than root) ... and hope it wont grow - disabling cpuset.cpus inheritance in libvirtd - allowing only xml with fully specified cputune - set machine cpuset to (vcpupins | emulatorpin)
I am not sure how useful the individual fixes are, i am sending them as concrete examples for the problems i described earlier. And i am hoping that will start a discussion.
Henning
Henning Schild (3): util: cgroups do not implicitly add task to new machine cgroup qemu: do not put a task into machine cgroup qemu cgroups: move new threads to new cgroup after cpuset is set up
src/lxc/lxc_cgroup.c | 6 ++++++ src/qemu/qemu_cgroup.c | 23 ++++++++++++++--------- src/util/vircgroup.c | 22 ---------------------- 3 files changed, 20 insertions(+), 31 deletions(-)
The updated code looks fine to me - although it didn't directly git am -3 to top of tree - I was able to make a few adjustments to get things merged... Since no one has objected to this ordering change - I've pushed.
Sorry the patches where still based on v1.2.19. Thanks for the merge and accepting them! Wrong operation ordering within libvirt cgroups (like the ones fixed by the patches) could still push tasks onto dedicated cpus. And more importantly other cgroups users can still grab the dedicated cpus as well. The only reliable solution to prevent that seems to be making use of the "exclusive" feature of cpusets. And that would imply changing the cgroups layout of libvirt again. Because sets can not be partially exclusive and libvirt deals with dedicated cpus and shared ones. How to deal with these problems is a discussion that i wanted to get started with this patch-series. It would be nice to receive general comments on that. How should we proceed here? I could maybe write an RFC mail describing the problems again and suggesting changes to libvirt on a conceptual basis. But until then maybe people responsible for cgroups in libvirt (Paul and Martin?) can again look at https://www.redhat.com/archives/libvir-list/2015-October/msg00876.html There i described how naive use of cgoups can place tasks on cpus that are supposed to be isolated/dedicated/exclusive. Even if libvirt does not make these mistakes it should protect itself against docker, systemd, ... Henning

On 12/21/2015 03:36 AM, Henning Schild wrote:
On Mon, 14 Dec 2015 16:27:54 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 11/13/2015 11:56 AM, Henning Schild wrote:
Hi,
i already explained some of the cgroup problems in some detail so i will not do that again. https://www.redhat.com/archives/libvir-list/2015-October/msg00876.html
I managed to solve some of the problems in the current codebase, and am now sharing the patches. But they are really just half of what i had to change to get libvirt to behave in a system with isolated cpus.
Other changes/hacks i am not sending here because they do not work for the general case: - create machine.slice before starting libvirtd (smaller than root) ... and hope it wont grow - disabling cpuset.cpus inheritance in libvirtd - allowing only xml with fully specified cputune - set machine cpuset to (vcpupins | emulatorpin)
I am not sure how useful the individual fixes are, i am sending them as concrete examples for the problems i described earlier. And i am hoping that will start a discussion.
Henning
Henning Schild (3): util: cgroups do not implicitly add task to new machine cgroup qemu: do not put a task into machine cgroup qemu cgroups: move new threads to new cgroup after cpuset is set up
src/lxc/lxc_cgroup.c | 6 ++++++ src/qemu/qemu_cgroup.c | 23 ++++++++++++++--------- src/util/vircgroup.c | 22 ---------------------- 3 files changed, 20 insertions(+), 31 deletions(-)
The updated code looks fine to me - although it didn't directly git am -3 to top of tree - I was able to make a few adjustments to get things merged... Since no one has objected to this ordering change - I've pushed.
Sorry the patches where still based on v1.2.19. Thanks for the merge and accepting them!
No problem - although it seems they've generated a regression in the virttest memtune test suite. I'm 'technically' on vacation for the next couple of weeks; however, I think/perhaps the problem is a result of this patch and the change to adding the task to the cgroup at the end of the for loop, but perhaps the following code causes the control to jump back to the top of the loop: if (!cpumap) continue; if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup; not allowing the /* move the thread for vcpu to sub dir */ if (virCgroupAddTask(cgroup_vcpu, qemuDomainGetVcpuPid(vm, i)) < 0) goto cleanup; to be executed. The code should probably change to be (like IOThreads): if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup; As for the rest, I suspect things will be quite quiet around here over the next couple of weeks. A discussion to perhaps start in the new year. John
Wrong operation ordering within libvirt cgroups (like the ones fixed by the patches) could still push tasks onto dedicated cpus. And more importantly other cgroups users can still grab the dedicated cpus as well. The only reliable solution to prevent that seems to be making use of the "exclusive" feature of cpusets. And that would imply changing the cgroups layout of libvirt again. Because sets can not be partially exclusive and libvirt deals with dedicated cpus and shared ones. How to deal with these problems is a discussion that i wanted to get started with this patch-series. It would be nice to receive general comments on that. How should we proceed here? I could maybe write an RFC mail describing the problems again and suggesting changes to libvirt on a conceptual basis.
But until then maybe people responsible for cgroups in libvirt (Paul and Martin?) can again look at https://www.redhat.com/archives/libvir-list/2015-October/msg00876.html There i described how naive use of cgoups can place tasks on cpus that are supposed to be isolated/dedicated/exclusive. Even if libvirt does not make these mistakes it should protect itself against docker, systemd, ...
Henning

On Mon, 21 Dec 2015 12:44:32 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 12/21/2015 03:36 AM, Henning Schild wrote:
On Mon, 14 Dec 2015 16:27:54 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 11/13/2015 11:56 AM, Henning Schild wrote:
Hi,
i already explained some of the cgroup problems in some detail so i will not do that again. https://www.redhat.com/archives/libvir-list/2015-October/msg00876.html
I managed to solve some of the problems in the current codebase, and am now sharing the patches. But they are really just half of what i had to change to get libvirt to behave in a system with isolated cpus.
Other changes/hacks i am not sending here because they do not work for the general case: - create machine.slice before starting libvirtd (smaller than root) ... and hope it wont grow - disabling cpuset.cpus inheritance in libvirtd - allowing only xml with fully specified cputune - set machine cpuset to (vcpupins | emulatorpin)
I am not sure how useful the individual fixes are, i am sending them as concrete examples for the problems i described earlier. And i am hoping that will start a discussion.
Henning
Henning Schild (3): util: cgroups do not implicitly add task to new machine cgroup qemu: do not put a task into machine cgroup qemu cgroups: move new threads to new cgroup after cpuset is set up
src/lxc/lxc_cgroup.c | 6 ++++++ src/qemu/qemu_cgroup.c | 23 ++++++++++++++--------- src/util/vircgroup.c | 22 ---------------------- 3 files changed, 20 insertions(+), 31 deletions(-)
The updated code looks fine to me - although it didn't directly git am -3 to top of tree - I was able to make a few adjustments to get things merged... Since no one has objected to this ordering change - I've pushed.
Sorry the patches where still based on v1.2.19. Thanks for the merge and accepting them!
No problem - although it seems they've generated a regression in the virttest memtune test suite. I'm 'technically' on vacation for the next couple of weeks; however, I think/perhaps the problem is a result of this patch and the change to adding the task to the cgroup at the end of the for loop, but perhaps the following code causes the control to jump back to the top of the loop:
if (!cpumap) continue;
if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
not allowing the
/* move the thread for vcpu to sub dir */ if (virCgroupAddTask(cgroup_vcpu, qemuDomainGetVcpuPid(vm, i)) < 0) goto cleanup;
to be executed.
The code should probably change to be (like IOThreads):
if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
As for the rest, I suspect things will be quite quiet around here over the next couple of weeks. A discussion to perhaps start in the new year.
Same here. I will have a look at that regression after my vacation, should it still be there. Henning
John
Wrong operation ordering within libvirt cgroups (like the ones fixed by the patches) could still push tasks onto dedicated cpus. And more importantly other cgroups users can still grab the dedicated cpus as well. The only reliable solution to prevent that seems to be making use of the "exclusive" feature of cpusets. And that would imply changing the cgroups layout of libvirt again. Because sets can not be partially exclusive and libvirt deals with dedicated cpus and shared ones. How to deal with these problems is a discussion that i wanted to get started with this patch-series. It would be nice to receive general comments on that. How should we proceed here? I could maybe write an RFC mail describing the problems again and suggesting changes to libvirt on a conceptual basis.
But until then maybe people responsible for cgroups in libvirt (Paul and Martin?) can again look at https://www.redhat.com/archives/libvir-list/2015-October/msg00876.html There i described how naive use of cgoups can place tasks on cpus that are supposed to be isolated/dedicated/exclusive. Even if libvirt does not make these mistakes it should protect itself against docker, systemd, ...
Henning

[...]
No problem - although it seems they've generated a regression in the virttest memtune test suite. I'm 'technically' on vacation for the next couple of weeks; however, I think/perhaps the problem is a result of this patch and the change to adding the task to the cgroup at the end of the for loop, but perhaps the following code causes the control to jump back to the top of the loop:
if (!cpumap) continue;
if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
not allowing the
/* move the thread for vcpu to sub dir */ if (virCgroupAddTask(cgroup_vcpu, qemuDomainGetVcpuPid(vm, i)) < 0) goto cleanup;
to be executed.
The code should probably change to be (like IOThreads):
if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
As for the rest, I suspect things will be quite quiet around here over the next couple of weeks. A discussion to perhaps start in the new year.
Same here. I will have a look at that regression after my vacation, should it still be there.
Henning
More data from the issue... While the above mentioned path is an issue, I don't believe it's what's causing the test failure. I haven't quite figured out why yet, but it seems the /proc/#/cgroup file isn't getting the proper path for the 'memory' slice and thus the test fails because it's looking at the: /sys/fs/cgroup/memory/machine.slice/memory.* files instead of the /sys/fs/cgroup/memory/machine.slice/$path/memory.* Where $path is "machine-qemu\x2dvirt\x2dtests\x2dvm1.scope" This affects the virsh memtune $dom command test suite which uses the /proc/$pid/cgroup file in order to find the path for the 'memory' or 'cpuset' or 'cpu,cpuacct' cgroup paths. Seems to be some interaction with systemd that I have quite figured out. I'm assuming this is essentially the issue you were trying to fix - that is changes to values should be done to the machine-qemu* specific files rather than the machine.slice files. The good news is I can see the changes occurring in the machine-qemu* specific files, so it seems libvirt is doing the right thing. However, there's something strange with perhaps previously existing/running domains where that /proc/$pid/cgroup file doesn't get the $path for the memory entry, thus causing the test validation to look in the wrong place. Hopefully this makes sense. What's really strange (for me at least) is that it's only occurring on one test system. I can set up the same test on another system and things work just fine. I'm not quite sure what interaction generates that /proc/$pid/cgroup file - hopefully someone else understands it and help me make sense of it. John

On Thu, 7 Jan 2016 11:20:23 -0500 John Ferlan <jferlan@redhat.com> wrote:
[...]
No problem - although it seems they've generated a regression in the virttest memtune test suite. I'm 'technically' on vacation for the next couple of weeks; however, I think/perhaps the problem is a result of this patch and the change to adding the task to the cgroup at the end of the for loop, but perhaps the following code causes the control to jump back to the top of the loop:
if (!cpumap) continue;
if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
not allowing the
/* move the thread for vcpu to sub dir */ if (virCgroupAddTask(cgroup_vcpu, qemuDomainGetVcpuPid(vm, i)) < 0) goto cleanup;
to be executed.
The code should probably change to be (like IOThreads):
if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
As for the rest, I suspect things will be quite quiet around here over the next couple of weeks. A discussion to perhaps start in the new year.
Same here. I will have a look at that regression after my vacation, should it still be there.
Henning
More data from the issue... While the above mentioned path is an issue, I don't believe it's what's causing the test failure.
I haven't quite figured out why yet, but it seems the /proc/#/cgroup file isn't getting the proper path for the 'memory' slice and thus the test fails because it's looking at the:
/sys/fs/cgroup/memory/machine.slice/memory.*
files instead of the
/sys/fs/cgroup/memory/machine.slice/$path/memory.*
To be honest i did just look at the cgroup/cpuset/ hierarchy, but i just browsed cgroup/memory/ as well. The target of my patch series was to get cgroup/cpuset/machine.slice/tasks to be emtpy, all tasks should be in their sub-cgroup under the machine.slice. And the ordering patches make sure the file is always empty. In the memory cgroups all tasks are in the parent group (all in machine.slice/tasks). machine.slice/*/tasks are empty. I am not sure whether that is intended, i can just assume it is a bug in the memory cgroup subsystem. Why are the groups created and tuned when the tasks stay in the big superset? /proc/#/cgroup is showing the correct path, libvirt seems to fail to migrate tasks into memory subgroups. (i am talking about a patched 1.2.19 where vms do not have any special memory tuning) Without my patches the first qemu thread was in "2:cpuset:/machine.slice" and the name did match "4:memory:/machine.slice". Now if the test wants matching names the test might just be wrong. Or as indicated before there might be a bug in the memory cgroups.
Where $path is "machine-qemu\x2dvirt\x2dtests\x2dvm1.scope"
This affects the virsh memtune $dom command test suite which uses the /proc/$pid/cgroup file in order to find the path for the 'memory' or 'cpuset' or 'cpu,cpuacct' cgroup paths.
Seems to be some interaction with systemd that I have quite figured out.
I'm assuming this is essentially the issue you were trying to fix - that is changes to values should be done to the machine-qemu* specific files rather than the machine.slice files.
The good news is I can see the changes occurring in the machine-qemu* specific files, so it seems libvirt is doing the right thing.
However, there's something strange with perhaps previously existing/running domains where that /proc/$pid/cgroup file doesn't get the $path for the memory entry, thus causing the test validation to look in the wrong place.
Hopefully this makes sense. What's really strange (for me at least) is that it's only occurring on one test system. I can set up the same test on another system and things work just fine. I'm not quite sure what interaction generates that /proc/$pid/cgroup file - hopefully someone else understands it and help me make sense of it.

On 01/07/2016 02:01 PM, Henning Schild wrote:
On Thu, 7 Jan 2016 11:20:23 -0500 John Ferlan <jferlan@redhat.com> wrote:
[...]
No problem - although it seems they've generated a regression in the virttest memtune test suite. I'm 'technically' on vacation for the next couple of weeks; however, I think/perhaps the problem is a result of this patch and the change to adding the task to the cgroup at the end of the for loop, but perhaps the following code causes the control to jump back to the top of the loop:
if (!cpumap) continue;
if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
not allowing the
/* move the thread for vcpu to sub dir */ if (virCgroupAddTask(cgroup_vcpu, qemuDomainGetVcpuPid(vm, i)) < 0) goto cleanup;
to be executed.
The code should probably change to be (like IOThreads):
if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
As for the rest, I suspect things will be quite quiet around here over the next couple of weeks. A discussion to perhaps start in the new year.
Same here. I will have a look at that regression after my vacation, should it still be there.
Henning
More data from the issue... While the above mentioned path is an issue, I don't believe it's what's causing the test failure.
I haven't quite figured out why yet, but it seems the /proc/#/cgroup file isn't getting the proper path for the 'memory' slice and thus the test fails because it's looking at the:
/sys/fs/cgroup/memory/machine.slice/memory.*
files instead of the
/sys/fs/cgroup/memory/machine.slice/$path/memory.*
To be honest i did just look at the cgroup/cpuset/ hierarchy, but i just browsed cgroup/memory/ as well.
The target of my patch series was to get cgroup/cpuset/machine.slice/tasks to be emtpy, all tasks should be in their sub-cgroup under the machine.slice. And the ordering patches make sure the file is always empty.
In the memory cgroups all tasks are in the parent group (all in machine.slice/tasks). machine.slice/*/tasks are empty. I am not sure whether that is intended, i can just assume it is a bug in the memory cgroup subsystem. Why are the groups created and tuned when the tasks stay in the big superset?
TBH - there's quite a bit of this that mystifies me... Use of cgroups is not something I've spent a whole lot of time looking at... I guess I've been working under the assumption that when the machine.slice/$path is created, the domain would use that for all cgroup specific file adjustments for that domain. Not sure how the /proc/$pid/cgroup is related to this. My f23 system seems to generate the /proc/$pid/cgroup with the machine.slice/$path/ for each of the cgroups libvirt cares about while the f20 system with the test only has that path for cpuset and cpu,cpuacct. Since that's what the test uses for to find the memory path for validation that's why it fails. I've been looking through the libvirtd debug logs to see if anything jumps out at me, but it seems both the systems I've looked at will build the path for the domain using the machine.slice/$path as seen during domain startup. Very odd - perhaps looking at it too long right now though!
/proc/#/cgroup is showing the correct path, libvirt seems to fail to migrate tasks into memory subgroups. (i am talking about a patched 1.2.19 where vms do not have any special memory tuning)
I'm using latest upstream 1.3.1 - it seems to set the machine.slice/$path for blkio, cpu,cpuacct, cpuset, memory, and devices entries.
Without my patches the first qemu thread was in "2:cpuset:/machine.slice" and the name did match "4:memory:/machine.slice". Now if the test wants matching names the test might just be wrong. Or as indicated before there might be a bug in the memory cgroups.
I'm leaning towards something in the test. I'll check if reverting these changes alters the results. I don't imagine it will. John
Where $path is "machine-qemu\x2dvirt\x2dtests\x2dvm1.scope"
This affects the virsh memtune $dom command test suite which uses the /proc/$pid/cgroup file in order to find the path for the 'memory' or 'cpuset' or 'cpu,cpuacct' cgroup paths.
Seems to be some interaction with systemd that I have quite figured out.
I'm assuming this is essentially the issue you were trying to fix - that is changes to values should be done to the machine-qemu* specific files rather than the machine.slice files.
The good news is I can see the changes occurring in the machine-qemu* specific files, so it seems libvirt is doing the right thing.
However, there's something strange with perhaps previously existing/running domains where that /proc/$pid/cgroup file doesn't get the $path for the memory entry, thus causing the test validation to look in the wrong place.
Hopefully this makes sense. What's really strange (for me at least) is that it's only occurring on one test system. I can set up the same test on another system and things work just fine. I'm not quite sure what interaction generates that /proc/$pid/cgroup file - hopefully someone else understands it and help me make sense of it.

On Thu, 7 Jan 2016 19:56:33 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 01/07/2016 02:01 PM, Henning Schild wrote:
On Thu, 7 Jan 2016 11:20:23 -0500 John Ferlan <jferlan@redhat.com> wrote:
[...]
No problem - although it seems they've generated a regression in the virttest memtune test suite. I'm 'technically' on vacation for the next couple of weeks; however, I think/perhaps the problem is a result of this patch and the change to adding the task to the cgroup at the end of the for loop, but perhaps the following code causes the control to jump back to the top of the loop:
if (!cpumap) continue;
if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
not allowing the
/* move the thread for vcpu to sub dir */ if (virCgroupAddTask(cgroup_vcpu, qemuDomainGetVcpuPid(vm, i)) < 0) goto cleanup;
to be executed.
The code should probably change to be (like IOThreads):
if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
As for the rest, I suspect things will be quite quiet around here over the next couple of weeks. A discussion to perhaps start in the new year.
Same here. I will have a look at that regression after my vacation, should it still be there.
Henning
More data from the issue... While the above mentioned path is an issue, I don't believe it's what's causing the test failure.
I haven't quite figured out why yet, but it seems the /proc/#/cgroup file isn't getting the proper path for the 'memory' slice and thus the test fails because it's looking at the:
/sys/fs/cgroup/memory/machine.slice/memory.*
files instead of the
/sys/fs/cgroup/memory/machine.slice/$path/memory.*
To be honest i did just look at the cgroup/cpuset/ hierarchy, but i just browsed cgroup/memory/ as well.
The target of my patch series was to get cgroup/cpuset/machine.slice/tasks to be emtpy, all tasks should be in their sub-cgroup under the machine.slice. And the ordering patches make sure the file is always empty.
In the memory cgroups all tasks are in the parent group (all in machine.slice/tasks). machine.slice/*/tasks are empty. I am not sure whether that is intended, i can just assume it is a bug in the memory cgroup subsystem. Why are the groups created and tuned when the tasks stay in the big superset?
TBH - there's quite a bit of this that mystifies me... Use of cgroups is not something I've spent a whole lot of time looking at...
I guess I've been working under the assumption that when the machine.slice/$path is created, the domain would use that for all cgroup specific file adjustments for that domain. Not sure how the /proc/$pid/cgroup is related to this.
My f23 system seems to generate the /proc/$pid/cgroup with the machine.slice/$path/ for each of the cgroups libvirt cares about while the f20 system with the test only has that path for cpuset and cpu,cpuacct. Since that's what the test uses for to find the memory path for validation that's why it fails.
I've been looking through the libvirtd debug logs to see if anything jumps out at me, but it seems both the systems I've looked at will build the path for the domain using the machine.slice/$path as seen during domain startup.
Very odd - perhaps looking at it too long right now though!
/proc/#/cgroup is showing the correct path, libvirt seems to fail to migrate tasks into memory subgroups. (i am talking about a patched 1.2.19 where vms do not have any special memory tuning)
I'm using latest upstream 1.3.1 - it seems to set the machine.slice/$path for blkio, cpu,cpuacct, cpuset, memory, and devices entries.
Without my patches the first qemu thread was in "2:cpuset:/machine.slice" and the name did match "4:memory:/machine.slice". Now if the test wants matching names the test might just be wrong. Or as indicated before there might be a bug in the memory cgroups.
I'm leaning towards something in the test. I'll check if reverting these changes alters the results. I don't imagine it will.
The real question is which thread it fails on and at what point in time. My patches only changed the order of operations where threads enter the cpuset cgroups at a slightly different time. And the qemu main thread never enters the parent group, it becomes an emulator-thread. Maybe you can point to exactly the assertion that fails. Including a link to the test code. And yes if you can confirm that the patches are to blame that would be a good first step ;). Thanks, Henning
John
Where $path is "machine-qemu\x2dvirt\x2dtests\x2dvm1.scope"
This affects the virsh memtune $dom command test suite which uses the /proc/$pid/cgroup file in order to find the path for the 'memory' or 'cpuset' or 'cpu,cpuacct' cgroup paths.
Seems to be some interaction with systemd that I have quite figured out.
I'm assuming this is essentially the issue you were trying to fix - that is changes to values should be done to the machine-qemu* specific files rather than the machine.slice files.
The good news is I can see the changes occurring in the machine-qemu* specific files, so it seems libvirt is doing the right thing.
However, there's something strange with perhaps previously existing/running domains where that /proc/$pid/cgroup file doesn't get the $path for the memory entry, thus causing the test validation to look in the wrong place.
Hopefully this makes sense. What's really strange (for me at least) is that it's only occurring on one test system. I can set up the same test on another system and things work just fine. I'm not quite sure what interaction generates that /proc/$pid/cgroup file - hopefully someone else understands it and help me make sense of it.

I'm leaning towards something in the test. I'll check if reverting these changes alters the results. I don't imagine it will.
The real question is which thread it fails on and at what point in time. My patches only changed the order of operations where threads enter the cpuset cgroups at a slightly different time. And the qemu main thread never enters the parent group, it becomes an emulator-thread. Maybe you can point to exactly the assertion that fails. Including a link to the test code. And yes if you can confirm that the patches are to blame that would be a good first step ;).
Thanks, Henning
Not quite sure how to answer your question about which thread - I'm still at the point of figuring out the symptoms. At startup, there's a priv->cgroup created. Then when vcpu, emulator, iothread calls are made - each seems to create it's own cgroup thread via virCgroupNewThread using the priv->cgroup, then make the adjustment, and free the cgroup. As for which test - it's part of the 'virt-test' suite. It's run on a Red Hat internal system every night in order to help determine what/if any changes made during the work day have caused a regression. You can look up 'virt-test' on github, but it's being replaced by something known as avacado. A test was run with all the patches reverted and the test passed, so something in the way things were moved. What's "interesting" (to me at least) is that if I start the vm on that system used for the test, the /proc/$pid/cgroup file is as follows: 10:hugetlb:/ 9:perf_event:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope 8:blkio:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope 7:net_cls,net_prio:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope 6:freezer:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope 5:devices:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope 4:memory:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope 3:cpu,cpuacct:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope/emulator 2:cpuset:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope/emulator 1:name=systemd:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope If I then "unrevert" the patches one by one (e.g. poor man's git bisect), I find that patch 2/3 results in the following adjustment to the /proc/$pid/cgroup file: 10:hugetlb:/ 9:perf_event:/ 8:blkio:/machine.slice 7:net_cls,net_prio:/ 6:freezer:/ 5:devices:/machine.slice 4:memory:/machine.slice 3:cpu,cpuacct:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope/emulator 2:cpuset:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope/emulator 1:name=systemd:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope So, I have a candidate... It seems that by changing the AddTask from using 'priv->cgroup' to a copy of the cgroup as created by virCgroupNewThread in qemuSetupCgroupForEmulator, the /proc/$pid/cgroup file only modifies the 'cpuset and cpu,cpuacct'. Thus changing the other entries back to /machine.slice. I'm not clear why that happens (yet). BTW: What's interesting with the file changes is that they differ from my f23 system in which the same revert processing would have the following /proc/$pid/cgroup file when patch 2 is re-applied: 10:devices:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope 9:memory:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope 8:freezer:/ 7:cpuset:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope/emulator 6:net_cls,net_prio:/ 5:cpu,cpuacct:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope/emulator 4:blkio:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope 3:hugetlb:/ 2:perf_event:/ 1:name=systemd:/machine.slice/machine-qemu\x2dvirt\x2dtests\x2dvm1.scope This does seem similar in a way to something I found while doing a search to https://bugzilla.redhat.com/show_bug.cgi?id=1139223. It's not completely the same, but the symptom of systemd overwriting non changing controller entries feels similar. John

I'm leaning towards something in the test. I'll check if reverting these changes alters the results. I don't imagine it will.
The real question is which thread it fails on and at what point in time. My patches only changed the order of operations where threads enter the cpuset cgroups at a slightly different time. And the qemu main thread never enters the parent group, it becomes an emulator-thread. Maybe you can point to exactly the assertion that fails. Including a link to the test code. And yes if you can confirm that the patches are to blame that would be a good first step ;).
Thanks, Henning
Update: I have found that if I revert patch 2... Then modify qemuInitCgroup() to modify the virCgroupNewMachine check to also ensure "|| !priv->cgroup) Then modify qemuSetupCgroupForEmulator() to make the virCgroupAddTask() call like was in patch 2 Then modify patch 3 (qemuSetupCgroupForVcpu) to change the call: if (!cpumap) continue; if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup; to if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup; Then retest and the test passes again. Note that taking this route, I found that when I start the guest, I have the following in 'tasks': # cat /sys/fs/cgroup/memory/machine.slice/tasks # cat /sys/fs/cgroup/memory/machine.slice/*/tasks 15007 15008 15010 15011 15013 # Where '15007' is the virt-tests-vm1 process (eg, /proc/$pid/cgroup). If I read the intentions you had, this follows that... I'll post a couple of patches in a bit... John

On Fri, 8 Jan 2016 11:05:59 -0500 John Ferlan <jferlan@redhat.com> wrote:
I'm leaning towards something in the test. I'll check if reverting these changes alters the results. I don't imagine it will.
The real question is which thread it fails on and at what point in time. My patches only changed the order of operations where threads enter the cpuset cgroups at a slightly different time. And the qemu main thread never enters the parent group, it becomes an emulator-thread. Maybe you can point to exactly the assertion that fails. Including a link to the test code. And yes if you can confirm that the patches are to blame that would be a good first step ;).
Thanks, Henning
Update:
I have found that if I revert patch 2...
Then modify qemuInitCgroup() to modify the virCgroupNewMachine check to also ensure "|| !priv->cgroup)
I see the check for the parent cgroup should probably go back into virCgroupNewMachine, including the cleanup stuff in case of failure.
Then modify qemuSetupCgroupForEmulator() to make the virCgroupAddTask() call like was in patch 2
Then modify patch 3 (qemuSetupCgroupForVcpu) to change the call:
if (!cpumap) continue;
if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
to
if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
Well that is not a syntactical change, maybe easier to read and in line with the other places where qemuSetupCgroupCpusetCpus is called.
Then retest and the test passes again.
Note that taking this route, I found that when I start the guest, I have the following in 'tasks':
# cat /sys/fs/cgroup/memory/machine.slice/tasks # cat /sys/fs/cgroup/memory/machine.slice/*/tasks 15007 15008 15010 15011 15013 #
Where '15007' is the virt-tests-vm1 process (eg, /proc/$pid/cgroup). If I read the intentions you had, this follows that...
I'll post a couple of patches in a bit...
John

On 01/11/2016 06:38 AM, Henning Schild wrote:
On Fri, 8 Jan 2016 11:05:59 -0500 John Ferlan <jferlan@redhat.com> wrote:
I'm leaning towards something in the test. I'll check if reverting these changes alters the results. I don't imagine it will.
The real question is which thread it fails on and at what point in time. My patches only changed the order of operations where threads enter the cpuset cgroups at a slightly different time. And the qemu main thread never enters the parent group, it becomes an emulator-thread. Maybe you can point to exactly the assertion that fails. Including a link to the test code. And yes if you can confirm that the patches are to blame that would be a good first step ;).
Thanks, Henning
Update:
I have found that if I revert patch 2...
Then modify qemuInitCgroup() to modify the virCgroupNewMachine check to also ensure "|| !priv->cgroup)
I see the check for the parent cgroup should probably go back into virCgroupNewMachine, including the cleanup stuff in case of failure.
Forgot to CC you (and Jan) on the 4 patch series I sent: http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html Patches 2, 3, & 4 are related to above while patch 1 is for below. John
Then modify qemuSetupCgroupForEmulator() to make the virCgroupAddTask() call like was in patch 2
Then modify patch 3 (qemuSetupCgroupForVcpu) to change the call:
if (!cpumap) continue;
if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
to
if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
Well that is not a syntactical change, maybe easier to read and in line with the other places where qemuSetupCgroupCpusetCpus is called.
Then retest and the test passes again.
Note that taking this route, I found that when I start the guest, I have the following in 'tasks':
# cat /sys/fs/cgroup/memory/machine.slice/tasks # cat /sys/fs/cgroup/memory/machine.slice/*/tasks 15007 15008 15010 15011 15013 #
Where '15007' is the virt-tests-vm1 process (eg, /proc/$pid/cgroup). If I read the intentions you had, this follows that...
I'll post a couple of patches in a bit...
John

On Mon, 11 Jan 2016 07:05:11 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 01/11/2016 06:38 AM, Henning Schild wrote:
On Fri, 8 Jan 2016 11:05:59 -0500 John Ferlan <jferlan@redhat.com> wrote:
I'm leaning towards something in the test. I'll check if reverting these changes alters the results. I don't imagine it will.
The real question is which thread it fails on and at what point in time. My patches only changed the order of operations where threads enter the cpuset cgroups at a slightly different time. And the qemu main thread never enters the parent group, it becomes an emulator-thread. Maybe you can point to exactly the assertion that fails. Including a link to the test code. And yes if you can confirm that the patches are to blame that would be a good first step ;).
Thanks, Henning
Update:
I have found that if I revert patch 2...
Then modify qemuInitCgroup() to modify the virCgroupNewMachine check to also ensure "|| !priv->cgroup)
I see the check for the parent cgroup should probably go back into virCgroupNewMachine, including the cleanup stuff in case of failure.
Forgot to CC you (and Jan) on the 4 patch series I sent:
http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
Patches 2, 3, & 4 are related to above while patch 1 is for below.
If you are subscribed could you please send me a copy of the mails - as received on the list, for review?
John
Then modify qemuSetupCgroupForEmulator() to make the virCgroupAddTask() call like was in patch 2
Then modify patch 3 (qemuSetupCgroupForVcpu) to change the call:
if (!cpumap) continue;
if (qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
to
if (cpumap && qemuSetupCgroupCpusetCpus(cgroup_vcpu, cpumap) < 0) goto cleanup;
Well that is not a syntactical change, maybe easier to read and in line with the other places where qemuSetupCgroupCpusetCpus is called.
Then retest and the test passes again.
Note that taking this route, I found that when I start the guest, I have the following in 'tasks':
# cat /sys/fs/cgroup/memory/machine.slice/tasks # cat /sys/fs/cgroup/memory/machine.slice/*/tasks 15007 15008 15010 15011 15013 #
Where '15007' is the virt-tests-vm1 process (eg, /proc/$pid/cgroup). If I read the intentions you had, this follows that...
I'll post a couple of patches in a bit...
John
participants (2)
-
Henning Schild
-
John Ferlan