[libvirt] [PATCH 0/9] fix thread related controllers in cgroups

This series picks up the cgroups work i started earlier. My initial patches got in and later reverted before 1.3.1. The problem the series is solving is about qemu-threads becoming runnable on pcpus outside the pinning masks configured for the machine. That only happens for a short time before the thread is moved to its final cpuset. But it can disturb other load on the system or can lead to qemu never starting. (qemu main thread ends up on a pcpu with busy high prio rt-task). The problem in the original series was the lack of understanding that one virCgroup can cover all controllers. Instead of just touching cpusets the patches had side effects on all the other controllers (memory, blkio etc.) Again the general idea is to put all threads right into the correct cgroups and to not move them around. But this series touches only the cpu, cpuset, and cpuacct controllers. That are the ones relevant to threads and that are the controllers the threading sub-groups have mounted. Patches 1, 2, and 9 deal with asserting correct behaviour. They are optional. But given the complexity of the "bringup" and the importance of getting that right, i think they should go in as well! The tricky bits are in patches 5 and 8, i kept them as simple as possible. The series is based on v1.3.1. Henning Schild (9): vircgroup: one central point for adding tasks to cgroups vircgroup: add assertion to allow cgroup controllers to stay empty vircgroup: introduce controller mask for threads util: cgroups do not implicitly add task to new machine cgroup qemu_cgroup: put qemu right into emulator sub-cgroup qemu_cgroup: use virCgroupAddTask instead of virCgroupMoveTask vircgroup: add controller mask to virCgroupAddTask qemu_cgroup: dont put qemu main thread into wrong cgroup qemu_cgroup: assert threading cgroup layout for machine cgroup src/libvirt_private.syms | 3 +- src/lxc/lxc_cgroup.c | 11 ++++ src/lxc/lxc_controller.c | 4 +- src/qemu/qemu_cgroup.c | 30 +++++++-- src/qemu/qemu_driver.c | 2 +- src/qemu/qemu_process.c | 8 +-- src/util/vircgroup.c | 155 ++++++++--------------------------------------- src/util/vircgroup.h | 13 +++- src/util/vircgrouppriv.h | 1 + 9 files changed, 81 insertions(+), 146 deletions(-) -- 2.4.10

Signed-off-by: Henning Schild <henning.schild@siemens.com> --- make-kpkg | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/make-kpkg b/make-kpkg index 5cb8ec3..ba663c9 100755 --- a/make-kpkg +++ b/make-kpkg @@ -662,9 +662,10 @@ sub main () { } if ( $config_target - !~ /^(?:|silentold|old|menu|n|x|g|rand|def|all(mod|yes|no))(config)?$/ ) { + !~ /^(?:|silentold|old|olddef|menu|n|x|g|rand|def|all(mod|yes|no))(config)?$/ ) { print - "Config type must be one of {config,silentoldconfig,oldconfig,menuconfig,xconfig,\n"; + "Config type must be one of {config,silentoldconfig,oldconfig,olddefconfig,menuconfig," + . "xconfig,\n"; print " nconfig,gconfig,randconfig,defconfig,allmodconfigallyesconfig,allnoconfig}.\n"; print "use --help to display command line syntax help.\n"; -- 2.4.10

Sorry that one is unrelated. Ignore it. On Tue, 23 Feb 2016 16:58:34 +0100 Henning Schild <henning.schild@siemens.com> wrote:
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- make-kpkg | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/make-kpkg b/make-kpkg index 5cb8ec3..ba663c9 100755 --- a/make-kpkg +++ b/make-kpkg @@ -662,9 +662,10 @@ sub main () { }
if ( $config_target - !~ /^(?:|silentold|old|menu|n|x|g|rand|def|all(mod|yes|no))(config)?$/ ) { + !~ /^(?:|silentold|old|olddef|menu|n|x|g|rand|def|all(mod|yes|no))(config)?$/ ) { print - "Config type must be one of {config,silentoldconfig,oldconfig,menuconfig,xconfig,\n"; + "Config type must be one of {config,silentoldconfig,oldconfig,olddefconfig,menuconfig," + . "xconfig,\n"; print " nconfig,gconfig,randconfig,defconfig,allmodconfigallyesconfig,allnoconfig}.\n"; print "use --help to display command line syntax help.\n";

Signed-off-by: Henning Schild <henning.schild@siemens.com> --- make-kpkg | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/make-kpkg b/make-kpkg index 5cb8ec3..ba663c9 100755 --- a/make-kpkg +++ b/make-kpkg @@ -662,9 +662,10 @@ sub main () { } if ( $config_target - !~ /^(?:|silentold|old|menu|n|x|g|rand|def|all(mod|yes|no))(config)?$/ ) { + !~ /^(?:|silentold|old|olddef|menu|n|x|g|rand|def|all(mod|yes|no))(config)?$/ ) { print - "Config type must be one of {config,silentoldconfig,oldconfig,menuconfig,xconfig,\n"; + "Config type must be one of {config,silentoldconfig,oldconfig,olddefconfig,menuconfig," + . "xconfig,\n"; print " nconfig,gconfig,randconfig,defconfig,allmodconfigallyesconfig,allnoconfig}.\n"; print "use --help to display command line syntax help.\n"; -- 2.4.10

Sorry that one is unrelated. Ignore it. On Tue, 23 Feb 2016 16:58:35 +0100 Henning Schild <henning.schild@siemens.com> wrote:
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- make-kpkg | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/make-kpkg b/make-kpkg index 5cb8ec3..ba663c9 100755 --- a/make-kpkg +++ b/make-kpkg @@ -662,9 +662,10 @@ sub main () { }
if ( $config_target - !~ /^(?:|silentold|old|menu|n|x|g|rand|def|all(mod|yes|no))(config)?$/ ) { + !~ /^(?:|silentold|old|olddef|menu|n|x|g|rand|def|all(mod|yes|no))(config)?$/ ) { print - "Config type must be one of {config,silentoldconfig,oldconfig,menuconfig,xconfig,\n"; + "Config type must be one of {config,silentoldconfig,oldconfig,olddefconfig,menuconfig," + . "xconfig,\n"; print " nconfig,gconfig,randconfig,defconfig,allmodconfigallyesconfig,allnoconfig}.\n"; print "use --help to display command line syntax help.\n";

Use virCgroupAddTaskController in virCgroupAddTask so we have one single point where we add tasks to cgroups. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/util/vircgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 7584ee4..0b65238 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1162,7 +1162,7 @@ virCgroupAddTask(virCgroupPtr group, pid_t pid) if (i == VIR_CGROUP_CONTROLLER_SYSTEMD) continue; - if (virCgroupSetValueU64(group, i, "tasks", pid) < 0) + if (virCgroupAddTaskController(group, pid, i) < 0) goto cleanup; } -- 2.4.10

On 02/23/2016 10:58 AM, Henning Schild wrote:
Use virCgroupAddTaskController in virCgroupAddTask so we have one single point where we add tasks to cgroups.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/util/vircgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
This seems reasonable - couple of extra sanity checks being made, but it definitely seems like the right thing to do. John

On Tue, Feb 23, 2016 at 04:58:36PM +0100, Henning Schild wrote:
Use virCgroupAddTaskController in virCgroupAddTask so we have one single point where we add tasks to cgroups.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/util/vircgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 7584ee4..0b65238 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1162,7 +1162,7 @@ virCgroupAddTask(virCgroupPtr group, pid_t pid) if (i == VIR_CGROUP_CONTROLLER_SYSTEMD) continue;
- if (virCgroupSetValueU64(group, i, "tasks", pid) < 0) + if (virCgroupAddTaskController(group, pid, i) < 0) goto cleanup; }
ACK Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

When using a hierarchy of cgroups we might want to add tasks just to the children cgroups but never to the parent. To make sure we do not use a parent cgroup by accident add a mechanism that lets us assert a correct implementation in cases we want such a hierarchy. i.e. for qemu cpusets we want all tasks in /vcpuX or /emulator, not in /. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/libvirt_private.syms | 2 ++ src/util/vircgroup.c | 19 +++++++++++++++++++ src/util/vircgroup.h | 3 +++ src/util/vircgrouppriv.h | 1 + 4 files changed, 25 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 83f6e2c..cf93d06 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1201,6 +1201,7 @@ virCgroupDenyDeviceMajor; virCgroupDenyDevicePath; virCgroupDetectMountsFromFile; virCgroupFree; +virCgroupGetAssertEmpty; virCgroupGetBlkioDeviceReadBps; virCgroupGetBlkioDeviceReadIops; virCgroupGetBlkioDeviceWeight; @@ -1245,6 +1246,7 @@ virCgroupNewThread; virCgroupPathOfController; virCgroupRemove; virCgroupRemoveRecursively; +virCgroupSetAssertEmpty; virCgroupSetBlkioDeviceReadBps; virCgroupSetBlkioDeviceReadIops; virCgroupSetBlkioDeviceWeight; diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 0b65238..ad46dfc 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1197,6 +1197,15 @@ virCgroupAddTaskController(virCgroupPtr group, pid_t pid, int controller) return -1; } + if(group->assert_empty & (1 << controller)) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Controller '%s' is not supposed to contain any" + " tasks. group=%s pid=%d\n"), + virCgroupControllerTypeToString(controller), + group->path, pid); + return -1; + } + return virCgroupSetValueU64(group, controller, "tasks", (unsigned long long)pid); } @@ -1246,6 +1255,16 @@ virCgroupAddTaskStrController(virCgroupPtr group, return rc; } +void +virCgroupSetAssertEmpty(virCgroupPtr group, int mask) { + group->assert_empty = mask; +} + +int +virCgroupGetAssertEmpty(virCgroupPtr group) { + return group->assert_empty; +} + /** * virCgroupMoveTask: diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index 63a9e1c..f244c24 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -131,6 +131,9 @@ int virCgroupAddTaskController(virCgroupPtr group, pid_t pid, int controller); +void virCgroupSetAssertEmpty(virCgroupPtr group, int mask); +int virCgroupGetAssertEmpty(virCgroupPtr group); + int virCgroupMoveTask(virCgroupPtr src_group, virCgroupPtr dest_group); diff --git a/src/util/vircgrouppriv.h b/src/util/vircgrouppriv.h index 722863e..944d6ae 100644 --- a/src/util/vircgrouppriv.h +++ b/src/util/vircgrouppriv.h @@ -44,6 +44,7 @@ struct virCgroupController { struct virCgroup { char *path; + int assert_empty; struct virCgroupController controllers[VIR_CGROUP_CONTROLLER_LAST]; }; -- 2.4.10

On 02/23/2016 10:58 AM, Henning Schild wrote:
When using a hierarchy of cgroups we might want to add tasks just to the children cgroups but never to the parent. To make sure we do not use a parent cgroup by accident add a mechanism that lets us assert a correct implementation in cases we want such a hierarchy.
i.e. for qemu cpusets we want all tasks in /vcpuX or /emulator, not in /.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/libvirt_private.syms | 2 ++ src/util/vircgroup.c | 19 +++++++++++++++++++ src/util/vircgroup.h | 3 +++ src/util/vircgrouppriv.h | 1 + 4 files changed, 25 insertions(+)
These aren't used until patch 9 - I think this should be closer to that patch... That is introduce it just before you use it..
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 83f6e2c..cf93d06 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1201,6 +1201,7 @@ virCgroupDenyDeviceMajor; virCgroupDenyDevicePath; virCgroupDetectMountsFromFile; virCgroupFree; +virCgroupGetAssertEmpty; virCgroupGetBlkioDeviceReadBps; virCgroupGetBlkioDeviceReadIops; virCgroupGetBlkioDeviceWeight; @@ -1245,6 +1246,7 @@ virCgroupNewThread; virCgroupPathOfController; virCgroupRemove; virCgroupRemoveRecursively; +virCgroupSetAssertEmpty; virCgroupSetBlkioDeviceReadBps; virCgroupSetBlkioDeviceReadIops; virCgroupSetBlkioDeviceWeight; diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 0b65238..ad46dfc 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1197,6 +1197,15 @@ virCgroupAddTaskController(virCgroupPtr group, pid_t pid, int controller) return -1; }
+ if(group->assert_empty & (1 << controller)) {
should be : if (group...
+ virReportError(VIR_ERR_INTERNAL_ERROR, + _("Controller '%s' is not supposed to contain any" + " tasks. group=%s pid=%d\n"), + virCgroupControllerTypeToString(controller), + group->path, pid); + return -1; + } + return virCgroupSetValueU64(group, controller, "tasks", (unsigned long long)pid); } @@ -1246,6 +1255,16 @@ virCgroupAddTaskStrController(virCgroupPtr group, return rc; }
Need to have some code comments regarding input and what these do...
+void +virCgroupSetAssertEmpty(virCgroupPtr group, int mask) { + group->assert_empty = mask; +} + +int +virCgroupGetAssertEmpty(virCgroupPtr group) { + return group->assert_empty; +} +
You'll need to add the corresponding API in the "#else /* !VIR_CGROUP_SUPPORTED */" area... Search on virCgroupAddTaskController - you'll find a second entry in the module which reports a system error. That's what you'll need to add for these John
/** * virCgroupMoveTask: diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index 63a9e1c..f244c24 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -131,6 +131,9 @@ int virCgroupAddTaskController(virCgroupPtr group, pid_t pid, int controller);
+void virCgroupSetAssertEmpty(virCgroupPtr group, int mask); +int virCgroupGetAssertEmpty(virCgroupPtr group); + int virCgroupMoveTask(virCgroupPtr src_group, virCgroupPtr dest_group);
diff --git a/src/util/vircgrouppriv.h b/src/util/vircgrouppriv.h index 722863e..944d6ae 100644 --- a/src/util/vircgrouppriv.h +++ b/src/util/vircgrouppriv.h @@ -44,6 +44,7 @@ struct virCgroupController {
struct virCgroup { char *path; + int assert_empty;
struct virCgroupController controllers[VIR_CGROUP_CONTROLLER_LAST]; };

Ok i will reorder, fix style and docs etc. On Thu, 25 Feb 2016 17:52:55 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 02/23/2016 10:58 AM, Henning Schild wrote:
When using a hierarchy of cgroups we might want to add tasks just to the children cgroups but never to the parent. To make sure we do not use a parent cgroup by accident add a mechanism that lets us assert a correct implementation in cases we want such a hierarchy.
i.e. for qemu cpusets we want all tasks in /vcpuX or /emulator, not in /.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/libvirt_private.syms | 2 ++ src/util/vircgroup.c | 19 +++++++++++++++++++ src/util/vircgroup.h | 3 +++ src/util/vircgrouppriv.h | 1 + 4 files changed, 25 insertions(+)
These aren't used until patch 9 - I think this should be closer to that patch... That is introduce it just before you use it..
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 83f6e2c..cf93d06 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1201,6 +1201,7 @@ virCgroupDenyDeviceMajor; virCgroupDenyDevicePath; virCgroupDetectMountsFromFile; virCgroupFree; +virCgroupGetAssertEmpty; virCgroupGetBlkioDeviceReadBps; virCgroupGetBlkioDeviceReadIops; virCgroupGetBlkioDeviceWeight; @@ -1245,6 +1246,7 @@ virCgroupNewThread; virCgroupPathOfController; virCgroupRemove; virCgroupRemoveRecursively; +virCgroupSetAssertEmpty; virCgroupSetBlkioDeviceReadBps; virCgroupSetBlkioDeviceReadIops; virCgroupSetBlkioDeviceWeight; diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 0b65238..ad46dfc 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1197,6 +1197,15 @@ virCgroupAddTaskController(virCgroupPtr group, pid_t pid, int controller) return -1; }
+ if(group->assert_empty & (1 << controller)) {
should be :
if (group...
+ virReportError(VIR_ERR_INTERNAL_ERROR, + _("Controller '%s' is not supposed to contain any" + " tasks. group=%s pid=%d\n"), + virCgroupControllerTypeToString(controller), + group->path, pid); + return -1; + } + return virCgroupSetValueU64(group, controller, "tasks", (unsigned long long)pid); } @@ -1246,6 +1255,16 @@ virCgroupAddTaskStrController(virCgroupPtr group, return rc; }
Need to have some code comments regarding input and what these do...
+void +virCgroupSetAssertEmpty(virCgroupPtr group, int mask) { + group->assert_empty = mask; +} + +int +virCgroupGetAssertEmpty(virCgroupPtr group) { + return group->assert_empty; +} +
You'll need to add the corresponding API in the "#else /* !VIR_CGROUP_SUPPORTED */" area... Search on virCgroupAddTaskController - you'll find a second entry in the module which reports a system error. That's what you'll need to add for these
John
/** * virCgroupMoveTask: diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index 63a9e1c..f244c24 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -131,6 +131,9 @@ int virCgroupAddTaskController(virCgroupPtr group, pid_t pid, int controller);
+void virCgroupSetAssertEmpty(virCgroupPtr group, int mask); +int virCgroupGetAssertEmpty(virCgroupPtr group); + int virCgroupMoveTask(virCgroupPtr src_group, virCgroupPtr dest_group);
diff --git a/src/util/vircgrouppriv.h b/src/util/vircgrouppriv.h index 722863e..944d6ae 100644 --- a/src/util/vircgrouppriv.h +++ b/src/util/vircgrouppriv.h @@ -44,6 +44,7 @@ struct virCgroupController {
struct virCgroup { char *path; + int assert_empty;
struct virCgroupController controllers[VIR_CGROUP_CONTROLLER_LAST]; };

On Tue, Feb 23, 2016 at 04:58:37PM +0100, Henning Schild wrote:
When using a hierarchy of cgroups we might want to add tasks just to the children cgroups but never to the parent. To make sure we do not use a parent cgroup by accident add a mechanism that lets us assert a correct implementation in cases we want such a hierarchy.
i.e. for qemu cpusets we want all tasks in /vcpuX or /emulator, not in /.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/libvirt_private.syms | 2 ++ src/util/vircgroup.c | 19 +++++++++++++++++++ src/util/vircgroup.h | 3 +++ src/util/vircgrouppriv.h | 1 + 4 files changed, 25 insertions(+)
This is a nice idea, but sadly it won't work - see comments on later patches. So there's no point adding this. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

When using a cgroups hierarchy threads have child cgroups for certain controllers. Introduce an enum for later reuse. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/util/vircgroup.c | 12 +++--------- src/util/vircgroup.h | 7 +++++++ 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index ad46dfc..11f33ab 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -277,9 +277,7 @@ virCgroupValidateMachineGroup(virCgroupPtr group, goto cleanup; if (stripEmulatorSuffix && - (i == VIR_CGROUP_CONTROLLER_CPU || - i == VIR_CGROUP_CONTROLLER_CPUACCT || - i == VIR_CGROUP_CONTROLLER_CPUSET)) { + (i & VIR_CGROUP_THREAD_CONTROLLER_MASK)) { if (STREQ(tmp, "/emulator")) *tmp = '\0'; tmp = strrchr(group->controllers[i].placement, '/'); @@ -1518,7 +1516,6 @@ virCgroupNewThread(virCgroupPtr domain, { int ret = -1; char *name = NULL; - int controllers; switch (nameval) { case VIR_CGROUP_THREAD_VCPU: @@ -1539,11 +1536,8 @@ virCgroupNewThread(virCgroupPtr domain, goto cleanup; } - controllers = ((1 << VIR_CGROUP_CONTROLLER_CPU) | - (1 << VIR_CGROUP_CONTROLLER_CPUACCT) | - (1 << VIR_CGROUP_CONTROLLER_CPUSET)); - - if (virCgroupNew(-1, name, domain, controllers, group) < 0) + if (virCgroupNew(-1, name, domain, VIR_CGROUP_THREAD_CONTROLLER_MASK, + group) < 0) goto cleanup; if (virCgroupMakeGroup(domain, *group, create, VIR_CGROUP_NONE) < 0) { diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index f244c24..f71aed5 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -52,6 +52,13 @@ VIR_ENUM_DECL(virCgroupController); * Make sure we will not overflow */ verify(VIR_CGROUP_CONTROLLER_LAST < 8 * sizeof(int)); +enum { + VIR_CGROUP_THREAD_CONTROLLER_MASK = + ((1 << VIR_CGROUP_CONTROLLER_CPU) | + (1 << VIR_CGROUP_CONTROLLER_CPUACCT) | + (1 << VIR_CGROUP_CONTROLLER_CPUSET)) +}; + typedef enum { VIR_CGROUP_THREAD_VCPU = 0, VIR_CGROUP_THREAD_EMULATOR, -- 2.4.10

On 02/23/2016 10:58 AM, Henning Schild wrote:
When using a cgroups hierarchy threads have child cgroups for certain controllers. Introduce an enum for later reuse.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/util/vircgroup.c | 12 +++--------- src/util/vircgroup.h | 7 +++++++ 2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index ad46dfc..11f33ab 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -277,9 +277,7 @@ virCgroupValidateMachineGroup(virCgroupPtr group, goto cleanup;
if (stripEmulatorSuffix && - (i == VIR_CGROUP_CONTROLLER_CPU || - i == VIR_CGROUP_CONTROLLER_CPUACCT || - i == VIR_CGROUP_CONTROLLER_CPUSET)) { + (i & VIR_CGROUP_THREAD_CONTROLLER_MASK)) {
Not sure this works as expected because 'i' is not a mask - it's just an int... On entry, VIR_CGROUP_THREAD_CONTROLLER_MASK is 7... The loop goes from 0 to VIR_CGROUP_CONTROLLER_LAST (11). If you logically go through the values of 'i': 0 & 7 1 & 7 2 & 7 3 & 7 4 & 7 ... etc You'd find 0 & 8 fail, but 1 -> 7, 9, & 10 succeed So what "would" work is : mask = (1 << i); if (stripEmulatorSuffix && mask & VIR_CGROUP_THREAD_CONTROLLER_MASK)) John
if (STREQ(tmp, "/emulator")) *tmp = '\0'; tmp = strrchr(group->controllers[i].placement, '/'); @@ -1518,7 +1516,6 @@ virCgroupNewThread(virCgroupPtr domain, { int ret = -1; char *name = NULL; - int controllers;
switch (nameval) { case VIR_CGROUP_THREAD_VCPU: @@ -1539,11 +1536,8 @@ virCgroupNewThread(virCgroupPtr domain, goto cleanup; }
- controllers = ((1 << VIR_CGROUP_CONTROLLER_CPU) | - (1 << VIR_CGROUP_CONTROLLER_CPUACCT) | - (1 << VIR_CGROUP_CONTROLLER_CPUSET)); - - if (virCgroupNew(-1, name, domain, controllers, group) < 0) + if (virCgroupNew(-1, name, domain, VIR_CGROUP_THREAD_CONTROLLER_MASK, + group) < 0) goto cleanup;
if (virCgroupMakeGroup(domain, *group, create, VIR_CGROUP_NONE) < 0) { diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index f244c24..f71aed5 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -52,6 +52,13 @@ VIR_ENUM_DECL(virCgroupController); * Make sure we will not overflow */ verify(VIR_CGROUP_CONTROLLER_LAST < 8 * sizeof(int));
+enum { + VIR_CGROUP_THREAD_CONTROLLER_MASK = + ((1 << VIR_CGROUP_CONTROLLER_CPU) | + (1 << VIR_CGROUP_CONTROLLER_CPUACCT) | + (1 << VIR_CGROUP_CONTROLLER_CPUSET)) +}; + typedef enum { VIR_CGROUP_THREAD_VCPU = 0, VIR_CGROUP_THREAD_EMULATOR,

On Thu, 25 Feb 2016 17:53:07 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 02/23/2016 10:58 AM, Henning Schild wrote:
When using a cgroups hierarchy threads have child cgroups for certain controllers. Introduce an enum for later reuse.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/util/vircgroup.c | 12 +++--------- src/util/vircgroup.h | 7 +++++++ 2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index ad46dfc..11f33ab 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -277,9 +277,7 @@ virCgroupValidateMachineGroup(virCgroupPtr group, goto cleanup;
if (stripEmulatorSuffix && - (i == VIR_CGROUP_CONTROLLER_CPU || - i == VIR_CGROUP_CONTROLLER_CPUACCT || - i == VIR_CGROUP_CONTROLLER_CPUSET)) { + (i & VIR_CGROUP_THREAD_CONTROLLER_MASK)) {
Not sure this works as expected because 'i' is not a mask - it's just an int... On entry, VIR_CGROUP_THREAD_CONTROLLER_MASK is 7...
The loop goes from 0 to VIR_CGROUP_CONTROLLER_LAST (11).
If you logically go through the values of 'i':
0 & 7 1 & 7 2 & 7 3 & 7 4 & 7 ... etc
You'd find 0 & 8 fail, but 1 -> 7, 9, & 10 succeed
So what "would" work is :
mask = (1 << i); if (stripEmulatorSuffix && mask & VIR_CGROUP_THREAD_CONTROLLER_MASK))
Sure, Stupid mistake, will fix.
John
if (STREQ(tmp, "/emulator")) *tmp = '\0'; tmp = strrchr(group->controllers[i].placement, '/'); @@ -1518,7 +1516,6 @@ virCgroupNewThread(virCgroupPtr domain, { int ret = -1; char *name = NULL; - int controllers;
switch (nameval) { case VIR_CGROUP_THREAD_VCPU: @@ -1539,11 +1536,8 @@ virCgroupNewThread(virCgroupPtr domain, goto cleanup; }
- controllers = ((1 << VIR_CGROUP_CONTROLLER_CPU) | - (1 << VIR_CGROUP_CONTROLLER_CPUACCT) | - (1 << VIR_CGROUP_CONTROLLER_CPUSET)); - - if (virCgroupNew(-1, name, domain, controllers, group) < 0) + if (virCgroupNew(-1, name, domain, VIR_CGROUP_THREAD_CONTROLLER_MASK, + group) < 0) goto cleanup;
if (virCgroupMakeGroup(domain, *group, create, VIR_CGROUP_NONE) < 0) { diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index f244c24..f71aed5 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -52,6 +52,13 @@ VIR_ENUM_DECL(virCgroupController); * Make sure we will not overflow */ verify(VIR_CGROUP_CONTROLLER_LAST < 8 * sizeof(int));
+enum { + VIR_CGROUP_THREAD_CONTROLLER_MASK = + ((1 << VIR_CGROUP_CONTROLLER_CPU) | + (1 << VIR_CGROUP_CONTROLLER_CPUACCT) | + (1 << VIR_CGROUP_CONTROLLER_CPUSET)) +}; + typedef enum { VIR_CGROUP_THREAD_VCPU = 0, VIR_CGROUP_THREAD_EMULATOR,

virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 11 +++++++++++ src/qemu/qemu_cgroup.c | 11 +++++++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 22 insertions(+), 22 deletions(-) diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index ad254e4..609e9ea 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -504,6 +504,17 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, &cgroup) < 0) goto cleanup; + if (virCgroupAddTask(cgroup, initpid) < 0) { + virErrorPtr saved = virSaveLastError(); + virCgroupRemove(cgroup); + virCgroupFree(&cgroup); + if (saved) { + virSetError(saved); + virFreeError(saved); + } + goto cleanup; + } + /* setup control group permissions for user namespace */ if (def->idmap.uidmap) { if (virCgroupSetOwner(cgroup, diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index e41f461..66dc782 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -789,6 +789,17 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; } + if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { + virErrorPtr saved = virSaveLastError(); + virCgroupRemove(priv->cgroup); + virCgroupFree(&priv->cgroup); + if (saved) { + virSetError(saved); + virFreeError(saved); + } + goto cleanup; + } + done: ret = 0; cleanup: diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 11f33ab..aef8e8c 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char *name, } } - if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - ret = 0; cleanup: virCgroupFree(&parent); @@ -1714,7 +1704,6 @@ int virCgroupTerminateMachine(const char *name, static int virCgroupNewMachineManual(const char *name, const char *drivername, - pid_t pidleader, const char *partition, int controllers, virCgroupPtr *group) @@ -1740,16 +1729,6 @@ virCgroupNewMachineManual(const char *name, group) < 0) goto cleanup; - if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - done: ret = 0; @@ -1796,7 +1775,6 @@ virCgroupNewMachine(const char *name, return virCgroupNewMachineManual(name, drivername, - pidleader, partition, controllers, group); -- 2.4.10

On 02/23/2016 10:58 AM, Henning Schild wrote:
virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 11 +++++++++++ src/qemu/qemu_cgroup.c | 11 +++++++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 22 insertions(+), 22 deletions(-)
So here we are again... Trying to recall where we left off... Ah yes, the entrails for this particular patch point me at the following: http://www.redhat.com/archives/libvir-list/2015-December/msg00759.html for lxc, and the following for qemu: http://www.redhat.com/archives/libvir-list/2016-January/msg00513.html Essentially the problem being that we can get a 0 return status *and* the returned cgroup pointer is still NULL That still needs to be addressed here (if this gets accepted)
diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index ad254e4..609e9ea 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -504,6 +504,17 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, &cgroup) < 0)
IOW: < 0 || !cgroup)
goto cleanup;
+ if (virCgroupAddTask(cgroup, initpid) < 0) { + virErrorPtr saved = virSaveLastError(); + virCgroupRemove(cgroup); + virCgroupFree(&cgroup); + if (saved) { + virSetError(saved); + virFreeError(saved); + } + goto cleanup; + } + /* setup control group permissions for user namespace */ if (def->idmap.uidmap) { if (virCgroupSetOwner(cgroup, diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index e41f461..66dc782 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -789,6 +789,17 @@ qemuInitCgroup(virQEMUDriverPtr driver,
and a few lines above here: &priv->cgroup) < 0 || !priv->cgroup) { John
goto cleanup; }
+ if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { + virErrorPtr saved = virSaveLastError(); + virCgroupRemove(priv->cgroup); + virCgroupFree(&priv->cgroup); + if (saved) { + virSetError(saved); + virFreeError(saved); + } + goto cleanup; + } + done: ret = 0; cleanup: diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 11f33ab..aef8e8c 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char *name, } }
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - ret = 0; cleanup: virCgroupFree(&parent); @@ -1714,7 +1704,6 @@ int virCgroupTerminateMachine(const char *name, static int virCgroupNewMachineManual(const char *name, const char *drivername, - pid_t pidleader, const char *partition, int controllers, virCgroupPtr *group) @@ -1740,16 +1729,6 @@ virCgroupNewMachineManual(const char *name, group) < 0) goto cleanup;
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - } - done: ret = 0;
@@ -1796,7 +1775,6 @@ virCgroupNewMachine(const char *name,
return virCgroupNewMachineManual(name, drivername, - pidleader, partition, controllers, group);

On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild wrote:
virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 11 +++++++++++ src/qemu/qemu_cgroup.c | 11 +++++++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 22 insertions(+), 22 deletions(-)
NACK to this patch once again. This does not actually work as you think it does.
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 11f33ab..aef8e8c 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char *name, } }
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - }
Just above this we called virSystemdCreateMachine. Systemd will create the cgroup and add the pidleader to those cgroups. Systemd may add the pidleader to just the 'systemd' controller, or it may add the pidleader to *ALL* controllers. We have no way of knowing. This virCgroupAddTask call deals with whatever systemd chose not todo, so we can guarantee consistent behaviour with the pidleader in all cgroups. By removing this you make this method non-deterministic - the pid may or may not be in the cpu controller now. THis is bad because it can lead to QEMU/LXC driver code working in some cases but failing in other cases. Furthermore, this existing does not cause any problems for the scenario you care about. THis cgroup placement is being set in between the time libvirtd calls fork() and exec(). With your later patch 5, we ensure that the PID is moved across into the emulator cgroup, before we call exec(). When we call exec all memory mappings will be replaced, so QEMU will stil start with the correct vCPU placement and memory allocation placement. Just just drop this patch please. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, 26 Feb 2016 11:13:13 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild wrote:
virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 11 +++++++++++ src/qemu/qemu_cgroup.c | 11 +++++++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 22 insertions(+), 22 deletions(-)
NACK to this patch once again.
This does not actually work as you think it does.
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 11f33ab..aef8e8c 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char *name, } }
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - }
Just above this we called virSystemdCreateMachine. Systemd will create the cgroup and add the pidleader to those cgroups. Systemd may add the pidleader to just the 'systemd' controller, or it may add the pidleader to *ALL* controllers. We have no way of knowing.
This virCgroupAddTask call deals with whatever systemd chose not todo, so we can guarantee consistent behaviour with the pidleader in all cgroups.
By removing this you make this method non-deterministic - the pid may or may not be in the cpu controller now. THis is bad because it can lead to QEMU/LXC driver code working in some cases but failing in other cases.
Furthermore, this existing does not cause any problems for the scenario you care about. THis cgroup placement is being set in between the time libvirtd calls fork() and exec(). With your later patch 5, we ensure that the PID is moved across into the emulator cgroup, before we call exec(). When we call exec all memory mappings will be replaced, so QEMU will stil start with the correct vCPU placement and memory allocation placement.
I agree having the task in the wrong cgroup before the exec() seems harmless. But i am not sure all the fiddling with cgroups is indeed harmless and does not cause i.e. kernel work on cores that should be left alone. I have the feeling allowing the task in the parent cgroup is a bad idea, no matter how short the window seems to be. Right now the parent cgroup contains all cpus found in machine.slice, which for pinned VMs is too much. How about we calculate the size of the child cgroups before and make the parent the union of them. Or give the parent the emulator pinning and extend it for the vcpus later. But that might turn out pretty complicated as well, getting the order right with the mix of cpusets and sched_setaffinity().
Just just drop this patch please.
Regards, Daniel

On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 11:13:13 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild wrote:
virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 11 +++++++++++ src/qemu/qemu_cgroup.c | 11 +++++++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 22 insertions(+), 22 deletions(-)
NACK to this patch once again.
This does not actually work as you think it does.
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 11f33ab..aef8e8c 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char *name, } }
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - }
Just above this we called virSystemdCreateMachine. Systemd will create the cgroup and add the pidleader to those cgroups. Systemd may add the pidleader to just the 'systemd' controller, or it may add the pidleader to *ALL* controllers. We have no way of knowing.
This virCgroupAddTask call deals with whatever systemd chose not todo, so we can guarantee consistent behaviour with the pidleader in all cgroups.
By removing this you make this method non-deterministic - the pid may or may not be in the cpu controller now. THis is bad because it can lead to QEMU/LXC driver code working in some cases but failing in other cases.
Furthermore, this existing does not cause any problems for the scenario you care about. THis cgroup placement is being set in between the time libvirtd calls fork() and exec(). With your later patch 5, we ensure that the PID is moved across into the emulator cgroup, before we call exec(). When we call exec all memory mappings will be replaced, so QEMU will stil start with the correct vCPU placement and memory allocation placement.
I agree having the task in the wrong cgroup before the exec() seems harmless. But i am not sure all the fiddling with cgroups is indeed harmless and does not cause i.e. kernel work on cores that should be left alone. I have the feeling allowing the task in the parent cgroup is a bad idea, no matter how short the window seems to be.
Right now the parent cgroup contains all cpus found in machine.slice, which for pinned VMs is too much. How about we calculate the size of the child cgroups before and make the parent the union of them. Or give the parent the emulator pinning and extend it for the vcpus later. But that might turn out pretty complicated as well, getting the order right with the mix of cpusets and sched_setaffinity().
Just just drop this patch please.
The point is though that we have *no* choice. Systemd can put the task in the cpu controller and we've no way to prevent that. So the code *has* to be able to cope with that happening. Therefore this patch is wrong it just makes behaviour non-deterministic increasing the chances that we don't correctly handle the case where systemd adds the task to the cpu controllers Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, 26 Feb 2016 12:21:02 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 11:13:13 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild wrote:
virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 11 +++++++++++ src/qemu/qemu_cgroup.c | 11 +++++++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 22 insertions(+), 22 deletions(-)
NACK to this patch once again.
This does not actually work as you think it does.
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 11f33ab..aef8e8c 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char *name, } }
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - }
Just above this we called virSystemdCreateMachine. Systemd will create the cgroup and add the pidleader to those cgroups. Systemd may add the pidleader to just the 'systemd' controller, or it may add the pidleader to *ALL* controllers. We have no way of knowing.
This virCgroupAddTask call deals with whatever systemd chose not todo, so we can guarantee consistent behaviour with the pidleader in all cgroups.
By removing this you make this method non-deterministic - the pid may or may not be in the cpu controller now. THis is bad because it can lead to QEMU/LXC driver code working in some cases but failing in other cases.
Furthermore, this existing does not cause any problems for the scenario you care about. THis cgroup placement is being set in between the time libvirtd calls fork() and exec(). With your later patch 5, we ensure that the PID is moved across into the emulator cgroup, before we call exec(). When we call exec all memory mappings will be replaced, so QEMU will stil start with the correct vCPU placement and memory allocation placement.
I agree having the task in the wrong cgroup before the exec() seems harmless. But i am not sure all the fiddling with cgroups is indeed harmless and does not cause i.e. kernel work on cores that should be left alone. I have the feeling allowing the task in the parent cgroup is a bad idea, no matter how short the window seems to be.
Right now the parent cgroup contains all cpus found in machine.slice, which for pinned VMs is too much. How about we calculate the size of the child cgroups before and make the parent the union of them. Or give the parent the emulator pinning and extend it for the vcpus later. But that might turn out pretty complicated as well, getting the order right with the mix of cpusets and sched_setaffinity().
Just just drop this patch please.
The point is though that we have *no* choice. Systemd can put the task in the cpu controller and we've no way to prevent that. So the code *has* to be able to cope with that happening. Therefore this patch is wrong it just makes behaviour non-deterministic increasing the chances that we don't correctly handle the case where systemd adds the task to the cpu controllers
Understood! I was suggesting a "growing on demand" policy instead of "shrinking after inheriting all". If we can not control what systemd does we have to give it harmless cpus to mess around with. That assumes we can control the size of the cpuset before systemd puts anything in. Once back in control we grow the parent group before deriving more child groups. Would that be possible? I have no objections to keep using the shrinking approach. Especially since the controlled growing is harder to implement in the given codebase. It just feels like it should be the other way around.
Regards, Daniel

On Fri, Feb 26, 2016 at 01:43:05PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 12:21:02 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 11:13:13 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild wrote:
virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 11 +++++++++++ src/qemu/qemu_cgroup.c | 11 +++++++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 22 insertions(+), 22 deletions(-)
NACK to this patch once again.
This does not actually work as you think it does.
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 11f33ab..aef8e8c 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char *name, } }
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - }
Just above this we called virSystemdCreateMachine. Systemd will create the cgroup and add the pidleader to those cgroups. Systemd may add the pidleader to just the 'systemd' controller, or it may add the pidleader to *ALL* controllers. We have no way of knowing.
This virCgroupAddTask call deals with whatever systemd chose not todo, so we can guarantee consistent behaviour with the pidleader in all cgroups.
By removing this you make this method non-deterministic - the pid may or may not be in the cpu controller now. THis is bad because it can lead to QEMU/LXC driver code working in some cases but failing in other cases.
Furthermore, this existing does not cause any problems for the scenario you care about. THis cgroup placement is being set in between the time libvirtd calls fork() and exec(). With your later patch 5, we ensure that the PID is moved across into the emulator cgroup, before we call exec(). When we call exec all memory mappings will be replaced, so QEMU will stil start with the correct vCPU placement and memory allocation placement.
I agree having the task in the wrong cgroup before the exec() seems harmless. But i am not sure all the fiddling with cgroups is indeed harmless and does not cause i.e. kernel work on cores that should be left alone. I have the feeling allowing the task in the parent cgroup is a bad idea, no matter how short the window seems to be.
Right now the parent cgroup contains all cpus found in machine.slice, which for pinned VMs is too much. How about we calculate the size of the child cgroups before and make the parent the union of them. Or give the parent the emulator pinning and extend it for the vcpus later. But that might turn out pretty complicated as well, getting the order right with the mix of cpusets and sched_setaffinity().
Just just drop this patch please.
The point is though that we have *no* choice. Systemd can put the task in the cpu controller and we've no way to prevent that. So the code *has* to be able to cope with that happening. Therefore this patch is wrong it just makes behaviour non-deterministic increasing the chances that we don't correctly handle the case where systemd adds the task to the cpu controllers
Understood! I was suggesting a "growing on demand" policy instead of "shrinking after inheriting all".
If we can not control what systemd does we have to give it harmless cpus to mess around with. That assumes we can control the size of the cpuset before systemd puts anything in. Once back in control we grow the parent group before deriving more child groups.
Would that be possible?
I have no objections to keep using the shrinking approach. Especially since the controlled growing is harder to implement in the given codebase. It just feels like it should be the other way around.
I don't see what actual problem you are trying to solve here. IIUC, the original problem you wanted to address was that vCPU pids could run the wrong CPU for a short time. ie The original code did 1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Child pid execs QEMU ... QEMU pid runs on whatever pCPUs the root cgroup inherited... 4. QEMU creates vCPU pids ... vCPU pids run on whatever pCPUs the root cgroup inherited... 4. Libvirtd moves emulator PIDs and vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs.... With the important change in patch 5 this now looks like 1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Libvirtd moves pid into emulator group ... this pid runs on assigned pCPUs for emulator... 4. Child pid execs QEMU ... QEMU pid runs on assigned pCPUs for emulator... 5. QEMU creates vCPU pids ... vCPU pids are running on assigned pCPUS for emulator... 6. Libvirtd moves vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs.... Which is good, because vCPU pids don't ever run on un-restricted pCPUs. Your patch 4 here is attempting to change step 2 only so that it looks like 1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs that libvirt is permitted to use... or depending on what controller system added ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Libvirtd adds pid into emulator group ... this pid runs on assigned pCPUs for emulator... 4. Child pid execs QEMU ... QEMU pid runs on assigned pCPUs for emulator... 5. QEMU creates vCPU pids ... vCPU pids are running on assigned pCPUS for emulator... 6. Libvirtd moves vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs.... At the time we exec QEMU in step 4 the situation is exactly the same as before. The vCPU pids are still created in the right place straight away. So this patch 4 doesn't achieve anything useful Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, 26 Feb 2016 13:00:04 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 01:43:05PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 12:21:02 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 11:13:13 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild wrote:
virCgroupNewMachine used to add the pidleader to the newly created machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 11 +++++++++++ src/qemu/qemu_cgroup.c | 11 +++++++++++ src/util/vircgroup.c | 22 ---------------------- 3 files changed, 22 insertions(+), 22 deletions(-)
NACK to this patch once again.
This does not actually work as you think it does.
diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 11f33ab..aef8e8c 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char *name, } }
- if (virCgroupAddTask(*group, pidleader) < 0) { - virErrorPtr saved = virSaveLastError(); - virCgroupRemove(*group); - virCgroupFree(group); - if (saved) { - virSetError(saved); - virFreeError(saved); - } - }
Just above this we called virSystemdCreateMachine. Systemd will create the cgroup and add the pidleader to those cgroups. Systemd may add the pidleader to just the 'systemd' controller, or it may add the pidleader to *ALL* controllers. We have no way of knowing.
This virCgroupAddTask call deals with whatever systemd chose not todo, so we can guarantee consistent behaviour with the pidleader in all cgroups.
By removing this you make this method non-deterministic - the pid may or may not be in the cpu controller now. THis is bad because it can lead to QEMU/LXC driver code working in some cases but failing in other cases.
Furthermore, this existing does not cause any problems for the scenario you care about. THis cgroup placement is being set in between the time libvirtd calls fork() and exec(). With your later patch 5, we ensure that the PID is moved across into the emulator cgroup, before we call exec(). When we call exec all memory mappings will be replaced, so QEMU will stil start with the correct vCPU placement and memory allocation placement.
I agree having the task in the wrong cgroup before the exec() seems harmless. But i am not sure all the fiddling with cgroups is indeed harmless and does not cause i.e. kernel work on cores that should be left alone. I have the feeling allowing the task in the parent cgroup is a bad idea, no matter how short the window seems to be.
Right now the parent cgroup contains all cpus found in machine.slice, which for pinned VMs is too much. How about we calculate the size of the child cgroups before and make the parent the union of them. Or give the parent the emulator pinning and extend it for the vcpus later. But that might turn out pretty complicated as well, getting the order right with the mix of cpusets and sched_setaffinity().
Just just drop this patch please.
The point is though that we have *no* choice. Systemd can put the task in the cpu controller and we've no way to prevent that. So the code *has* to be able to cope with that happening. Therefore this patch is wrong it just makes behaviour non-deterministic increasing the chances that we don't correctly handle the case where systemd adds the task to the cpu controllers
Understood! I was suggesting a "growing on demand" policy instead of "shrinking after inheriting all".
If we can not control what systemd does we have to give it harmless cpus to mess around with. That assumes we can control the size of the cpuset before systemd puts anything in. Once back in control we grow the parent group before deriving more child groups.
Would that be possible?
I have no objections to keep using the shrinking approach. Especially since the controlled growing is harder to implement in the given codebase. It just feels like it should be the other way around.
I don't see what actual problem you are trying to solve here.
IIUC, the original problem you wanted to address was that vCPU pids could run the wrong CPU for a short time. ie The original code did
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Child pid execs QEMU ... QEMU pid runs on whatever pCPUs the root cgroup inherited... 4. QEMU creates vCPU pids ... vCPU pids run on whatever pCPUs the root cgroup inherited... 4. Libvirtd moves emulator PIDs and vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
With the important change in patch 5 this now looks like
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM
... this pid runs on whatever pCPUs the root cgroup inherited...
I am trying to come up with a solution that eliminates the above line from the whole bringup. I.e never allow a pid belonging to the VM outside the pinnings of libvirtd and the VM configuration. But until step 4 it should probably be "... this pid *just sits* on whatever pCPUs the root cgroup inherited..." If we are sure that it does not "run" before 4. patch 5 does the trick already
3. Libvirtd moves pid into emulator group ... this pid runs on assigned pCPUs for emulator... 4. Child pid execs QEMU ... QEMU pid runs on assigned pCPUs for emulator... 5. QEMU creates vCPU pids ... vCPU pids are running on assigned pCPUS for emulator... 6. Libvirtd moves vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
Which is good, because vCPU pids don't ever run on un-restricted pCPUs.
Your patch 4 here is attempting to change step 2 only so that it looks like
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs that libvirt is permitted to use... or depending on what controller system added ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Libvirtd adds pid into emulator group ... this pid runs on assigned pCPUs for emulator... 4. Child pid execs QEMU ... QEMU pid runs on assigned pCPUs for emulator... 5. QEMU creates vCPU pids ... vCPU pids are running on assigned pCPUS for emulator... 6. Libvirtd moves vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
At the time we exec QEMU in step 4 the situation is exactly the same as before. The vCPU pids are still created in the right place straight away.
So this patch 4 doesn't achieve anything useful
Regards, Daniel

On Fri, Feb 26, 2016 at 02:17:35PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 13:00:04 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 01:43:05PM +0100, Henning Schild wrote: IIUC, the original problem you wanted to address was that vCPU pids could run the wrong CPU for a short time. ie The original code did
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Child pid execs QEMU ... QEMU pid runs on whatever pCPUs the root cgroup inherited... 4. QEMU creates vCPU pids ... vCPU pids run on whatever pCPUs the root cgroup inherited... 4. Libvirtd moves emulator PIDs and vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
With the important change in patch 5 this now looks like
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM
... this pid runs on whatever pCPUs the root cgroup inherited...
I am trying to come up with a solution that eliminates the above line from the whole bringup. I.e never allow a pid belonging to the VM outside the pinnings of libvirtd and the VM configuration.
That's imposible because you can't stop systemd placing the pid leader
But until step 4 it should probably be "... this pid *just sits* on whatever pCPUs the root cgroup inherited..." If we are sure that it does not "run" before 4. patch 5 does the trick already
Yes the pid *runs* - it has to run in order to do the setup tasks before exec'ing QEMU. Indeed even invoking 'execve()' syscall requires that it run.
3. Libvirtd moves pid into emulator group ... this pid runs on assigned pCPUs for emulator... 4. Child pid execs QEMU ... QEMU pid runs on assigned pCPUs for emulator... 5. QEMU creates vCPU pids ... vCPU pids are running on assigned pCPUS for emulator... 6. Libvirtd moves vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
Which is good, because vCPU pids don't ever run on un-restricted pCPUs.
Your patch 4 here is attempting to change step 2 only so that it looks like
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs that libvirt is permitted to use... or depending on what controller system added ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Libvirtd adds pid into emulator group ... this pid runs on assigned pCPUs for emulator... 4. Child pid execs QEMU ... QEMU pid runs on assigned pCPUs for emulator... 5. QEMU creates vCPU pids ... vCPU pids are running on assigned pCPUS for emulator... 6. Libvirtd moves vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
At the time we exec QEMU in step 4 the situation is exactly the same as before. The vCPU pids are still created in the right place straight away.
So this patch 4 doesn't achieve anything useful
Regards, Daniel
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, 26 Feb 2016 13:26:36 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 02:17:35PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 13:00:04 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 01:43:05PM +0100, Henning Schild wrote: IIUC, the original problem you wanted to address was that vCPU pids could run the wrong CPU for a short time. ie The original code did
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Child pid execs QEMU ... QEMU pid runs on whatever pCPUs the root cgroup inherited... 4. QEMU creates vCPU pids ... vCPU pids run on whatever pCPUs the root cgroup inherited... 4. Libvirtd moves emulator PIDs and vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
With the important change in patch 5 this now looks like
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM
... this pid runs on whatever pCPUs the root cgroup inherited...
I am trying to come up with a solution that eliminates the above line from the whole bringup. I.e never allow a pid belonging to the VM outside the pinnings of libvirtd and the VM configuration.
That's imposible because you can't stop systemd placing the pid leader
But until step 4 it should probably be "... this pid *just sits* on whatever pCPUs the root cgroup inherited..." If we are sure that it does not "run" before 4. patch 5 does the trick already
Yes the pid *runs* - it has to run in order to do the setup tasks before exec'ing QEMU. Indeed even invoking 'execve()' syscall requires that it run.
I am saying it should not run while in the parent cgroup. so between steps 2 and 3. If we can not stop the pid from getting into the parent cgroup we have to rely on it not causing disturbance by "running". Otherwise the whole series is not a solution to the disturbance problem, it is just a mitigation. That beeing said i think it is still good enough and we should stop that discussion here. I will send a v2 series.
3. Libvirtd moves pid into emulator group ... this pid runs on assigned pCPUs for emulator...
from now on it can run all it wants, because it is in the corrent cpuset
4. Child pid execs QEMU ... QEMU pid runs on assigned pCPUs for emulator... 5. QEMU creates vCPU pids ... vCPU pids are running on assigned pCPUS for emulator... 6. Libvirtd moves vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
Which is good, because vCPU pids don't ever run on un-restricted pCPUs.
Your patch 4 here is attempting to change step 2 only so that it looks like
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs that libvirt is permitted to use... or depending on what controller system added ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Libvirtd adds pid into emulator group ... this pid runs on assigned pCPUs for emulator... 4. Child pid execs QEMU ... QEMU pid runs on assigned pCPUs for emulator... 5. QEMU creates vCPU pids ... vCPU pids are running on assigned pCPUS for emulator... 6. Libvirtd moves vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
At the time we exec QEMU in step 4 the situation is exactly the same as before. The vCPU pids are still created in the right place straight away.
So this patch 4 doesn't achieve anything useful
Regards, Daniel
Regards, Daniel

On Fri, Feb 26, 2016 at 02:46:56PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 13:26:36 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 02:17:35PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 13:00:04 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 01:43:05PM +0100, Henning Schild wrote: IIUC, the original problem you wanted to address was that vCPU pids could run the wrong CPU for a short time. ie The original code did
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM ... this pid runs on whatever pCPUs the root cgroup inherited... 3. Child pid execs QEMU ... QEMU pid runs on whatever pCPUs the root cgroup inherited... 4. QEMU creates vCPU pids ... vCPU pids run on whatever pCPUs the root cgroup inherited... 4. Libvirtd moves emulator PIDs and vCPU PIDs ... emulator PIDs are running on assigned pCPUs for emulator... ... vCPU PIDs are running on assigned pCPUs for vCPUs....
With the important change in patch 5 this now looks like
1. Libvirtd forks child pid ... this pid runs on whatever pCPUs that libvirt is permitted to use... 2. Libvirtd creates root cgroup for VM
... this pid runs on whatever pCPUs the root cgroup inherited...
I am trying to come up with a solution that eliminates the above line from the whole bringup. I.e never allow a pid belonging to the VM outside the pinnings of libvirtd and the VM configuration.
That's imposible because you can't stop systemd placing the pid leader
But until step 4 it should probably be "... this pid *just sits* on whatever pCPUs the root cgroup inherited..." If we are sure that it does not "run" before 4. patch 5 does the trick already
Yes the pid *runs* - it has to run in order to do the setup tasks before exec'ing QEMU. Indeed even invoking 'execve()' syscall requires that it run.
I am saying it should not run while in the parent cgroup. so between steps 2 and 3. If we can not stop the pid from getting into the parent cgroup we have to rely on it not causing disturbance by "running". Otherwise the whole series is not a solution to the disturbance problem, it is just a mitigation.
We *already* have the possibility of running on arbitrary pCPUs, because in between fork() and use creating the cgroup the process is runing on whatever pCPUs libvirtd is on. This is potentially any CPUs. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Move qemuSetupCgroupForEmulator up under qemuSetupCgroup. That way we move the one main thread right into the emulator cgroup, instead of moving multiple threads later on. And we do not actually want any threads running in the parent cgroups (cpu cpuacct cpuset). Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 05cbda2..65f718c 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4895,6 +4895,10 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroup(driver, vm, nnicindexes, nicindexes) < 0) goto cleanup; + VIR_DEBUG("Setting cgroup for emulator (if required)"); + if (qemuSetupCgroupForEmulator(vm) < 0) + goto cleanup; + /* This must be done after cgroup placement to avoid resetting CPU * affinity */ if (!vm->def->cputune.emulatorpin && @@ -4943,10 +4947,6 @@ qemuProcessLaunch(virConnectPtr conn, if (rv == -1) /* The VM failed to start */ goto cleanup; - VIR_DEBUG("Setting cgroup for emulator (if required)"); - if (qemuSetupCgroupForEmulator(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup; -- 2.4.10

On 02/23/2016 10:58 AM, Henning Schild wrote:
Move qemuSetupCgroupForEmulator up under qemuSetupCgroup. That way we move the one main thread right into the emulator cgroup, instead of moving multiple threads later on. And we do not actually want any threads running in the parent cgroups (cpu cpuacct cpuset).
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
This is where things are going to get messy. Peter Krempa posted a series after yours: http://www.redhat.com/archives/libvir-list/2016-February/msg01211.html which conflicts with this and the followup patch. Hopefully between you, Peter, and Dan something can be worked out. Also, it seems starting at patch 7 there's more conflicts with the top of the upstream, so I couldn't 'git am -3' them into a local branch. John
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 05cbda2..65f718c 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4895,6 +4895,10 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroup(driver, vm, nnicindexes, nicindexes) < 0) goto cleanup;
+ VIR_DEBUG("Setting cgroup for emulator (if required)"); + if (qemuSetupCgroupForEmulator(vm) < 0) + goto cleanup; + /* This must be done after cgroup placement to avoid resetting CPU * affinity */ if (!vm->def->cputune.emulatorpin && @@ -4943,10 +4947,6 @@ qemuProcessLaunch(virConnectPtr conn, if (rv == -1) /* The VM failed to start */ goto cleanup;
- VIR_DEBUG("Setting cgroup for emulator (if required)"); - if (qemuSetupCgroupForEmulator(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup;

On Thu, 25 Feb 2016 17:53:30 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 02/23/2016 10:58 AM, Henning Schild wrote:
Move qemuSetupCgroupForEmulator up under qemuSetupCgroup. That way we move the one main thread right into the emulator cgroup, instead of moving multiple threads later on. And we do not actually want any threads running in the parent cgroups (cpu cpuacct cpuset).
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
This is where things are going to get messy. Peter Krempa posted a series after yours:
http://www.redhat.com/archives/libvir-list/2016-February/msg01211.html
which conflicts with this and the followup patch. Hopefully between you, Peter, and Dan something can be worked out.
Peters patch 1 seems like a semantic noop. It just merge cgroup creation and affinity setting. I can probably rebase on top of that without a problem.
Also, it seems starting at patch 7 there's more conflicts with the top of the upstream, so I couldn't 'git am -3' them into a local branch.
John
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 05cbda2..65f718c 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4895,6 +4895,10 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroup(driver, vm, nnicindexes, nicindexes) < 0) goto cleanup;
+ VIR_DEBUG("Setting cgroup for emulator (if required)"); + if (qemuSetupCgroupForEmulator(vm) < 0) + goto cleanup; + /* This must be done after cgroup placement to avoid resetting CPU * affinity */ if (!vm->def->cputune.emulatorpin && @@ -4943,10 +4947,6 @@ qemuProcessLaunch(virConnectPtr conn, if (rv == -1) /* The VM failed to start */ goto cleanup;
- VIR_DEBUG("Setting cgroup for emulator (if required)"); - if (qemuSetupCgroupForEmulator(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup;

On Thu, 25 Feb 2016 17:53:30 -0500 John Ferlan <jferlan@redhat.com> wrote:
On 02/23/2016 10:58 AM, Henning Schild wrote:
Move qemuSetupCgroupForEmulator up under qemuSetupCgroup. That way we move the one main thread right into the emulator cgroup, instead of moving multiple threads later on. And we do not actually want any threads running in the parent cgroups (cpu cpuacct cpuset).
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
This is where things are going to get messy. Peter Krempa posted a series after yours:
http://www.redhat.com/archives/libvir-list/2016-February/msg01211.html
Where can i get a raw copy of this to "git am"? Is there something like patchwork for people who are subscribed write-only or not at all?
which conflicts with this and the followup patch. Hopefully between you, Peter, and Dan something can be worked out.
Also, it seems starting at patch 7 there's more conflicts with the top of the upstream, so I couldn't 'git am -3' them into a local branch.
John
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 05cbda2..65f718c 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4895,6 +4895,10 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroup(driver, vm, nnicindexes, nicindexes) < 0) goto cleanup;
+ VIR_DEBUG("Setting cgroup for emulator (if required)"); + if (qemuSetupCgroupForEmulator(vm) < 0) + goto cleanup; + /* This must be done after cgroup placement to avoid resetting CPU * affinity */ if (!vm->def->cputune.emulatorpin && @@ -4943,10 +4947,6 @@ qemuProcessLaunch(virConnectPtr conn, if (rv == -1) /* The VM failed to start */ goto cleanup;
- VIR_DEBUG("Setting cgroup for emulator (if required)"); - if (qemuSetupCgroupForEmulator(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup;

On Tue, Feb 23, 2016 at 04:58:40PM +0100, Henning Schild wrote:
Move qemuSetupCgroupForEmulator up under qemuSetupCgroup. That way we move the one main thread right into the emulator cgroup, instead of moving multiple threads later on. And we do not actually want any threads running in the parent cgroups (cpu cpuacct cpuset).
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
ACK, this is the key part of the fix. With the old code the QEMU pids are only moved /after/ exec(), with this change, the pids are moved /before/ exec(), fixing the core problem of threads runing int the wrong place between 'exec()' and libvirt querying vCPUs.
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 05cbda2..65f718c 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4895,6 +4895,10 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroup(driver, vm, nnicindexes, nicindexes) < 0) goto cleanup;
+ VIR_DEBUG("Setting cgroup for emulator (if required)"); + if (qemuSetupCgroupForEmulator(vm) < 0) + goto cleanup; + /* This must be done after cgroup placement to avoid resetting CPU * affinity */ if (!vm->def->cputune.emulatorpin && @@ -4943,10 +4947,6 @@ qemuProcessLaunch(virConnectPtr conn, if (rv == -1) /* The VM failed to start */ goto cleanup;
- VIR_DEBUG("Setting cgroup for emulator (if required)"); - if (qemuSetupCgroupForEmulator(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup; -- 2.4.10
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, 26 Feb 2016 11:14:31 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Feb 23, 2016 at 04:58:40PM +0100, Henning Schild wrote:
Move qemuSetupCgroupForEmulator up under qemuSetupCgroup. That way we move the one main thread right into the emulator cgroup, instead of moving multiple threads later on. And we do not actually want any threads running in the parent cgroups (cpu cpuacct cpuset).
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_process.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
ACK, this is the key part of the fix. With the old code the QEMU pids are only moved /after/ exec(), with this change, the pids are moved /before/ exec(), fixing the core problem of threads runing int the wrong place between 'exec()' and libvirt querying vCPUs.
If the asserts wont work and we have to live with the task being in the parent cgroup between fork() and exec() we need to make sure the new process is truly inactive. We need to make sure we are not just making the window smaller.
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 05cbda2..65f718c 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4895,6 +4895,10 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroup(driver, vm, nnicindexes, nicindexes) < 0) goto cleanup;
+ VIR_DEBUG("Setting cgroup for emulator (if required)"); + if (qemuSetupCgroupForEmulator(vm) < 0) + goto cleanup; + /* This must be done after cgroup placement to avoid resetting CPU * affinity */ if (!vm->def->cputune.emulatorpin && @@ -4943,10 +4947,6 @@ qemuProcessLaunch(virConnectPtr conn, if (rv == -1) /* The VM failed to start */ goto cleanup;
- VIR_DEBUG("Setting cgroup for emulator (if required)"); - if (qemuSetupCgroupForEmulator(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting affinity of emulator threads"); if (qemuProcessSetEmulatorAffinity(vm) < 0) goto cleanup; -- 2.4.10
Regards, Daniel

qemuSetupCgroupForEmulator runs at a point in time where there is only the qemu main thread. Use virCgroupAddTask to put just that one task into the emulator cgroup. That patch makes virCgroupMoveTask and virCgroupAddTaskStrController obsolete. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/libvirt_private.syms | 1 - src/qemu/qemu_cgroup.c | 4 +- src/util/vircgroup.c | 102 ----------------------------------------------- src/util/vircgroup.h | 3 -- 4 files changed, 2 insertions(+), 108 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index cf93d06..c5e57bf 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1234,7 +1234,6 @@ virCgroupIsolateMount; virCgroupKill; virCgroupKillPainfully; virCgroupKillRecursive; -virCgroupMoveTask; virCgroupNewDetect; virCgroupNewDetectMachine; virCgroupNewDomainPartition; diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 66dc782..d410a66 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -1145,8 +1145,8 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) true, &cgroup_emulator) < 0) goto cleanup; - if (virCgroupMoveTask(priv->cgroup, cgroup_emulator) < 0) - goto cleanup; + if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) + goto cleanup; if (def->cputune.emulatorpin) cpumask = def->cputune.emulatorpin; diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index aef8e8c..c31c83b 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1209,50 +1209,6 @@ virCgroupAddTaskController(virCgroupPtr group, pid_t pid, int controller) } -static int -virCgroupAddTaskStrController(virCgroupPtr group, - const char *pidstr, - int controller) -{ - char *str = NULL, *cur = NULL, *next = NULL; - unsigned long long p = 0; - int rc = 0; - char *endp; - - if (VIR_STRDUP(str, pidstr) < 0) - return -1; - - cur = str; - while (*cur != '\0') { - if (virStrToLong_ull(cur, &endp, 10, &p) < 0) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Cannot parse '%s' as an integer"), cur); - goto cleanup; - } - - if (virCgroupAddTaskController(group, p, controller) < 0) { - /* A thread that exits between when we first read the source - * tasks and now is not fatal. */ - if (virLastErrorIsSystemErrno(ESRCH)) - virResetLastError(); - else - goto cleanup; - } - - next = strchr(cur, '\n'); - if (next) { - cur = next + 1; - *next = '\0'; - } else { - break; - } - } - - cleanup: - VIR_FREE(str); - return rc; -} - void virCgroupSetAssertEmpty(virCgroupPtr group, int mask) { group->assert_empty = mask; @@ -1264,54 +1220,6 @@ virCgroupGetAssertEmpty(virCgroupPtr group) { } -/** - * virCgroupMoveTask: - * - * @src_group: The source cgroup where all tasks are removed from - * @dest_group: The destination where all tasks are added to - * - * Returns: 0 on success or -1 on failure - */ -int -virCgroupMoveTask(virCgroupPtr src_group, virCgroupPtr dest_group) -{ - int ret = -1; - char *content = NULL; - size_t i; - - for (i = 0; i < VIR_CGROUP_CONTROLLER_LAST; i++) { - if (!src_group->controllers[i].mountPoint || - !dest_group->controllers[i].mountPoint) - continue; - - /* We must never move tasks in systemd's hierarchy */ - if (i == VIR_CGROUP_CONTROLLER_SYSTEMD) - continue; - - /* New threads are created in the same group as their parent; - * but if a thread is created after we first read we aren't - * aware that it needs to move. Therefore, we must iterate - * until content is empty. */ - while (1) { - VIR_FREE(content); - if (virCgroupGetValueStr(src_group, i, "tasks", &content) < 0) - return -1; - - if (!*content) - break; - - if (virCgroupAddTaskStrController(dest_group, content, i) < 0) - goto cleanup; - } - } - - ret = 0; - cleanup: - VIR_FREE(content); - return ret; -} - - static int virCgroupSetPartitionSuffix(const char *path, char **res) { @@ -4309,16 +4217,6 @@ virCgroupAddTaskController(virCgroupPtr group ATTRIBUTE_UNUSED, int -virCgroupMoveTask(virCgroupPtr src_group ATTRIBUTE_UNUSED, - virCgroupPtr dest_group ATTRIBUTE_UNUSED) -{ - virReportSystemError(ENXIO, "%s", - _("Control groups not supported on this platform")); - return -1; -} - - -int virCgroupGetBlkioIoServiced(virCgroupPtr group ATTRIBUTE_UNUSED, long long *bytes_read ATTRIBUTE_UNUSED, long long *bytes_write ATTRIBUTE_UNUSED, diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index f71aed5..0286d56 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -141,9 +141,6 @@ int virCgroupAddTaskController(virCgroupPtr group, void virCgroupSetAssertEmpty(virCgroupPtr group, int mask); int virCgroupGetAssertEmpty(virCgroupPtr group); -int virCgroupMoveTask(virCgroupPtr src_group, - virCgroupPtr dest_group); - int virCgroupSetBlkioWeight(virCgroupPtr group, unsigned int weight); int virCgroupGetBlkioWeight(virCgroupPtr group, unsigned int *weight); -- 2.4.10

Is it ok do drop unused code in the same patch that makes the code obsolete, or should i split that up? On Tue, 23 Feb 2016 16:58:41 +0100 Henning Schild <henning.schild@siemens.com> wrote:
qemuSetupCgroupForEmulator runs at a point in time where there is only the qemu main thread. Use virCgroupAddTask to put just that one task into the emulator cgroup. That patch makes virCgroupMoveTask and virCgroupAddTaskStrController obsolete.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/libvirt_private.syms | 1 - src/qemu/qemu_cgroup.c | 4 +- src/util/vircgroup.c | 102 ----------------------------------------------- src/util/vircgroup.h | 3 -- 4 files changed, 2 insertions(+), 108 deletions(-)
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index cf93d06..c5e57bf 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1234,7 +1234,6 @@ virCgroupIsolateMount; virCgroupKill; virCgroupKillPainfully; virCgroupKillRecursive; -virCgroupMoveTask; virCgroupNewDetect; virCgroupNewDetectMachine; virCgroupNewDomainPartition; diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 66dc782..d410a66 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -1145,8 +1145,8 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) true, &cgroup_emulator) < 0) goto cleanup;
- if (virCgroupMoveTask(priv->cgroup, cgroup_emulator) < 0) - goto cleanup; + if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) + goto cleanup;
if (def->cputune.emulatorpin) cpumask = def->cputune.emulatorpin; diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index aef8e8c..c31c83b 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1209,50 +1209,6 @@ virCgroupAddTaskController(virCgroupPtr group, pid_t pid, int controller) }
-static int -virCgroupAddTaskStrController(virCgroupPtr group, - const char *pidstr, - int controller) -{ - char *str = NULL, *cur = NULL, *next = NULL; - unsigned long long p = 0; - int rc = 0; - char *endp; - - if (VIR_STRDUP(str, pidstr) < 0) - return -1; - - cur = str; - while (*cur != '\0') { - if (virStrToLong_ull(cur, &endp, 10, &p) < 0) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Cannot parse '%s' as an integer"), cur); - goto cleanup; - } - - if (virCgroupAddTaskController(group, p, controller) < 0) { - /* A thread that exits between when we first read the source - * tasks and now is not fatal. */ - if (virLastErrorIsSystemErrno(ESRCH)) - virResetLastError(); - else - goto cleanup; - } - - next = strchr(cur, '\n'); - if (next) { - cur = next + 1; - *next = '\0'; - } else { - break; - } - } - - cleanup: - VIR_FREE(str); - return rc; -} - void virCgroupSetAssertEmpty(virCgroupPtr group, int mask) { group->assert_empty = mask; @@ -1264,54 +1220,6 @@ virCgroupGetAssertEmpty(virCgroupPtr group) { }
-/** - * virCgroupMoveTask: - * - * @src_group: The source cgroup where all tasks are removed from - * @dest_group: The destination where all tasks are added to - * - * Returns: 0 on success or -1 on failure - */ -int -virCgroupMoveTask(virCgroupPtr src_group, virCgroupPtr dest_group) -{ - int ret = -1; - char *content = NULL; - size_t i; - - for (i = 0; i < VIR_CGROUP_CONTROLLER_LAST; i++) { - if (!src_group->controllers[i].mountPoint || - !dest_group->controllers[i].mountPoint) - continue; - - /* We must never move tasks in systemd's hierarchy */ - if (i == VIR_CGROUP_CONTROLLER_SYSTEMD) - continue; - - /* New threads are created in the same group as their parent; - * but if a thread is created after we first read we aren't - * aware that it needs to move. Therefore, we must iterate - * until content is empty. */ - while (1) { - VIR_FREE(content); - if (virCgroupGetValueStr(src_group, i, "tasks", &content) < 0) - return -1; - - if (!*content) - break; - - if (virCgroupAddTaskStrController(dest_group, content, i) < 0) - goto cleanup; - } - } - - ret = 0; - cleanup: - VIR_FREE(content); - return ret; -} - - static int virCgroupSetPartitionSuffix(const char *path, char **res) { @@ -4309,16 +4217,6 @@ virCgroupAddTaskController(virCgroupPtr group ATTRIBUTE_UNUSED,
int -virCgroupMoveTask(virCgroupPtr src_group ATTRIBUTE_UNUSED, - virCgroupPtr dest_group ATTRIBUTE_UNUSED) -{ - virReportSystemError(ENXIO, "%s", - _("Control groups not supported on this platform")); - return -1; -} - - -int virCgroupGetBlkioIoServiced(virCgroupPtr group ATTRIBUTE_UNUSED, long long *bytes_read ATTRIBUTE_UNUSED, long long *bytes_write ATTRIBUTE_UNUSED, diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index f71aed5..0286d56 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -141,9 +141,6 @@ int virCgroupAddTaskController(virCgroupPtr group, void virCgroupSetAssertEmpty(virCgroupPtr group, int mask); int virCgroupGetAssertEmpty(virCgroupPtr group);
-int virCgroupMoveTask(virCgroupPtr src_group, - virCgroupPtr dest_group); - int virCgroupSetBlkioWeight(virCgroupPtr group, unsigned int weight); int virCgroupGetBlkioWeight(virCgroupPtr group, unsigned int *weight);

On Fri, 2016-02-26 at 11:27 +0100, Henning Schild wrote:
Is it ok do drop unused code in the same patch that makes the code obsolete, or should i split that up?
I think it's generally okay to go either way; on the other hand, sometimes you won't even have a choice because the compiler will complain about unused functions. Cheers. --Â Andrea Bolognani Software Engineer - Virtualization Team

On Fri, Feb 26, 2016 at 11:27:41AM +0100, Henning Schild wrote:
Is it ok do drop unused code in the same patch that makes the code obsolete, or should i split that up?
It is fine as is in this case. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Feb 23, 2016 at 04:58:41PM +0100, Henning Schild wrote:
qemuSetupCgroupForEmulator runs at a point in time where there is only the qemu main thread. Use virCgroupAddTask to put just that one task into the emulator cgroup. That patch makes virCgroupMoveTask and virCgroupAddTaskStrController obsolete.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/libvirt_private.syms | 1 - src/qemu/qemu_cgroup.c | 4 +- src/util/vircgroup.c | 102 ----------------------------------------------- src/util/vircgroup.h | 3 -- 4 files changed, 2 insertions(+), 108 deletions(-)
ACK, its nice to kill this MoveTask method. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Add a way to exclude controllers from virCgroupAddTask. In a cgroup hierarchy the parent might have controllers just to allow children cgroups to inherit them, not necessarily to put any tasks in them. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/lxc/lxc_cgroup.c | 2 +- src/lxc/lxc_controller.c | 4 ++-- src/qemu/qemu_cgroup.c | 8 ++++---- src/qemu/qemu_driver.c | 2 +- src/util/vircgroup.c | 8 ++++++-- src/util/vircgroup.h | 2 +- 6 files changed, 15 insertions(+), 11 deletions(-) diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c index 609e9ea..9b91dd2 100644 --- a/src/lxc/lxc_cgroup.c +++ b/src/lxc/lxc_cgroup.c @@ -504,7 +504,7 @@ virCgroupPtr virLXCCgroupCreate(virDomainDefPtr def, &cgroup) < 0) goto cleanup; - if (virCgroupAddTask(cgroup, initpid) < 0) { + if (virCgroupAddTask(cgroup, initpid, -1) < 0) { virErrorPtr saved = virSaveLastError(); virCgroupRemove(cgroup); virCgroupFree(&cgroup); diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 438103a..b1fe8fa 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -863,12 +863,12 @@ static int virLXCControllerSetupCgroupLimits(virLXCControllerPtr ctrl) ctrl->nicindexes))) goto cleanup; - if (virCgroupAddTask(ctrl->cgroup, getpid()) < 0) + if (virCgroupAddTask(ctrl->cgroup, getpid(), -1) < 0) goto cleanup; /* Add all qemu-nbd tasks to the cgroup */ for (i = 0; i < ctrl->nnbdpids; i++) { - if (virCgroupAddTask(ctrl->cgroup, ctrl->nbdpids[i]) < 0) + if (virCgroupAddTask(ctrl->cgroup, ctrl->nbdpids[i], -1) < 0) goto cleanup; } diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index d410a66..41a583c 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -789,7 +789,7 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; } - if (virCgroupAddTask(priv->cgroup, vm->pid) < 0) { + if (virCgroupAddTask(priv->cgroup, vm->pid, -1) < 0) { virErrorPtr saved = virSaveLastError(); virCgroupRemove(priv->cgroup); virCgroupFree(&priv->cgroup); @@ -1096,7 +1096,7 @@ qemuSetupCgroupForVcpu(virDomainObjPtr vm) /* move the thread for vcpu to sub dir */ if (virCgroupAddTask(cgroup_vcpu, - qemuDomainGetVcpuPid(vm, i)) < 0) + qemuDomainGetVcpuPid(vm, i), -1) < 0) goto cleanup; } @@ -1145,7 +1145,7 @@ qemuSetupCgroupForEmulator(virDomainObjPtr vm) true, &cgroup_emulator) < 0) goto cleanup; - if (virCgroupAddTask(cgroup_emulator, vm->pid) < 0) + if (virCgroupAddTask(cgroup_emulator, vm->pid, -1) < 0) goto cleanup; if (def->cputune.emulatorpin) @@ -1255,7 +1255,7 @@ qemuSetupCgroupForIOThreads(virDomainObjPtr vm) /* move the thread for iothread to sub dir */ if (virCgroupAddTask(cgroup_iothread, - def->iothreadids[i]->thread_id) < 0) + def->iothreadids[i]->thread_id, -1) < 0) goto cleanup; virCgroupFree(&cgroup_iothread); diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 8ccf68b..c0b840b 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4583,7 +4583,7 @@ qemuDomainAddCgroupForThread(virCgroupPtr cgroup, goto error; /* Add pid/thread to the cgroup */ - rv = virCgroupAddTask(new_cgroup, pid); + rv = virCgroupAddTask(new_cgroup, pid, -1); if (rv < 0) { virCgroupRemove(new_cgroup); goto error; diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index c31c83b..bbc88f3 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1142,16 +1142,19 @@ virCgroupNew(pid_t pid, * * @group: The cgroup to add a task to * @pid: The pid of the task to add + * @controllers: mask of controllers to operate on * * Returns: 0 on success, -1 on error */ int -virCgroupAddTask(virCgroupPtr group, pid_t pid) +virCgroupAddTask(virCgroupPtr group, pid_t pid, int controllers) { int ret = -1; size_t i; for (i = 0; i < VIR_CGROUP_CONTROLLER_LAST; i++) { + if (((controllers & (1 << i)) == 0)) + continue; /* Skip over controllers not mounted */ if (!group->controllers[i].mountPoint) continue; @@ -4197,7 +4200,8 @@ virCgroupPathOfController(virCgroupPtr group ATTRIBUTE_UNUSED, int virCgroupAddTask(virCgroupPtr group ATTRIBUTE_UNUSED, - pid_t pid ATTRIBUTE_UNUSED) + pid_t pid ATTRIBUTE_UNUSED, + int controllers ATTRIBUTE_UNUSED) { virReportSystemError(ENXIO, "%s", _("Control groups not supported on this platform")); diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index 0286d56..d9c5100 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -132,7 +132,7 @@ int virCgroupPathOfController(virCgroupPtr group, const char *key, char **path); -int virCgroupAddTask(virCgroupPtr group, pid_t pid); +int virCgroupAddTask(virCgroupPtr group, pid_t pid, int controllers); int virCgroupAddTaskController(virCgroupPtr group, pid_t pid, -- 2.4.10

Eventually all qemu threads are supposed to be in the final cgroup structure in one of the leaf cgroups (vcpuX, emulator). They should never be in the main machine cgroup. That is for all thread related controllers like cpuset, cpu, and cpuacct. By excluding the threading related controllers we do not put the task into the machine cgroup controllers just to move it out to the emulator subgroup later. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 41a583c..99fb5bf 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -789,7 +789,8 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; } - if (virCgroupAddTask(priv->cgroup, vm->pid, -1) < 0) { + if (virCgroupAddTask(priv->cgroup, vm->pid, + ~VIR_CGROUP_THREAD_CONTROLLER_MASK) < 0) { virErrorPtr saved = virSaveLastError(); virCgroupRemove(priv->cgroup); virCgroupFree(&priv->cgroup); -- 2.4.10

On Tue, Feb 23, 2016 at 04:58:43PM +0100, Henning Schild wrote:
Eventually all qemu threads are supposed to be in the final cgroup structure in one of the leaf cgroups (vcpuX, emulator). They should never be in the main machine cgroup. That is for all thread related controllers like cpuset, cpu, and cpuacct. By excluding the threading related controllers we do not put the task into the machine cgroup controllers just to move it out to the emulator subgroup later.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
NACK, because your earlier patch that this relies on does not work Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Make sure the thread related controls of the machine cgroup never get any tasks assigned. Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 99fb5bf..c827787 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -740,6 +740,7 @@ qemuInitCgroup(virQEMUDriverPtr driver, int *nicindexes) { int ret = -1; + int assert_empty, controllers; qemuDomainObjPrivatePtr priv = vm->privateData; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); @@ -789,8 +790,17 @@ qemuInitCgroup(virQEMUDriverPtr driver, goto cleanup; } - if (virCgroupAddTask(priv->cgroup, vm->pid, - ~VIR_CGROUP_THREAD_CONTROLLER_MASK) < 0) { + /* + * the child cgroups for emulator, vcpu, and io -threads contain + * all qemu threads for the following controllers, the parent + * group has to stay empty. + */ + controllers = VIR_CGROUP_THREAD_CONTROLLER_MASK; + assert_empty = virCgroupGetAssertEmpty(priv->cgroup); + assert_empty |= controllers; + virCgroupSetAssertEmpty(priv->cgroup, assert_empty); + + if (virCgroupAddTask(priv->cgroup, vm->pid, ~controllers) < 0) { virErrorPtr saved = virSaveLastError(); virCgroupRemove(priv->cgroup); virCgroupFree(&priv->cgroup); -- 2.4.10

On Tue, Feb 23, 2016 at 04:58:44PM +0100, Henning Schild wrote:
Make sure the thread related controls of the machine cgroup never get any tasks assigned.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
NACK This also won't work for same reason as previous patch Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, 26 Feb 2016 11:22:07 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Feb 23, 2016 at 04:58:44PM +0100, Henning Schild wrote:
Make sure the thread related controls of the machine cgroup never get any tasks assigned.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
NACK This also won't work for same reason as previous patch
Having that in place can still be useful after we have sorted out the random result of what systemd gave us. Is the general idea of such an assertion a good idea, and should i adopt it according to comments? At the moment i just used the assertion mask in the only code-path that adds tasks within libvirt. If we have to deal with manipulation from the outside, it might be a good idea to introduce more assertions based on the mask. Henning

On Fri, Feb 26, 2016 at 12:57:38PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 11:22:07 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Feb 23, 2016 at 04:58:44PM +0100, Henning Schild wrote:
Make sure the thread related controls of the machine cgroup never get any tasks assigned.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
NACK This also won't work for same reason as previous patch
Having that in place can still be useful after we have sorted out the random result of what systemd gave us. Is the general idea of such an assertion a good idea, and should i adopt it according to comments? At the moment i just used the assertion mask in the only code-path that adds tasks within libvirt. If we have to deal with manipulation from the outside, it might be a good idea to introduce more assertions based on the mask.
Without patch 4 though, there's nowhere you can put this afaict. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, 26 Feb 2016 12:02:52 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Feb 26, 2016 at 12:57:38PM +0100, Henning Schild wrote:
On Fri, 26 Feb 2016 11:22:07 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Tue, Feb 23, 2016 at 04:58:44PM +0100, Henning Schild wrote:
Make sure the thread related controls of the machine cgroup never get any tasks assigned.
Signed-off-by: Henning Schild <henning.schild@siemens.com> --- src/qemu/qemu_cgroup.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
NACK This also won't work for same reason as previous patch
Having that in place can still be useful after we have sorted out the random result of what systemd gave us. Is the general idea of such an assertion a good idea, and should i adopt it according to comments? At the moment i just used the assertion mask in the only code-path that adds tasks within libvirt. If we have to deal with manipulation from the outside, it might be a good idea to introduce more assertions based on the mask.
Without patch 4 though, there's nowhere you can put this afaict.
After moving the pid to the emulator cgroup, i can assert that the parent is now empty and then put it in place. It would assert libvirt itself does not use the parent group somewhen in the future.
Regards, Daniel

John, thanks for the review so far. Since i will have to rebase and check conflicts with other queued patches i would like to also get a review of the semantics of the overall series. Or do i first have to get the changes past a CI testsuit? Henning On Tue, 23 Feb 2016 16:58:33 +0100 Henning Schild <henning.schild@siemens.com> wrote:
This series picks up the cgroups work i started earlier. My initial patches got in and later reverted before 1.3.1.
The problem the series is solving is about qemu-threads becoming runnable on pcpus outside the pinning masks configured for the machine. That only happens for a short time before the thread is moved to its final cpuset. But it can disturb other load on the system or can lead to qemu never starting. (qemu main thread ends up on a pcpu with busy high prio rt-task).
The problem in the original series was the lack of understanding that one virCgroup can cover all controllers. Instead of just touching cpusets the patches had side effects on all the other controllers (memory, blkio etc.) Again the general idea is to put all threads right into the correct cgroups and to not move them around. But this series touches only the cpu, cpuset, and cpuacct controllers. That are the ones relevant to threads and that are the controllers the threading sub-groups have mounted.
Patches 1, 2, and 9 deal with asserting correct behaviour. They are optional. But given the complexity of the "bringup" and the importance of getting that right, i think they should go in as well!
The tricky bits are in patches 5 and 8, i kept them as simple as possible.
The series is based on v1.3.1.
Henning Schild (9): vircgroup: one central point for adding tasks to cgroups vircgroup: add assertion to allow cgroup controllers to stay empty vircgroup: introduce controller mask for threads util: cgroups do not implicitly add task to new machine cgroup qemu_cgroup: put qemu right into emulator sub-cgroup qemu_cgroup: use virCgroupAddTask instead of virCgroupMoveTask vircgroup: add controller mask to virCgroupAddTask qemu_cgroup: dont put qemu main thread into wrong cgroup qemu_cgroup: assert threading cgroup layout for machine cgroup
src/libvirt_private.syms | 3 +- src/lxc/lxc_cgroup.c | 11 ++++ src/lxc/lxc_controller.c | 4 +- src/qemu/qemu_cgroup.c | 30 +++++++-- src/qemu/qemu_driver.c | 2 +- src/qemu/qemu_process.c | 8 +-- src/util/vircgroup.c | 155 ++++++++--------------------------------------- src/util/vircgroup.h | 13 +++- src/util/vircgrouppriv.h | 1 + 9 files changed, 81 insertions(+), 146 deletions(-)
participants (4)
-
Andrea Bolognani
-
Daniel P. Berrange
-
Henning Schild
-
John Ferlan