[libvirt] [PATCH 0/6] Few NUMA fixes

The heart of the patchset are the last two patches. Long story short, if a domain is configured to be pinned onto a set of NUMA nodes stricly, we use both CGroups and numa_set_membind(). While we can change the former later on a user's wish, we can't change the latter. Therefore, any subsequent memory enlargement (e.g. via virsh setmem $dom) will result in memory still being allocated from old NUMA nodes and OOM killer interference is likely to occur. Michal Privoznik (6): virCgroupNewPartition: Fix comment virCgroupNew: Enhance debug message virCgroupController: Check the enum fits into 'int' qemuDomainGetNumaParameters: Check for the correct CGroup controller qemuProcessHook: Call virNuma*() iff needed virLXCControllerSetupResourceLimits: Call virNuma*() iff needed src/lxc/lxc_controller.c | 31 +++++++++++++++++++++++++------ src/qemu/qemu_driver.c | 2 +- src/qemu/qemu_process.c | 32 ++++++++++++++++++++++++++++---- src/util/vircgroup.c | 6 +++--- src/util/vircgroup.h | 5 +++++ 5 files changed, 62 insertions(+), 14 deletions(-) -- 2.0.5

The function has no argument named @name rather than @path instead. The comment is, however, referring to @name while it should have been referring to @path really. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/vircgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 0a2e729..660f840 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1290,7 +1290,7 @@ virCgroupSetPartitionSuffix(const char *path, char **res) * @controllers: mask of controllers to create * * Creates a new cgroup to represent the resource - * partition path identified by @name. + * partition path identified by @path. * * Returns 0 on success, -1 on failure */ -- 2.0.5

When creating new internal representation of cgroups, all passed arguments are logged. Well, except for two: pid and pointer for return value. Lets log them too. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/vircgroup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 660f840..0ae99c9 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -1050,8 +1050,8 @@ virCgroupNew(pid_t pid, int controllers, virCgroupPtr *group) { - VIR_DEBUG("parent=%p path=%s controllers=%d", - parent, path, controllers); + VIR_DEBUG("pid=%lld path=%s parent=%p controllers=%d group=%p", + (long long) pid, path, parent, controllers, group); *group = NULL; if (VIR_ALLOC((*group)) < 0) -- 2.0.5

Throughout our code, the virCgroupController enum is used in two ways. First as an index to an array of cgroup controllers: struct virCgroup { char *path; struct virCgroupController controllers[VIR_CGROUP_CONTROLLER_LAST]; }; Second way is that when calling virCgroupNew() a bitmask of the enum items can be passed to selectively detect only some controllers. For instance: int virCgroupNewVcpu(virCgroupPtr domain, int vcpuid, bool create, virCgroupPtr *group) { ... controllers = ((1 << VIR_CGROUP_CONTROLLER_CPU) | (1 << VIR_CGROUP_CONTROLLER_CPUACCT) | (1 << VIR_CGROUP_CONTROLLER_CPUSET)); if (virCgroupNew(-1, name, domain, controllers, group) < 0) goto cleanup; } Even though it's highly unlikely that so many new controllers will be invented so that we would overflow when constructing the bitmask, it doesn't hurt to check at compile time either. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/vircgroup.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/util/vircgroup.h b/src/util/vircgroup.h index bfb3a9b..eee15ca 100644 --- a/src/util/vircgroup.h +++ b/src/util/vircgroup.h @@ -46,6 +46,11 @@ enum { }; VIR_ENUM_DECL(virCgroupController); +/* Items of this enum are used later in virCgroupNew to create + * bit array stored in int. Like this: + * 1 << VIR_CGROUP_CONTROLLER_CPU + * Make sure we will not overflow */ +verify(VIR_CGROUP_CONTROLLER_LAST < 8 * sizeof(int)); bool virCgroupAvailable(void); -- 2.0.5

When getting info on NUMA parameters for domain, virCgroupGetCpusetMems() may be called. However, as of 43b67f2e the call is guarded by check if memory controller is present. Even though it may be not obvious instantly, NUMA parameters are stored under cpuset controller. Therefore the check needs to look like this: if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET) || virCgroupGetCpusetMems(priv->cgroup, &nodeset) < 0) { Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_driver.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index f07e4fb..4d05221 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -10187,7 +10187,7 @@ qemuDomainGetNumaParameters(virDomainPtr dom, goto cleanup; } else { if (!virCgroupHasController(priv->cgroup, - VIR_CGROUP_CONTROLLER_MEMORY) || + VIR_CGROUP_CONTROLLER_CPUSET) || virCgroupGetCpusetMems(priv->cgroup, &nodeset) < 0) { nodeset = virDomainNumatuneFormatNodeset(vm->def->numa, NULL, -1); -- 2.0.5

Once upon a time, there was a little domain. And the domain was pinned onto a NUMA node and hasn't fully allocated its memory: <memory unit='KiB'>2355200</memory> <currentMemory unit='KiB'>1048576</currentMemory> <numatune> <memory mode='strict' nodeset='0'/> </numatune> Oh little me, said the domain, what will I do with so few memory. If I only had a few megabytes more. But the old admin noticed whimpering, barely audible to untrained human ear. And good admin he was, he gave the domain yet more memory. But the old NUMA topology witch forbidden to allocate more memory on the node zero. So he decided to allocate it on a different node: virsh # numatune little_domain --nodeset 0-1 virsh # setmem little_domain 2355200 The little domain was happy. For a while. Until bad, sharp teeth shaped creature came. Every process in the system was afraid of him. The OOM Killer they called him. Oh no, he's after the little domain. There's no escape. Do you kids know why? Because when the little domain was born, her father, Libvirt, called numa_set_membind(). So even if the admin allowed her to allocate memory from other nodes in the cgroups, the membind() forbid it. So what's the lesson? Libvirt should rely on cgroups, whenever possible and use numa_set_membind() as the last ditch effort. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 32 ++++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 79f763e..cba042d 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3154,6 +3154,7 @@ static int qemuProcessHook(void *data) int fd; virBitmapPtr nodeset = NULL; virDomainNumatuneMemMode mode; + bool doNuma = true; /* This method cannot use any mutexes, which are not * protected across fork() @@ -3185,11 +3186,34 @@ static int qemuProcessHook(void *data) goto cleanup; mode = virDomainNumatuneGetMode(h->vm->def->numa, -1); - nodeset = virDomainNumatuneGetNodeset(h->vm->def->numa, - priv->autoNodeset, -1); - if (virNumaSetupMemoryPolicy(mode, nodeset) < 0) - goto cleanup; + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + virCgroupPtr cgroup = NULL; + + /* Create dummy cgroup ... */ + if (virCgroupNewSelf(&cgroup) < 0) + goto cleanup; + + /* ... just to detect if cpuset cgroup is present */ + if (virCgroupHasController(cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) { + /* Because if it's not, we must use virNuma* APIs to bind + * memory onto desired nodes. CGroup way is preferred, as + * it allows runtime tuning, while virNuma - well, once + * set and child (qemu) is exec()-ed, we can't do + * anything about the settings. virNuma* does not take + * any PID argument after all. */ + doNuma = false; + } + virCgroupFree(&cgroup); + } + + if (doNuma) { + nodeset = virDomainNumatuneGetNodeset(h->vm->def->numa, + priv->autoNodeset, -1); + + if (virNumaSetupMemoryPolicy(mode, nodeset) < 0) + goto cleanup; + } ret = 0; -- 2.0.5

On Fri, Mar 27, 2015 at 03:26:53PM +0100, Michal Privoznik wrote:
Once upon a time, there was a little domain. And the domain was pinned onto a NUMA node and hasn't fully allocated its memory:
<memory unit='KiB'>2355200</memory> <currentMemory unit='KiB'>1048576</currentMemory>
<numatune> <memory mode='strict' nodeset='0'/> </numatune>
Oh little me, said the domain, what will I do with so few memory.
s/few/little/
If I only had a few megabytes more. But the old admin noticed whimpering, barely audible to untrained human ear. And good admin he
the whimpering the good admin (?)
was, he gave the domain yet more memory. But the old NUMA topology witch forbidden to allocate more memory on the node zero. So he
forbidden -> forbade or forbid
decided to allocate it on a different node:
virsh # numatune little_domain --nodeset 0-1
virsh # setmem little_domain 2355200
The little domain was happy. For a while. Until bad, sharp teeth
a bad? the bad?
shaped creature came. Every process in the system was afraid of him. The OOM Killer they called him. Oh no, he's after the little domain. There's no escape.
Do you kids know why? Because when the little domain was born, her father, Libvirt, called numa_set_membind(). So even if the admin allowed her to allocate memory from other nodes in the cgroups, the membind() forbid it.
So what's the lesson? Libvirt should rely on cgroups, whenever possible and use numa_set_membind() as the last ditch effort.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 32 ++++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-)
I don't have much experience proofreading children's books, but the logic looks okay to me.
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 79f763e..cba042d 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3154,6 +3154,7 @@ static int qemuProcessHook(void *data) int fd; virBitmapPtr nodeset = NULL; virDomainNumatuneMemMode mode; + bool doNuma = true;
/* This method cannot use any mutexes, which are not * protected across fork() @@ -3185,11 +3186,34 @@ static int qemuProcessHook(void *data) goto cleanup;
mode = virDomainNumatuneGetMode(h->vm->def->numa, -1); - nodeset = virDomainNumatuneGetNodeset(h->vm->def->numa, - priv->autoNodeset, -1);
- if (virNumaSetupMemoryPolicy(mode, nodeset) < 0) - goto cleanup; + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + virCgroupPtr cgroup = NULL; + + /* Create dummy cgroup ... */ + if (virCgroupNewSelf(&cgroup) < 0) + goto cleanup;
The domain's cgroup is accessible under priv->cgroup here, you can use that one instead.
+ + /* ... just to detect if cpuset cgroup is present */ + if (virCgroupHasController(cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) { + /* Because if it's not, we must use virNuma* APIs to bind + * memory onto desired nodes. CGroup way is preferred, as + * it allows runtime tuning, while virNuma - well, once + * set and child (qemu) is exec()-ed, we can't do + * anything about the settings. virNuma* does not take + * any PID argument after all. */
Can this comment be shortened? Jan

On 30.03.2015 15:15, Ján Tomko wrote:
On Fri, Mar 27, 2015 at 03:26:53PM +0100, Michal Privoznik wrote:
Once upon a time, there was a little domain. And the domain was pinned onto a NUMA node and hasn't fully allocated its memory:
<memory unit='KiB'>2355200</memory> <currentMemory unit='KiB'>1048576</currentMemory>
<numatune> <memory mode='strict' nodeset='0'/> </numatune>
Oh little me, said the domain, what will I do with so few memory.
s/few/little/
If I only had a few megabytes more. But the old admin noticed whimpering, barely audible to untrained human ear. And good admin he
the whimpering the good admin (?)
was, he gave the domain yet more memory. But the old NUMA topology witch forbidden to allocate more memory on the node zero. So he
forbidden -> forbade or forbid
decided to allocate it on a different node:
virsh # numatune little_domain --nodeset 0-1
virsh # setmem little_domain 2355200
The little domain was happy. For a while. Until bad, sharp teeth
a bad? the bad?
shaped creature came. Every process in the system was afraid of him. The OOM Killer they called him. Oh no, he's after the little domain. There's no escape.
Do you kids know why? Because when the little domain was born, her father, Libvirt, called numa_set_membind(). So even if the admin allowed her to allocate memory from other nodes in the cgroups, the membind() forbid it.
So what's the lesson? Libvirt should rely on cgroups, whenever possible and use numa_set_membind() as the last ditch effort.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 32 ++++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-)
I don't have much experience proofreading children's books, but the logic looks okay to me.
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 79f763e..cba042d 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3154,6 +3154,7 @@ static int qemuProcessHook(void *data) int fd; virBitmapPtr nodeset = NULL; virDomainNumatuneMemMode mode; + bool doNuma = true;
/* This method cannot use any mutexes, which are not * protected across fork() @@ -3185,11 +3186,34 @@ static int qemuProcessHook(void *data) goto cleanup;
mode = virDomainNumatuneGetMode(h->vm->def->numa, -1); - nodeset = virDomainNumatuneGetNodeset(h->vm->def->numa, - priv->autoNodeset, -1);
- if (virNumaSetupMemoryPolicy(mode, nodeset) < 0) - goto cleanup; + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + virCgroupPtr cgroup = NULL; + + /* Create dummy cgroup ... */ + if (virCgroupNewSelf(&cgroup) < 0) + goto cleanup;
The domain's cgroup is accessible under priv->cgroup here, you can use that one instead.
Good point, I though that CGroups are created later in the process. But due to handshaking with child, we are guaranteed that we can access priv->cgroup. Will send v2 in a while. Michal

Like we are doing in qemu driver ($COMMIT_HASH_TO_BE_FILLED_DURING_PUSHING), lets call virNumaSetupMemoryPolicy() only if really needed. Problem is, if we numa_set_membind() child, there's no way to change it from the daemon afterwards. So any later attempts to change the pinning will fail. But in very weird way - CGroups will be set, but due to membind child will not allocate memory from any other node. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/lxc/lxc_controller.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index 8545f29..6881a37 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -742,14 +742,33 @@ static int virLXCControllerSetupResourceLimits(virLXCControllerPtr ctrl) virBitmapPtr nodeset = NULL; virDomainNumatuneMemMode mode; - VIR_DEBUG("Setting up process resource limits"); - - if (virLXCControllerGetNumadAdvice(ctrl, &auto_nodeset) < 0) - goto cleanup; - - nodeset = virDomainNumatuneGetNodeset(ctrl->def->numa, auto_nodeset, -1); mode = virDomainNumatuneGetMode(ctrl->def->numa, -1); + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + virCgroupPtr cgroup; + + /* Create dummy cgroup ... */ + if (virCgroupNewSelf(&cgroup) < 0) + goto cleanup; + + /* ... just to detect if cpuset cgroup is present */ + if (virCgroupHasController(cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) { + /* Because if it's not, we will pin the child onto + * specified nodes for good. No later cpuset.mems + * change will have any effect. */ + VIR_DEBUG("Postponing setting up resource limits to CGroup set up phase"); + virCgroupFree(&cgroup); + return virLXCControllerSetupCpuAffinity(ctrl); + } + } + + VIR_DEBUG("Setting up process resource limits"); + + if (virLXCControllerGetNumadAdvice(ctrl, &auto_nodeset) < 0) + goto cleanup; + + nodeset = virDomainNumatuneGetNodeset(ctrl->def->numa, auto_nodeset, -1); + if (virNumaSetupMemoryPolicy(mode, nodeset) < 0) goto cleanup; -- 2.0.5

On Fri, Mar 27, 2015 at 03:26:48PM +0100, Michal Privoznik wrote:
The heart of the patchset are the last two patches. Long story short, if a domain is configured to be pinned onto a set of NUMA nodes stricly, we use both CGroups and numa_set_membind(). While we can change the former later on a user's wish, we can't change the latter. Therefore, any subsequent memory enlargement (e.g. via virsh setmem $dom) will result in memory still being allocated from old NUMA nodes and OOM killer interference is likely to occur.
Michal Privoznik (6): virCgroupNewPartition: Fix comment virCgroupNew: Enhance debug message virCgroupController: Check the enum fits into 'int' qemuDomainGetNumaParameters: Check for the correct CGroup controller qemuProcessHook: Call virNuma*() iff needed virLXCControllerSetupResourceLimits: Call virNuma*() iff needed
ACK to patches 1-4, safe during the freeze. Jan
participants (2)
-
Ján Tomko
-
Michal Privoznik