[PATCH 0/3] qemu: Be more generous on cpuset.mems

*** BLURB HERE *** Michal Prívozník (3): qemu: Don't try to 'fix up' cpuset.mems after QEMU's memory allocation qemu: Allow more generous cpuset.mems for vCPUs and IOThreads qemu: Drop @unionMems argument from qemuProcessSetupPid() src/qemu/qemu_hotplug.c | 7 +++---- src/qemu/qemu_process.c | 45 +++++++++++++---------------------------- src/qemu/qemu_process.h | 3 +-- 3 files changed, 18 insertions(+), 37 deletions(-) -- 2.39.3

In ideal world, my plan was perfect. We allow union of all host nodes in cpuset.mems and once QEMU has allocated its memory, we 'fix up' restriction of its emulator thread by writing the original value we wanted to set all along. But in fact, we can't do it because that triggers memory movement. For instance, consider the following <numatune/>: <numatune> <memory mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> </numatune> <numa> <cell id="0" cpus="0-1" memory="1024000" unit="KiB" /> <cell id="1" cpus="2-3" memory="1048576" unit="KiB"/> </numa> This is meant to create 1:1 mapping between guest and host NUMA nodes. So we start QEMU with cpuset.mems set to "0-1" (so that it can allocate memory even for guest node #1 and have the memory come fro host node #1) and then, set cpuset.mems to "0" (because that's where we wanted emulator thread to live). But this in turn triggers movement of all memory (even the allocated one) to host NUMA node #0. Therefore, we have to just keep cpuset.mems untouched and rely on .host-nodes passed on the QEMU cmd line. The placement still suffers because of cpuset.mems set for vcpus or iothreads, but that's fixed in next commit. Fixes: 3ec6d586bc3ec7a8cf406b1b6363e87d50aa159c Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 7 +++---- src/qemu/qemu_process.c | 15 ++++----------- src/qemu/qemu_process.h | 3 +-- 3 files changed, 8 insertions(+), 17 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 806aecb29d..ba9e44945b 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -2295,7 +2295,7 @@ qemuDomainAttachMemory(virQEMUDriver *driver, if (qemuDomainAdjustMaxMemLock(vm) < 0) goto removedef; - if (qemuProcessSetupEmulator(vm, true) < 0) + if (qemuProcessSetupEmulator(vm) < 0) goto removedef; restoreemulatorcgroup = true; @@ -2336,11 +2336,10 @@ qemuDomainAttachMemory(virQEMUDriver *driver, VIR_WARN("Unable to remove memory device from /dev"); if (releaseaddr) qemuDomainReleaseMemoryDeviceSlot(vm, mem); + if (restoreemulatorcgroup) + qemuProcessSetupEmulator(vm); } - if (restoreemulatorcgroup) - qemuProcessSetupEmulator(vm, false); - virDomainMemoryDefFree(mem); return ret; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index a4a4a17a9b..1d3cdeff9a 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2596,9 +2596,7 @@ qemuProcessSetupPid(virDomainObj *vm, mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE)) { /* QEMU allocates its memory from the emulator thread. Thus it - * needs to access union of all host nodes configured. This is - * going to be replaced with proper value later in the startup - * process. */ + * needs to access union of all host nodes configured. */ if (unionMems && nameval == VIR_CGROUP_THREAD_EMULATOR && mem_mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { @@ -2702,15 +2700,14 @@ qemuProcessSetupPid(virDomainObj *vm, int -qemuProcessSetupEmulator(virDomainObj *vm, - bool unionMems) +qemuProcessSetupEmulator(virDomainObj *vm) { return qemuProcessSetupPid(vm, vm->pid, VIR_CGROUP_THREAD_EMULATOR, 0, vm->def->cputune.emulatorpin, vm->def->cputune.emulator_period, vm->def->cputune.emulator_quota, vm->def->cputune.emulatorsched, - unionMems); + true); } @@ -7764,7 +7761,7 @@ qemuProcessLaunch(virConnectPtr conn, goto cleanup; VIR_DEBUG("Setting emulator tuning/settings"); - if (qemuProcessSetupEmulator(vm, true) < 0) + if (qemuProcessSetupEmulator(vm) < 0) goto cleanup; VIR_DEBUG("Setting cgroup for external devices (if required)"); @@ -7827,10 +7824,6 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuConnectAgent(driver, vm) < 0) goto cleanup; - VIR_DEBUG("Fixing up emulator tuning/settings"); - if (qemuProcessSetupEmulator(vm, false) < 0) - goto cleanup; - VIR_DEBUG("setting up hotpluggable cpus"); if (qemuDomainHasHotpluggableStartupVcpus(vm->def)) { if (qemuDomainRefreshVcpuInfo(vm, asyncJob, false) < 0) diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index 1c4c0678ab..cae1b49756 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -236,5 +236,4 @@ void qemuProcessCleanupMigrationJob(virQEMUDriver *driver, void qemuProcessRefreshDiskProps(virDomainDiskDef *disk, struct qemuDomainDiskInfo *info); -int qemuProcessSetupEmulator(virDomainObj *vm, - bool unionMems); +int qemuProcessSetupEmulator(virDomainObj *vm); -- 2.39.3

On Wed, Jun 07, 2023 at 04:40:59PM +0200, Michal Privoznik wrote:
In ideal world, my plan was perfect. We allow union of all host nodes in cpuset.mems and once QEMU has allocated its memory, we 'fix up' restriction of its emulator thread by writing the original value we wanted to set all along. But in fact, we can't do it because that triggers memory movement. For instance, consider the following <numatune/>:
<numatune> <memory mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> </numatune>
<numa> <cell id="0" cpus="0-1" memory="1024000" unit="KiB" /> <cell id="1" cpus="2-3" memory="1048576" unit="KiB"/> </numa>
This is meant to create 1:1 mapping between guest and host NUMA nodes. So we start QEMU with cpuset.mems set to "0-1" (so that it can allocate memory even for guest node #1 and have the memory come fro host node #1) and then, set cpuset.mems to "0" (because that's where we wanted emulator thread to live).
But this in turn triggers movement of all memory (even the allocated one) to host NUMA node #0. Therefore, we have to just keep cpuset.mems untouched and rely on .host-nodes passed on the QEMU cmd line.
The placement still suffers because of cpuset.mems set for vcpus or iothreads, but that's fixed in next commit.
Fixes: 3ec6d586bc3ec7a8cf406b1b6363e87d50aa159c Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>

The unit that cpuset CGroups controller works with is a thread/process, not individual memory allocations. Therefore, after we've set cpuset.mems for emulator (after previous commit it's set to union of all host NUMA nodes allowed for given domain), and as we try to set up cpuset.mems for vCPUs/IOThreads, memory is migrated to selected NUMA node(s). We are effectively saying: "this thread (vCPU thread) can have memory only from these NUMA node(s)". That's not really what we want though. The cpuset controller doesn't differentiate memory "belonging" to the emulator thread and vCPU thread or IOThread even. Therefore, set union of all allowed host NUMA nodes, just like we're doing for the emulator thread. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2138150 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 1d3cdeff9a..d9269e37a1 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2598,7 +2598,6 @@ qemuProcessSetupPid(virDomainObj *vm, /* QEMU allocates its memory from the emulator thread. Thus it * needs to access union of all host nodes configured. */ if (unionMems && - nameval == VIR_CGROUP_THREAD_EMULATOR && mem_mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask); } else { @@ -5911,7 +5910,7 @@ qemuProcessSetupVcpu(virDomainObj *vm, vm->def->cputune.period, vm->def->cputune.quota, &vcpu->sched, - false) < 0) + true) < 0) return -1; if (schedCore && @@ -6067,7 +6066,7 @@ qemuProcessSetupIOThread(virDomainObj *vm, vm->def->cputune.iothread_period, vm->def->cputune.iothread_quota, &iothread->sched, - false); + true); } -- 2.39.3

On Wed, Jun 07, 2023 at 04:41:00PM +0200, Michal Privoznik wrote:
The unit that cpuset CGroups controller works with is a thread/process, not individual memory allocations. Therefore, after we've set cpuset.mems for emulator (after previous commit it's set to union of all host NUMA nodes allowed for given domain), and as we try to set up cpuset.mems for vCPUs/IOThreads, memory is migrated to selected NUMA node(s). We are effectively saying: "this thread (vCPU thread) can have memory only from these NUMA node(s)".
That's not really what we want though. The cpuset controller doesn't differentiate memory "belonging" to the emulator thread and vCPU thread or IOThread even.
Therefore, set union of all allowed host NUMA nodes, just like we're doing for the emulator thread.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2138150 Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>

The @unionMems argument of qemuProcessSetupPid() function is not necessary really as all callers pass 'true'. Drop it. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 31 +++++++++++-------------------- 1 file changed, 11 insertions(+), 20 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index d9269e37a1..74e85c8464 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2550,8 +2550,7 @@ qemuProcessSetupPid(virDomainObj *vm, virBitmap *cpumask, unsigned long long period, long long quota, - virDomainThreadSchedParam *sched, - bool unionMems) + virDomainThreadSchedParam *sched) { qemuDomainObjPrivate *priv = vm->privateData; virDomainNuma *numatune = vm->def->numa; @@ -2591,21 +2590,16 @@ qemuProcessSetupPid(virDomainObj *vm, if (virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPU) || virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) { - if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0 && - (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT || - mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE)) { - + if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0) { /* QEMU allocates its memory from the emulator thread. Thus it * needs to access union of all host nodes configured. */ - if (unionMems && - mem_mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { + if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask); - } else { - if (virDomainNumatuneMaybeFormatNodeset(numatune, - priv->autoNodeset, - &mem_mask, -1) < 0) - goto cleanup; - } + } else if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE && + virDomainNumatuneMaybeFormatNodeset(numatune, + priv->autoNodeset, + &mem_mask, -1) < 0) + goto cleanup; } /* For restrictive numatune mode we need to set cpuset.mems for vCPU @@ -2705,8 +2699,7 @@ qemuProcessSetupEmulator(virDomainObj *vm) 0, vm->def->cputune.emulatorpin, vm->def->cputune.emulator_period, vm->def->cputune.emulator_quota, - vm->def->cputune.emulatorsched, - true); + vm->def->cputune.emulatorsched); } @@ -5909,8 +5902,7 @@ qemuProcessSetupVcpu(virDomainObj *vm, vcpuid, vcpu->cpumask, vm->def->cputune.period, vm->def->cputune.quota, - &vcpu->sched, - true) < 0) + &vcpu->sched) < 0) return -1; if (schedCore && @@ -6065,8 +6057,7 @@ qemuProcessSetupIOThread(virDomainObj *vm, iothread->cpumask, vm->def->cputune.iothread_period, vm->def->cputune.iothread_quota, - &iothread->sched, - true); + &iothread->sched); } -- 2.39.3

On Wed, Jun 07, 2023 at 04:41:01PM +0200, Michal Privoznik wrote:
The @unionMems argument of qemuProcessSetupPid() function is not necessary really as all callers pass 'true'. Drop it.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 31 +++++++++++-------------------- 1 file changed, 11 insertions(+), 20 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index d9269e37a1..74e85c8464 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2550,8 +2550,7 @@ qemuProcessSetupPid(virDomainObj *vm, virBitmap *cpumask, unsigned long long period, long long quota, - virDomainThreadSchedParam *sched, - bool unionMems) + virDomainThreadSchedParam *sched) { qemuDomainObjPrivate *priv = vm->privateData; virDomainNuma *numatune = vm->def->numa; @@ -2591,21 +2590,16 @@ qemuProcessSetupPid(virDomainObj *vm, if (virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPU) || virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) {
- if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0 && - (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT || - mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE)) { - + if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0) { /* QEMU allocates its memory from the emulator thread. Thus it * needs to access union of all host nodes configured. */ - if (unionMems && - mem_mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { + if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask); - } else { - if (virDomainNumatuneMaybeFormatNodeset(numatune, - priv->autoNodeset, - &mem_mask, -1) < 0) - goto cleanup; - } + } else if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE && + virDomainNumatuneMaybeFormatNodeset(numatune, + priv->autoNodeset, + &mem_mask, -1) < 0) + goto cleanup;
This body should also use squiggly brackets based on our coding style. It might be cleaner to switch it around and do: if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE && virDomainNumatuneMaybeFormatNodeset(numatune, priv->autoNodeset, &mem_mask, -1) < 0) goto cleanup; else if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask); or just do it as two different if's without the "else", mem_mode cannot be both anyway. With that fixed up: Reviewed-by: Martin Kletzander <mkletzan@redhat.com>

On 6/8/23 08:45, Martin Kletzander wrote:
On Wed, Jun 07, 2023 at 04:41:01PM +0200, Michal Privoznik wrote:
The @unionMems argument of qemuProcessSetupPid() function is not necessary really as all callers pass 'true'. Drop it.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 31 +++++++++++-------------------- 1 file changed, 11 insertions(+), 20 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index d9269e37a1..74e85c8464 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2550,8 +2550,7 @@ qemuProcessSetupPid(virDomainObj *vm, virBitmap *cpumask, unsigned long long period, long long quota, - virDomainThreadSchedParam *sched, - bool unionMems) + virDomainThreadSchedParam *sched) { qemuDomainObjPrivate *priv = vm->privateData; virDomainNuma *numatune = vm->def->numa; @@ -2591,21 +2590,16 @@ qemuProcessSetupPid(virDomainObj *vm, if (virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPU) || virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) {
- if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0 && - (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT || - mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE)) { - + if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0) { /* QEMU allocates its memory from the emulator thread. Thus it * needs to access union of all host nodes configured. */ - if (unionMems && - mem_mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { + if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask); - } else { - if (virDomainNumatuneMaybeFormatNodeset(numatune, - priv->autoNodeset, - &mem_mask, -1) < 0) - goto cleanup; - } + } else if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE && + virDomainNumatuneMaybeFormatNodeset(numatune, + priv->autoNodeset, + &mem_mask, -1) < 0) + goto cleanup;
This body should also use squiggly brackets based on our coding style. It might be cleaner to switch it around and do:
if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE && virDomainNumatuneMaybeFormatNodeset(numatune, priv->autoNodeset, &mem_mask, -1) < 0) goto cleanup; else if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask);
or just do it as two different if's without the "else", mem_mode cannot be both anyway.
Good point. This got me playing with switch() and instantly made me realize - whether MEM_STRICT and MEM_INTERLEAVE should do the same thing here. I mean, it's now obvious that strict needs an union of all (configured) nodes. But MEM_INTERLEAVE also needs it as the only difference is how memory is distributed across those nodes (i.e. irrelevant from CGroup's POV). Of course, if anything, that would be a separate commit, but if I use switch() here, then it's a trivial one-liner. Michal

On Thu, Jun 08, 2023 at 09:03:04AM +0200, Michal Prívozník wrote:
On 6/8/23 08:45, Martin Kletzander wrote:
On Wed, Jun 07, 2023 at 04:41:01PM +0200, Michal Privoznik wrote:
The @unionMems argument of qemuProcessSetupPid() function is not necessary really as all callers pass 'true'. Drop it.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 31 +++++++++++-------------------- 1 file changed, 11 insertions(+), 20 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index d9269e37a1..74e85c8464 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2550,8 +2550,7 @@ qemuProcessSetupPid(virDomainObj *vm, virBitmap *cpumask, unsigned long long period, long long quota, - virDomainThreadSchedParam *sched, - bool unionMems) + virDomainThreadSchedParam *sched) { qemuDomainObjPrivate *priv = vm->privateData; virDomainNuma *numatune = vm->def->numa; @@ -2591,21 +2590,16 @@ qemuProcessSetupPid(virDomainObj *vm, if (virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPU) || virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) {
- if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0 && - (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT || - mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE)) { - + if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0) { /* QEMU allocates its memory from the emulator thread. Thus it * needs to access union of all host nodes configured. */ - if (unionMems && - mem_mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { + if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask); - } else { - if (virDomainNumatuneMaybeFormatNodeset(numatune, - priv->autoNodeset, - &mem_mask, -1) < 0) - goto cleanup; - } + } else if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE && + virDomainNumatuneMaybeFormatNodeset(numatune, + priv->autoNodeset, + &mem_mask, -1) < 0) + goto cleanup;
This body should also use squiggly brackets based on our coding style. It might be cleaner to switch it around and do:
if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE && virDomainNumatuneMaybeFormatNodeset(numatune, priv->autoNodeset, &mem_mask, -1) < 0) goto cleanup; else if (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask);
or just do it as two different if's without the "else", mem_mode cannot be both anyway.
Good point. This got me playing with switch() and instantly made me realize - whether MEM_STRICT and MEM_INTERLEAVE should do the same thing here. I mean, it's now obvious that strict needs an union of all (configured) nodes. But MEM_INTERLEAVE also needs it as the only difference is how memory is distributed across those nodes (i.e. irrelevant from CGroup's POV).
Unlike STRICT, INTERLEAVE is just a hint, so I don't think so.
Of course, if anything, that would be a separate commit, but if I use switch() here, then it's a trivial one-liner.
Michal
participants (3)
-
Martin Kletzander
-
Michal Privoznik
-
Michal Prívozník