[libvirt] [PATCH 0/4] qemu: Honor memory mode='strict'

If there's a domain configured as: <currentMemory unit='MiB'>4096</currentMemory> <numatune> <memory mode='strict' nodeset='1'/> </numatune> but there is not enough memory on NUMA node 1 the domain will start successfully because we allow it to allocate memory from other nodes. This is a result of some previous fix (v1.2.7-rc1~91). However, I've tested my fix successfully on a NUMA machine with F29 and recent kernel. So the kernel bug I'm mentioning in 4/4 is probably fixed then and we can drop the workaround. Michal Prívozník (4): qemuSetupCpusetMems: Use VIR_AUTOFREE() qemuSetupCpusetMems: Create EMULATOR thread upfront qemu_cgroup: Make qemuSetupCpusetMems static qemuSetupCpusetCgroup: Set up cpuset.mems before execing qemu src/qemu/qemu_cgroup.c | 12 +++++++----- src/qemu/qemu_cgroup.h | 1 - 2 files changed, 7 insertions(+), 6 deletions(-) -- 2.21.0

There is one string that can be VIR_AUTOFREE used on. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index c23f0af2aa..689e0839cd 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -830,7 +830,7 @@ qemuSetupCpusetMems(virDomainObjPtr vm) virCgroupPtr cgroup_temp = NULL; qemuDomainObjPrivatePtr priv = vm->privateData; virDomainNumatuneMemMode mode; - char *mem_mask = NULL; + VIR_AUTOFREE(char *) mem_mask = NULL; int ret = -1; if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) @@ -843,7 +843,7 @@ qemuSetupCpusetMems(virDomainObjPtr vm) if (virDomainNumatuneMaybeFormatNodeset(vm->def->numa, priv->autoNodeset, &mem_mask, -1) < 0) - goto cleanup; + return -1; if (mem_mask) if (virCgroupNewThread(priv->cgroup, VIR_CGROUP_THREAD_EMULATOR, 0, @@ -853,7 +853,6 @@ qemuSetupCpusetMems(virDomainObjPtr vm) ret = 0; cleanup: - VIR_FREE(mem_mask); virCgroupFree(&cgroup_temp); return ret; } -- 2.21.0

This function is going to be called after fork() and before exec() of qemu to ensure that 'strict' mode of numatune is honored. This point is far before qemuProcessSetupEmulator() is called and thus CGroup for EMULATOR thread does not exist yet. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 689e0839cd..2c663aca0a 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -847,7 +847,7 @@ qemuSetupCpusetMems(virDomainObjPtr vm) if (mem_mask) if (virCgroupNewThread(priv->cgroup, VIR_CGROUP_THREAD_EMULATOR, 0, - false, &cgroup_temp) < 0 || + true, &cgroup_temp) < 0 || virCgroupSetCpusetMems(cgroup_temp, mem_mask) < 0) goto cleanup; -- 2.21.0

This function is not used (nor going to be used) outside of qemu_cgroup.c file. Mark it as static. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 2 +- src/qemu/qemu_cgroup.h | 1 - 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 2c663aca0a..b2d300b419 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -824,7 +824,7 @@ qemuSetupDevicesCgroup(virDomainObjPtr vm) } -int +static int qemuSetupCpusetMems(virDomainObjPtr vm) { virCgroupPtr cgroup_temp = NULL; diff --git a/src/qemu/qemu_cgroup.h b/src/qemu/qemu_cgroup.h index dc6d173fce..88ad4a38b6 100644 --- a/src/qemu/qemu_cgroup.h +++ b/src/qemu/qemu_cgroup.h @@ -61,7 +61,6 @@ int qemuConnectCgroup(virDomainObjPtr vm); int qemuSetupCgroup(virDomainObjPtr vm, size_t nnicindexes, int *nicindexes); -int qemuSetupCpusetMems(virDomainObjPtr vm); int qemuSetupCgroupVcpuBW(virCgroupPtr cgroup, unsigned long long period, long long quota); -- 2.21.0

It's funny how this went unnoticed for such a long time. Long story short, if a domain is configured with VIR_DOMAIN_NUMATUNE_MEM_STRICT libvirt doesn't really honour that. This is because of 7e72ac787848 after which libvirt allowed qemu to allocate memory just anywhere and only after that it used some magic involving cpuset.memory_migrate and cpuset.mems to move the memory to desired NUMA nodes. This was done in order to work around some KVM bug where KVM would fail if there wasn't a DMA zone available on the NUMA node. Well, while the work around might stopped libvirt tickling the KVM bug it also caused a bug on libvirt side: if there is not enough memory on configured NUMA node(s) then any attempt to start a domain must fail. Because of the way we play with guest memory domains can start just happily. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b2d300b419..d6c982aeed 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -869,6 +869,9 @@ qemuSetupCpusetCgroup(virDomainObjPtr vm) if (virCgroupSetCpusetMemoryMigrate(priv->cgroup, true) < 0) return -1; + if (qemuSetupCpusetMems(vm) < 0) + return -1; + return 0; } -- 2.21.0

Hi, On 4/9/19 11:10 AM, Michal Privoznik wrote:
If there's a domain configured as:
<currentMemory unit='MiB'>4096</currentMemory> <numatune> <memory mode='strict' nodeset='1'/> </numatune>
but there is not enough memory on NUMA node 1 the domain will start successfully because we allow it to allocate memory from other nodes. This is a result of some previous fix (v1.2.7-rc1~91). However, I've tested my fix successfully on a NUMA machine with F29 and recent kernel. So the kernel bug I'm mentioning in 4/4 is probably fixed then and we can drop the workaround.
I've tested out of curiosity your patch set in a Power8 system to see if I could spot a difference, but in my case it didn't change the behavior. I've tried a guest with the following numatune: <memory unit='KiB'>67108864</memory> <currentMemory unit='KiB'>67108864</currentMemory> <vcpu placement='static' current='4'>16</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> This is the numa setup of the host: $ numactl -H available: 4 nodes (0-1,16-17) node 0 cpus: 0 8 16 24 32 40 node 0 size: 32606 MB node 0 free: 24125 MB node 1 cpus: 48 56 64 72 80 88 node 1 size: 32704 MB node 1 free: 27657 MB node 16 cpus: 96 104 112 120 128 136 node 16 size: 32704 MB node 16 free: 25455 MB node 17 cpus: 144 152 160 168 176 184 node 17 size: 32565 MB node 17 free: 30030 MB node distances: node 0 1 16 17 0: 10 20 40 40 1: 20 10 40 40 16: 40 40 10 20 17: 40 40 20 10 If I understood it right, the patches removed the capability to allocate memory from different numa nodes with the 'strict' setting, making the guest failing to launch if the numa node does not have enough memory. Unless I am getting something wrong, this guest shouldn't launch after applying this patch (node 0 does not have 64Gb available). But the guest is launching as if nothing changed. I'll dig it further if I have the chance. I'm just curious if this is something that works differently with pseries guests. Thanks, DHB
Michal Prívozník (4): qemuSetupCpusetMems: Use VIR_AUTOFREE() qemuSetupCpusetMems: Create EMULATOR thread upfront qemu_cgroup: Make qemuSetupCpusetMems static qemuSetupCpusetCgroup: Set up cpuset.mems before execing qemu
src/qemu/qemu_cgroup.c | 12 +++++++----- src/qemu/qemu_cgroup.h | 1 - 2 files changed, 7 insertions(+), 6 deletions(-)

On 4/10/19 12:35 AM, Daniel Henrique Barboza wrote:
Hi,
On 4/9/19 11:10 AM, Michal Privoznik wrote:
If there's a domain configured as:
<currentMemory unit='MiB'>4096</currentMemory> <numatune> <memory mode='strict' nodeset='1'/> </numatune>
but there is not enough memory on NUMA node 1 the domain will start successfully because we allow it to allocate memory from other nodes. This is a result of some previous fix (v1.2.7-rc1~91). However, I've tested my fix successfully on a NUMA machine with F29 and recent kernel. So the kernel bug I'm mentioning in 4/4 is probably fixed then and we can drop the workaround.
I've tested out of curiosity your patch set in a Power8 system to see if I could spot a difference, but in my case it didn't change the behavior.
I've tried a guest with the following numatune:
<memory unit='KiB'>67108864</memory> <currentMemory unit='KiB'>67108864</currentMemory> <vcpu placement='static' current='4'>16</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune>
This is the numa setup of the host:
$ numactl -H available: 4 nodes (0-1,16-17) node 0 cpus: 0 8 16 24 32 40 node 0 size: 32606 MB node 0 free: 24125 MB node 1 cpus: 48 56 64 72 80 88 node 1 size: 32704 MB node 1 free: 27657 MB node 16 cpus: 96 104 112 120 128 136 node 16 size: 32704 MB node 16 free: 25455 MB node 17 cpus: 144 152 160 168 176 184 node 17 size: 32565 MB node 17 free: 30030 MB node distances: node 0 1 16 17 0: 10 20 40 40 1: 20 10 40 40 16: 40 40 10 20 17: 40 40 20 10
If I understood it right, the patches removed the capability to allocate memory from different numa nodes with the 'strict' setting, making the guest failing to launch if the numa node does not have enough memory. Unless I am getting something wrong, this guest shouldn't launch after applying this patch (node 0 does not have 64Gb available). But the guest is launching as if nothing changed.
I'll dig it further if I have the chance. I'm just curious if this is something that works differently with pseries guests.
Hey, firstly thanks for testing this! And yes, it shows flaw in my patches. Thing is, my patches set up emulator/cpuset.mems but at the time qemu is doing its allocation it is still living under top level CGroup (which is left untouched). Will post v2! Thanks. Michal
participants (2)
-
Daniel Henrique Barboza
-
Michal Privoznik