[libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'

v2 of: https://www.redhat.com/archives/libvir-list/2019-April/msg00658.html diff to v1: - Fixed the reported problem. Basically, even though emulator CGroup was created qemu was not running in it. Now qemu is moved into the CGroup even before exec() Michal Prívozník (2): qemuSetupCpusetMems: Use VIR_AUTOFREE() qemu: Set up EMULATOR thread and cpuset.mems before exec()-ing qemu src/qemu/qemu_cgroup.c | 5 ++--- src/qemu/qemu_process.c | 12 ++++++++---- 2 files changed, 10 insertions(+), 7 deletions(-) -- 2.21.0

There is one string that can be VIR_AUTOFREE used on. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index c23f0af2aa..689e0839cd 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -830,7 +830,7 @@ qemuSetupCpusetMems(virDomainObjPtr vm) virCgroupPtr cgroup_temp = NULL; qemuDomainObjPrivatePtr priv = vm->privateData; virDomainNumatuneMemMode mode; - char *mem_mask = NULL; + VIR_AUTOFREE(char *) mem_mask = NULL; int ret = -1; if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) @@ -843,7 +843,7 @@ qemuSetupCpusetMems(virDomainObjPtr vm) if (virDomainNumatuneMaybeFormatNodeset(vm->def->numa, priv->autoNodeset, &mem_mask, -1) < 0) - goto cleanup; + return -1; if (mem_mask) if (virCgroupNewThread(priv->cgroup, VIR_CGROUP_THREAD_EMULATOR, 0, @@ -853,7 +853,6 @@ qemuSetupCpusetMems(virDomainObjPtr vm) ret = 0; cleanup: - VIR_FREE(mem_mask); virCgroupFree(&cgroup_temp); return ret; } -- 2.21.0

On Wed, Apr 10, 2019 at 06:10:43PM +0200, Michal Privoznik wrote:
There is one string that can be VIR_AUTOFREE used on.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_cgroup.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
ACK, trivial

It's funny how this went unnoticed for such a long time. Long story short, if a domain is configured with VIR_DOMAIN_NUMATUNE_MEM_STRICT libvirt doesn't really honour that. This is because of 7e72ac787848 after which libvirt allowed qemu to allocate memory just anywhere and only after that it used some magic involving cpuset.memory_migrate and cpuset.mems to move the memory to desired NUMA nodes. This was done in order to work around some KVM bug where KVM would fail if there wasn't a DMA zone available on the NUMA node. Well, while the work around might stopped libvirt tickling the KVM bug it also caused a bug on libvirt side: if there is not enough memory on configured NUMA node(s) then any attempt to start a domain must fail. Because of the way we play with guest memory domains can start just happily. The solution is to move the child we've just forked into emulator cgroup, set up cpuset.mems and exec() qemu only after that. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 47d8ca2ff1..076ec18e21 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6653,6 +6653,14 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessInitCpuAffinity(vm) < 0) goto cleanup; + VIR_DEBUG("Setting emulator tuning/settings"); + if (qemuProcessSetupEmulator(vm) < 0) + goto cleanup; + + VIR_DEBUG("Setting up post-init cgroup restrictions"); + if (qemuSetupCpusetMems(vm) < 0) + goto cleanup; + VIR_DEBUG("Setting cgroup for external devices (if required)"); if (qemuSetupCgroupForExtDevices(vm, driver) < 0) goto cleanup; @@ -6744,10 +6752,6 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessDetectIOThreadPIDs(driver, vm, asyncJob) < 0) goto cleanup; - VIR_DEBUG("Setting emulator tuning/settings"); - if (qemuProcessSetupEmulator(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting global CPU cgroup (if required)"); if (qemuSetupGlobalCpuCgroup(vm) < 0) goto cleanup; -- 2.21.0

On Wed, Apr 10, 2019 at 06:10:44PM +0200, Michal Privoznik wrote:
It's funny how this went unnoticed for such a long time. Long story short, if a domain is configured with VIR_DOMAIN_NUMATUNE_MEM_STRICT libvirt doesn't really honour that. This is because of 7e72ac787848 after which libvirt allowed qemu to allocate memory just anywhere and only after that it used some magic involving cpuset.memory_migrate and cpuset.mems to move the memory to desired NUMA nodes. This was done in order to work around some KVM bug where KVM would fail if there wasn't a DMA zone available on the NUMA node. Well, while the work around might stopped libvirt tickling the KVM bug it also caused a bug on libvirt side: if there is not enough memory on configured NUMA node(s) then any attempt to start a domain must fail. Because of the way we play with guest memory domains can start just happily.
The solution is to move the child we've just forked into emulator cgroup, set up cpuset.mems and exec() qemu only after that.
So you are saying this was a bug in KVM? Is it fixed now? I am not against this patch, I hated that I had to do the workaround, but I just want to be sure we won't start hitting that again.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 47d8ca2ff1..076ec18e21 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6653,6 +6653,14 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessInitCpuAffinity(vm) < 0) goto cleanup;
+ VIR_DEBUG("Setting emulator tuning/settings"); + if (qemuProcessSetupEmulator(vm) < 0) + goto cleanup; + + VIR_DEBUG("Setting up post-init cgroup restrictions");
This is not post-init any more, but more importantly,
+ if (qemuSetupCpusetMems(vm) < 0)
This function does a subset of what qemuProcessSetupEmulator() called right before, does, so I see no reason for it being called here, or to keep existing in the codebase for that matter.
+ goto cleanup; + VIR_DEBUG("Setting cgroup for external devices (if required)"); if (qemuSetupCgroupForExtDevices(vm, driver) < 0) goto cleanup; @@ -6744,10 +6752,6 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessDetectIOThreadPIDs(driver, vm, asyncJob) < 0) goto cleanup;
- VIR_DEBUG("Setting emulator tuning/settings"); - if (qemuProcessSetupEmulator(vm) < 0) - goto cleanup; - VIR_DEBUG("Setting global CPU cgroup (if required)"); if (qemuSetupGlobalCpuCgroup(vm) < 0) goto cleanup; -- 2.21.0
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 4/15/19 3:47 PM, Martin Kletzander wrote:
On Wed, Apr 10, 2019 at 06:10:44PM +0200, Michal Privoznik wrote:
It's funny how this went unnoticed for such a long time. Long story short, if a domain is configured with VIR_DOMAIN_NUMATUNE_MEM_STRICT libvirt doesn't really honour that. This is because of 7e72ac787848 after which libvirt allowed qemu to allocate memory just anywhere and only after that it used some magic involving cpuset.memory_migrate and cpuset.mems to move the memory to desired NUMA nodes. This was done in order to work around some KVM bug where KVM would fail if there wasn't a DMA zone available on the NUMA node. Well, while the work around might stopped libvirt tickling the KVM bug it also caused a bug on libvirt side: if there is not enough memory on configured NUMA node(s) then any attempt to start a domain must fail. Because of the way we play with guest memory domains can start just happily.
The solution is to move the child we've just forked into emulator cgroup, set up cpuset.mems and exec() qemu only after that.
So you are saying this was a bug in KVM? Is it fixed now? I am not against this patch, I hated that I had to do the workaround, but I just want to be sure we won't start hitting that again.
Yes, that's what I'm saying. Looks like the KVM bug is fixed now because with a Fedora 29 on a NUMA machine I can start domains just fine.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 47d8ca2ff1..076ec18e21 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6653,6 +6653,14 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessInitCpuAffinity(vm) < 0) goto cleanup;
+ VIR_DEBUG("Setting emulator tuning/settings"); + if (qemuProcessSetupEmulator(vm) < 0) + goto cleanup; + + VIR_DEBUG("Setting up post-init cgroup restrictions");
This is not post-init any more, but more importantly,
+ if (qemuSetupCpusetMems(vm) < 0)
This function does a subset of what qemuProcessSetupEmulator() called right before, does, so I see no reason for it being called here, or to keep existing in the codebase for that matter.
Ah, good point. I'll send v3. Michal

On Mon, Apr 15, 2019 at 06:32:32PM +0200, Michal Privoznik wrote:
On 4/15/19 3:47 PM, Martin Kletzander wrote:
On Wed, Apr 10, 2019 at 06:10:44PM +0200, Michal Privoznik wrote:
It's funny how this went unnoticed for such a long time. Long story short, if a domain is configured with VIR_DOMAIN_NUMATUNE_MEM_STRICT libvirt doesn't really honour that. This is because of 7e72ac787848 after which libvirt allowed qemu to allocate memory just anywhere and only after that it used some magic involving cpuset.memory_migrate and cpuset.mems to move the memory to desired NUMA nodes. This was done in order to work around some KVM bug where KVM would fail if there wasn't a DMA zone available on the NUMA node. Well, while the work around might stopped libvirt tickling the KVM bug it also caused a bug on libvirt side: if there is not enough memory on configured NUMA node(s) then any attempt to start a domain must fail. Because of the way we play with guest memory domains can start just happily.
The solution is to move the child we've just forked into emulator cgroup, set up cpuset.mems and exec() qemu only after that.
So you are saying this was a bug in KVM? Is it fixed now? I am not against this patch, I hated that I had to do the workaround, but I just want to be sure we won't start hitting that again.
Yes, that's what I'm saying. Looks like the KVM bug is fixed now because with a Fedora 29 on a NUMA machine I can start domains just fine.
What I was saying was that it would be nice to have some proof for this instead of guesswork. I, however, acknowledge that this might not be easy, or even possible (the first patch that introduced the need for the initial workaround was not pinpointed, at least not to my knowledge). Just make sure that when checking for this, you strictly required all the allocations to be done from node not mentioned in output of: cat /proc/zoneinfo | grep DMA and also that you used multiple vCPUs. If you can also hotplug an extra vCPU later on, then the test is perfect enough for me to justify this change [1]. Martin [1] If you feel like looking up (bisecting) the kernel commit that used this, I'm _not_ standing in your way ;)

Hi, I've tested these patches again, twice, in similar setups like I tested the first version (first in a Power8, then in a Power9 server). Same results, though. Libvirt will not avoid the launch of a pseries guest, with numanode=strict, even if the numa node does not have available RAM. If I stress test the memory of the guest to force the allocation, QEMU exits with an error as soon as the memory of the host numa node is exhausted. If I change the numanode setting to 'preferred' and repeats the test, QEMU doesn't exit with an error - the process starts to take memory from other numa nodes. This indicates that the numanode policy is apparently being forced in the QEMU process - however, it is not forced in VM boot. I've debugged it a little and haven't found anything wrong that jumps the eye. All functions that succeeds qemuSetupCpusetMems exits out with ret = 0. Unfortunately, I don't have access to a x86 server with more than one NUMA node to compare results. Since I can't say for sure if what I'm seeing is an exclusive pseries behavior, I see no problem into pushing this series upstream if it makes sense for x86. We can debug/fix the Power side later. Thanks, DHB On 4/10/19 1:10 PM, Michal Privoznik wrote:
v2 of:
https://www.redhat.com/archives/libvir-list/2019-April/msg00658.html
diff to v1: - Fixed the reported problem. Basically, even though emulator CGroup was created qemu was not running in it. Now qemu is moved into the CGroup even before exec()
Michal Prívozník (2): qemuSetupCpusetMems: Use VIR_AUTOFREE() qemu: Set up EMULATOR thread and cpuset.mems before exec()-ing qemu
src/qemu/qemu_cgroup.c | 5 ++--- src/qemu/qemu_process.c | 12 ++++++++---- 2 files changed, 10 insertions(+), 7 deletions(-)

On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
Hi,
I've tested these patches again, twice, in similar setups like I tested the first version (first in a Power8, then in a Power9 server).
Same results, though. Libvirt will not avoid the launch of a pseries guest, with numanode=strict, even if the numa node does not have available RAM. If I stress test the memory of the guest to force the allocation, QEMU exits with an error as soon as the memory of the host numa node is exhausted.
Yes, this is expected. I mean, by default qemu doesn't allocate memory for the guest fully. You'd have to force it: <memoryBacking> <allocation mode='immediate'/> </memoryBacking>
If I change the numanode setting to 'preferred' and repeats the test, QEMU doesn't exit with an error - the process starts to take memory from other numa nodes. This indicates that the numanode policy is apparently being forced in the QEMU process - however, it is not forced in VM boot.
I've debugged it a little and haven't found anything wrong that jumps the eye. All functions that succeeds qemuSetupCpusetMems exits out with ret = 0. Unfortunately, I don't have access to a x86 server with more than one NUMA node to compare results.
Since I can't say for sure if what I'm seeing is an exclusive pseries behavior, I see no problem into pushing this series upstream if it makes sense for x86. We can debug/fix the Power side later.
I bet that if you force the allocation then the domain will be unable to boot. Thanks for the testing! Michal

On 4/11/19 11:56 AM, Michal Privoznik wrote:
On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
Hi,
I've tested these patches again, twice, in similar setups like I tested the first version (first in a Power8, then in a Power9 server).
Same results, though. Libvirt will not avoid the launch of a pseries guest, with numanode=strict, even if the numa node does not have available RAM. If I stress test the memory of the guest to force the allocation, QEMU exits with an error as soon as the memory of the host numa node is exhausted.
Yes, this is expected. I mean, by default qemu doesn't allocate memory for the guest fully. You'd have to force it:
<memoryBacking> <allocation mode='immediate'/> </memoryBacking>
Tried with this extra setting, still no good. Domain still boots, even if there is not enough memory to load up all its ram in the NUMA node I am setting. For reference, this is the top of the guest XML: <name>vm1</name> <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid> <memory unit='KiB'>314572800</memory> <currentMemory unit='KiB'>314572800</currentMemory> <memoryBacking> <allocation mode='immediate'/> </memoryBacking> <vcpu placement='static'>16</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> <os> <type arch='ppc64' machine='pseries'>hvm</type> <boot dev='hd'/> </os> <clock offset='utc'/> While doing this test, I recalled that some of my IBM peers recently mentioned that they were unable to do a pre-allocation of the RAM of a pseries guest using Libvirt, but they were able to do it using QEMU directly (using -realtime mlock=on). In fact, I just tried it out with command line QEMU and the guest allocated all the memory at boot. This means that the pseries guest is able to do mem pre-alloc. I'd say that there might be something missing somewhere (XML, host setup, libvirt config ...) or perhaps even a bug that is preventing Libvirt from doing this pre-alloc. This explains why I can't verify this patch series. I'll see if I dig it further to understand why when I have the time. Thanks, DHB
If I change the numanode setting to 'preferred' and repeats the test, QEMU doesn't exit with an error - the process starts to take memory from other numa nodes. This indicates that the numanode policy is apparently being forced in the QEMU process - however, it is not forced in VM boot.
I've debugged it a little and haven't found anything wrong that jumps the eye. All functions that succeeds qemuSetupCpusetMems exits out with ret = 0. Unfortunately, I don't have access to a x86 server with more than one NUMA node to compare results.
Since I can't say for sure if what I'm seeing is an exclusive pseries behavior, I see no problem into pushing this series upstream if it makes sense for x86. We can debug/fix the Power side later.
I bet that if you force the allocation then the domain will be unable to boot.
Thanks for the testing!
Michal

On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
On 4/11/19 11:56 AM, Michal Privoznik wrote:
On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
Hi,
I've tested these patches again, twice, in similar setups like I tested the first version (first in a Power8, then in a Power9 server).
Same results, though. Libvirt will not avoid the launch of a pseries guest, with numanode=strict, even if the numa node does not have available RAM. If I stress test the memory of the guest to force the allocation, QEMU exits with an error as soon as the memory of the host numa node is exhausted.
Yes, this is expected. I mean, by default qemu doesn't allocate memory for the guest fully. You'd have to force it:
<memoryBacking> <allocation mode='immediate'/> </memoryBacking>
Tried with this extra setting, still no good. Domain still boots, even if there is not enough memory to load up all its ram in the NUMA node I am setting. For reference, this is the top of the guest XML:
<name>vm1</name> <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid> <memory unit='KiB'>314572800</memory> <currentMemory unit='KiB'>314572800</currentMemory> <memoryBacking> <allocation mode='immediate'/> </memoryBacking> <vcpu placement='static'>16</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> <os> <type arch='ppc64' machine='pseries'>hvm</type> <boot dev='hd'/> </os> <clock offset='utc'/>
While doing this test, I recalled that some of my IBM peers recently mentioned that they were unable to do a pre-allocation of the RAM of a pseries guest using Libvirt, but they were able to do it using QEMU directly (using -realtime mlock=on). In fact, I just tried it out with command line QEMU and the guest allocated all the memory at boot.
Ah, so looks like -mem-prealloc doesn't work at Power? Can you please check: 1) that -mem-prealloc is on the qemu command line 2) how much memory qemu allocates right after it started the guest? I mean, before you start some mem stress test which causes it to allocate the memory fully.
This means that the pseries guest is able to do mem pre-alloc. I'd say that there might be something missing somewhere (XML, host setup, libvirt config ...) or perhaps even a bug that is preventing Libvirt from doing this pre-alloc. This explains why I can't verify this patch series. I'll see if I dig it further to understand why when I have the time.
Yeah, I don't know Power well enough to help you. Sorry. Michal

On 4/12/19 6:10 AM, Michal Privoznik wrote:
On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
On 4/11/19 11:56 AM, Michal Privoznik wrote:
On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
Hi,
I've tested these patches again, twice, in similar setups like I tested the first version (first in a Power8, then in a Power9 server).
Same results, though. Libvirt will not avoid the launch of a pseries guest, with numanode=strict, even if the numa node does not have available RAM. If I stress test the memory of the guest to force the allocation, QEMU exits with an error as soon as the memory of the host numa node is exhausted.
Yes, this is expected. I mean, by default qemu doesn't allocate memory for the guest fully. You'd have to force it:
<memoryBacking> <allocation mode='immediate'/> </memoryBacking>
Tried with this extra setting, still no good. Domain still boots, even if there is not enough memory to load up all its ram in the NUMA node I am setting. For reference, this is the top of the guest XML:
<name>vm1</name> <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid> <memory unit='KiB'>314572800</memory> <currentMemory unit='KiB'>314572800</currentMemory> <memoryBacking> <allocation mode='immediate'/> </memoryBacking> <vcpu placement='static'>16</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> <os> <type arch='ppc64' machine='pseries'>hvm</type> <boot dev='hd'/> </os> <clock offset='utc'/>
While doing this test, I recalled that some of my IBM peers recently mentioned that they were unable to do a pre-allocation of the RAM of a pseries guest using Libvirt, but they were able to do it using QEMU directly (using -realtime mlock=on). In fact, I just tried it out with command line QEMU and the guest allocated all the memory at boot.
Ah, so looks like -mem-prealloc doesn't work at Power? Can you please check:
1) that -mem-prealloc is on the qemu command line
Yes. This is the cmd line generated: /usr/bin/qemu-system-ppc64 \ -name guest=vm1,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/master-key.aes \ -machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off \ -bios /home/user/boot_rom.bin \ -m 307200 \ -mem-prealloc \ -realtime mlock=off \ -smp 16,sockets=16,cores=1,threads=1 \ -uuid f48e9e35-8406-4784-875f-5185cb4d47d7 \ -display none \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,path=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/monitor.sock,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc \ -no-shutdown \ -boot strict=on \ -device spapr-pci-host-bridge,index=1,id=pci.1 \ -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 \ -drive file=/home/user/nv2-vm1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -chardev pty,id=charserial0 \ -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x1 \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on
2) how much memory qemu allocates right after it started the guest? I mean, before you start some mem stress test which causes it to allocate the memory fully.
It starts with 300Gb. It depletes its assigned NUMA node (that has 256Gb), then it takes ~70Gb from another NUMA node to complete the 300Gb.
This means that the pseries guest is able to do mem pre-alloc. I'd say that there might be something missing somewhere (XML, host setup, libvirt config ...) or perhaps even a bug that is preventing Libvirt from doing this pre-alloc. This explains why I can't verify this patch series. I'll see if I dig it further to understand why when I have the time.
Yeah, I don't know Power well enough to help you. Sorry.
No problem. One question: Libvirt is supposed to let the VM do the full allocation of its RAM using -mem-prealloc and with -realtime mlock=off, is that correct? Thanks, DHB
Michal

On 4/12/19 12:11 PM, Daniel Henrique Barboza wrote:
On 4/12/19 6:10 AM, Michal Privoznik wrote:
On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
On 4/11/19 11:56 AM, Michal Privoznik wrote:
On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
Hi,
I've tested these patches again, twice, in similar setups like I tested the first version (first in a Power8, then in a Power9 server).
Same results, though. Libvirt will not avoid the launch of a pseries guest, with numanode=strict, even if the numa node does not have available RAM. If I stress test the memory of the guest to force the allocation, QEMU exits with an error as soon as the memory of the host numa node is exhausted.
Yes, this is expected. I mean, by default qemu doesn't allocate memory for the guest fully. You'd have to force it:
<memoryBacking> <allocation mode='immediate'/> </memoryBacking>
Tried with this extra setting, still no good. Domain still boots, even if there is not enough memory to load up all its ram in the NUMA node I am setting. For reference, this is the top of the guest XML:
<name>vm1</name> <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid> <memory unit='KiB'>314572800</memory> <currentMemory unit='KiB'>314572800</currentMemory> <memoryBacking> <allocation mode='immediate'/> </memoryBacking> <vcpu placement='static'>16</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> <os> <type arch='ppc64' machine='pseries'>hvm</type> <boot dev='hd'/> </os> <clock offset='utc'/>
While doing this test, I recalled that some of my IBM peers recently mentioned that they were unable to do a pre-allocation of the RAM of a pseries guest using Libvirt, but they were able to do it using QEMU directly (using -realtime mlock=on). In fact, I just tried it out with command line QEMU and the guest allocated all the memory at boot.
Ah, so looks like -mem-prealloc doesn't work at Power? Can you please check:
1) that -mem-prealloc is on the qemu command line
Yes. This is the cmd line generated:
/usr/bin/qemu-system-ppc64 \ -name guest=vm1,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/master-key.aes \ -machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off \ -bios /home/user/boot_rom.bin \ -m 307200 \ -mem-prealloc \ -realtime mlock=off \
This looks correct.
-smp 16,sockets=16,cores=1,threads=1 \ -uuid f48e9e35-8406-4784-875f-5185cb4d47d7 \ -display none \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,path=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/monitor.sock,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc \ -no-shutdown \ -boot strict=on \ -device spapr-pci-host-bridge,index=1,id=pci.1 \ -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 \ -drive file=/home/user/nv2-vm1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -chardev pty,id=charserial0 \ -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x1 \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on
2) how much memory qemu allocates right after it started the guest? I mean, before you start some mem stress test which causes it to allocate the memory fully.
It starts with 300Gb. It depletes its assigned NUMA node (that has 256Gb), then it takes ~70Gb from another NUMA node to complete the 300Gb.
Huh, than -mem-prealloc is working but something else is not. What strikes me is that once guest starts using the memory then host kernel kills the guest. So host kernel knows about the limits we've set but doesn't enforce them when allocating the memory.
This means that the pseries guest is able to do mem pre-alloc. I'd say that there might be something missing somewhere (XML, host setup, libvirt config ...) or perhaps even a bug that is preventing Libvirt from doing this pre-alloc. This explains why I can't verify this patch series. I'll see if I dig it further to understand why when I have the time.
Yeah, I don't know Power well enough to help you. Sorry.
No problem. One question: Libvirt is supposed to let the VM do the full allocation of its RAM using -mem-prealloc and with -realtime mlock=off, is that correct?
-mem-prealloc should be enough. -realtime mlock is ther to lock the allocated memory so that it doesn't get swapped out. You can enable memory locking via: <memoryBacking> <locked/> </memoryBacking> Michal

On Fri, Apr 12, 2019 at 01:15:05PM +0200, Michal Privoznik wrote:
On 4/12/19 12:11 PM, Daniel Henrique Barboza wrote:
On 4/12/19 6:10 AM, Michal Privoznik wrote:
On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
On 4/11/19 11:56 AM, Michal Privoznik wrote:
On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
Hi,
I've tested these patches again, twice, in similar setups like I tested the first version (first in a Power8, then in a Power9 server).
Same results, though. Libvirt will not avoid the launch of a pseries guest, with numanode=strict, even if the numa node does not have available RAM. If I stress test the memory of the guest to force the allocation, QEMU exits with an error as soon as the memory of the host numa node is exhausted.
Yes, this is expected. I mean, by default qemu doesn't allocate memory for the guest fully. You'd have to force it:
<memoryBacking> <allocation mode='immediate'/> </memoryBacking>
Tried with this extra setting, still no good. Domain still boots, even if there is not enough memory to load up all its ram in the NUMA node I am setting. For reference, this is the top of the guest XML:
<name>vm1</name> <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid> <memory unit='KiB'>314572800</memory> <currentMemory unit='KiB'>314572800</currentMemory> <memoryBacking> <allocation mode='immediate'/> </memoryBacking> <vcpu placement='static'>16</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> <os> <type arch='ppc64' machine='pseries'>hvm</type> <boot dev='hd'/> </os> <clock offset='utc'/>
While doing this test, I recalled that some of my IBM peers recently mentioned that they were unable to do a pre-allocation of the RAM of a pseries guest using Libvirt, but they were able to do it using QEMU directly (using -realtime mlock=on). In fact, I just tried it out with command line QEMU and the guest allocated all the memory at boot.
Ah, so looks like -mem-prealloc doesn't work at Power? Can you please check:
1) that -mem-prealloc is on the qemu command line
Yes. This is the cmd line generated:
/usr/bin/qemu-system-ppc64 \ -name guest=vm1,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/master-key.aes \ -machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off \ -bios /home/user/boot_rom.bin \ -m 307200 \ -mem-prealloc \ -realtime mlock=off \
This looks correct.
-smp 16,sockets=16,cores=1,threads=1 \ -uuid f48e9e35-8406-4784-875f-5185cb4d47d7 \ -display none \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,path=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/monitor.sock,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc \ -no-shutdown \ -boot strict=on \ -device spapr-pci-host-bridge,index=1,id=pci.1 \ -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 \ -drive file=/home/user/nv2-vm1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -chardev pty,id=charserial0 \ -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x1 \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on
2) how much memory qemu allocates right after it started the guest? I mean, before you start some mem stress test which causes it to allocate the memory fully.
It starts with 300Gb. It depletes its assigned NUMA node (that has 256Gb), then it takes ~70Gb from another NUMA node to complete the 300Gb.
Huh, than -mem-prealloc is working but something else is not. What strikes me is that once guest starts using the memory then host kernel kills the guest. So host kernel knows about the limits we've set but doesn't enforce them when allocating the memory.
The way QEMU implemnetings -mem-prealloc is a bit of a hack. Essentially it tries to write a single byte in each page of memory, on the belief that this will cause the kernel to allocate that page. See do_touch_pages() in qemu's util/oslib-posix.c: for (i = 0; i < numpages; i++) { /* * Read & write back the same value, so we don't * corrupt existing user/app data that might be * stored. * * 'volatile' to stop compiler optimizing this away * to a no-op * * TODO: get a better solution from kernel so we * don't need to write at all so we don't cause * wear on the storage backing the region... */ *(volatile char *)addr = *addr; addr += hpagesize; } I wonder if the compiler on PPC is optimizing this in some way that turns it into a no-op unexpectedly. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (4)
-
Daniel Henrique Barboza
-
Daniel P. Berrangé
-
Martin Kletzander
-
Michal Privoznik