On Fri, Apr 12, 2019 at 01:15:05PM +0200, Michal Privoznik wrote:
On 4/12/19 12:11 PM, Daniel Henrique Barboza wrote:
>
>
> On 4/12/19 6:10 AM, Michal Privoznik wrote:
> > On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
> > >
> > >
> > > On 4/11/19 11:56 AM, Michal Privoznik wrote:
> > > > On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
> > > > > Hi,
> > > > >
> > > > > I've tested these patches again, twice, in similar
> > > > > setups like I tested
> > > > > the first version (first in a Power8, then in a Power9 server).
> > > > >
> > > > > Same results, though. Libvirt will not avoid the launch
> > > > > of a pseries guest,
> > > > > with numanode=strict, even if the numa node does not have
available
> > > > > RAM. If I stress test the memory of the guest to force the
allocation,
> > > > > QEMU exits with an error as soon as the memory of the host numa
node
> > > > > is exhausted.
> > > >
> > > > Yes, this is expected. I mean, by default qemu doesn't
> > > > allocate memory for the guest fully. You'd have to force it:
> > > >
> > > > <memoryBacking>
> > > > <allocation mode='immediate'/>
> > > > </memoryBacking>
> > > >
> > >
> > > Tried with this extra setting, still no good. Domain still
> > > boots, even if
> > > there is not enough memory to load up all its ram in the NUMA node
> > > I am setting. For reference, this is the top of the guest XML:
> > >
> > >
> > > <name>vm1</name>
> > > <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid>
> > > <memory unit='KiB'>314572800</memory>
> > > <currentMemory
unit='KiB'>314572800</currentMemory>
> > > <memoryBacking>
> > > <allocation mode='immediate'/>
> > > </memoryBacking>
> > > <vcpu placement='static'>16</vcpu>
> > > <numatune>
> > > <memory mode='strict' nodeset='0'/>
> > > </numatune>
> > > <os>
> > > <type arch='ppc64'
machine='pseries'>hvm</type>
> > > <boot dev='hd'/>
> > > </os>
> > > <clock offset='utc'/>
> > >
> > > While doing this test, I recalled that some of my IBM peers recently
> > > mentioned that they were unable to do a pre-allocation of the RAM
> > > of a pseries guest using Libvirt, but they were able to do it using QEMU
> > > directly (using -realtime mlock=on). In fact, I just tried it
> > > out with command
> > > line QEMU and the guest allocated all the memory at boot.
> >
> > Ah, so looks like -mem-prealloc doesn't work at Power? Can you
> > please check:
> >
> > 1) that -mem-prealloc is on the qemu command line
>
> Yes. This is the cmd line generated:
>
> /usr/bin/qemu-system-ppc64 \
> -name guest=vm1,debug-threads=on \
> -S \
> -object
secret,id=masterKey0,format=raw,file=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/master-key.aes
> \
> -machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off \
> -bios /home/user/boot_rom.bin \
> -m 307200 \
> -mem-prealloc \
> -realtime mlock=off \
This looks correct.
> -smp 16,sockets=16,cores=1,threads=1 \
> -uuid f48e9e35-8406-4784-875f-5185cb4d47d7 \
> -display none \
> -no-user-config \
> -nodefaults \
> -chardev
socket,id=charmonitor,path=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/monitor.sock,server,nowait
> \
> -mon chardev=charmonitor,id=monitor,mode=control \
> -rtc base=utc \
> -no-shutdown \
> -boot strict=on \
> -device spapr-pci-host-bridge,index=1,id=pci.1 \
> -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 \
> -drive
> file=/home/user/nv2-vm1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0
> \
> -device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> \
> -chardev pty,id=charserial0 \
> -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x1 \
> -sandbox
> on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
> \
> -msg timestamp=on
>
>
>
>
> > 2) how much memory qemu allocates right after it started the guest?
> > I mean, before you start some mem stress test which causes it to
> > allocate the memory fully.
>
> It starts with 300Gb. It depletes its assigned NUMA node (that has 256Gb),
> then it takes ~70Gb from another NUMA node to complete the 300Gb.
Huh, than -mem-prealloc is working but something else is not. What strikes
me is that once guest starts using the memory then host kernel kills the
guest. So host kernel knows about the limits we've set but doesn't enforce
them when allocating the memory.
The way QEMU implemnetings -mem-prealloc is a bit of a hack.
Essentially it tries to write a single byte in each page of
memory, on the belief that this will cause the kernel to
allocate that page.
See do_touch_pages() in qemu's util/oslib-posix.c:
for (i = 0; i < numpages; i++) {
/*
* Read & write back the same value, so we don't
* corrupt existing user/app data that might be
* stored.
*
* 'volatile' to stop compiler optimizing this away
* to a no-op
*
* TODO: get a better solution from kernel so we
* don't need to write at all so we don't cause
* wear on the storage backing the region...
*/
*(volatile char *)addr = *addr;
addr += hpagesize;
}
I wonder if the compiler on PPC is optimizing this in some
way that turns it into a no-op unexpectedly.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|