Re: [libvirt-users] NUMA issues on virtualized hosts

Tuesday, 18 September 2018

On 09/17/2018 04:59 PM, Lukas Hejtmanek wrote:
...
 Hello,

 so the current domain configuration:
 <cpu mode='host-passthrough'><topology sockets='8'
cores='4' threads='1'/><numa><cell cpus='0-3'
memory='62000000' /><cell cpus='4-7' memory='62000000'
/><cell cpus='8-11' memory='62000000' /><cell
cpus='12-15' memory='62000000' /><cell cpus='16-19'
memory='62000000' /><cell cpus='20-23' memory='62000000'
/><cell cpus='24-27' memory='62000000' /><cell
cpus='28-31' memory='62000000' /></numa></cpu>
 <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin
vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2'
/><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4'
cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin
vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7'
/><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9'
cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin
vcpu='11' cpuset='11' /><vcpupin vcpu='12'
cpuset='12' /><vcpupin vcpu='13' cpuset='13'
/><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15'
cpuset='15' /><vcpupin vcpu='16' cpuset='16'
/><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18'
cpuset='18' /><vcpupin vcpu='19' cpuset='19'
/><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21'
cpuset='21' /><vcpupin vcpu='22' cpuset='22'
/><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24'
cpuset='24' /><vcpupin vcpu='25' cpuset='25'
/><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27'
cpuset='27' /><vcpupin vcpu='28' cpuset='28'
/><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30'
cpuset='30' /><vcpupin vcpu='31' cpuset='31'
/></cputune>
 <numatune>
 <memnode cellid="0" mode="strict" nodeset="0"/>
 <memnode cellid="1" mode="strict" nodeset="1"/>
 <memnode cellid="2" mode="strict" nodeset="2"/>
 <memnode cellid="3" mode="strict" nodeset="3"/>
 <memnode cellid="4" mode="strict" nodeset="4"/>
 <memnode cellid="5" mode="strict" nodeset="5"/>
 <memnode cellid="6" mode="strict" nodeset="6"/>
 <memnode cellid="7" mode="strict" nodeset="7"/>
 </numatune>

 hopefully, I got it right.  

Yes, looking good.

...

 Good news is, that spec benchmark looks promising. The first test bwaves
 finished in 1003 seconds compared to 1700 seconds in the previous wrong case.
 So far so good. 
Very well, this means that the config above is correct.

...

 Bad news is, that iozone is still the same. There might be some
 misunderstanding. 

 I have to cases:

 1) cache=unsafe. In this case, I can see that hypervizor is prone to swap.
 Swap a lot. It usually eats whole swap partition and kswapd is running on 100%
 CPU. swappines, dirty_ration and company do not improve things at all.
 However, I believe, this is just wrong option for scratch disks where one can
 expect huge I/O load. Moreover, the hypevizor is poor machine with only low
 memory left (ok, in my case about 10GB available), so it does not make sense
 to use that memory for additional cache/disk buffers. 
One thing that just occurred to me - is the qcow2 file fully allocated?

# qemu-img info /var/lib/libvirt/images/fedora.qcow2
..
virtual size: 20G (21474836480 bytes)
disk size: 7.0G
..

This is NOT a fully allocated qcow2.

...

 2) cache=none. In this case, performance is better (only few percent behind
 baremetal). However, as soon as the size of stored data is about the size of
 memory of the virtual, writes stops and iozone is eating whole CPU, it looks like
 it is searching more free pages and it is harder and harder. But not sure,
 I am not skilled in this area. 
Hmm. Could it be that SSD doesn't have enough free blocks and thus
writes are throttled? Can you fstrim it and see if that helps?

...

 here, you can clearly see, that it starts writes, doing the writes, then it
 takes a pause, writes again, and so on, but the pauses are longer and longer..
 https://pastebin.com/2gfPFgb9
 The output is until the very end of iozone (I cancelled it by ctrl-c).

 It seems that this is not happening on 2-NUMA node with rotational disks only.
 It is partly happening on 2-NUMA node with 2 NVME SSDs. The partly means, that
 there are also pauses in writes but it finishes, speed is reduced though. On
 1-NUMA node, with the same test, I can see steady writes from the very
 beginning to the very end at roughly the same speed.

 Maybe it could be related to the fact, that NVME is PCI device that is linked
 to one NUMA node only? 
Can be. I don't know qemu internals that much to know if its capable of
doing zero copy disk writes.

...

 As of iothreads, I have only 1 disk (the vde) that is exposed to high i/o
 load, so I believe more I/O threads is not applicable here. If I understand
 correctly, I cannot set more iothreads to a single device.. And it does not
 seem to be iothreads linked as the same scenario in 1-NUMA configuration works
 OK (I mean that memory penalties can be huge as it does not reflect real NUMA
 topology, but disk speed it ok anyway.) 
Ah, since it's only one disk then iothreads will not help much here.
Still worth giving it a shot ;-) Remember, iothreads are for all I/O,
not disk I/O only.

Anyway, this is the point where I have to say "I don't know". Sorry. Try
contacting qemu guys:

qemu-discuss(a)nongnu.org
qemu-devel(a)nongnu.org

Michal

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [libvirt-users] NUMA issues on virtualized hosts