[libvirt-users] NUMA issues on virtualized hosts

Hello, I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance 8-NUMA configuration: This is from hypervizor: [root@hde10 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 1800.000 CPU max MHz: 2400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4800.05 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3,32-35 NUMA node1 CPU(s): 4-7,36-39 NUMA node2 CPU(s): 8-11,40-43 NUMA node3 CPU(s): 12-15,44-47 NUMA node4 CPU(s): 16-19,48-51 NUMA node5 CPU(s): 20-23,52-55 NUMA node6 CPU(s): 24-27,56-59 NUMA node7 CPU(s): 28-31,60-63 I'm running one big virtual on this hypervizor - almost whole memory + all physical CPUs. This is what I'm seeing inside: root@zenon10:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 8 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 2400.000 BogoMIPS: 4800.00 Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 64K L1i cache: 64K L2 cache: 512K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 NUMA node2 CPU(s): 8-11 NUMA node3 CPU(s): 12-15 NUMA node4 CPU(s): 16-19 NUMA node5 CPU(s): 20-23 NUMA node6 CPU(s): 24-27 NUMA node7 CPU(s): 28-31 This is virtual node configuration: (i tried different numatune settings but it was still the same) <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <name>one-55782</name> <vcpu><![CDATA[32]]></vcpu> <cputune> <shares>32768</shares> </cputune> <memory>507904000</memory> <os> <type arch='x86_64'>hvm</type> </os> <devices> <emulator><![CDATA[/usr/bin/kvm]]></emulator> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/> <target dev='vda'/> <driver name='qemu' type='qcow2' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/> <target dev='vdc'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/> <target dev='vdd'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='cdrom'> <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/> <target dev='vdb'/> <readonly/> <driver name='qemu' type='raw'/> </disk> <interface type='bridge'> <source bridge='br0'/> <mac address='02:00:93:fb:3b:78'/> <target dev='one-55782-0'/> <model type='virtio'/> <filterref filter='no-arp-mac-spoofing'> <parameter name='IP' value='147.251.59.120'/> </filterref> </interface> </devices> <features> <pae/> <acpi/> </features> <!-- RAW data follows: --> <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune><memory mode='preferred' nodeset='0'/></numatune>) <devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices> <devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices> <devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices> <metadata> <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore> </metadata> </domain> If I run e.g., spec2017 on the virtual, I can see: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m 2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m 4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m 6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m 7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m 8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m 8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m 8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m 9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m 10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m 12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m 13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m 13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m 14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m 12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m 10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m 15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m 15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m 16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m 17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m 17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m 17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m 27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m 24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m 30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m The CPU TIME should be roughly the same but huge differences are obvious. This is what I see on the hypervizor: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm 369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1 368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0 i.e., kswapd is eating whole CPU. Swap is turned off. [root@hde10 ~]# free total used free shared buff/cache available Mem: 528151432 503432580 1214048 34740 23504804 21907800 Swap: 0 0 0 Hypervisor is [root@hde10 ~]# cat /etc/redhat-release CentOS Linux release 7.5.1804 (Core) qemu-kvm-1.5.3-156.el7_5.5.x86_64 Virtual is Debian 9. Moreover, I'm using this type of disks for virtuals: <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> If I keep cache='unsafe' and if I run iozone test on really big files (e.g., 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are running on 100 % percent and slowing things down. The disk under datastore is NVME SSD Intel 4500. If I set cache='none', kswaps are on idle, disk writes are pretty fast, however, with 8-NUMA configuration, writes slow down to less than 10MB/s as soon as the size of written data is roughly the same as memory size in the virtual node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page lists. If I do the same with 1-NUMA configuration, everything is ok except performance penalty about 25 %. -- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title

Hello, ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue with iozone remains the same. The spec is running, however, it runs slower than 1-NUMA case. The corrected XML looks like follows: <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune><memory mode='strict' nodeset='0-7'/></numatune> In this case, the first part took more than 1700 seconds. 1-NUMA config finishes in 1646 seconds. Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with 8-NUMA config finishes in 900 seconds. On Fri, Sep 14, 2018 at 02:06:26PM +0200, Lukas Hejtmanek wrote:
Hello,
I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance 8-NUMA configuration:
This is from hypervizor: [root@hde10 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 1800.000 CPU max MHz: 2400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4800.05 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3,32-35 NUMA node1 CPU(s): 4-7,36-39 NUMA node2 CPU(s): 8-11,40-43 NUMA node3 CPU(s): 12-15,44-47 NUMA node4 CPU(s): 16-19,48-51 NUMA node5 CPU(s): 20-23,52-55 NUMA node6 CPU(s): 24-27,56-59 NUMA node7 CPU(s): 28-31,60-63
I'm running one big virtual on this hypervizor - almost whole memory + all physical CPUs.
This is what I'm seeing inside:
root@zenon10:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 8 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 2400.000 BogoMIPS: 4800.00 Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 64K L1i cache: 64K L2 cache: 512K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 NUMA node2 CPU(s): 8-11 NUMA node3 CPU(s): 12-15 NUMA node4 CPU(s): 16-19 NUMA node5 CPU(s): 20-23 NUMA node6 CPU(s): 24-27 NUMA node7 CPU(s): 28-31
This is virtual node configuration: (i tried different numatune settings but it was still the same)
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <name>one-55782</name> <vcpu><![CDATA[32]]></vcpu> <cputune> <shares>32768</shares> </cputune> <memory>507904000</memory> <os> <type arch='x86_64'>hvm</type> </os> <devices> <emulator><![CDATA[/usr/bin/kvm]]></emulator> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/> <target dev='vda'/> <driver name='qemu' type='qcow2' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/> <target dev='vdc'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/> <target dev='vdd'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='cdrom'> <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/> <target dev='vdb'/> <readonly/> <driver name='qemu' type='raw'/> </disk> <interface type='bridge'> <source bridge='br0'/> <mac address='02:00:93:fb:3b:78'/> <target dev='one-55782-0'/> <model type='virtio'/> <filterref filter='no-arp-mac-spoofing'> <parameter name='IP' value='147.251.59.120'/> </filterref> </interface> </devices> <features> <pae/> <acpi/> </features> <!-- RAW data follows: --> <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune><memory mode='preferred' nodeset='0'/></numatune>) <devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices> <devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>
<devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices> <metadata> <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore> </metadata> </domain>
If I run e.g., spec2017 on the virtual, I can see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m 2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m 4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m 6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m 7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m 8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m 8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m 8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m 9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m 10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m 12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m 13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m 13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m 14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m 12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m 10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m 15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m 15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m 16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m 17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m 17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m 17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m 27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m 24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m 30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m
The CPU TIME should be roughly the same but huge differences are obvious.
This is what I see on the hypervizor: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm 369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1 368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0
i.e., kswapd is eating whole CPU. Swap is turned off.
[root@hde10 ~]# free total used free shared buff/cache available Mem: 528151432 503432580 1214048 34740 23504804 21907800 Swap: 0 0 0
Hypervisor is [root@hde10 ~]# cat /etc/redhat-release CentOS Linux release 7.5.1804 (Core)
qemu-kvm-1.5.3-156.el7_5.5.x86_64
Virtual is Debian 9.
Moreover, I'm using this type of disks for virtuals: <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk>
If I keep cache='unsafe' and if I run iozone test on really big files (e.g., 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are running on 100 % percent and slowing things down. The disk under datastore is NVME SSD Intel 4500.
If I set cache='none', kswaps are on idle, disk writes are pretty fast, however, with 8-NUMA configuration, writes slow down to less than 10MB/s as soon as the size of written data is roughly the same as memory size in the virtual node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page lists. If I do the same with 1-NUMA configuration, everything is ok except performance penalty about 25 %.
-- Lukáš Hejtmánek
Linux Administrator only because Full Time Multitasking Ninja is not an official job title
-- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title

Hello again, when the iozone writes slow. This is how slabtop looks like: 62476752 62476728 0% 0.10K 1601968 39 6407872K buffer_head 1000678 999168 0% 0.56K 142954 7 571816K radix_tree_node 132184 125911 0% 0.03K 1066 124 4264K kmalloc-32 118496 118224 0% 0.12K 3703 32 14812K kmalloc-node 73206 56467 0% 0.19K 3486 21 13944K dentry 34816 33247 0% 0.12K 1024 34 4096K kernfs_node_cache 34496 29031 0% 0.06K 539 64 2156K kmalloc-64 23283 22707 0% 1.05K 7761 3 31044K ext4_inode_cache 16940 16052 0% 0.57K 2420 7 9680K inode_cache 14464 4124 0% 0.06K 226 64 904K anon_vma_chain 11900 11841 0% 0.14K 425 28 1700K ext4_groupinfo_4k 11312 9861 0% 0.50K 1414 8 5656K kmalloc-512 10692 10066 0% 0.04K 108 99 432K ext4_extent_status 10688 4238 0% 0.25K 668 16 2672K kmalloc-256 8120 2420 0% 0.07K 145 56 580K anon_vma 8040 4563 0% 0.20K 402 20 1608K vm_area_struct 7488 3845 0% 0.12K 234 32 936K kmalloc-96 7456 7061 0% 1.00K 1864 4 7456K kmalloc-1024 7234 7227 0% 4.00K 7234 1 28936K kmalloc-4096 and this is /proc/$PID/stack of iozone eating CPU but not writing data. [<ffffffffba78151b>] find_get_entry+0x1b/0x100 [<ffffffffba781de0>] pagecache_get_page+0x30/0x2a0 [<ffffffffc06ec12b>] ext4_da_get_block_prep+0x27b/0x440 [ext4] [<ffffffffba840d8b>] __find_get_block_slow+0x3b/0x150 [<ffffffffba840ebd>] unmap_underlying_metadata+0x1d/0x70 [<ffffffffc06ec960>] ext4_block_write_begin+0x2e0/0x520 [ext4] [<ffffffffc06ebeb0>] ext4_inode_attach_jinode.part.72+0xa0/0xa0 [ext4] [<ffffffffc041f9f9>] jbd2__journal_start+0xd9/0x1e0 [jbd2] [<ffffffffba80511a>] __check_object_size+0xfa/0x1d8 [<ffffffffba946b85>] iov_iter_copy_from_user_atomic+0xa5/0x330 [<ffffffffba780dcb>] generic_perform_write+0xfb/0x1d0 [<ffffffffba7831ca>] __generic_file_write_iter+0x16a/0x1b0 [<ffffffffc06e7220>] ext4_file_write_iter+0x90/0x370 [ext4] [<ffffffffc06e7190>] ext4_dax_fault+0x140/0x140 [ext4] [<ffffffffba6aef01>] update_curr+0xe1/0x160 [<ffffffffba808890>] new_sync_write+0xe0/0x130 [<ffffffffba809010>] vfs_write+0xb0/0x190 [<ffffffffba80a452>] SyS_write+0x52/0xc0 [<ffffffffba603b7d>] do_syscall_64+0x8d/0xf0 [<ffffffffbac15c4e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6 [<ffffffffffffffff>] 0xffffffffffffffff On Fri, Sep 14, 2018 at 03:36:59PM +0200, Lukas Hejtmanek wrote:
Hello,
ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue with iozone remains the same.
The spec is running, however, it runs slower than 1-NUMA case.
The corrected XML looks like follows: <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune><memory mode='strict' nodeset='0-7'/></numatune>
In this case, the first part took more than 1700 seconds. 1-NUMA config finishes in 1646 seconds.
Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with 8-NUMA config finishes in 900 seconds.
On Fri, Sep 14, 2018 at 02:06:26PM +0200, Lukas Hejtmanek wrote:
Hello,
I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance 8-NUMA configuration:
This is from hypervizor: [root@hde10 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 1800.000 CPU max MHz: 2400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4800.05 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3,32-35 NUMA node1 CPU(s): 4-7,36-39 NUMA node2 CPU(s): 8-11,40-43 NUMA node3 CPU(s): 12-15,44-47 NUMA node4 CPU(s): 16-19,48-51 NUMA node5 CPU(s): 20-23,52-55 NUMA node6 CPU(s): 24-27,56-59 NUMA node7 CPU(s): 28-31,60-63
I'm running one big virtual on this hypervizor - almost whole memory + all physical CPUs.
This is what I'm seeing inside:
root@zenon10:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 8 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 2400.000 BogoMIPS: 4800.00 Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 64K L1i cache: 64K L2 cache: 512K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 NUMA node2 CPU(s): 8-11 NUMA node3 CPU(s): 12-15 NUMA node4 CPU(s): 16-19 NUMA node5 CPU(s): 20-23 NUMA node6 CPU(s): 24-27 NUMA node7 CPU(s): 28-31
This is virtual node configuration: (i tried different numatune settings but it was still the same)
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <name>one-55782</name> <vcpu><![CDATA[32]]></vcpu> <cputune> <shares>32768</shares> </cputune> <memory>507904000</memory> <os> <type arch='x86_64'>hvm</type> </os> <devices> <emulator><![CDATA[/usr/bin/kvm]]></emulator> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/> <target dev='vda'/> <driver name='qemu' type='qcow2' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/> <target dev='vdc'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/> <target dev='vdd'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='cdrom'> <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/> <target dev='vdb'/> <readonly/> <driver name='qemu' type='raw'/> </disk> <interface type='bridge'> <source bridge='br0'/> <mac address='02:00:93:fb:3b:78'/> <target dev='one-55782-0'/> <model type='virtio'/> <filterref filter='no-arp-mac-spoofing'> <parameter name='IP' value='147.251.59.120'/> </filterref> </interface> </devices> <features> <pae/> <acpi/> </features> <!-- RAW data follows: --> <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune><memory mode='preferred' nodeset='0'/></numatune>) <devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices> <devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>
<devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices> <metadata> <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore> </metadata> </domain>
If I run e.g., spec2017 on the virtual, I can see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m 2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m 4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m 6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m 7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m 8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m 8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m 8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m 9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m 10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m 12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m 13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m 13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m 14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m 12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m 10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m 15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m 15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m 16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m 17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m 17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m 17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m 27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m 24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m 30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m
The CPU TIME should be roughly the same but huge differences are obvious.
This is what I see on the hypervizor: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm 369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1 368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0
i.e., kswapd is eating whole CPU. Swap is turned off.
[root@hde10 ~]# free total used free shared buff/cache available Mem: 528151432 503432580 1214048 34740 23504804 21907800 Swap: 0 0 0
Hypervisor is [root@hde10 ~]# cat /etc/redhat-release CentOS Linux release 7.5.1804 (Core)
qemu-kvm-1.5.3-156.el7_5.5.x86_64
Virtual is Debian 9.
Moreover, I'm using this type of disks for virtuals: <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk>
If I keep cache='unsafe' and if I run iozone test on really big files (e.g., 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are running on 100 % percent and slowing things down. The disk under datastore is NVME SSD Intel 4500.
If I set cache='none', kswaps are on idle, disk writes are pretty fast, however, with 8-NUMA configuration, writes slow down to less than 10MB/s as soon as the size of written data is roughly the same as memory size in the virtual node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page lists. If I do the same with 1-NUMA configuration, everything is ok except performance penalty about 25 %.
-- Lukáš Hejtmánek
Linux Administrator only because Full Time Multitasking Ninja is not an official job title
-- Lukáš Hejtmánek
Linux Administrator only because Full Time Multitasking Ninja is not an official job title
-- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title

Hello, I did some performance measurements with SpecCPU 2017 in variant fp rate (i.e., utilize all cpu cores). It looks like this: 8-NUMA Hypervizor specfp2017 - 124 1-NUMA Hypervizor specfp2017 - 103 2-NUMA Hypervizor specfp2017 - 120 8-NUMA Virtual (on 8N Hypervizor) specfp2017 - 92 1-NUMA Virtual (on 1N Hypervizor) specfp2017 - 95.2 2-NUMA Virtual (on 2N Hypervizor) specfp2017 - 98 (memory strict) 2-NUMA Virtual (on 2N Hypervizor) specfp2017 - 98.1 (memory interleave) 2x 1-NUMA Virtual (on 2N Hypervizor) specfp2017 - 117.2 (sum for both) On Fri, Sep 14, 2018 at 03:40:56PM +0200, Lukas Hejtmanek wrote:
Hello again,
when the iozone writes slow. This is how slabtop looks like: 62476752 62476728 0% 0.10K 1601968 39 6407872K buffer_head 1000678 999168 0% 0.56K 142954 7 571816K radix_tree_node 132184 125911 0% 0.03K 1066 124 4264K kmalloc-32 118496 118224 0% 0.12K 3703 32 14812K kmalloc-node 73206 56467 0% 0.19K 3486 21 13944K dentry 34816 33247 0% 0.12K 1024 34 4096K kernfs_node_cache 34496 29031 0% 0.06K 539 64 2156K kmalloc-64 23283 22707 0% 1.05K 7761 3 31044K ext4_inode_cache 16940 16052 0% 0.57K 2420 7 9680K inode_cache 14464 4124 0% 0.06K 226 64 904K anon_vma_chain 11900 11841 0% 0.14K 425 28 1700K ext4_groupinfo_4k 11312 9861 0% 0.50K 1414 8 5656K kmalloc-512 10692 10066 0% 0.04K 108 99 432K ext4_extent_status 10688 4238 0% 0.25K 668 16 2672K kmalloc-256 8120 2420 0% 0.07K 145 56 580K anon_vma 8040 4563 0% 0.20K 402 20 1608K vm_area_struct 7488 3845 0% 0.12K 234 32 936K kmalloc-96 7456 7061 0% 1.00K 1864 4 7456K kmalloc-1024 7234 7227 0% 4.00K 7234 1 28936K kmalloc-4096
and this is /proc/$PID/stack of iozone eating CPU but not writing data.
[<ffffffffba78151b>] find_get_entry+0x1b/0x100 [<ffffffffba781de0>] pagecache_get_page+0x30/0x2a0 [<ffffffffc06ec12b>] ext4_da_get_block_prep+0x27b/0x440 [ext4] [<ffffffffba840d8b>] __find_get_block_slow+0x3b/0x150 [<ffffffffba840ebd>] unmap_underlying_metadata+0x1d/0x70 [<ffffffffc06ec960>] ext4_block_write_begin+0x2e0/0x520 [ext4] [<ffffffffc06ebeb0>] ext4_inode_attach_jinode.part.72+0xa0/0xa0 [ext4] [<ffffffffc041f9f9>] jbd2__journal_start+0xd9/0x1e0 [jbd2] [<ffffffffba80511a>] __check_object_size+0xfa/0x1d8 [<ffffffffba946b85>] iov_iter_copy_from_user_atomic+0xa5/0x330 [<ffffffffba780dcb>] generic_perform_write+0xfb/0x1d0 [<ffffffffba7831ca>] __generic_file_write_iter+0x16a/0x1b0 [<ffffffffc06e7220>] ext4_file_write_iter+0x90/0x370 [ext4] [<ffffffffc06e7190>] ext4_dax_fault+0x140/0x140 [ext4] [<ffffffffba6aef01>] update_curr+0xe1/0x160 [<ffffffffba808890>] new_sync_write+0xe0/0x130 [<ffffffffba809010>] vfs_write+0xb0/0x190 [<ffffffffba80a452>] SyS_write+0x52/0xc0 [<ffffffffba603b7d>] do_syscall_64+0x8d/0xf0 [<ffffffffbac15c4e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6 [<ffffffffffffffff>] 0xffffffffffffffff
On Fri, Sep 14, 2018 at 03:36:59PM +0200, Lukas Hejtmanek wrote:
Hello,
ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue with iozone remains the same.
The spec is running, however, it runs slower than 1-NUMA case.
The corrected XML looks like follows: <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune><memory mode='strict' nodeset='0-7'/></numatune>
In this case, the first part took more than 1700 seconds. 1-NUMA config finishes in 1646 seconds.
Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with 8-NUMA config finishes in 900 seconds.
On Fri, Sep 14, 2018 at 02:06:26PM +0200, Lukas Hejtmanek wrote:
Hello,
I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance 8-NUMA configuration:
This is from hypervizor: [root@hde10 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 1800.000 CPU max MHz: 2400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4800.05 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3,32-35 NUMA node1 CPU(s): 4-7,36-39 NUMA node2 CPU(s): 8-11,40-43 NUMA node3 CPU(s): 12-15,44-47 NUMA node4 CPU(s): 16-19,48-51 NUMA node5 CPU(s): 20-23,52-55 NUMA node6 CPU(s): 24-27,56-59 NUMA node7 CPU(s): 28-31,60-63
I'm running one big virtual on this hypervizor - almost whole memory + all physical CPUs.
This is what I'm seeing inside:
root@zenon10:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 8 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 2400.000 BogoMIPS: 4800.00 Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 64K L1i cache: 64K L2 cache: 512K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 NUMA node2 CPU(s): 8-11 NUMA node3 CPU(s): 12-15 NUMA node4 CPU(s): 16-19 NUMA node5 CPU(s): 20-23 NUMA node6 CPU(s): 24-27 NUMA node7 CPU(s): 28-31
This is virtual node configuration: (i tried different numatune settings but it was still the same)
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <name>one-55782</name> <vcpu><![CDATA[32]]></vcpu> <cputune> <shares>32768</shares> </cputune> <memory>507904000</memory> <os> <type arch='x86_64'>hvm</type> </os> <devices> <emulator><![CDATA[/usr/bin/kvm]]></emulator> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/> <target dev='vda'/> <driver name='qemu' type='qcow2' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/> <target dev='vdc'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/> <target dev='vdd'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='cdrom'> <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/> <target dev='vdb'/> <readonly/> <driver name='qemu' type='raw'/> </disk> <interface type='bridge'> <source bridge='br0'/> <mac address='02:00:93:fb:3b:78'/> <target dev='one-55782-0'/> <model type='virtio'/> <filterref filter='no-arp-mac-spoofing'> <parameter name='IP' value='147.251.59.120'/> </filterref> </interface> </devices> <features> <pae/> <acpi/> </features> <!-- RAW data follows: --> <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune><memory mode='preferred' nodeset='0'/></numatune>) <devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices> <devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>
<devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices> <metadata> <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore> </metadata> </domain>
If I run e.g., spec2017 on the virtual, I can see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m 2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m 4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m 6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m 7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m 8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m 8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m 8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m 9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m 10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m 12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m 13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m 13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m 14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m 12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m 10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m 15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m 15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m 16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m 17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m 17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m 17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m 27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m 24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m 30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m
The CPU TIME should be roughly the same but huge differences are obvious.
This is what I see on the hypervizor: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm 369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1 368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0
i.e., kswapd is eating whole CPU. Swap is turned off.
[root@hde10 ~]# free total used free shared buff/cache available Mem: 528151432 503432580 1214048 34740 23504804 21907800 Swap: 0 0 0
Hypervisor is [root@hde10 ~]# cat /etc/redhat-release CentOS Linux release 7.5.1804 (Core)
qemu-kvm-1.5.3-156.el7_5.5.x86_64
Virtual is Debian 9.
Moreover, I'm using this type of disks for virtuals: <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk>
If I keep cache='unsafe' and if I run iozone test on really big files (e.g., 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are running on 100 % percent and slowing things down. The disk under datastore is NVME SSD Intel 4500.
If I set cache='none', kswaps are on idle, disk writes are pretty fast, however, with 8-NUMA configuration, writes slow down to less than 10MB/s as soon as the size of written data is roughly the same as memory size in the virtual node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page lists. If I do the same with 1-NUMA configuration, everything is ok except performance penalty about 25 %.
-- Lukáš Hejtmánek
Linux Administrator only because Full Time Multitasking Ninja is not an official job title
-- Lukáš Hejtmánek
Linux Administrator only because Full Time Multitasking Ninja is not an official job title
-- Lukáš Hejtmánek
Linux Administrator only because Full Time Multitasking Ninja is not an official job title
-- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title

On 09/14/2018 03:36 PM, Lukas Hejtmanek wrote:
Hello,
ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue with iozone remains the same.
The spec is running, however, it runs slower than 1-NUMA case.
The corrected XML looks like follows: [Reformated XML for better reading]
<cpu mode="host-passthrough"> <topology sockets="8" cores="4" threads="1"/> <numa> <cell cpus="0-3" memory="62000000"/> <cell cpus="4-7" memory="62000000"/> <cell cpus="8-11" memory="62000000"/> <cell cpus="12-15" memory="62000000"/> <cell cpus="16-19" memory="62000000"/> <cell cpus="20-23" memory="62000000"/> <cell cpus="24-27" memory="62000000"/> <cell cpus="28-31" memory="62000000"/> </numa> </cpu> <cputune> <vcpupin vcpu="0" cpuset="0"/> <vcpupin vcpu="1" cpuset="1"/> <vcpupin vcpu="2" cpuset="2"/> <vcpupin vcpu="3" cpuset="3"/> <vcpupin vcpu="4" cpuset="4"/> <vcpupin vcpu="5" cpuset="5"/> <vcpupin vcpu="6" cpuset="6"/> <vcpupin vcpu="7" cpuset="7"/> <vcpupin vcpu="8" cpuset="8"/> <vcpupin vcpu="9" cpuset="9"/> <vcpupin vcpu="10" cpuset="10"/> <vcpupin vcpu="11" cpuset="11"/> <vcpupin vcpu="12" cpuset="12"/> <vcpupin vcpu="13" cpuset="13"/> <vcpupin vcpu="14" cpuset="14"/> <vcpupin vcpu="15" cpuset="15"/> <vcpupin vcpu="16" cpuset="16"/> <vcpupin vcpu="17" cpuset="17"/> <vcpupin vcpu="18" cpuset="18"/> <vcpupin vcpu="19" cpuset="19"/> <vcpupin vcpu="20" cpuset="20"/> <vcpupin vcpu="21" cpuset="21"/> <vcpupin vcpu="22" cpuset="22"/> <vcpupin vcpu="23" cpuset="23"/> <vcpupin vcpu="24" cpuset="24"/> <vcpupin vcpu="25" cpuset="25"/> <vcpupin vcpu="26" cpuset="26"/> <vcpupin vcpu="27" cpuset="27"/> <vcpupin vcpu="28" cpuset="28"/> <vcpupin vcpu="29" cpuset="29"/> <vcpupin vcpu="30" cpuset="30"/> <vcpupin vcpu="31" cpuset="31"/> </cputune> <numatune> <memory mode="strict" nodeset="0-7"/> </numatune> However, this is not enough. This XML pins only vCPUs and not guest memory. So while say vCPU #0 is pinned onto physical CPU #0, the memory for guest NUMA #0 might be allocated at host NUMA #7 (for instance). You need to add: <numatune> <memnode cellid="0" mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> ... </numatune> This will ensure also the guest memory pinning. But wait, there is more. In your later e-mails you mention slow disk I/O. This might be caused by various variables but the most obvious one in this case is qemu I/O loop, I'd say. Without iothreads qemu has only one I/O loop and thus if your guest issues writes from all 32 cores at once this loop is unable to handle it (performance wise) and therefore the performance drop. You can try enabling iothreads: https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation This is a qemu feature that allows you to create more I/O threads and also pin them. This is an example how to use them: https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/iothr... And this is an example how to pin them: https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/cputu... Also, since iothreads are capable of handling just any I/O they can be used for other devices too, not only disks. For instance interfaces. Hopefully, this will boost your performance. Regards, Michal (who is a bit envious about your machine :-P)

Hello, so the current domain configuration: <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune> <memnode cellid="0" mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> <memnode cellid="2" mode="strict" nodeset="2"/> <memnode cellid="3" mode="strict" nodeset="3"/> <memnode cellid="4" mode="strict" nodeset="4"/> <memnode cellid="5" mode="strict" nodeset="5"/> <memnode cellid="6" mode="strict" nodeset="6"/> <memnode cellid="7" mode="strict" nodeset="7"/> </numatune> hopefully, I got it right. Good news is, that spec benchmark looks promising. The first test bwaves finished in 1003 seconds compared to 1700 seconds in the previous wrong case. So far so good. Bad news is, that iozone is still the same. There might be some misunderstanding. I have to cases: 1) cache=unsafe. In this case, I can see that hypervizor is prone to swap. Swap a lot. It usually eats whole swap partition and kswapd is running on 100% CPU. swappines, dirty_ration and company do not improve things at all. However, I believe, this is just wrong option for scratch disks where one can expect huge I/O load. Moreover, the hypevizor is poor machine with only low memory left (ok, in my case about 10GB available), so it does not make sense to use that memory for additional cache/disk buffers. 2) cache=none. In this case, performance is better (only few percent behind baremetal). However, as soon as the size of stored data is about the size of memory of the virtual, writes stops and iozone is eating whole CPU, it looks like it is searching more free pages and it is harder and harder. But not sure, I am not skilled in this area. here, you can clearly see, that it starts writes, doing the writes, then it takes a pause, writes again, and so on, but the pauses are longer and longer.. https://pastebin.com/2gfPFgb9 The output is until the very end of iozone (I cancelled it by ctrl-c). It seems that this is not happening on 2-NUMA node with rotational disks only. It is partly happening on 2-NUMA node with 2 NVME SSDs. The partly means, that there are also pauses in writes but it finishes, speed is reduced though. On 1-NUMA node, with the same test, I can see steady writes from the very beginning to the very end at roughly the same speed. Maybe it could be related to the fact, that NVME is PCI device that is linked to one NUMA node only? As of iothreads, I have only 1 disk (the vde) that is exposed to high i/o load, so I believe more I/O threads is not applicable here. If I understand correctly, I cannot set more iothreads to a single device.. And it does not seem to be iothreads linked as the same scenario in 1-NUMA configuration works OK (I mean that memory penalties can be huge as it does not reflect real NUMA topology, but disk speed it ok anyway.) And as of that machine, what about this one? :) [root@urga1 ~]$ free -g total used free shared buff/cache available Mem: 5857 75 5746 0 35 5768 ... NUMA node47 CPU(s): 376-383 this it not virtualized though :) On Mon, Sep 17, 2018 at 03:08:34PM +0200, Michal Privoznik wrote:
On 09/14/2018 03:36 PM, Lukas Hejtmanek wrote:
Hello,
ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue with iozone remains the same.
The spec is running, however, it runs slower than 1-NUMA case.
The corrected XML looks like follows: [Reformated XML for better reading]
<cpu mode="host-passthrough"> <topology sockets="8" cores="4" threads="1"/> <numa> <cell cpus="0-3" memory="62000000"/> <cell cpus="4-7" memory="62000000"/> <cell cpus="8-11" memory="62000000"/> <cell cpus="12-15" memory="62000000"/> <cell cpus="16-19" memory="62000000"/> <cell cpus="20-23" memory="62000000"/> <cell cpus="24-27" memory="62000000"/> <cell cpus="28-31" memory="62000000"/> </numa> </cpu> <cputune> <vcpupin vcpu="0" cpuset="0"/> <vcpupin vcpu="1" cpuset="1"/> <vcpupin vcpu="2" cpuset="2"/> <vcpupin vcpu="3" cpuset="3"/> <vcpupin vcpu="4" cpuset="4"/> <vcpupin vcpu="5" cpuset="5"/> <vcpupin vcpu="6" cpuset="6"/> <vcpupin vcpu="7" cpuset="7"/> <vcpupin vcpu="8" cpuset="8"/> <vcpupin vcpu="9" cpuset="9"/> <vcpupin vcpu="10" cpuset="10"/> <vcpupin vcpu="11" cpuset="11"/> <vcpupin vcpu="12" cpuset="12"/> <vcpupin vcpu="13" cpuset="13"/> <vcpupin vcpu="14" cpuset="14"/> <vcpupin vcpu="15" cpuset="15"/> <vcpupin vcpu="16" cpuset="16"/> <vcpupin vcpu="17" cpuset="17"/> <vcpupin vcpu="18" cpuset="18"/> <vcpupin vcpu="19" cpuset="19"/> <vcpupin vcpu="20" cpuset="20"/> <vcpupin vcpu="21" cpuset="21"/> <vcpupin vcpu="22" cpuset="22"/> <vcpupin vcpu="23" cpuset="23"/> <vcpupin vcpu="24" cpuset="24"/> <vcpupin vcpu="25" cpuset="25"/> <vcpupin vcpu="26" cpuset="26"/> <vcpupin vcpu="27" cpuset="27"/> <vcpupin vcpu="28" cpuset="28"/> <vcpupin vcpu="29" cpuset="29"/> <vcpupin vcpu="30" cpuset="30"/> <vcpupin vcpu="31" cpuset="31"/> </cputune> <numatune> <memory mode="strict" nodeset="0-7"/> </numatune>
However, this is not enough. This XML pins only vCPUs and not guest memory. So while say vCPU #0 is pinned onto physical CPU #0, the memory for guest NUMA #0 might be allocated at host NUMA #7 (for instance). You need to add:
<numatune> <memnode cellid="0" mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> ... </numatune>
This will ensure also the guest memory pinning. But wait, there is more. In your later e-mails you mention slow disk I/O. This might be caused by various variables but the most obvious one in this case is qemu I/O loop, I'd say. Without iothreads qemu has only one I/O loop and thus if your guest issues writes from all 32 cores at once this loop is unable to handle it (performance wise) and therefore the performance drop. You can try enabling iothreads:
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
This is a qemu feature that allows you to create more I/O threads and also pin them. This is an example how to use them:
https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/iothr...
And this is an example how to pin them:
https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/cputu...
Also, since iothreads are capable of handling just any I/O they can be used for other devices too, not only disks. For instance interfaces.
Hopefully, this will boost your performance.
Regards, Michal (who is a bit envious about your machine :-P)
-- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title

On 09/17/2018 04:59 PM, Lukas Hejtmanek wrote:
Hello,
so the current domain configuration: <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune> <memnode cellid="0" mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> <memnode cellid="2" mode="strict" nodeset="2"/> <memnode cellid="3" mode="strict" nodeset="3"/> <memnode cellid="4" mode="strict" nodeset="4"/> <memnode cellid="5" mode="strict" nodeset="5"/> <memnode cellid="6" mode="strict" nodeset="6"/> <memnode cellid="7" mode="strict" nodeset="7"/> </numatune>
hopefully, I got it right.
Yes, looking good.
Good news is, that spec benchmark looks promising. The first test bwaves finished in 1003 seconds compared to 1700 seconds in the previous wrong case. So far so good.
Very well, this means that the config above is correct.
Bad news is, that iozone is still the same. There might be some misunderstanding.
I have to cases:
1) cache=unsafe. In this case, I can see that hypervizor is prone to swap. Swap a lot. It usually eats whole swap partition and kswapd is running on 100% CPU. swappines, dirty_ration and company do not improve things at all. However, I believe, this is just wrong option for scratch disks where one can expect huge I/O load. Moreover, the hypevizor is poor machine with only low memory left (ok, in my case about 10GB available), so it does not make sense to use that memory for additional cache/disk buffers.
One thing that just occurred to me - is the qcow2 file fully allocated? # qemu-img info /var/lib/libvirt/images/fedora.qcow2 .. virtual size: 20G (21474836480 bytes) disk size: 7.0G .. This is NOT a fully allocated qcow2.
2) cache=none. In this case, performance is better (only few percent behind baremetal). However, as soon as the size of stored data is about the size of memory of the virtual, writes stops and iozone is eating whole CPU, it looks like it is searching more free pages and it is harder and harder. But not sure, I am not skilled in this area.
Hmm. Could it be that SSD doesn't have enough free blocks and thus writes are throttled? Can you fstrim it and see if that helps?
here, you can clearly see, that it starts writes, doing the writes, then it takes a pause, writes again, and so on, but the pauses are longer and longer.. https://pastebin.com/2gfPFgb9 The output is until the very end of iozone (I cancelled it by ctrl-c).
It seems that this is not happening on 2-NUMA node with rotational disks only. It is partly happening on 2-NUMA node with 2 NVME SSDs. The partly means, that there are also pauses in writes but it finishes, speed is reduced though. On 1-NUMA node, with the same test, I can see steady writes from the very beginning to the very end at roughly the same speed.
Maybe it could be related to the fact, that NVME is PCI device that is linked to one NUMA node only?
Can be. I don't know qemu internals that much to know if its capable of doing zero copy disk writes.
As of iothreads, I have only 1 disk (the vde) that is exposed to high i/o load, so I believe more I/O threads is not applicable here. If I understand correctly, I cannot set more iothreads to a single device.. And it does not seem to be iothreads linked as the same scenario in 1-NUMA configuration works OK (I mean that memory penalties can be huge as it does not reflect real NUMA topology, but disk speed it ok anyway.)
Ah, since it's only one disk then iothreads will not help much here. Still worth giving it a shot ;-) Remember, iothreads are for all I/O, not disk I/O only. Anyway, this is the point where I have to say "I don't know". Sorry. Try contacting qemu guys: qemu-discuss@nongnu.org qemu-devel@nongnu.org Michal

Hello, so final working solution for 8-NUMA node configuration is: <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa> <cell id='0' cpus='0-3' memory='62000000'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='16'/> <sibling id='2' value='16'/> <sibling id='3' value='16'/> <sibling id='4' value='32'/> <sibling id='5' value='32'/> <sibling id='6' value='32'/> <sibling id='7' value='32'/> </distances> </cell> <cell id='1' cpus='4-7' memory='62000000'> <distances> <sibling id='0' value='16'/> <sibling id='1' value='10'/> <sibling id='2' value='16'/> <sibling id='3' value='16'/> <sibling id='4' value='32'/> <sibling id='5' value='32'/> <sibling id='6' value='32'/> <sibling id='7' value='32'/> </distances> </cell> <cell id='2' cpus='8-11' memory='62000000'> <distances> <sibling id='0' value='16'/> <sibling id='1' value='16'/> <sibling id='2' value='10'/> <sibling id='3' value='16'/> <sibling id='4' value='32'/> <sibling id='5' value='32'/> <sibling id='6' value='32'/> <sibling id='7' value='32'/> </distances> </cell> <cell id='3' cpus='12-15' memory='62000000'> <distances> <sibling id='0' value='16'/> <sibling id='1' value='16'/> <sibling id='2' value='16'/> <sibling id='3' value='10'/> <sibling id='4' value='32'/> <sibling id='5' value='32'/> <sibling id='6' value='32'/> <sibling id='7' value='32'/> </distances> </cell> <cell id='4' cpus='16-19' memory='62000000'> <distances> <sibling id='0' value='32'/> <sibling id='1' value='32'/> <sibling id='2' value='32'/> <sibling id='3' value='32'/> <sibling id='4' value='10'/> <sibling id='5' value='16'/> <sibling id='6' value='16'/> <sibling id='7' value='16'/> </distances> </cell> <cell id='5' cpus='20-23' memory='62000000'> <distances> <sibling id='0' value='32'/> <sibling id='1' value='32'/> <sibling id='2' value='32'/> <sibling id='3' value='32'/> <sibling id='4' value='16'/> <sibling id='5' value='10'/> <sibling id='6' value='16'/> <sibling id='7' value='16'/> </distances> </cell> <cell id='6' cpus='24-27' memory='62000000'> <distances> <sibling id='0' value='32'/> <sibling id='1' value='32'/> <sibling id='2' value='32'/> <sibling id='3' value='32'/> <sibling id='4' value='16'/> <sibling id='5' value='16'/> <sibling id='6' value='10'/> <sibling id='7' value='16'/> </distances> </cell> <cell id='7' cpus='28-31' memory='62000000'> <distances> <sibling id='0' value='32'/> <sibling id='1' value='32'/> <sibling id='2' value='32'/> <sibling id='3' value='32'/> <sibling id='4' value='16'/> <sibling id='5' value='16'/> <sibling id='6' value='16'/> <sibling id='7' value='10'/> </distances> </cell> </numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune> <memnode cellid='0' mode='strict' nodeset='0'/> <memnode cellid='1' mode='strict' nodeset='1'/> <memnode cellid='2' mode='strict' nodeset='2'/> <memnode cellid='3' mode='strict' nodeset='3'/> <memnode cellid='4' mode='strict' nodeset='4'/> <memnode cellid='5' mode='strict' nodeset='5'/> <memnode cellid='6' mode='strict' nodeset='6'/> <memnode cellid='7' mode='strict' nodeset='7'/> </numatune> With this configuration, virtualized Debian 9 even slightly outperforms the same Debian 9 on the bare metal. As of iozone and cache none case. It seems that the problem is with KVM which stops the iozone running for the first time when not all memory pages are puppulated or something. Number of pages is not small with 512GB machine. However, letting kvm to poppulate pages and running the iozone again, does not bring any performance loss. On Tue, Sep 18, 2018 at 09:50:44AM +0200, Michal Privoznik wrote:
On 09/17/2018 04:59 PM, Lukas Hejtmanek wrote:
Hello,
so the current domain configuration:
Yes, looking good.
Good news is, that spec benchmark looks promising. The first test bwaves finished in 1003 seconds compared to 1700 seconds in the previous wrong case. So far so good.
Very well, this means that the config above is correct.
Bad news is, that iozone is still the same. There might be some misunderstanding.
I have to cases:
1) cache=unsafe. In this case, I can see that hypervizor is prone to swap. Swap a lot. It usually eats whole swap partition and kswapd is running on 100% CPU. swappines, dirty_ration and company do not improve things at all. However, I believe, this is just wrong option for scratch disks where one can expect huge I/O load. Moreover, the hypevizor is poor machine with only low memory left (ok, in my case about 10GB available), so it does not make sense to use that memory for additional cache/disk buffers.
One thing that just occurred to me - is the qcow2 file fully allocated?
# qemu-img info /var/lib/libvirt/images/fedora.qcow2 .. virtual size: 20G (21474836480 bytes) disk size: 7.0G ..
This is NOT a fully allocated qcow2.
2) cache=none. In this case, performance is better (only few percent behind baremetal). However, as soon as the size of stored data is about the size of memory of the virtual, writes stops and iozone is eating whole CPU, it looks like it is searching more free pages and it is harder and harder. But not sure, I am not skilled in this area.
Hmm. Could it be that SSD doesn't have enough free blocks and thus writes are throttled? Can you fstrim it and see if that helps?
here, you can clearly see, that it starts writes, doing the writes, then it takes a pause, writes again, and so on, but the pauses are longer and longer.. https://pastebin.com/2gfPFgb9 The output is until the very end of iozone (I cancelled it by ctrl-c).
It seems that this is not happening on 2-NUMA node with rotational disks only. It is partly happening on 2-NUMA node with 2 NVME SSDs. The partly means, that there are also pauses in writes but it finishes, speed is reduced though. On 1-NUMA node, with the same test, I can see steady writes from the very beginning to the very end at roughly the same speed.
Maybe it could be related to the fact, that NVME is PCI device that is linked to one NUMA node only?
Can be. I don't know qemu internals that much to know if its capable of doing zero copy disk writes.
As of iothreads, I have only 1 disk (the vde) that is exposed to high i/o load, so I believe more I/O threads is not applicable here. If I understand correctly, I cannot set more iothreads to a single device.. And it does not seem to be iothreads linked as the same scenario in 1-NUMA configuration works OK (I mean that memory penalties can be huge as it does not reflect real NUMA topology, but disk speed it ok anyway.)
Ah, since it's only one disk then iothreads will not help much here. Still worth giving it a shot ;-) Remember, iothreads are for all I/O, not disk I/O only.
Anyway, this is the point where I have to say "I don't know". Sorry. Try contacting qemu guys:
qemu-discuss@nongnu.org qemu-devel@nongnu.org
Michal
-- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title
participants (2)
-
Lukas Hejtmanek
-
Michal Privoznik