[libvirt-users] NUMA issues on virtualized hosts

14 Sep 2018

      Hello,

I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance
8-NUMA configuration:

This is from hypervizor:
[root@hde10 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC 7351 16-Core Processor
Stepping:              2
CPU MHz:               1800.000
CPU max MHz:           2400.0000
CPU min MHz:           1200.0000
BogoMIPS:              4800.05
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             64K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-3,32-35
NUMA node1 CPU(s):     4-7,36-39
NUMA node2 CPU(s):     8-11,40-43
NUMA node3 CPU(s):     12-15,44-47
NUMA node4 CPU(s):     16-19,48-51
NUMA node5 CPU(s):     20-23,52-55
NUMA node6 CPU(s):     24-27,56-59
NUMA node7 CPU(s):     28-31,60-63

I'm running one big virtual on this hypervizor - almost whole memory + all
physical CPUs.

This is what I'm seeing inside:

root@zenon10:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             8
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC 7351 16-Core Processor
Stepping:              2
CPU MHz:               2400.000
BogoMIPS:              4800.00
Virtualization:        AMD-V
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0-3
NUMA node1 CPU(s):     4-7
NUMA node2 CPU(s):     8-11
NUMA node3 CPU(s):     12-15
NUMA node4 CPU(s):     16-19
NUMA node5 CPU(s):     20-23
NUMA node6 CPU(s):     24-27
NUMA node7 CPU(s):     28-31

This is virtual node configuration: (i tried different numatune settings but
it was still the same)

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
        <name>one-55782</name>
        <vcpu><![CDATA[32]]></vcpu>
        <cputune>
                <shares>32768</shares>
        </cputune>
        <memory>507904000</memory>
        <os>
                <type arch='x86_64'>hvm</type>
        </os>
        <devices>
                <emulator><![CDATA[/usr/bin/kvm]]></emulator>
                <disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/>
                        <target dev='vda'/>
                        <driver name='qemu' type='qcow2' cache='unsafe'/>
                </disk>
                <disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/>
                        <target dev='vdc'/>
                        <driver name='qemu' type='raw' cache='unsafe'/>
                </disk>
                <disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/>
                        <target dev='vdd'/>
                        <driver name='qemu' type='raw' cache='unsafe'/>
                </disk>
                <disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
                        <target dev='vde'/>
                        <driver name='qemu' type='raw' cache='unsafe'/>
                </disk>
                <disk type='file' device='cdrom'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/>
                        <target dev='vdb'/>
                        <readonly/>
                        <driver name='qemu' type='raw'/>
                </disk>
                <interface type='bridge'>
                        <source bridge='br0'/>
                        <mac address='02:00:93:fb:3b:78'/>
                        <target dev='one-55782-0'/>
                        <model type='virtio'/>
                        <filterref filter='no-arp-mac-spoofing'>
                                <parameter name='IP' value='147.251.59.120'/>
                        </filterref>
                </interface>
        </devices>
        <features>
                <pae/>
                <acpi/>
        </features>
        <!-- RAW data follows: -->
<cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu>
<cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune>
<numatune><memory mode='preferred' nodeset='0'/></numatune>)
<devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices>
<devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices>

        <devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices>
        <metadata>
                <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]>          </system_datastore>
        </metadata>
</domain>

If I run e.g., spec2017 on the virtual, I can see:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND         
 1350 root      20   0  843136 830068   2524 R  78.1  0.2 513:16.16 bwaves_r_base.m 
 2456 root      20   0  804608 791264   2524 R  76.6  0.2 491:39.92 bwaves_r_base.m 
 4631 root      20   0  843136 829892   2344 R  75.8  0.2 450:16.04 bwaves_r_base.m 
 6441 root      20   0  802580 790212   2532 R  75.0  0.2 120:37.54 bwaves_r_base.m 
 7991 root      20   0  784676 772092   2576 R  75.0  0.2 387:15.39 bwaves_r_base.m 
 8142 root      20   0  843136 830044   2496 R  75.0  0.2 384:39.02 bwaves_r_base.m 
 8234 root      20   0  843136 830064   2524 R  75.0  0.2  99:04.48 bwaves_r_base.m 
 8578 root      20   0  749240 736604   2468 R  73.4  0.2 375:45.66 bwaves_r_base.m 
 9974 root      20   0  784676 771984   2468 R  73.4  0.2 348:01.36 bwaves_r_base.m 
10396 root      20   0  802580 790264   2576 R  73.4  0.2 340:08.40 bwaves_r_base.m 
12932 root      20   0  843136 830024   2480 R  73.4  0.2 288:39.76 bwaves_r_base.m 
13113 root      20   0  784676 771864   2348 R  71.9  0.2 284:47.34 bwaves_r_base.m 
13518 root      20   0  784676 762816   2540 R  71.9  0.2 276:31.58 bwaves_r_base.m 
14443 root      20   0  784676 771984   2468 R  71.9  0.2 260:01.82 bwaves_r_base.m 
12791 root      20   0  784676 772060   2544 R  70.3  0.2 291:43.96 bwaves_r_base.m 
10544 root      20   0  843136 830068   2520 R  68.8  0.2 336:47.43 bwaves_r_base.m 
15464 root      20   0  784676 762880   2608 R  60.9  0.2 239:19.14 bwaves_r_base.m 
15487 root      20   0  784676 772048   2532 R  60.2  0.2 238:37.07 bwaves_r_base.m 
16824 root      20   0  784676 772120   2604 R  55.5  0.2 212:10.92 bwaves_r_base.m 
17255 root      20   0  843136 830012   2468 R  54.7  0.2 203:22.89 bwaves_r_base.m 
17962 root      20   0  784676 772004   2488 R  54.7  0.2 188:26.07 bwaves_r_base.m 
17505 root      20   0  843136 830068   2520 R  53.1  0.2 198:04.25 bwaves_r_base.m 
27767 root      20   0  784676 771860   2344 R  52.3  0.2 592:25.95 bwaves_r_base.m 
24458 root      20   0  843136 829888   2344 R  50.8  0.2 658:23.70 bwaves_r_base.m 
30746 root      20   0  747376 735160   2604 R  43.0  0.2 556:47.67 bwaves_r_base.m 

The CPU TIME should be roughly the same but huge differences are obvious. 

This is what I see on the hypervizor:
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                     
 18201 oneadmin  20   0  474.0g 473.3g   1732 S  2459 94.0  33332:54 kvm                                                                         
   369 root      20   0       0      0      0 R 100.0  0.0 768:12.85 kswapd1                                                                     
   368 root      20   0       0      0      0 R  94.1  0.0 869:05.61 kswapd0  

i.e., kswapd is eating whole CPU. Swap is turned off. 

[root@hde10 ~]# free
              total        used        free      shared  buff/cache   available
Mem:      528151432   503432580     1214048       34740    23504804    21907800
Swap:             0           0           0

Hypervisor is 
[root@hde10 ~]# cat /etc/redhat-release 
CentOS Linux release 7.5.1804 (Core)

qemu-kvm-1.5.3-156.el7_5.5.x86_64

Virtual is Debian 9.

Moreover, I'm using this type of disks for virtuals:
<disk type='file' device='disk'>
                        <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
                        <target dev='vde'/>
                        <driver name='qemu' type='raw' cache='unsafe'/>
                </disk>

If I keep cache='unsafe' and if I run iozone test on really big files (e.g.,
8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are
running on 100 % percent and slowing things down. The disk under datastore is
NVME SSD Intel 4500. 

If I set cache='none', kswaps are on idle, disk writes are pretty fast,
however, with 8-NUMA configuration, writes slow down to less than 10MB/s as
soon as the size of written data is roughly the same as memory size in the virtual
node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page
lists. If I do the same with 1-NUMA configuration, everything is ok except
performance penalty about 25 %.

-- 
Lukáš Hejtmánek

Linux Administrator only because
  Full Time Multitasking Ninja 
  is not an official job title

Lukas Hejtmanek

Lukas Hejtmanek

Lukas Hejtmanek

Lukas Hejtmanek

Michal Privoznik

Lukas Hejtmanek

Michal Privoznik

Lukas Hejtmanek

tags

participants (2)