Hello,
I did some performance measurements with SpecCPU 2017 in variant fp rate
(i.e., utilize all cpu cores). It looks like this:
8-NUMA Hypervizor specfp2017 - 124
1-NUMA Hypervizor specfp2017 - 103
2-NUMA Hypervizor specfp2017 - 120
8-NUMA Virtual (on 8N Hypervizor) specfp2017 - 92
1-NUMA Virtual (on 1N Hypervizor) specfp2017 - 95.2
2-NUMA Virtual (on 2N Hypervizor) specfp2017 - 98 (memory strict)
2-NUMA Virtual (on 2N Hypervizor) specfp2017 - 98.1 (memory interleave)
2x 1-NUMA Virtual (on 2N Hypervizor) specfp2017 - 117.2 (sum for both)
On Fri, Sep 14, 2018 at 03:40:56PM +0200, Lukas Hejtmanek wrote:
Hello again,
when the iozone writes slow. This is how slabtop looks like:
62476752 62476728 0% 0.10K 1601968 39 6407872K buffer_head
1000678 999168 0% 0.56K 142954 7 571816K radix_tree_node
132184 125911 0% 0.03K 1066 124 4264K kmalloc-32
118496 118224 0% 0.12K 3703 32 14812K kmalloc-node
73206 56467 0% 0.19K 3486 21 13944K dentry
34816 33247 0% 0.12K 1024 34 4096K kernfs_node_cache
34496 29031 0% 0.06K 539 64 2156K kmalloc-64
23283 22707 0% 1.05K 7761 3 31044K ext4_inode_cache
16940 16052 0% 0.57K 2420 7 9680K inode_cache
14464 4124 0% 0.06K 226 64 904K anon_vma_chain
11900 11841 0% 0.14K 425 28 1700K ext4_groupinfo_4k
11312 9861 0% 0.50K 1414 8 5656K kmalloc-512
10692 10066 0% 0.04K 108 99 432K ext4_extent_status
10688 4238 0% 0.25K 668 16 2672K kmalloc-256
8120 2420 0% 0.07K 145 56 580K anon_vma
8040 4563 0% 0.20K 402 20 1608K vm_area_struct
7488 3845 0% 0.12K 234 32 936K kmalloc-96
7456 7061 0% 1.00K 1864 4 7456K kmalloc-1024
7234 7227 0% 4.00K 7234 1 28936K kmalloc-4096
and this is /proc/$PID/stack of iozone eating CPU but not writing data.
[<ffffffffba78151b>] find_get_entry+0x1b/0x100
[<ffffffffba781de0>] pagecache_get_page+0x30/0x2a0
[<ffffffffc06ec12b>] ext4_da_get_block_prep+0x27b/0x440 [ext4]
[<ffffffffba840d8b>] __find_get_block_slow+0x3b/0x150
[<ffffffffba840ebd>] unmap_underlying_metadata+0x1d/0x70
[<ffffffffc06ec960>] ext4_block_write_begin+0x2e0/0x520 [ext4]
[<ffffffffc06ebeb0>] ext4_inode_attach_jinode.part.72+0xa0/0xa0 [ext4]
[<ffffffffc041f9f9>] jbd2__journal_start+0xd9/0x1e0 [jbd2]
[<ffffffffba80511a>] __check_object_size+0xfa/0x1d8
[<ffffffffba946b85>] iov_iter_copy_from_user_atomic+0xa5/0x330
[<ffffffffba780dcb>] generic_perform_write+0xfb/0x1d0
[<ffffffffba7831ca>] __generic_file_write_iter+0x16a/0x1b0
[<ffffffffc06e7220>] ext4_file_write_iter+0x90/0x370 [ext4]
[<ffffffffc06e7190>] ext4_dax_fault+0x140/0x140 [ext4]
[<ffffffffba6aef01>] update_curr+0xe1/0x160
[<ffffffffba808890>] new_sync_write+0xe0/0x130
[<ffffffffba809010>] vfs_write+0xb0/0x190
[<ffffffffba80a452>] SyS_write+0x52/0xc0
[<ffffffffba603b7d>] do_syscall_64+0x8d/0xf0
[<ffffffffbac15c4e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
[<ffffffffffffffff>] 0xffffffffffffffff
On Fri, Sep 14, 2018 at 03:36:59PM +0200, Lukas Hejtmanek wrote:
> Hello,
>
> ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue
> with iozone remains the same.
>
> The spec is running, however, it runs slower than 1-NUMA case.
>
> The corrected XML looks like follows:
> <cpu mode='host-passthrough'><topology sockets='8'
cores='4' threads='1'/><numa><cell cpus='0-3'
memory='62000000' /><cell cpus='4-7' memory='62000000'
/><cell cpus='8-11' memory='62000000' /><cell
cpus='12-15' memory='62000000' /><cell cpus='16-19'
memory='62000000' /><cell cpus='20-23' memory='62000000'
/><cell cpus='24-27' memory='62000000' /><cell
cpus='28-31' memory='62000000' /></numa></cpu>
> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin
vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2'
/><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4'
cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin
vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7'
/><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9'
cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin
vcpu='11' cpuset='11' /><vcpupin vcpu='12'
cpuset='12' /><vcpupin vcpu='13' cpuset='13'
/><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15'
cpuset='15' /><vcpupin vcpu='16' cpuset='16'
/><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18'
cpuset='18' /><vcpupin vcpu='19' cpuset='19'
/><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21'
cpuset='21' /><vcpupin vcpu='22' cpuset='22'
/><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24'
cpuset='24' /><vcpupin vcpu='25' cpuset='25'
/><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27'
cpuset='27' /><vcpupin vcpu='28' cpuset='28'
/><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30'
cpuset='30' /><vcpupin vcpu='31' cpuset='31'
/></cputune>
> <numatune><memory mode='strict'
nodeset='0-7'/></numatune>
>
> In this case, the first part took more than 1700 seconds. 1-NUMA config
> finishes in 1646 seconds.
>
> Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with
> 8-NUMA config finishes in 900 seconds.
>
> On Fri, Sep 14, 2018 at 02:06:26PM +0200, Lukas Hejtmanek wrote:
> > Hello,
> >
> > I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance
> > 8-NUMA configuration:
> >
> > This is from hypervizor:
> > [root@hde10 ~]# lscpu
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Byte Order: Little Endian
> > CPU(s): 64
> > On-line CPU(s) list: 0-63
> > Thread(s) per core: 2
> > Core(s) per socket: 16
> > Socket(s): 2
> > NUMA node(s): 8
> > Vendor ID: AuthenticAMD
> > CPU family: 23
> > Model: 1
> > Model name: AMD EPYC 7351 16-Core Processor
> > Stepping: 2
> > CPU MHz: 1800.000
> > CPU max MHz: 2400.0000
> > CPU min MHz: 1200.0000
> > BogoMIPS: 4800.05
> > Virtualization: AMD-V
> > L1d cache: 32K
> > L1i cache: 64K
> > L2 cache: 512K
> > L3 cache: 8192K
> > NUMA node0 CPU(s): 0-3,32-35
> > NUMA node1 CPU(s): 4-7,36-39
> > NUMA node2 CPU(s): 8-11,40-43
> > NUMA node3 CPU(s): 12-15,44-47
> > NUMA node4 CPU(s): 16-19,48-51
> > NUMA node5 CPU(s): 20-23,52-55
> > NUMA node6 CPU(s): 24-27,56-59
> > NUMA node7 CPU(s): 28-31,60-63
> >
> > I'm running one big virtual on this hypervizor - almost whole memory + all
> > physical CPUs.
> >
> > This is what I'm seeing inside:
> >
> > root@zenon10:~# lscpu
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Byte Order: Little Endian
> > CPU(s): 32
> > On-line CPU(s) list: 0-31
> > Thread(s) per core: 1
> > Core(s) per socket: 4
> > Socket(s): 8
> > NUMA node(s): 8
> > Vendor ID: AuthenticAMD
> > CPU family: 23
> > Model: 1
> > Model name: AMD EPYC 7351 16-Core Processor
> > Stepping: 2
> > CPU MHz: 2400.000
> > BogoMIPS: 4800.00
> > Virtualization: AMD-V
> > Hypervisor vendor: KVM
> > Virtualization type: full
> > L1d cache: 64K
> > L1i cache: 64K
> > L2 cache: 512K
> > NUMA node0 CPU(s): 0-3
> > NUMA node1 CPU(s): 4-7
> > NUMA node2 CPU(s): 8-11
> > NUMA node3 CPU(s): 12-15
> > NUMA node4 CPU(s): 16-19
> > NUMA node5 CPU(s): 20-23
> > NUMA node6 CPU(s): 24-27
> > NUMA node7 CPU(s): 28-31
> >
> > This is virtual node configuration: (i tried different numatune settings but
> > it was still the same)
> >
> > <domain type='kvm'
xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> > <name>one-55782</name>
> > <vcpu><![CDATA[32]]></vcpu>
> > <cputune>
> > <shares>32768</shares>
> > </cputune>
> > <memory>507904000</memory>
> > <os>
> > <type arch='x86_64'>hvm</type>
> > </os>
> > <devices>
> >
<emulator><![CDATA[/usr/bin/kvm]]></emulator>
> > <disk type='file' device='disk'>
> > <source
file='/opt/opennebula/var/datastores/108/55782/disk.0'/>
> > <target dev='vda'/>
> > <driver name='qemu' type='qcow2'
cache='unsafe'/>
> > </disk>
> > <disk type='file' device='disk'>
> > <source
file='/opt/opennebula/var/datastores/108/55782/disk.1'/>
> > <target dev='vdc'/>
> > <driver name='qemu' type='raw'
cache='unsafe'/>
> > </disk>
> > <disk type='file' device='disk'>
> > <source
file='/opt/opennebula/var/datastores/108/55782/disk.2'/>
> > <target dev='vdd'/>
> > <driver name='qemu' type='raw'
cache='unsafe'/>
> > </disk>
> > <disk type='file' device='disk'>
> > <source
file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
> > <target dev='vde'/>
> > <driver name='qemu' type='raw'
cache='unsafe'/>
> > </disk>
> > <disk type='file' device='cdrom'>
> > <source
file='/opt/opennebula/var/datastores/108/55782/disk.4'/>
> > <target dev='vdb'/>
> > <readonly/>
> > <driver name='qemu' type='raw'/>
> > </disk>
> > <interface type='bridge'>
> > <source bridge='br0'/>
> > <mac address='02:00:93:fb:3b:78'/>
> > <target dev='one-55782-0'/>
> > <model type='virtio'/>
> > <filterref filter='no-arp-mac-spoofing'>
> > <parameter name='IP'
value='147.251.59.120'/>
> > </filterref>
> > </interface>
> > </devices>
> > <features>
> > <pae/>
> > <acpi/>
> > </features>
> > <!-- RAW data follows: -->
> > <cpu mode='host-passthrough'><topology sockets='8'
cores='4' threads='1'/><numa><cell cpus='0-3'
memory='62000000' /><cell cpus='4-7' memory='62000000'
/><cell cpus='8-11' memory='62000000' /><cell
cpus='12-15' memory='62000000' /><cell cpus='16-19'
memory='62000000' /><cell cpus='20-23' memory='62000000'
/><cell cpus='24-27' memory='62000000' /><cell
cpus='28-31' memory='62000000' /></numa></cpu>
> > <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin
vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4'
/><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4'
cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin
vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14'
/><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9'
cpuset='18' /><vcpupin vcpu='10' cpuset='20'
/><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12'
cpuset='24' /><vcpupin vcpu='13' cpuset='26'
/><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15'
cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin
vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5'
/><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20'
cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin
vcpu='22' cpuset='13' /><vcpupin vcpu='23'
cpuset='15' /><vcpupin vcpu='24' cpuset='17'
/><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26'
cpuset='21' /><vcpupin vcpu='27' cpuset='23'
/><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29'
cpuset='27' /><vcpupin vcpu='30' cpuset='29'
/><vcpupin vcpu='31' cpuset='31' /></cputune>
> > <numatune><memory mode='preferred'
nodeset='0'/></numatune>)
> > <devices><serial type='pty'><target
port='0'/></serial><console type='pty'><target
type='serial' port='0'/></console><channel
type='pty'><target type='virtio'
name='org.qemu.guest_agent.0'/></channel></devices>
> > <devices><hostdev mode='subsystem' type='pci'
managed='yes'><source><address domain='0x0' bus='0x11'
slot='0x0'
function='0x1'/></source></hostdev></devices>
> >
> > <devices><controller type='pci' index='1'
model='pci-bridge'/><controller type='pci' index='2'
model='pci-bridge'/><controller type='pci' index='3'
model='pci-bridge'/><controller type='pci' index='4'
model='pci-bridge'/><controller type='pci' index='5'
model='pci-bridge'/></devices>
> > <metadata>
> >
<system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]>
</system_datastore>
> > </metadata>
> > </domain>
> >
> > If I run e.g., spec2017 on the virtual, I can see:
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16
bwaves_r_base.m
> > 2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92
bwaves_r_base.m
> > 4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04
bwaves_r_base.m
> > 6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54
bwaves_r_base.m
> > 7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39
bwaves_r_base.m
> > 8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02
bwaves_r_base.m
> > 8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48
bwaves_r_base.m
> > 8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66
bwaves_r_base.m
> > 9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36
bwaves_r_base.m
> > 10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40
bwaves_r_base.m
> > 12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76
bwaves_r_base.m
> > 13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34
bwaves_r_base.m
> > 13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58
bwaves_r_base.m
> > 14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82
bwaves_r_base.m
> > 12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96
bwaves_r_base.m
> > 10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43
bwaves_r_base.m
> > 15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14
bwaves_r_base.m
> > 15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07
bwaves_r_base.m
> > 16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92
bwaves_r_base.m
> > 17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89
bwaves_r_base.m
> > 17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07
bwaves_r_base.m
> > 17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25
bwaves_r_base.m
> > 27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95
bwaves_r_base.m
> > 24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70
bwaves_r_base.m
> > 30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67
bwaves_r_base.m
> >
> > The CPU TIME should be roughly the same but huge differences are obvious.
> >
> > This is what I see on the hypervizor:
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm
> > 369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1
> > 368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0
> >
> > i.e., kswapd is eating whole CPU. Swap is turned off.
> >
> > [root@hde10 ~]# free
> > total used free shared buff/cache
available
> > Mem: 528151432 503432580 1214048 34740 23504804
21907800
> > Swap: 0 0 0
> >
> > Hypervisor is
> > [root@hde10 ~]# cat /etc/redhat-release
> > CentOS Linux release 7.5.1804 (Core)
> >
> > qemu-kvm-1.5.3-156.el7_5.5.x86_64
> >
> > Virtual is Debian 9.
> >
> >
> > Moreover, I'm using this type of disks for virtuals:
> > <disk type='file' device='disk'>
> > <source
file='/opt/opennebula/var/datastores/108/55782/disk.3'/>
> > <target dev='vde'/>
> > <driver name='qemu' type='raw'
cache='unsafe'/>
> > </disk>
> >
> > If I keep cache='unsafe' and if I run iozone test on really big files
(e.g.,
> > 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are
> > running on 100 % percent and slowing things down. The disk under datastore is
> > NVME SSD Intel 4500.
> >
> > If I set cache='none', kswaps are on idle, disk writes are pretty
fast,
> > however, with 8-NUMA configuration, writes slow down to less than 10MB/s as
> > soon as the size of written data is roughly the same as memory size in the
virtual
> > node. iozone has 100 % CPU usage thereafter and it seems that it is traversing
page
> > lists. If I do the same with 1-NUMA configuration, everything is ok except
> > performance penalty about 25 %.
> >
> > --
> > Lukáš Hejtmánek
> >
> > Linux Administrator only because
> > Full Time Multitasking Ninja
> > is not an official job title
>
> --
> Lukáš Hejtmánek
>
> Linux Administrator only because
> Full Time Multitasking Ninja
> is not an official job title
--
Lukáš Hejtmánek
Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title
--
Lukáš Hejtmánek
Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title