From: Wim ten Have <wim.ten.have(a)oracle.com>
This patch extents guest domain administration adding support to advertise
node sibling distances when configuring HVM numa guests.
NUMA (non-uniform memory access), a method of configuring a cluster of nodes
within a single multiprocessing system such that it shares processor
local memory amongst others improving performance and the ability of the
system to be expanded.
A NUMA system could be illustrated as shown below. Within this 4-node
system, every socket is equipped with its own distinct memory. The whole
typically resembles a SMP (symmetric multiprocessing) system being a
"tightly-coupled," "share everything" system in which multiple
processors
are working under a single operating system and can access each others'
memory over multiple "Bus Interconnect" paths.
+-----+-----+-----+ +-----+-----+-----+
| M | CPU | CPU | | CPU | CPU | M |
| E | | | | | | E |
| M +- Socket0 -+ +- Socket3 -+ M |
| O | | | | | | O |
| R | CPU | CPU <---------> CPU | CPU | R |
| Y | | | | | | Y |
+-----+--^--+-----+ +-----+--^--+-----+
| |
| Bus Interconnect |
| |
+-----+--v--+-----+ +-----+--v--+-----+
| M | | | | | | M |
| E | CPU | CPU <---------> CPU | CPU | E |
| M | | | | | | M |
| O +- Socket1 -+ +- Socket2 -+ O |
| R | | | | | | R |
| Y | CPU | CPU | | CPU | CPU | Y |
+-----+-----+-----+ +-----+-----+-----+
In contrast there is the limitation of a flat SMP system, not illustrated.
Here, as sockets are added, the bus (data and address path), under high
activity, gets overloaded and easily becomes a performance bottleneck.
NUMA adds an intermediate level of memory shared amongst a few cores per
socket as illustrated above, so that data accesses do not have to travel
over a single bus.
Unfortunately the way NUMA does this adds its own limitations. This,
as visualized in the illustration above, happens when data is stored in
memory associated with Socket2 and is accessed by a CPU (core) in Socket0.
The processors use the "Bus Interconnect" to create gateways between the
sockets (nodes) enabling inter-socket access to memory. These "Bus
Interconnect" hops add data access delays when a CPU (core) accesses
memory associated with a remote socket (node).
For terminology we refer to sockets as "nodes" where access to each
others' distinct resources such as memory make them "siblings" with a
designated "distance" between them. A specific design is described under
the ACPI (Advanced Configuration and Power Interface Specification)
within the chapter explaining the system's SLIT (System Locality Distance
Information Table).
These patches extend core libvirt's XML description of a virtual machine's
hardware to include NUMA distance information for sibling nodes, which
is then passed to Xen guests via libxl. Recently qemu landed support for
constructing the SLIT since commit 0f203430dd ("numa: Allow setting NUMA
distance for different NUMA nodes"), hence these core libvirt extensions
can also help other drivers in supporting this feature.
The XML changes made allow to describe the <cell> (or node/sockets)
<distances>
amongst <sibling> node identifiers and propagate these towards the numa
domain functionality finally adding support to libxl.
[below is an example illustrating a 4 node/socket <cell> setup]
<cpu>
<numa>
<cell id='0' cpus='0,4-7' memory='2097152'
unit='KiB'>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='21'/>
<sibling id='2' value='31'/>
<sibling id='3' value='41'/>
</distances>
</cell>
<cell id='1' cpus='1,8-10,12-15' memory='2097152'
unit='KiB'>
<distances>
<sibling id='0' value='21'/>
<sibling id='1' value='10'/>
<sibling id='2' value='21'/>
<sibling id='3' value='31'/>
</distances>
</cell>
<cell id='2' cpus='2,11' memory='2097152'
unit='KiB'>
<distances>
<sibling id='0' value='31'/>
<sibling id='1' value='21'/>
<sibling id='2' value='10'/>
<sibling id='3' value='21'/>
</distances>
</cell>
<cell id='3' cpus='3' memory='2097152'
unit='KiB'>
<distances>
<sibling id='0' value='41'/>
<sibling id='1' value='31'/>
<sibling id='2' value='21'/>
<sibling id='3' value='10'/>
</distances>
</cell>
</numa>
</cpu>
By default on libxl, if no <distances> are given to describe the SLIT data
between different <cell>s, this patch will default to a scheme using 10
for local and 21 for any remote node/socket, which is the assumption of
guest OS when no SLIT is specified. While SLIT is optional, libxl requires
that distances are set nonetheless.
On Linux systems the SLIT detail can be listed with help of the 'numactl -H'
command. An above HVM guest as described would on such prompt with below output.
[root@f25 ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 4 5 6 7
node 0 size: 1988 MB
node 0 free: 1743 MB
node 1 cpus: 1 8 9 10 12 13 14 15
node 1 size: 1946 MB
node 1 free: 1885 MB
node 2 cpus: 2 11
node 2 size: 2011 MB
node 2 free: 1912 MB
node 3 cpus: 3
node 3 size: 2010 MB
node 3 free: 1980 MB
node distances:
node 0 1 2 3
0: 10 21 31 41
1: 21 10 21 31
2: 31 21 10 21
3: 41 31 21 10
Wim ten Have (4):
numa: describe siblings distances within cells
libxl: vnuma support
xenconfig: add domxml conversions for xen-xl
xlconfigtest: add tests for numa cell sibling distances
docs/formatdomain.html.in | 70 ++++-
docs/schemas/basictypes.rng | 9 +
docs/schemas/cputypes.rng | 18 ++
src/conf/cpu_conf.c | 2 +-
src/conf/numa_conf.c | 323 +++++++++++++++++++-
src/conf/numa_conf.h | 25 +-
src/libvirt_private.syms | 6 +
src/libxl/libxl_conf.c | 120 ++++++++
src/libxl/libxl_driver.c | 3 +-
src/xenconfig/xen_xl.c | 333 +++++++++++++++++++++
.../test-fullvirt-vnuma-nodistances.cfg | 26 ++
.../test-fullvirt-vnuma-nodistances.xml | 53 ++++
tests/xlconfigdata/test-fullvirt-vnuma.cfg | 26 ++
tests/xlconfigdata/test-fullvirt-vnuma.xml | 81 +++++
tests/xlconfigtest.c | 4 +
15 files changed, 1089 insertions(+), 10 deletions(-)
create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.cfg
create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.xml
create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.cfg
create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.xml
--
2.9.5