From: Wim ten Have <wim.ten.have(a)oracle.com>
This patch extends guest domain administration by adding a feature that
creates a guest with a NUMA layout, also referred to as vNUMA (Virtual
NUMA).
NUMA (Non-Uniform Memory Access) is a method of configuring a cluster of
nodes within a single multiprocessing system such that each node shares
its processor local memory with other nodes, improving performance and
the ability of the system to be expanded.
The illustration below shows a typical 4-node NUMA system. Within this
system, each socket is equipped with its own distinct memory and some
also with I/O. Access to memory or I/O on remote nodes is only possible
communicating through the "Interconnect."
+-------------+-------+ +-------+-------------+
|NODE0| | | | | |NODE3|
| | CPU00 | CPU03 | | CPU12 | CPU15 | |
| | | | | | | |
| Mem +--- Socket0 ---<-------->--- Socket3 ---+ Mem |
| | | | | | | |
+-----+ CPU01 | CPU02 | | CPU13 | CPU14 | |
| I/O | | | | | | |
+-----+-------^-------+ +-------^-------+-----+
| |
| Interconnect |
| |
+-------------v-------+ +-------v-------------+
|NODE1| | | | | |NODE2|
| | CPU04 | CPU07 | | CPU08 | CPU11 | |
| | | | | | | |
| Mem +--- Socket1 ---<-------->--- Socket2 ---+ Mem |
| | | | | | | |
+-----+ CPU05 | CPU06 | | CPU09 | CPU10 | |
| I/O | | | | | | |
+-----+-------+-------+ +-------+-------+-----+
Unfortunately, NUMA architectures have some drawbacks. For example,
when data is stored in memory associated with Socket2 but is accessed
by a CPU in Socket0, that CPU uses the interconnect to access the
memory associated with Socket2. These interconnect hops add data access
delays. Some high performance software takes NUMA architecture into
account by carefully placing data in memory and pinning the processes
most likely to access that data to CPUs with the shortest access times.
Similarly, such software can pin its I/O processes to CPUs with the
shortest access times to I/O devices. When such software is run within
a guest VM, constructing the VM such that its virtual NUMA topology
mirrors the physical NUMA topology preserves the application software's
performance.
The changes brought by this patch series add a new libvirt domain element
named <vnuma> that allows for dynamic 'host' or 'node' partitioning
of
a guest where libvirt inspects the host capabilities and renders a best
guest XML design holding a host matching vNUMA topology.
<domain>
..
<vnuma mode='host|node'
distribution='contiguous|siblings|round-robin|interleave'>
<memory unit='KiB'>524288</memory>
<partition nodeset="1-4,^3" cells="8"/>
</vnuma>
..
</domain>
The content of this <vnuma> element causes libvirt to dynamically
partition the guest domain XML into a 'host' or 'node' numa model.
Under <vnuma mode='host' ... > the guest domain is automatically
partitioned according to the "host" capabilities.
Under <vnuma mode='node' ... > the guest domain is partitioned according
to the nodeset and cells under the vnuma partition subelement.
The optional <vnuma> attribute distribution='type' is to indicate the
guest numa cell cpus distribution. This distribution='type' can have
the following values:
- 'contiguous' delivery, under which the cpus enumerate sequentially
over the numa defined cells.
- 'siblings' cpus are distributed over the numa cells matching the host
CPU SMT model.
- 'round-robin' cpus are distributed over the numa cells matching the
host CPU topology.
- 'interleave' cpus are interleaved one at a time over the numa cells.
The optional subelement <memory> specifies the memory size reserved
for the guest to dimension its <numa> <cell id> size. If no memory is
specified, the <vnuma> <memory> setting is acquired from the guest's
total memory, <domain> <memory> setting.
The optional attribute <partition> is only active when <vnuma
mode='node'>
is in effect and allows for defining the active "nodeset" and "cells"
to
target for under the "guest" domain. For example, the specified attribute
"nodeset" can limit the assigned host NUMA nodes in effect under the guest
with help of NUMA node tuning (<numatune>.) Alternatively, the provided
"cells" attribute can define the guest number of vNUMA cells to render.
We're planning a 'virsh vnuma' command to convert existing guest domains
to one of these vNUMA models.
Wim ten Have (4):
XML definitions for guest vNUMA and parsing routines
qemu: driver changes adding vNUMA vCPU hotplug support
qemu: driver changes adding vNUMA memory hotplug support
tests: add various tests to exercise vNUMA host partitioning
docs/formatdomain.html.in | 94 ++++
docs/schemas/domaincommon.rng | 65 +++
src/conf/domain_conf.c | 482 +++++++++++++++++-
src/conf/domain_conf.h | 2 +
src/conf/numa_conf.c | 241 ++++++++-
src/conf/numa_conf.h | 58 ++-
src/libvirt_private.syms | 8 +
src/qemu/qemu_driver.c | 65 ++-
src/qemu/qemu_hotplug.c | 95 +++-
.../cpu-host-passthrough-nonuma.args | 29 ++
.../cpu-host-passthrough-nonuma.xml | 19 +
.../cpu-host-passthrough-numa-contiguous.args | 37 ++
.../cpu-host-passthrough-numa-contiguous.xml | 20 +
.../cpu-host-passthrough-numa-interleave.args | 41 ++
.../cpu-host-passthrough-numa-interleave.xml | 19 +
...host-passthrough-numa-node-contiguous.args | 53 ++
...-host-passthrough-numa-node-contiguous.xml | 21 +
...host-passthrough-numa-node-interleave.args | 41 ++
...-host-passthrough-numa-node-interleave.xml | 22 +
...ost-passthrough-numa-node-round-robin.args | 125 +++++
...host-passthrough-numa-node-round-robin.xml | 21 +
...u-host-passthrough-numa-node-siblings.args | 32 ++
...pu-host-passthrough-numa-node-siblings.xml | 23 +
...cpu-host-passthrough-numa-round-robin.args | 37 ++
.../cpu-host-passthrough-numa-round-robin.xml | 22 +
.../cpu-host-passthrough-numa-siblings.args | 37 ++
.../cpu-host-passthrough-numa-siblings.xml | 20 +
.../cpu-host-passthrough-numa.args | 37 ++
.../cpu-host-passthrough-numa.xml | 20 +
tests/qemuxml2argvtest.c | 10 +
30 files changed, 1765 insertions(+), 31 deletions(-)
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-contiguous.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-contiguous.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-interleave.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-interleave.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-contiguous.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-contiguous.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-interleave.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-interleave.xml
create mode 100644
tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-round-robin.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-round-robin.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-siblings.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-siblings.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-round-robin.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-round-robin.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-siblings.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-siblings.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.xml
--
2.21.0