[libvirt] [PATCH v3 0/4] numa: describe sibling nodes distances

From: Wim ten Have <wim.ten.have@oracle.com> This patch extents guest domain administration adding support to advertise node sibling distances when configuring HVM numa guests. NUMA (non-uniform memory access), a method of configuring a cluster of nodes within a single multiprocessing system such that it shares processor local memory amongst others improving performance and the ability of the system to be expanded. A NUMA system could be illustrated as shown below. Within this 4-node system, every socket is equipped with its own distinct memory. The whole typically resembles a SMP (symmetric multiprocessing) system being a "tightly-coupled," "share everything" system in which multiple processors are working under a single operating system and can access each others' memory over multiple "Bus Interconnect" paths. +-----+-----+-----+ +-----+-----+-----+ | M | CPU | CPU | | CPU | CPU | M | | E | | | | | | E | | M +- Socket0 -+ +- Socket3 -+ M | | O | | | | | | O | | R | CPU | CPU <---------> CPU | CPU | R | | Y | | | | | | Y | +-----+--^--+-----+ +-----+--^--+-----+ | | | Bus Interconnect | | | +-----+--v--+-----+ +-----+--v--+-----+ | M | | | | | | M | | E | CPU | CPU <---------> CPU | CPU | E | | M | | | | | | M | | O +- Socket1 -+ +- Socket2 -+ O | | R | | | | | | R | | Y | CPU | CPU | | CPU | CPU | Y | +-----+-----+-----+ +-----+-----+-----+ In contrast there is the limitation of a flat SMP system, not illustrated. Here, as sockets are added, the bus (data and address path), under high activity, gets overloaded and easily becomes a performance bottleneck. NUMA adds an intermediate level of memory shared amongst a few cores per socket as illustrated above, so that data accesses do not have to travel over a single bus. Unfortunately the way NUMA does this adds its own limitations. This, as visualized in the illustration above, happens when data is stored in memory associated with Socket2 and is accessed by a CPU (core) in Socket0. The processors use the "Bus Interconnect" to create gateways between the sockets (nodes) enabling inter-socket access to memory. These "Bus Interconnect" hops add data access delays when a CPU (core) accesses memory associated with a remote socket (node). For terminology we refer to sockets as "nodes" where access to each others' distinct resources such as memory make them "siblings" with a designated "distance" between them. A specific design is described under the ACPI (Advanced Configuration and Power Interface Specification) within the chapter explaining the system's SLIT (System Locality Distance Information Table). These patches extend core libvirt's XML description of a virtual machine's hardware to include NUMA distance information for sibling nodes, which is then passed to Xen guests via libxl. Recently qemu landed support for constructing the SLIT since commit 0f203430dd ("numa: Allow setting NUMA distance for different NUMA nodes"), hence these core libvirt extensions can also help other drivers in supporting this feature. The XML changes made allow to describe the <cell> (or node/sockets) <distances> amongst <sibling> node identifiers and propagate these towards the numa domain functionality finally adding support to libxl. [below is an example illustrating a 4 node/socket <cell> setup] <cpu> <numa> <cell id='0' cpus='0,4-7' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='1' cpus='1,8-10,12-15' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='21'/> <sibling id='3' value='31'/> </distances> </cell> <cell id='2' cpus='2,11' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> </cell> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='41'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu> By default on libxl, if no <distances> are given to describe the SLIT data between different <cell>s, this patch will default to a scheme using 10 for local and 21 for any remote node/socket, which is the assumption of guest OS when no SLIT is specified. While SLIT is optional, libxl requires that distances are set nonetheless. On Linux systems the SLIT detail can be listed with help of the 'numactl -H' command. An above HVM guest as described would on such prompt with below output. [root@f25 ~]# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 4 5 6 7 node 0 size: 1988 MB node 0 free: 1743 MB node 1 cpus: 1 8 9 10 12 13 14 15 node 1 size: 1946 MB node 1 free: 1885 MB node 2 cpus: 2 11 node 2 size: 2011 MB node 2 free: 1912 MB node 3 cpus: 3 node 3 size: 2010 MB node 3 free: 1980 MB node distances: node 0 1 2 3 0: 10 21 31 41 1: 21 10 21 31 2: 31 21 10 21 3: 41 31 21 10 Wim ten Have (4): numa: describe siblings distances within cells libxl: vnuma support xenconfig: add domxml conversions for xen-xl xlconfigtest: add tests for numa cell sibling distances docs/formatdomain.html.in | 70 ++++- docs/schemas/basictypes.rng | 9 + docs/schemas/cputypes.rng | 18 ++ src/conf/cpu_conf.c | 2 +- src/conf/numa_conf.c | 323 +++++++++++++++++++- src/conf/numa_conf.h | 25 +- src/libvirt_private.syms | 6 + src/libxl/libxl_conf.c | 120 ++++++++ src/libxl/libxl_driver.c | 3 +- src/xenconfig/xen_xl.c | 333 +++++++++++++++++++++ .../test-fullvirt-vnuma-nodistances.cfg | 26 ++ .../test-fullvirt-vnuma-nodistances.xml | 53 ++++ tests/xlconfigdata/test-fullvirt-vnuma.cfg | 26 ++ tests/xlconfigdata/test-fullvirt-vnuma.xml | 81 +++++ tests/xlconfigtest.c | 4 + 15 files changed, 1089 insertions(+), 10 deletions(-) create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.cfg create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.xml create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.cfg create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.xml -- 2.9.5

From: Wim ten Have <wim.ten.have@oracle.com> Add libvirtd NUMA cell domain administration functionality to describe underlying cell id sibling distances in full fashion when configuring HVM guests. [below is an example of a 4 node setup] <cpu> <numa> <cell id='0' cpus='0' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='1' cpus='1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='2' cpus='2' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='41'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu> Changes under this commit concern all those that require adding the valid data structures, virDomainNuma* functional routines and the XML doc schema enhancements to enforce appropriate administration. These changes are accompanied with topic related documentation under docs/formatdomain.html within the "CPU model and topology" extending the "Guest NUMA topology" paragraph. For terminology we refer to sockets as "nodes" where access to each others' distinct resources such as memory make them "siblings" with a designated "distance" between them. A specific design is described under the ACPI (Advanced Configuration and Power Interface Specification) within the chapter explaining the system's SLIT (System Locality Distance Information Table). These patches extend core libvirt's XML description of a virtual machine's hardware to include NUMA distance information for sibling nodes, which is then passed to Xen guests via libxl. Recently qemu landed support for constructing the SLIT since commit 0f203430dd ("numa: Allow setting NUMA distance for different NUMA nodes"), hence these core libvirt extensions can also help other drivers in supporting this feature. These changes alter the docs/schemas/cputypes.rng enforcing domain administration to follow the syntax below per numa cell id. These changes also alter docs/schemas/basictypes.rng to add "numaDistanceValue" which is an "unsignedInt" with a minimum value of 10 as 0-9 are reserved values and can not be used as System Locality Distance Information Table data. Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> --- Changes on v1: - Add changes to docs/formatdomain.html.in describing schema update. Changes on v2: - Automatically apply distance symmetry maintaining cell <-> sibling. - Check for maximum '255' on numaDistanceValue. - Automatically complete empty distance ranges. - Check that sibling_id's are in range with cell identifiers. - Allow non-contiguous ranges, starting from any node id. - Respect parameters as ATTRIBUTE_NONNULL fix functions and callers. - Add and apply topoly for LOCAL_DISTANCE=10 and REMOTE_DISTANCE=20. --- docs/formatdomain.html.in | 70 +++++++++- docs/schemas/basictypes.rng | 9 ++ docs/schemas/cputypes.rng | 18 +++ src/conf/cpu_conf.c | 2 +- src/conf/numa_conf.c | 323 +++++++++++++++++++++++++++++++++++++++++++- src/conf/numa_conf.h | 25 +++- src/libvirt_private.syms | 6 + 7 files changed, 444 insertions(+), 9 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 8ca7637..19a2ac7 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -1529,7 +1529,75 @@ </p> <p> - This guest NUMA specification is currently available only for QEMU/KVM. + This guest NUMA specification is currently available only for + QEMU/KVM and Xen. Whereas Xen driver also allows for a distinct + description of NUMA arranged <code>sibling</code> <code>cell</code> + <code>distances</code> <span class="since">Since 3.6.0</span>. + </p> + + <p> + Under NUMA h/w architecture, distinct resources such as memory + create a designated distance between <code>cell</code> and + <code>siblings</code> that now can be described with the help of + <code>distances</code>. A detailed describtion can be found within + the ACPI (Advanced Configuration and Power Interface Specification) + within the chapter explaining the system's SLIT (System Locality + Distance Information Table). + </p> + +<pre> +... +<cpu> + ... + <numa> + <cell id='0' cpus='0,4-7' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='10'/> + <sibling id='1' value='21'/> + <sibling id='2' value='31'/> + <sibling id='3' value='41'/> + </distances> + </cell> + <cell id='1' cpus='1,8-10,12-15' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='21'/> + <sibling id='1' value='10'/> + <sibling id='2' value='21'/> + <sibling id='3' value='31'/> + </distances> + </cell> + <cell id='2' cpus='2,11' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='31'/> + <sibling id='1' value='21'/> + <sibling id='2' value='10'/> + <sibling id='3' value='21'/> + </distances> + </cell> + <cell id='3' cpus='3' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='41'/> + <sibling id='1' value='31'/> + <sibling id='2' value='21'/> + <sibling id='3' value='10'/> + </distances> + </cell> + </numa> + ... +</cpu> +...</pre> + + <p> + Under Xen driver, if no <code>distances</code> are given to describe + the SLIT data between different cells, it will default to a scheme + using 10 for local and 20 for remote distances. Made defaults + for guest OS when no SLIT is specified. + <br/>See include/linux/topology.h under + <pre> + /* Conform to ACPI 2.0 SLIT distance definitions */ + #define LOCAL_DISTANCE 10 + #define REMOTE_DISTANCE 20 + </pre> </p> <h3><a id="elementsEvents">Events configuration</a></h3> diff --git a/docs/schemas/basictypes.rng b/docs/schemas/basictypes.rng index 1ea667c..bbef282 100644 --- a/docs/schemas/basictypes.rng +++ b/docs/schemas/basictypes.rng @@ -77,6 +77,15 @@ </choice> </define> + <define name="numaDistanceValue"> + <choice> + <data type="unsignedInt"> + <param name="minInclusive">10</param> + <param name="maxInclusive">255</param> + </data> + </choice> + </define> + <define name="pciaddress"> <optional> <attribute name="domain"> diff --git a/docs/schemas/cputypes.rng b/docs/schemas/cputypes.rng index 3eef16a..c45b6df 100644 --- a/docs/schemas/cputypes.rng +++ b/docs/schemas/cputypes.rng @@ -129,6 +129,24 @@ </choice> </attribute> </optional> + <optional> + <element name="distances"> + <oneOrMore> + <ref name="numaDistance"/> + </oneOrMore> + </element> + </optional> + </element> + </define> + + <define name="numaDistance"> + <element name="sibling"> + <attribute name="id"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="value"> + <ref name="numaDistanceValue"/> + </attribute> </element> </define> diff --git a/src/conf/cpu_conf.c b/src/conf/cpu_conf.c index c21d11d..8d804a1 100644 --- a/src/conf/cpu_conf.c +++ b/src/conf/cpu_conf.c @@ -642,7 +642,7 @@ virCPUDefFormatBufFull(virBufferPtr buf, if (virCPUDefFormatBuf(&childrenBuf, def, updateCPU) < 0) goto cleanup; - if (virDomainNumaDefCPUFormat(&childrenBuf, numa) < 0) + if (virDomainNumaDefCPUFormatXML(&childrenBuf, numa) < 0) goto cleanup; if (virBufferCheckError(&attributeBuf) < 0 || diff --git a/src/conf/numa_conf.c b/src/conf/numa_conf.c index bfd3703..fb2a74c 100644 --- a/src/conf/numa_conf.c +++ b/src/conf/numa_conf.c @@ -29,6 +29,13 @@ #include "virnuma.h" #include "virstring.h" +/* + * Distance definitions defined Conform ACPI 2.0 SLIT. + * See include/linux/topology.h + */ +#define LOCAL_DISTANCE 10 +#define REMOTE_DISTANCE 20 + #define VIR_FROM_THIS VIR_FROM_DOMAIN VIR_ENUM_IMPL(virDomainNumatuneMemMode, @@ -48,6 +55,8 @@ VIR_ENUM_IMPL(virDomainMemoryAccess, VIR_DOMAIN_MEMORY_ACCESS_LAST, "shared", "private") +typedef struct _virDomainNumaDistance virDomainNumaDistance; +typedef virDomainNumaDistance *virDomainNumaDistancePtr; typedef struct _virDomainNumaNode virDomainNumaNode; typedef virDomainNumaNode *virDomainNumaNodePtr; @@ -66,6 +75,12 @@ struct _virDomainNuma { virBitmapPtr nodeset; /* host memory nodes where this guest node resides */ virDomainNumatuneMemMode mode; /* memory mode selection */ virDomainMemoryAccess memAccess; /* shared memory access configuration */ + + struct _virDomainNumaDistance { + unsigned int value; /* locality value for node i->j or j->i */ + unsigned int cellid; + } *distances; /* remote node distances */ + size_t ndistances; } *mem_nodes; /* guest node configuration */ size_t nmem_nodes; @@ -686,6 +701,128 @@ virDomainNumatuneNodesetIsAvailable(virDomainNumaPtr numatune, } +static int +virDomainNumaDefNodeDistanceParseXML(virDomainNumaPtr def, + xmlXPathContextPtr ctxt, + unsigned int cur_cell) +{ + int ret = -1; + int sibling; + char *tmp = NULL; + xmlNodePtr *nodes = NULL; + size_t i, ndistances = def->nmem_nodes; + + if (!ndistances) + return 0; + + if (!virXPathNode("./distances[1]/sibling", ctxt)) + return 0; + + if ((sibling = virXPathNodeSet("./distances[1]/sibling", ctxt, &nodes)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NUMA distances defined without siblings")); + goto cleanup; + } + + for (i = 0; i < sibling; i++) { + virDomainNumaDistancePtr ldist, rdist; + unsigned int sibling_id, sibling_value; + + /* siblings are in order of parsing or explicitly numbered */ + if ((tmp = virXMLPropString(nodes[i], "id"))) { + if (virStrToLong_uip(tmp, NULL, 10, &sibling_id) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'id' attribute in NUMA distances for sibling: '%s'"), + tmp); + goto cleanup; + } + } + VIR_FREE(tmp); + + if (sibling_id >= ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("No cell administration for 'sibling_id %d' under NUMA cell '%d' "), + sibling_id, cur_cell); + goto cleanup; + } + + /* We need a locality value. Check and correct + * distance to local and distance to remote node. + */ + if (!(tmp = virXMLPropString(nodes[i], "value"))) { + virReportError(VIR_ERR_XML_ERROR, + _("Missing 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + /* It needs to be applicable */ + if (virStrToLong_uip(tmp, NULL, 10, &sibling_value) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + VIR_FREE(tmp); + + ldist = def->mem_nodes[cur_cell].distances; + if (!ldist) { + if (def->mem_nodes[cur_cell].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + cur_cell); + goto cleanup; + } + + if (VIR_ALLOC_N(ldist, ndistances) < 0) + goto cleanup; + + if (!ldist[cur_cell].value) + ldist[cur_cell].value = LOCAL_DISTANCE; + ldist[cur_cell].cellid = cur_cell; + def->mem_nodes[cur_cell].ndistances = ndistances; + } + + ldist[sibling_id].cellid = sibling_id; + ldist[sibling_id].value = sibling_value; + def->mem_nodes[cur_cell].distances = ldist; + + rdist = def->mem_nodes[sibling_id].distances; + if (!rdist) { + if (def->mem_nodes[sibling_id].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + if (VIR_ALLOC_N(rdist, ndistances) < 0) + goto cleanup; + + if (!rdist[sibling_id].value) + rdist[sibling_id].value = LOCAL_DISTANCE; + rdist[sibling_id].cellid = sibling_id; + def->mem_nodes[sibling_id].ndistances = ndistances; + } + + rdist[cur_cell].cellid = cur_cell; + rdist[cur_cell].value = sibling_value; + def->mem_nodes[sibling_id].distances = rdist; + } + + ret = 0; + + cleanup: + if (ret) { + for (i = 0; i < ndistances; i++) + VIR_FREE(def->mem_nodes[i].distances); + } + VIR_FREE(nodes); + VIR_FREE(tmp); + + return ret; +} + int virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlXPathContextPtr ctxt) @@ -694,7 +831,7 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlNodePtr oldNode = ctxt->node; char *tmp = NULL; int n; - size_t i; + size_t i, j; int ret = -1; /* check if NUMA definition is present */ @@ -712,7 +849,6 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->nmem_nodes = n; for (i = 0; i < n; i++) { - size_t j; int rc; unsigned int cur_cell = i; @@ -788,6 +924,10 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->mem_nodes[cur_cell].memAccess = rc; VIR_FREE(tmp); } + + /* Parse NUMA distances info */ + if (virDomainNumaDefNodeDistanceParseXML(def, ctxt, cur_cell) < 0) + goto cleanup; } ret = 0; @@ -801,8 +941,8 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, int -virDomainNumaDefCPUFormat(virBufferPtr buf, - virDomainNumaPtr def) +virDomainNumaDefCPUFormatXML(virBufferPtr buf, + virDomainNumaPtr def) { virDomainMemoryAccess memAccess; char *cpustr; @@ -815,6 +955,8 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, virBufferAddLit(buf, "<numa>\n"); virBufferAdjustIndent(buf, 2); for (i = 0; i < ncells; i++) { + int ndistances; + memAccess = virDomainNumaGetNodeMemoryAccessMode(def, i); if (!(cpustr = virBitmapFormat(virDomainNumaGetNodeCpumask(def, i)))) @@ -829,7 +971,32 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, if (memAccess) virBufferAsprintf(buf, " memAccess='%s'", virDomainMemoryAccessTypeToString(memAccess)); - virBufferAddLit(buf, "/>\n"); + + ndistances = def->mem_nodes[i].ndistances; + if (!ndistances) { + virBufferAddLit(buf, "/>\n"); + } else { + size_t j; + virDomainNumaDistancePtr distances = def->mem_nodes[i].distances; + + virBufferAddLit(buf, ">\n"); + virBufferAdjustIndent(buf, 2); + virBufferAddLit(buf, "<distances>\n"); + virBufferAdjustIndent(buf, 2); + for (j = 0; j < ndistances; j++) { + if (distances[j].value) { + virBufferAddLit(buf, "<sibling"); + virBufferAsprintf(buf, " id='%d'", distances[j].cellid); + virBufferAsprintf(buf, " value='%d'", distances[j].value); + virBufferAddLit(buf, "/>\n"); + } + } + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</distances>\n"); + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</cell>\n"); + } + VIR_FREE(cpustr); } virBufferAdjustIndent(buf, -2); @@ -922,13 +1089,146 @@ virDomainNumaCheckABIStability(virDomainNumaPtr src, size_t virDomainNumaGetNodeCount(virDomainNumaPtr numa) { - if (!numa) + if (!numa || !numa->mem_nodes) return 0; return numa->nmem_nodes; } +size_t +virDomainNumaSetNodeCount(virDomainNumaPtr numa, size_t nmem_nodes) +{ + if (!nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot set an empty nmem_nodes set")); + return 0; + } + + if (numa->mem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot alter an existing nmem_nodes set")); + return 0; + } + + if (VIR_ALLOC_N(numa->mem_nodes, nmem_nodes) < 0) + return 0; + + numa->nmem_nodes = nmem_nodes; + + return numa->nmem_nodes; +} + +size_t +virDomainNumaGetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid) +{ + virDomainNumaDistancePtr distances = NULL; + + if (node < numa->nmem_nodes) + distances = numa->mem_nodes[node].distances; + + /* + * Present the configured distance value. If + * out of range or not available set the platform + * defined default for local and remote nodes. + */ + if (!distances || + !distances[cellid].value || + !numa->mem_nodes[node].ndistances) + return (node == cellid) ? \ + LOCAL_DISTANCE : REMOTE_DISTANCE; + + return distances[cellid].value; +} + + +int +virDomainNumaSetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid, + unsigned int value) +{ + virDomainNumaDistancePtr distances; + + if (node >= numa->nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Argument 'node' %zu outranges defined number " + "of NUMA nodes"), node); + return -1; + } + + distances = numa->mem_nodes[node].distances; + if (!distances || + cellid >= numa->mem_nodes[node].ndistances) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Arguments under memnode element do not " + "correspond with existing guest's NUMA cell")); + return -1; + } + + /* + * Advanced Configuration and Power Interface + * Specification version 6.1. Chapter 5.2.17 + * System Locality Distance Information Table + * ... Distance values of 0-9 are reserved. + */ + if (value < LOCAL_DISTANCE) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value of %d is in 0-9 reserved range"), + value); + return -1; + } + + if (value > LOCAL_DISTANCE && node == cellid) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value %d under node %zu does " + "not meet LOCAL_DISTANCE of 10"), + value, node); + return -1; + } + + distances[cellid].cellid = cellid; + distances[cellid].value = value; + + return distances[cellid].value; +} + + +size_t +virDomainNumaGetNodeDistanceCount(virDomainNumaPtr numa, + size_t node) +{ + return numa->mem_nodes[node].ndistances; +} + + +size_t +virDomainNumaSetNodeDistanceCount(virDomainNumaPtr numa, + size_t node, + size_t ndistances) +{ + virDomainNumaDistancePtr distances; + + distances = numa->mem_nodes[node].distances; + if (distances) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Cannot alter an existing nmem_nodes distances set for node: %zu"), + node); + return 0; + } + + if (VIR_ALLOC_N(distances, ndistances) < 0) + return 0; + + numa->mem_nodes[node].distances = distances; + numa->mem_nodes[node].ndistances = ndistances; + + return numa->mem_nodes[node].ndistances; +} + + virBitmapPtr virDomainNumaGetNodeCpumask(virDomainNumaPtr numa, size_t node) @@ -937,6 +1237,17 @@ virDomainNumaGetNodeCpumask(virDomainNumaPtr numa, } +virBitmapPtr +virDomainNumaSetNodeCpumask(virDomainNumaPtr numa, + size_t node, + virBitmapPtr cpumask) +{ + numa->mem_nodes[node].cpumask = cpumask; + + return numa->mem_nodes[node].cpumask; +} + + virDomainMemoryAccess virDomainNumaGetNodeMemoryAccessMode(virDomainNumaPtr numa, size_t node) diff --git a/src/conf/numa_conf.h b/src/conf/numa_conf.h index b6a5354..785d12b 100644 --- a/src/conf/numa_conf.h +++ b/src/conf/numa_conf.h @@ -87,12 +87,35 @@ int virDomainNumatuneMaybeGetNodeset(virDomainNumaPtr numatune, size_t virDomainNumaGetNodeCount(virDomainNumaPtr numa); +size_t virDomainNumaSetNodeCount(virDomainNumaPtr numa, + size_t nmem_nodes) + ATTRIBUTE_NONNULL(1); +size_t virDomainNumaGetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t sibling) + ATTRIBUTE_NONNULL(1); +int virDomainNumaSetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t sibling, + unsigned int value) + ATTRIBUTE_NONNULL(1); +size_t virDomainNumaGetNodeDistanceCount(virDomainNumaPtr numa, + size_t node) + ATTRIBUTE_NONNULL(1); +size_t virDomainNumaSetNodeDistanceCount(virDomainNumaPtr numa, + size_t node, + size_t ndistances) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(3); virBitmapPtr virDomainNumaGetNodeCpumask(virDomainNumaPtr numa, size_t node) ATTRIBUTE_NONNULL(1); virDomainMemoryAccess virDomainNumaGetNodeMemoryAccessMode(virDomainNumaPtr numa, size_t node) ATTRIBUTE_NONNULL(1); +virBitmapPtr virDomainNumaSetNodeCpumask(virDomainNumaPtr numa, + size_t node, + virBitmapPtr cpumask) + ATTRIBUTE_NONNULL(1); unsigned long long virDomainNumaGetNodeMemorySize(virDomainNumaPtr numa, size_t node) ATTRIBUTE_NONNULL(1); @@ -151,7 +174,7 @@ bool virDomainNumatuneNodeSpecified(virDomainNumaPtr numatune, int cellid); int virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlXPathContextPtr ctxt); -int virDomainNumaDefCPUFormat(virBufferPtr buf, virDomainNumaPtr def); +int virDomainNumaDefCPUFormatXML(virBufferPtr buf, virDomainNumaPtr def); unsigned int virDomainNumaGetCPUCountTotal(virDomainNumaPtr numa); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index f30a04b..7fbda7e 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -709,9 +709,15 @@ virDomainNumaGetMaxCPUID; virDomainNumaGetMemorySize; virDomainNumaGetNodeCount; virDomainNumaGetNodeCpumask; +virDomainNumaGetNodeDistance; +virDomainNumaGetNodeDistanceCount; virDomainNumaGetNodeMemoryAccessMode; virDomainNumaGetNodeMemorySize; virDomainNumaNew; +virDomainNumaSetNodeCount; +virDomainNumaSetNodeCpumask; +virDomainNumaSetNodeDistance; +virDomainNumaSetNodeDistanceCount; virDomainNumaSetNodeMemorySize; virDomainNumatuneFormatNodeset; virDomainNumatuneFormatXML; -- 2.9.5

On Thu, Aug 31, 2017 at 04:02:38PM +0200, Wim Ten Have wrote:
From: Wim ten Have <wim.ten.have@oracle.com>
I haven't seen the previous versions, so sorry if I point out something that got changed already.
Add libvirtd NUMA cell domain administration functionality to describe underlying cell id sibling distances in full fashion when configuring HVM guests.
[below is an example of a 4 node setup]
<cpu> <numa> <cell id='0' cpus='0' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='1' cpus='1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='2' cpus='2' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='41'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu>
Are all those required? I don't see the point in requiring duplicate values.
Changes under this commit concern all those that require adding the valid data structures, virDomainNuma* functional routines and the XML doc schema enhancements to enforce appropriate administration.
I don't understand the point of this paragraph.
These changes are accompanied with topic related documentation under docs/formatdomain.html within the "CPU model and topology" extending the "Guest NUMA topology" paragraph.
This can be seen from the patch.
For terminology we refer to sockets as "nodes" where access to each others' distinct resources such as memory make them "siblings" with a designated "distance" between them. A specific design is described under the ACPI (Advanced Configuration and Power Interface Specification) within the chapter explaining the system's SLIT (System Locality Distance Information Table).
The above paragraph is also being added in the docs.
These patches extend core libvirt's XML description of a virtual machine's hardware to include NUMA distance information for sibling nodes, which is then passed to Xen guests via libxl. Recently qemu landed support for constructing the SLIT since commit 0f203430dd ("numa: Allow setting NUMA distance for different NUMA nodes"), hence these core libvirt extensions can also help other drivers in supporting this feature.
This is not true, libxl changes are in separate patch. There's a lot of copy-paste from the cover-letter...
These changes alter the docs/schemas/cputypes.rng enforcing domain administration to follow the syntax below per numa cell id.
These changes also alter docs/schemas/basictypes.rng to add "numaDistanceValue" which is an "unsignedInt" with a minimum value of 10 as 0-9 are reserved values and can not be used as System Locality Distance Information Table data.
Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> --- Changes on v1: - Add changes to docs/formatdomain.html.in describing schema update. Changes on v2: - Automatically apply distance symmetry maintaining cell <-> sibling. - Check for maximum '255' on numaDistanceValue. - Automatically complete empty distance ranges. - Check that sibling_id's are in range with cell identifiers. - Allow non-contiguous ranges, starting from any node id. - Respect parameters as ATTRIBUTE_NONNULL fix functions and callers. - Add and apply topoly for LOCAL_DISTANCE=10 and REMOTE_DISTANCE=20. --- docs/formatdomain.html.in | 70 +++++++++- docs/schemas/basictypes.rng | 9 ++ docs/schemas/cputypes.rng | 18 +++ src/conf/cpu_conf.c | 2 +- src/conf/numa_conf.c | 323 +++++++++++++++++++++++++++++++++++++++++++- src/conf/numa_conf.h | 25 +++- src/libvirt_private.syms | 6 + 7 files changed, 444 insertions(+), 9 deletions(-)
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 8ca7637..19a2ac7 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -1529,7 +1529,75 @@ </p>
<p> - This guest NUMA specification is currently available only for QEMU/KVM. + This guest NUMA specification is currently available only for + QEMU/KVM and Xen. Whereas Xen driver also allows for a distinct + description of NUMA arranged <code>sibling</code> <code>cell</code> + <code>distances</code> <span class="since">Since 3.6.0</span>. + </p> + + <p> + Under NUMA h/w architecture, distinct resources such as memory + create a designated distance between <code>cell</code> and + <code>siblings</code> that now can be described with the help of + <code>distances</code>. A detailed describtion can be found within + the ACPI (Advanced Configuration and Power Interface Specification) + within the chapter explaining the system's SLIT (System Locality + Distance Information Table). + </p> + +<pre> +... +<cpu> + ... + <numa> + <cell id='0' cpus='0,4-7' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='10'/> + <sibling id='1' value='21'/> + <sibling id='2' value='31'/> + <sibling id='3' value='41'/> + </distances> + </cell> + <cell id='1' cpus='1,8-10,12-15' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='21'/> + <sibling id='1' value='10'/> + <sibling id='2' value='21'/> + <sibling id='3' value='31'/> + </distances> + </cell> + <cell id='2' cpus='2,11' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='31'/> + <sibling id='1' value='21'/> + <sibling id='2' value='10'/> + <sibling id='3' value='21'/> + </distances> + </cell> + <cell id='3' cpus='3' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='41'/> + <sibling id='1' value='31'/> + <sibling id='2' value='21'/> + <sibling id='3' value='10'/> + </distances> + </cell> + </numa> + ... +</cpu> +...</pre> + + <p> + Under Xen driver, if no <code>distances</code> are given to describe + the SLIT data between different cells, it will default to a scheme + using 10 for local and 20 for remote distances. Made defaults + for guest OS when no SLIT is specified. + <br/>See include/linux/topology.h under + <pre> + /* Conform to ACPI 2.0 SLIT distance definitions */ + #define LOCAL_DISTANCE 10 + #define REMOTE_DISTANCE 20 + </pre>
I wouldn't include the code in the <pre>, I don't see the point for that.
</p>
<h3><a id="elementsEvents">Events configuration</a></h3> diff --git a/docs/schemas/basictypes.rng b/docs/schemas/basictypes.rng index 1ea667c..bbef282 100644 --- a/docs/schemas/basictypes.rng +++ b/docs/schemas/basictypes.rng @@ -77,6 +77,15 @@ </choice> </define>
+ <define name="numaDistanceValue"> + <choice> + <data type="unsignedInt"> + <param name="minInclusive">10</param> + <param name="maxInclusive">255</param> + </data> + </choice>
Why <choice>?
+ </define> + <define name="pciaddress"> <optional> <attribute name="domain"> diff --git a/docs/schemas/cputypes.rng b/docs/schemas/cputypes.rng index 3eef16a..c45b6df 100644 --- a/docs/schemas/cputypes.rng +++ b/docs/schemas/cputypes.rng @@ -129,6 +129,24 @@ </choice> </attribute> </optional> + <optional> + <element name="distances"> + <oneOrMore> + <ref name="numaDistance"/> + </oneOrMore> + </element> + </optional> + </element> + </define> + + <define name="numaDistance"> + <element name="sibling"> + <attribute name="id"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="value"> + <ref name="numaDistanceValue"/> + </attribute> </element> </define>
diff --git a/src/conf/cpu_conf.c b/src/conf/cpu_conf.c index c21d11d..8d804a1 100644 --- a/src/conf/cpu_conf.c +++ b/src/conf/cpu_conf.c @@ -642,7 +642,7 @@ virCPUDefFormatBufFull(virBufferPtr buf, if (virCPUDefFormatBuf(&childrenBuf, def, updateCPU) < 0) goto cleanup;
- if (virDomainNumaDefCPUFormat(&childrenBuf, numa) < 0) + if (virDomainNumaDefCPUFormatXML(&childrenBuf, numa) < 0) goto cleanup;
Changing function names should be separate patch. Why is this changed anyway?
if (virBufferCheckError(&attributeBuf) < 0 || diff --git a/src/conf/numa_conf.c b/src/conf/numa_conf.c index bfd3703..fb2a74c 100644 --- a/src/conf/numa_conf.c +++ b/src/conf/numa_conf.c @@ -29,6 +29,13 @@ #include "virnuma.h" #include "virstring.h"
+/* + * Distance definitions defined Conform ACPI 2.0 SLIT. + * See include/linux/topology.h + */ +#define LOCAL_DISTANCE 10 +#define REMOTE_DISTANCE 20 + #define VIR_FROM_THIS VIR_FROM_DOMAIN
VIR_ENUM_IMPL(virDomainNumatuneMemMode, @@ -48,6 +55,8 @@ VIR_ENUM_IMPL(virDomainMemoryAccess, VIR_DOMAIN_MEMORY_ACCESS_LAST, "shared", "private")
+typedef struct _virDomainNumaDistance virDomainNumaDistance; +typedef virDomainNumaDistance *virDomainNumaDistancePtr;
typedef struct _virDomainNumaNode virDomainNumaNode; typedef virDomainNumaNode *virDomainNumaNodePtr; @@ -66,6 +75,12 @@ struct _virDomainNuma { virBitmapPtr nodeset; /* host memory nodes where this guest node resides */ virDomainNumatuneMemMode mode; /* memory mode selection */ virDomainMemoryAccess memAccess; /* shared memory access configuration */ + + struct _virDomainNumaDistance { + unsigned int value; /* locality value for node i->j or j->i */ + unsigned int cellid; + } *distances; /* remote node distances */ + size_t ndistances; } *mem_nodes; /* guest node configuration */ size_t nmem_nodes;
@@ -686,6 +701,128 @@ virDomainNumatuneNodesetIsAvailable(virDomainNumaPtr numatune, }
+static int +virDomainNumaDefNodeDistanceParseXML(virDomainNumaPtr def, + xmlXPathContextPtr ctxt, + unsigned int cur_cell) +{ + int ret = -1; + int sibling; + char *tmp = NULL; + xmlNodePtr *nodes = NULL; + size_t i, ndistances = def->nmem_nodes; + + if (!ndistances) + return 0; + + if (!virXPathNode("./distances[1]/sibling", ctxt)) + return 0; + + if ((sibling = virXPathNodeSet("./distances[1]/sibling", ctxt, &nodes)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NUMA distances defined without siblings")); + goto cleanup; + } +
Aren't these two blocks of code checking the same thing?
+ for (i = 0; i < sibling; i++) { + virDomainNumaDistancePtr ldist, rdist; + unsigned int sibling_id, sibling_value; + + /* siblings are in order of parsing or explicitly numbered */ + if ((tmp = virXMLPropString(nodes[i], "id"))) { + if (virStrToLong_uip(tmp, NULL, 10, &sibling_id) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'id' attribute in NUMA distances for sibling: '%s'"), + tmp); + goto cleanup; + } + } + VIR_FREE(tmp); + + if (sibling_id >= ndistances) {
Isn't sibling_id uninitialized in case there is no 'id' specified?
+ virReportError(VIR_ERR_XML_ERROR, + _("No cell administration for 'sibling_id %d' under NUMA cell '%d' "), + sibling_id, cur_cell); + goto cleanup; + } + + /* We need a locality value. Check and correct + * distance to local and distance to remote node. + */ + if (!(tmp = virXMLPropString(nodes[i], "value"))) { + virReportError(VIR_ERR_XML_ERROR, + _("Missing 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + /* It needs to be applicable */ + if (virStrToLong_uip(tmp, NULL, 10, &sibling_value) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + VIR_FREE(tmp); + + ldist = def->mem_nodes[cur_cell].distances; + if (!ldist) { + if (def->mem_nodes[cur_cell].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + cur_cell); + goto cleanup; + } + + if (VIR_ALLOC_N(ldist, ndistances) < 0) + goto cleanup; + + if (!ldist[cur_cell].value) + ldist[cur_cell].value = LOCAL_DISTANCE; + ldist[cur_cell].cellid = cur_cell; + def->mem_nodes[cur_cell].ndistances = ndistances; + } + + ldist[sibling_id].cellid = sibling_id; + ldist[sibling_id].value = sibling_value;
I don't see the check for 10 <= sibling_value <= 255 anywhere.
+ def->mem_nodes[cur_cell].distances = ldist; + + rdist = def->mem_nodes[sibling_id].distances; + if (!rdist) { + if (def->mem_nodes[sibling_id].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + if (VIR_ALLOC_N(rdist, ndistances) < 0) + goto cleanup; + + if (!rdist[sibling_id].value) + rdist[sibling_id].value = LOCAL_DISTANCE; + rdist[sibling_id].cellid = sibling_id; + def->mem_nodes[sibling_id].ndistances = ndistances; + } + + rdist[cur_cell].cellid = cur_cell; + rdist[cur_cell].value = sibling_value; + def->mem_nodes[sibling_id].distances = rdist; + } + + ret = 0; + + cleanup: + if (ret) { + for (i = 0; i < ndistances; i++) + VIR_FREE(def->mem_nodes[i].distances); + } + VIR_FREE(nodes); + VIR_FREE(tmp); + + return ret; +} + int virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlXPathContextPtr ctxt) @@ -694,7 +831,7 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlNodePtr oldNode = ctxt->node; char *tmp = NULL; int n; - size_t i; + size_t i, j; int ret = -1;
/* check if NUMA definition is present */ @@ -712,7 +849,6 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->nmem_nodes = n;
for (i = 0; i < n; i++) { - size_t j; int rc; unsigned int cur_cell = i;
@@ -788,6 +924,10 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->mem_nodes[cur_cell].memAccess = rc; VIR_FREE(tmp); } + + /* Parse NUMA distances info */ + if (virDomainNumaDefNodeDistanceParseXML(def, ctxt, cur_cell) < 0) + goto cleanup; }
ret = 0; @@ -801,8 +941,8 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def,
int -virDomainNumaDefCPUFormat(virBufferPtr buf, - virDomainNumaPtr def) +virDomainNumaDefCPUFormatXML(virBufferPtr buf, + virDomainNumaPtr def)
Same here with the rename.
{ virDomainMemoryAccess memAccess; char *cpustr; @@ -815,6 +955,8 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, virBufferAddLit(buf, "<numa>\n"); virBufferAdjustIndent(buf, 2); for (i = 0; i < ncells; i++) { + int ndistances; + memAccess = virDomainNumaGetNodeMemoryAccessMode(def, i);
if (!(cpustr = virBitmapFormat(virDomainNumaGetNodeCpumask(def, i)))) @@ -829,7 +971,32 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, if (memAccess) virBufferAsprintf(buf, " memAccess='%s'", virDomainMemoryAccessTypeToString(memAccess)); - virBufferAddLit(buf, "/>\n"); + + ndistances = def->mem_nodes[i].ndistances; + if (!ndistances) { + virBufferAddLit(buf, "/>\n"); + } else { + size_t j; + virDomainNumaDistancePtr distances = def->mem_nodes[i].distances; + + virBufferAddLit(buf, ">\n"); + virBufferAdjustIndent(buf, 2); + virBufferAddLit(buf, "<distances>\n"); + virBufferAdjustIndent(buf, 2); + for (j = 0; j < ndistances; j++) { + if (distances[j].value) { + virBufferAddLit(buf, "<sibling"); + virBufferAsprintf(buf, " id='%d'", distances[j].cellid); + virBufferAsprintf(buf, " value='%d'", distances[j].value); + virBufferAddLit(buf, "/>\n"); + } + } + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</distances>\n"); + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</cell>\n"); + } + VIR_FREE(cpustr); } virBufferAdjustIndent(buf, -2); @@ -922,13 +1089,146 @@ virDomainNumaCheckABIStability(virDomainNumaPtr src, size_t virDomainNumaGetNodeCount(virDomainNumaPtr numa) { - if (!numa) + if (!numa || !numa->mem_nodes)
Just "why?" ^^
return 0;
return numa->nmem_nodes; }
+size_t +virDomainNumaSetNodeCount(virDomainNumaPtr numa, size_t nmem_nodes) +{ + if (!nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot set an empty nmem_nodes set")); + return 0; + } + + if (numa->mem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot alter an existing nmem_nodes set")); + return 0; + } + + if (VIR_ALLOC_N(numa->mem_nodes, nmem_nodes) < 0) + return 0; + + numa->nmem_nodes = nmem_nodes; + + return numa->nmem_nodes; +} +
Are you introducing a function that is not used anywhere as a part of a patch that does something else as well? I can't see the point of this function really... It looks like you are guarding against user input which doesn't really makes sense here for me. Maybe I need to see where it is used but it's not in this patch...
+size_t +virDomainNumaGetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid) +{ + virDomainNumaDistancePtr distances = NULL; + + if (node < numa->nmem_nodes) + distances = numa->mem_nodes[node].distances; + + /* + * Present the configured distance value. If + * out of range or not available set the platform + * defined default for local and remote nodes. + */ + if (!distances || + !distances[cellid].value || + !numa->mem_nodes[node].ndistances) + return (node == cellid) ? \ + LOCAL_DISTANCE : REMOTE_DISTANCE; + + return distances[cellid].value; +} + + +int +virDomainNumaSetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid, + unsigned int value) +{ + virDomainNumaDistancePtr distances; + + if (node >= numa->nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Argument 'node' %zu outranges defined number " + "of NUMA nodes"), node); + return -1; + } + + distances = numa->mem_nodes[node].distances; + if (!distances || + cellid >= numa->mem_nodes[node].ndistances) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Arguments under memnode element do not " + "correspond with existing guest's NUMA cell")); + return -1; + } + + /* + * Advanced Configuration and Power Interface + * Specification version 6.1. Chapter 5.2.17 + * System Locality Distance Information Table + * ... Distance values of 0-9 are reserved. + */ + if (value < LOCAL_DISTANCE) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value of %d is in 0-9 reserved range"), + value); + return -1; + } + + if (value > LOCAL_DISTANCE && node == cellid) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value %d under node %zu does " + "not meet LOCAL_DISTANCE of 10"),
Maybe you meant REMOTE_DISTANCE here? If there were tests it would be visible. Similarly if this function was used anywhere in this patch. I don't feel like reading further, other may continue the review. Have a nice day, Martin

On Thu, 31 Aug 2017 16:36:58 +0200 Martin Kletzander <mkletzan@redhat.com> wrote:
On Thu, Aug 31, 2017 at 04:02:38PM +0200, Wim Ten Have wrote:
From: Wim ten Have <wim.ten.have@oracle.com>
I haven't seen the previous versions, so sorry if I point out something that got changed already.
There was a v1 and v2 cycle by Jim and Daniel 2 month back. Due to personal issues whole got delayed at my end where i am catching up.
Add libvirtd NUMA cell domain administration functionality to describe underlying cell id sibling distances in full fashion when configuring HVM guests.
[below is an example of a 4 node setup]
<cpu> <numa> <cell id='0' cpus='0' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='1' cpus='1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='2' cpus='2' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='41'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu>
Are all those required? I don't see the point in requiring duplicate values.
Sorry I am not sure I understand what you mean by "duplicate values" here. Can you elaborate?
Changes under this commit concern all those that require adding the valid data structures, virDomainNuma* functional routines and the XML doc schema enhancements to enforce appropriate administration.
I don't understand the point of this paragraph.
Let me rephrase this paragraph in future version.
These changes are accompanied with topic related documentation under docs/formatdomain.html within the "CPU model and topology" extending the "Guest NUMA topology" paragraph.
This can be seen from the patch.
Agree, so I'll address this in future version too.
For terminology we refer to sockets as "nodes" where access to each others' distinct resources such as memory make them "siblings" with a designated "distance" between them. A specific design is described under the ACPI (Advanced Configuration and Power Interface Specification) within the chapter explaining the system's SLIT (System Locality Distance Information Table).
The above paragraph is also being added in the docs.
True so I'll scratch this part of the commit message.
These patches extend core libvirt's XML description of a virtual machine's hardware to include NUMA distance information for sibling nodes, which is then passed to Xen guests via libxl. Recently qemu landed support for constructing the SLIT since commit 0f203430dd ("numa: Allow setting NUMA distance for different NUMA nodes"), hence these core libvirt extensions can also help other drivers in supporting this feature.
This is not true, libxl changes are in separate patch.
Like the previous comments and replies, I'll rearrange the commit messages and rephrase them to be more specific to the patch.
There's a lot of copy-paste from the cover-letter...
There is indeed room for improvement.
These changes alter the docs/schemas/cputypes.rng enforcing domain administration to follow the syntax below per numa cell id.
These changes also alter docs/schemas/basictypes.rng to add "numaDistanceValue" which is an "unsignedInt" with a minimum value of 10 as 0-9 are reserved values and can not be used as System Locality Distance Information Table data.
Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> --- Changes on v1: - Add changes to docs/formatdomain.html.in describing schema update. Changes on v2: - Automatically apply distance symmetry maintaining cell <-> sibling. - Check for maximum '255' on numaDistanceValue. - Automatically complete empty distance ranges. - Check that sibling_id's are in range with cell identifiers. - Allow non-contiguous ranges, starting from any node id. - Respect parameters as ATTRIBUTE_NONNULL fix functions and callers. - Add and apply topoly for LOCAL_DISTANCE=10 and REMOTE_DISTANCE=20. --- docs/formatdomain.html.in | 70 +++++++++- docs/schemas/basictypes.rng | 9 ++ docs/schemas/cputypes.rng | 18 +++ src/conf/cpu_conf.c | 2 +- src/conf/numa_conf.c | 323 +++++++++++++++++++++++++++++++++++++++++++- src/conf/numa_conf.h | 25 +++- src/libvirt_private.syms | 6 + 7 files changed, 444 insertions(+), 9 deletions(-)
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 8ca7637..19a2ac7 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -1529,7 +1529,75 @@ </p>
<p> - This guest NUMA specification is currently available only for QEMU/KVM. + This guest NUMA specification is currently available only for + QEMU/KVM and Xen. Whereas Xen driver also allows for a distinct + description of NUMA arranged <code>sibling</code> <code>cell</code> + <code>distances</code> <span class="since">Since 3.6.0</span>. + </p> + + <p> + Under NUMA h/w architecture, distinct resources such as memory + create a designated distance between <code>cell</code> and + <code>siblings</code> that now can be described with the help of + <code>distances</code>. A detailed describtion can be found within + the ACPI (Advanced Configuration and Power Interface Specification) + within the chapter explaining the system's SLIT (System Locality + Distance Information Table). + </p> + +<pre> +... +<cpu> + ... + <numa> + <cell id='0' cpus='0,4-7' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='10'/> + <sibling id='1' value='21'/> + <sibling id='2' value='31'/> + <sibling id='3' value='41'/> + </distances> + </cell> + <cell id='1' cpus='1,8-10,12-15' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='21'/> + <sibling id='1' value='10'/> + <sibling id='2' value='21'/> + <sibling id='3' value='31'/> + </distances> + </cell> + <cell id='2' cpus='2,11' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='31'/> + <sibling id='1' value='21'/> + <sibling id='2' value='10'/> + <sibling id='3' value='21'/> + </distances> + </cell> + <cell id='3' cpus='3' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='41'/> + <sibling id='1' value='31'/> + <sibling id='2' value='21'/> + <sibling id='3' value='10'/> + </distances> + </cell> + </numa> + ... +</cpu> +...</pre> + + <p> + Under Xen driver, if no <code>distances</code> are given to describe + the SLIT data between different cells, it will default to a scheme + using 10 for local and 20 for remote distances. Made defaults + for guest OS when no SLIT is specified. + <br/>See include/linux/topology.h under + <pre> + /* Conform to ACPI 2.0 SLIT distance definitions */ + #define LOCAL_DISTANCE 10 + #define REMOTE_DISTANCE 20 + </pre>
I wouldn't include the code in the <pre>, I don't see the point for that.
Let me check.
</p>
<h3><a id="elementsEvents">Events configuration</a></h3> diff --git a/docs/schemas/basictypes.rng b/docs/schemas/basictypes.rng index 1ea667c..bbef282 100644 --- a/docs/schemas/basictypes.rng +++ b/docs/schemas/basictypes.rng @@ -77,6 +77,15 @@ </choice> </define>
+ <define name="numaDistanceValue"> + <choice> + <data type="unsignedInt"> + <param name="minInclusive">10</param> + <param name="maxInclusive">255</param> + </data> + </choice>
Why <choice>?
Good point! ... will fix this.
+ </define> + <define name="pciaddress"> <optional> <attribute name="domain"> diff --git a/docs/schemas/cputypes.rng b/docs/schemas/cputypes.rng index 3eef16a..c45b6df 100644 --- a/docs/schemas/cputypes.rng +++ b/docs/schemas/cputypes.rng @@ -129,6 +129,24 @@ </choice> </attribute> </optional> + <optional> + <element name="distances"> + <oneOrMore> + <ref name="numaDistance"/> + </oneOrMore> + </element> + </optional> + </element> + </define> + + <define name="numaDistance"> + <element name="sibling"> + <attribute name="id"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="value"> + <ref name="numaDistanceValue"/> + </attribute> </element> </define>
diff --git a/src/conf/cpu_conf.c b/src/conf/cpu_conf.c index c21d11d..8d804a1 100644 --- a/src/conf/cpu_conf.c +++ b/src/conf/cpu_conf.c @@ -642,7 +642,7 @@ virCPUDefFormatBufFull(virBufferPtr buf, if (virCPUDefFormatBuf(&childrenBuf, def, updateCPU) < 0) goto cleanup;
- if (virDomainNumaDefCPUFormat(&childrenBuf, numa) < 0) + if (virDomainNumaDefCPUFormatXML(&childrenBuf, numa) < 0) goto cleanup;
Changing function names should be separate patch. Why is this changed anyway?
I renamed virDomainNumaDefCPUFormat() to virDomainNumaDefCPUFormatXML() to make it consistent with already existing function names like virDomainNumaDefCPUParseXML()
if (virBufferCheckError(&attributeBuf) < 0 || diff --git a/src/conf/numa_conf.c b/src/conf/numa_conf.c index bfd3703..fb2a74c 100644 --- a/src/conf/numa_conf.c +++ b/src/conf/numa_conf.c @@ -29,6 +29,13 @@ #include "virnuma.h" #include "virstring.h"
+/* + * Distance definitions defined Conform ACPI 2.0 SLIT. + * See include/linux/topology.h + */ +#define LOCAL_DISTANCE 10 +#define REMOTE_DISTANCE 20 + #define VIR_FROM_THIS VIR_FROM_DOMAIN
VIR_ENUM_IMPL(virDomainNumatuneMemMode, @@ -48,6 +55,8 @@ VIR_ENUM_IMPL(virDomainMemoryAccess, VIR_DOMAIN_MEMORY_ACCESS_LAST, "shared", "private")
+typedef struct _virDomainNumaDistance virDomainNumaDistance; +typedef virDomainNumaDistance *virDomainNumaDistancePtr;
typedef struct _virDomainNumaNode virDomainNumaNode; typedef virDomainNumaNode *virDomainNumaNodePtr; @@ -66,6 +75,12 @@ struct _virDomainNuma { virBitmapPtr nodeset; /* host memory nodes where this guest node resides */ virDomainNumatuneMemMode mode; /* memory mode selection */ virDomainMemoryAccess memAccess; /* shared memory access configuration */ + + struct _virDomainNumaDistance { + unsigned int value; /* locality value for node i->j or j->i */ + unsigned int cellid; + } *distances; /* remote node distances */ + size_t ndistances; } *mem_nodes; /* guest node configuration */ size_t nmem_nodes;
@@ -686,6 +701,128 @@ virDomainNumatuneNodesetIsAvailable(virDomainNumaPtr numatune, }
+static int +virDomainNumaDefNodeDistanceParseXML(virDomainNumaPtr def, + xmlXPathContextPtr ctxt, + unsigned int cur_cell) +{ + int ret = -1; + int sibling; + char *tmp = NULL; + xmlNodePtr *nodes = NULL; + size_t i, ndistances = def->nmem_nodes; + + if (!ndistances) + return 0; + + if (!virXPathNode("./distances[1]/sibling", ctxt)) + return 0; + + if ((sibling = virXPathNodeSet("./distances[1]/sibling", ctxt, &nodes)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NUMA distances defined without siblings")); + goto cleanup; + } +
Aren't these two blocks of code checking the same thing?
They are not checking the same thing. The first one checks if there are distances under the cell description at all, where the second test ensures that distances descibing the cell actually hold siblings. Let me add a comment similar to the one in virDomainNumaDefCPUParseXML().
+ for (i = 0; i < sibling; i++) { + virDomainNumaDistancePtr ldist, rdist; + unsigned int sibling_id, sibling_value; + + /* siblings are in order of parsing or explicitly numbered */ + if ((tmp = virXMLPropString(nodes[i], "id"))) { + if (virStrToLong_uip(tmp, NULL, 10, &sibling_id) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'id' attribute in NUMA distances for sibling: '%s'"), + tmp); + goto cleanup; + } + } + VIR_FREE(tmp); + + if (sibling_id >= ndistances) {
Isn't sibling_id uninitialized in case there is no 'id' specified?
You are right ... I'll fix this.
+ virReportError(VIR_ERR_XML_ERROR, + _("No cell administration for 'sibling_id %d' under NUMA cell '%d' "), + sibling_id, cur_cell); + goto cleanup; + } + + /* We need a locality value. Check and correct + * distance to local and distance to remote node. + */ + if (!(tmp = virXMLPropString(nodes[i], "value"))) { + virReportError(VIR_ERR_XML_ERROR, + _("Missing 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + /* It needs to be applicable */ + if (virStrToLong_uip(tmp, NULL, 10, &sibling_value) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + VIR_FREE(tmp); + + ldist = def->mem_nodes[cur_cell].distances; + if (!ldist) { + if (def->mem_nodes[cur_cell].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + cur_cell); + goto cleanup; + } + + if (VIR_ALLOC_N(ldist, ndistances) < 0) + goto cleanup; + + if (!ldist[cur_cell].value) + ldist[cur_cell].value = LOCAL_DISTANCE; + ldist[cur_cell].cellid = cur_cell; + def->mem_nodes[cur_cell].ndistances = ndistances; + } + + ldist[sibling_id].cellid = sibling_id; + ldist[sibling_id].value = sibling_value;
I don't see the check for 10 <= sibling_value <= 255 anywhere.
That is already taken care of by the schema validation.
+ def->mem_nodes[cur_cell].distances = ldist; + + rdist = def->mem_nodes[sibling_id].distances; + if (!rdist) { + if (def->mem_nodes[sibling_id].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + if (VIR_ALLOC_N(rdist, ndistances) < 0) + goto cleanup; + + if (!rdist[sibling_id].value) + rdist[sibling_id].value = LOCAL_DISTANCE; + rdist[sibling_id].cellid = sibling_id; + def->mem_nodes[sibling_id].ndistances = ndistances; + } + + rdist[cur_cell].cellid = cur_cell; + rdist[cur_cell].value = sibling_value; + def->mem_nodes[sibling_id].distances = rdist; + } + + ret = 0; + + cleanup: + if (ret) { + for (i = 0; i < ndistances; i++) + VIR_FREE(def->mem_nodes[i].distances); + } + VIR_FREE(nodes); + VIR_FREE(tmp); + + return ret; +} + int virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlXPathContextPtr ctxt) @@ -694,7 +831,7 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlNodePtr oldNode = ctxt->node; char *tmp = NULL; int n; - size_t i; + size_t i, j; int ret = -1;
/* check if NUMA definition is present */ @@ -712,7 +849,6 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->nmem_nodes = n;
for (i = 0; i < n; i++) { - size_t j; int rc; unsigned int cur_cell = i;
@@ -788,6 +924,10 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->mem_nodes[cur_cell].memAccess = rc; VIR_FREE(tmp); } + + /* Parse NUMA distances info */ + if (virDomainNumaDefNodeDistanceParseXML(def, ctxt, cur_cell) < 0) + goto cleanup; }
ret = 0; @@ -801,8 +941,8 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def,
int -virDomainNumaDefCPUFormat(virBufferPtr buf, - virDomainNumaPtr def) +virDomainNumaDefCPUFormatXML(virBufferPtr buf, + virDomainNumaPtr def)
Same here with the rename.
To make it consistent with already existing function names like virDomainNumaDefCPUParseXML()
{ virDomainMemoryAccess memAccess; char *cpustr; @@ -815,6 +955,8 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, virBufferAddLit(buf, "<numa>\n"); virBufferAdjustIndent(buf, 2); for (i = 0; i < ncells; i++) { + int ndistances; + memAccess = virDomainNumaGetNodeMemoryAccessMode(def, i);
if (!(cpustr = virBitmapFormat(virDomainNumaGetNodeCpumask(def, i)))) @@ -829,7 +971,32 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, if (memAccess) virBufferAsprintf(buf, " memAccess='%s'", virDomainMemoryAccessTypeToString(memAccess)); - virBufferAddLit(buf, "/>\n"); + + ndistances = def->mem_nodes[i].ndistances; + if (!ndistances) { + virBufferAddLit(buf, "/>\n"); + } else { + size_t j; + virDomainNumaDistancePtr distances = def->mem_nodes[i].distances; + + virBufferAddLit(buf, ">\n"); + virBufferAdjustIndent(buf, 2); + virBufferAddLit(buf, "<distances>\n"); + virBufferAdjustIndent(buf, 2); + for (j = 0; j < ndistances; j++) { + if (distances[j].value) { + virBufferAddLit(buf, "<sibling"); + virBufferAsprintf(buf, " id='%d'", distances[j].cellid); + virBufferAsprintf(buf, " value='%d'", distances[j].value); + virBufferAddLit(buf, "/>\n"); + } + } + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</distances>\n"); + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</cell>\n"); + } + VIR_FREE(cpustr); } virBufferAdjustIndent(buf, -2); @@ -922,13 +1089,146 @@ virDomainNumaCheckABIStability(virDomainNumaPtr src, size_t virDomainNumaGetNodeCount(virDomainNumaPtr numa) { - if (!numa) + if (!numa || !numa->mem_nodes)
Just "why?" ^^
Overly cautious. ... Will revert!
return 0;
return numa->nmem_nodes; }
+size_t +virDomainNumaSetNodeCount(virDomainNumaPtr numa, size_t nmem_nodes) +{ + if (!nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot set an empty nmem_nodes set")); + return 0; + } + + if (numa->mem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot alter an existing nmem_nodes set")); + return 0; + } + + if (VIR_ALLOC_N(numa->mem_nodes, nmem_nodes) < 0) + return 0; + + numa->nmem_nodes = nmem_nodes; + + return numa->nmem_nodes; +} +
Are you introducing a function that is not used anywhere as a part of a patch that does something else as well? I can't see the point of this function really... It looks like you are guarding against user input which doesn't really makes sense here for me. Maybe I need to see where it is used but it's not in this patch...
It is used in [PATCH v3 3/4] by xenParseXLVnuma().
+size_t +virDomainNumaGetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid) +{ + virDomainNumaDistancePtr distances = NULL; + + if (node < numa->nmem_nodes) + distances = numa->mem_nodes[node].distances; + + /* + * Present the configured distance value. If + * out of range or not available set the platform + * defined default for local and remote nodes. + */ + if (!distances || + !distances[cellid].value || + !numa->mem_nodes[node].ndistances) + return (node == cellid) ? \ + LOCAL_DISTANCE : REMOTE_DISTANCE; + + return distances[cellid].value; +} + + +int +virDomainNumaSetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid, + unsigned int value) +{ + virDomainNumaDistancePtr distances; + + if (node >= numa->nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Argument 'node' %zu outranges defined number " + "of NUMA nodes"), node); + return -1; + } + + distances = numa->mem_nodes[node].distances; + if (!distances || + cellid >= numa->mem_nodes[node].ndistances) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Arguments under memnode element do not " + "correspond with existing guest's NUMA cell")); + return -1; + } + + /* + * Advanced Configuration and Power Interface + * Specification version 6.1. Chapter 5.2.17 + * System Locality Distance Information Table + * ... Distance values of 0-9 are reserved. + */ + if (value < LOCAL_DISTANCE) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value of %d is in 0-9 reserved range"), + value); + return -1; + } + + if (value > LOCAL_DISTANCE && node == cellid) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value %d under node %zu does " + "not meet LOCAL_DISTANCE of 10"),
Maybe you meant REMOTE_DISTANCE here?
LOCAL_DISTANCE is meant here because the (node == cellid) expression tests for the node itself. Maybe moving (node == cellid) to the front makes it more clear.
If there were tests it would be visible. Similarly if this function was used anywhere in this patch.
I am in agreement on adding more tests and specifically one for this scenario.
I don't feel like reading further, other may continue the review.
Thanks for your review comments so far.
Have a nice day, Martin
Regards, - Wim.

On Fri, Sep 01, 2017 at 04:31:50PM +0200, Wim ten Have wrote:
On Thu, 31 Aug 2017 16:36:58 +0200 Martin Kletzander <mkletzan@redhat.com> wrote:
On Thu, Aug 31, 2017 at 04:02:38PM +0200, Wim Ten Have wrote:
From: Wim ten Have <wim.ten.have@oracle.com>
I haven't seen the previous versions, so sorry if I point out something that got changed already.
There was a v1 and v2 cycle by Jim and Daniel 2 month back. Due to personal issues whole got delayed at my end where i am catching up.
Add libvirtd NUMA cell domain administration functionality to describe underlying cell id sibling distances in full fashion when configuring HVM guests.
[below is an example of a 4 node setup]
<cpu> <numa> <cell id='0' cpus='0' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='1' cpus='1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='2' cpus='2' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='41'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu>
Are all those required? I don't see the point in requiring duplicate values.
Sorry I am not sure I understand what you mean by "duplicate values" here. Can you elaborate?
Sure. For simplicity let's assume 2 cells: <cell id='0' cpus='0' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> </cell> <cell id='1' cpus='1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <!-- This should not be needed --> <sibling id='1' value='10'/> </distances> </cell>
Changes under this commit concern all those that require adding the valid data structures, virDomainNuma* functional routines and the XML doc schema enhancements to enforce appropriate administration.
I don't understand the point of this paragraph.
Let me rephrase this paragraph in future version.
These changes are accompanied with topic related documentation under docs/formatdomain.html within the "CPU model and topology" extending the "Guest NUMA topology" paragraph.
This can be seen from the patch.
Agree, so I'll address this in future version too.
For terminology we refer to sockets as "nodes" where access to each others' distinct resources such as memory make them "siblings" with a designated "distance" between them. A specific design is described under the ACPI (Advanced Configuration and Power Interface Specification) within the chapter explaining the system's SLIT (System Locality Distance Information Table).
The above paragraph is also being added in the docs.
True so I'll scratch this part of the commit message.
These patches extend core libvirt's XML description of a virtual machine's hardware to include NUMA distance information for sibling nodes, which is then passed to Xen guests via libxl. Recently qemu landed support for constructing the SLIT since commit 0f203430dd ("numa: Allow setting NUMA distance for different NUMA nodes"), hence these core libvirt extensions can also help other drivers in supporting this feature.
This is not true, libxl changes are in separate patch.
Like the previous comments and replies, I'll rearrange the commit messages and rephrase them to be more specific to the patch.
There's a lot of copy-paste from the cover-letter...
There is indeed room for improvement.
These changes alter the docs/schemas/cputypes.rng enforcing domain administration to follow the syntax below per numa cell id.
These changes also alter docs/schemas/basictypes.rng to add "numaDistanceValue" which is an "unsignedInt" with a minimum value of 10 as 0-9 are reserved values and can not be used as System Locality Distance Information Table data.
Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> --- Changes on v1: - Add changes to docs/formatdomain.html.in describing schema update. Changes on v2: - Automatically apply distance symmetry maintaining cell <-> sibling. - Check for maximum '255' on numaDistanceValue. - Automatically complete empty distance ranges. - Check that sibling_id's are in range with cell identifiers. - Allow non-contiguous ranges, starting from any node id. - Respect parameters as ATTRIBUTE_NONNULL fix functions and callers. - Add and apply topoly for LOCAL_DISTANCE=10 and REMOTE_DISTANCE=20. --- docs/formatdomain.html.in | 70 +++++++++- docs/schemas/basictypes.rng | 9 ++ docs/schemas/cputypes.rng | 18 +++ src/conf/cpu_conf.c | 2 +- src/conf/numa_conf.c | 323 +++++++++++++++++++++++++++++++++++++++++++- src/conf/numa_conf.h | 25 +++- src/libvirt_private.syms | 6 + 7 files changed, 444 insertions(+), 9 deletions(-)
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 8ca7637..19a2ac7 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -1529,7 +1529,75 @@ </p>
<p> - This guest NUMA specification is currently available only for QEMU/KVM. + This guest NUMA specification is currently available only for + QEMU/KVM and Xen. Whereas Xen driver also allows for a distinct + description of NUMA arranged <code>sibling</code> <code>cell</code> + <code>distances</code> <span class="since">Since 3.6.0</span>. + </p> + + <p> + Under NUMA h/w architecture, distinct resources such as memory + create a designated distance between <code>cell</code> and + <code>siblings</code> that now can be described with the help of + <code>distances</code>. A detailed describtion can be found within + the ACPI (Advanced Configuration and Power Interface Specification) + within the chapter explaining the system's SLIT (System Locality + Distance Information Table). + </p> + +<pre> +... +<cpu> + ... + <numa> + <cell id='0' cpus='0,4-7' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='10'/> + <sibling id='1' value='21'/> + <sibling id='2' value='31'/> + <sibling id='3' value='41'/> + </distances> + </cell> + <cell id='1' cpus='1,8-10,12-15' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='21'/> + <sibling id='1' value='10'/> + <sibling id='2' value='21'/> + <sibling id='3' value='31'/> + </distances> + </cell> + <cell id='2' cpus='2,11' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='31'/> + <sibling id='1' value='21'/> + <sibling id='2' value='10'/> + <sibling id='3' value='21'/> + </distances> + </cell> + <cell id='3' cpus='3' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='41'/> + <sibling id='1' value='31'/> + <sibling id='2' value='21'/> + <sibling id='3' value='10'/> + </distances> + </cell> + </numa> + ... +</cpu> +...</pre> + + <p> + Under Xen driver, if no <code>distances</code> are given to describe + the SLIT data between different cells, it will default to a scheme + using 10 for local and 20 for remote distances. Made defaults + for guest OS when no SLIT is specified. + <br/>See include/linux/topology.h under + <pre> + /* Conform to ACPI 2.0 SLIT distance definitions */ + #define LOCAL_DISTANCE 10 + #define REMOTE_DISTANCE 20 + </pre>
I wouldn't include the code in the <pre>, I don't see the point for that.
Let me check.
</p>
<h3><a id="elementsEvents">Events configuration</a></h3> diff --git a/docs/schemas/basictypes.rng b/docs/schemas/basictypes.rng index 1ea667c..bbef282 100644 --- a/docs/schemas/basictypes.rng +++ b/docs/schemas/basictypes.rng @@ -77,6 +77,15 @@ </choice> </define>
+ <define name="numaDistanceValue"> + <choice> + <data type="unsignedInt"> + <param name="minInclusive">10</param> + <param name="maxInclusive">255</param> + </data> + </choice>
Why <choice>?
Good point! ... will fix this.
+ </define> + <define name="pciaddress"> <optional> <attribute name="domain"> diff --git a/docs/schemas/cputypes.rng b/docs/schemas/cputypes.rng index 3eef16a..c45b6df 100644 --- a/docs/schemas/cputypes.rng +++ b/docs/schemas/cputypes.rng @@ -129,6 +129,24 @@ </choice> </attribute> </optional> + <optional> + <element name="distances"> + <oneOrMore> + <ref name="numaDistance"/> + </oneOrMore> + </element> + </optional> + </element> + </define> + + <define name="numaDistance"> + <element name="sibling"> + <attribute name="id"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="value"> + <ref name="numaDistanceValue"/> + </attribute> </element> </define>
diff --git a/src/conf/cpu_conf.c b/src/conf/cpu_conf.c index c21d11d..8d804a1 100644 --- a/src/conf/cpu_conf.c +++ b/src/conf/cpu_conf.c @@ -642,7 +642,7 @@ virCPUDefFormatBufFull(virBufferPtr buf, if (virCPUDefFormatBuf(&childrenBuf, def, updateCPU) < 0) goto cleanup;
- if (virDomainNumaDefCPUFormat(&childrenBuf, numa) < 0) + if (virDomainNumaDefCPUFormatXML(&childrenBuf, numa) < 0) goto cleanup;
Changing function names should be separate patch. Why is this changed anyway?
I renamed virDomainNumaDefCPUFormat() to virDomainNumaDefCPUFormatXML() to make it consistent with already existing function names like virDomainNumaDefCPUParseXML()
Then put it in a separate patch.
if (virBufferCheckError(&attributeBuf) < 0 || diff --git a/src/conf/numa_conf.c b/src/conf/numa_conf.c index bfd3703..fb2a74c 100644 --- a/src/conf/numa_conf.c +++ b/src/conf/numa_conf.c @@ -29,6 +29,13 @@ #include "virnuma.h" #include "virstring.h"
+/* + * Distance definitions defined Conform ACPI 2.0 SLIT. + * See include/linux/topology.h + */ +#define LOCAL_DISTANCE 10 +#define REMOTE_DISTANCE 20 + #define VIR_FROM_THIS VIR_FROM_DOMAIN
VIR_ENUM_IMPL(virDomainNumatuneMemMode, @@ -48,6 +55,8 @@ VIR_ENUM_IMPL(virDomainMemoryAccess, VIR_DOMAIN_MEMORY_ACCESS_LAST, "shared", "private")
+typedef struct _virDomainNumaDistance virDomainNumaDistance; +typedef virDomainNumaDistance *virDomainNumaDistancePtr;
typedef struct _virDomainNumaNode virDomainNumaNode; typedef virDomainNumaNode *virDomainNumaNodePtr; @@ -66,6 +75,12 @@ struct _virDomainNuma { virBitmapPtr nodeset; /* host memory nodes where this guest node resides */ virDomainNumatuneMemMode mode; /* memory mode selection */ virDomainMemoryAccess memAccess; /* shared memory access configuration */ + + struct _virDomainNumaDistance { + unsigned int value; /* locality value for node i->j or j->i */ + unsigned int cellid; + } *distances; /* remote node distances */ + size_t ndistances; } *mem_nodes; /* guest node configuration */ size_t nmem_nodes;
@@ -686,6 +701,128 @@ virDomainNumatuneNodesetIsAvailable(virDomainNumaPtr numatune, }
+static int +virDomainNumaDefNodeDistanceParseXML(virDomainNumaPtr def, + xmlXPathContextPtr ctxt, + unsigned int cur_cell) +{ + int ret = -1; + int sibling; + char *tmp = NULL; + xmlNodePtr *nodes = NULL; + size_t i, ndistances = def->nmem_nodes; + + if (!ndistances) + return 0; + + if (!virXPathNode("./distances[1]/sibling", ctxt)) + return 0; + + if ((sibling = virXPathNodeSet("./distances[1]/sibling", ctxt, &nodes)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NUMA distances defined without siblings")); + goto cleanup; + } +
Aren't these two blocks of code checking the same thing?
They are not checking the same thing. The first one checks if there are distances under the cell description at all, where the second test ensures that distances descibing the cell actually hold siblings.
Sure, but if you remove the first block the code will work the same way, thus it's pointless IMHO.
Let me add a comment similar to the one in virDomainNumaDefCPUParseXML().
+ for (i = 0; i < sibling; i++) { + virDomainNumaDistancePtr ldist, rdist; + unsigned int sibling_id, sibling_value; + + /* siblings are in order of parsing or explicitly numbered */ + if ((tmp = virXMLPropString(nodes[i], "id"))) { + if (virStrToLong_uip(tmp, NULL, 10, &sibling_id) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'id' attribute in NUMA distances for sibling: '%s'"), + tmp); + goto cleanup; + } + } + VIR_FREE(tmp); + + if (sibling_id >= ndistances) {
Isn't sibling_id uninitialized in case there is no 'id' specified?
You are right ... I'll fix this.
+ virReportError(VIR_ERR_XML_ERROR, + _("No cell administration for 'sibling_id %d' under NUMA cell '%d' "), + sibling_id, cur_cell); + goto cleanup; + } + + /* We need a locality value. Check and correct + * distance to local and distance to remote node. + */ + if (!(tmp = virXMLPropString(nodes[i], "value"))) { + virReportError(VIR_ERR_XML_ERROR, + _("Missing 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + /* It needs to be applicable */ + if (virStrToLong_uip(tmp, NULL, 10, &sibling_value) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + VIR_FREE(tmp); + + ldist = def->mem_nodes[cur_cell].distances; + if (!ldist) { + if (def->mem_nodes[cur_cell].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + cur_cell); + goto cleanup; + } + + if (VIR_ALLOC_N(ldist, ndistances) < 0) + goto cleanup; + + if (!ldist[cur_cell].value) + ldist[cur_cell].value = LOCAL_DISTANCE; + ldist[cur_cell].cellid = cur_cell; + def->mem_nodes[cur_cell].ndistances = ndistances; + } + + ldist[sibling_id].cellid = sibling_id; + ldist[sibling_id].value = sibling_value;
I don't see the check for 10 <= sibling_value <= 255 anywhere.
That is already taken care of by the schema validation.
Which can be skipped. You cannot rely on that.
+ def->mem_nodes[cur_cell].distances = ldist; + + rdist = def->mem_nodes[sibling_id].distances; + if (!rdist) { + if (def->mem_nodes[sibling_id].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + if (VIR_ALLOC_N(rdist, ndistances) < 0) + goto cleanup; + + if (!rdist[sibling_id].value) + rdist[sibling_id].value = LOCAL_DISTANCE; + rdist[sibling_id].cellid = sibling_id; + def->mem_nodes[sibling_id].ndistances = ndistances; + } + + rdist[cur_cell].cellid = cur_cell; + rdist[cur_cell].value = sibling_value; + def->mem_nodes[sibling_id].distances = rdist; + } + + ret = 0; + + cleanup: + if (ret) { + for (i = 0; i < ndistances; i++) + VIR_FREE(def->mem_nodes[i].distances); + } + VIR_FREE(nodes); + VIR_FREE(tmp); + + return ret; +} + int virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlXPathContextPtr ctxt) @@ -694,7 +831,7 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlNodePtr oldNode = ctxt->node; char *tmp = NULL; int n; - size_t i; + size_t i, j; int ret = -1;
/* check if NUMA definition is present */ @@ -712,7 +849,6 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->nmem_nodes = n;
for (i = 0; i < n; i++) { - size_t j; int rc; unsigned int cur_cell = i;
@@ -788,6 +924,10 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->mem_nodes[cur_cell].memAccess = rc; VIR_FREE(tmp); } + + /* Parse NUMA distances info */ + if (virDomainNumaDefNodeDistanceParseXML(def, ctxt, cur_cell) < 0) + goto cleanup; }
ret = 0; @@ -801,8 +941,8 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def,
int -virDomainNumaDefCPUFormat(virBufferPtr buf, - virDomainNumaPtr def) +virDomainNumaDefCPUFormatXML(virBufferPtr buf, + virDomainNumaPtr def)
Same here with the rename.
To make it consistent with already existing function names like virDomainNumaDefCPUParseXML()
{ virDomainMemoryAccess memAccess; char *cpustr; @@ -815,6 +955,8 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, virBufferAddLit(buf, "<numa>\n"); virBufferAdjustIndent(buf, 2); for (i = 0; i < ncells; i++) { + int ndistances; + memAccess = virDomainNumaGetNodeMemoryAccessMode(def, i);
if (!(cpustr = virBitmapFormat(virDomainNumaGetNodeCpumask(def, i)))) @@ -829,7 +971,32 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, if (memAccess) virBufferAsprintf(buf, " memAccess='%s'", virDomainMemoryAccessTypeToString(memAccess)); - virBufferAddLit(buf, "/>\n"); + + ndistances = def->mem_nodes[i].ndistances; + if (!ndistances) { + virBufferAddLit(buf, "/>\n"); + } else { + size_t j; + virDomainNumaDistancePtr distances = def->mem_nodes[i].distances; + + virBufferAddLit(buf, ">\n"); + virBufferAdjustIndent(buf, 2); + virBufferAddLit(buf, "<distances>\n"); + virBufferAdjustIndent(buf, 2); + for (j = 0; j < ndistances; j++) { + if (distances[j].value) { + virBufferAddLit(buf, "<sibling"); + virBufferAsprintf(buf, " id='%d'", distances[j].cellid); + virBufferAsprintf(buf, " value='%d'", distances[j].value); + virBufferAddLit(buf, "/>\n"); + } + } + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</distances>\n"); + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</cell>\n"); + } + VIR_FREE(cpustr); } virBufferAdjustIndent(buf, -2); @@ -922,13 +1089,146 @@ virDomainNumaCheckABIStability(virDomainNumaPtr src, size_t virDomainNumaGetNodeCount(virDomainNumaPtr numa) { - if (!numa) + if (!numa || !numa->mem_nodes)
Just "why?" ^^
Overly cautious. ... Will revert!
return 0;
return numa->nmem_nodes; }
+size_t +virDomainNumaSetNodeCount(virDomainNumaPtr numa, size_t nmem_nodes) +{ + if (!nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot set an empty nmem_nodes set")); + return 0; + } + + if (numa->mem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot alter an existing nmem_nodes set")); + return 0; + } + + if (VIR_ALLOC_N(numa->mem_nodes, nmem_nodes) < 0) + return 0; + + numa->nmem_nodes = nmem_nodes; + + return numa->nmem_nodes; +} +
Are you introducing a function that is not used anywhere as a part of a patch that does something else as well? I can't see the point of this function really... It looks like you are guarding against user input which doesn't really makes sense here for me. Maybe I need to see where it is used but it's not in this patch...
It is used in [PATCH v3 3/4] by xenParseXLVnuma().
Then introduce it either there or in its own patch. Not in a patch that is parsing XML.
+size_t +virDomainNumaGetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid) +{ + virDomainNumaDistancePtr distances = NULL; + + if (node < numa->nmem_nodes) + distances = numa->mem_nodes[node].distances; + + /* + * Present the configured distance value. If + * out of range or not available set the platform + * defined default for local and remote nodes. + */ + if (!distances || + !distances[cellid].value || + !numa->mem_nodes[node].ndistances) + return (node == cellid) ? \ + LOCAL_DISTANCE : REMOTE_DISTANCE; + + return distances[cellid].value; +} + + +int +virDomainNumaSetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid, + unsigned int value) +{ + virDomainNumaDistancePtr distances; + + if (node >= numa->nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Argument 'node' %zu outranges defined number " + "of NUMA nodes"), node); + return -1; + } + + distances = numa->mem_nodes[node].distances; + if (!distances || + cellid >= numa->mem_nodes[node].ndistances) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Arguments under memnode element do not " + "correspond with existing guest's NUMA cell")); + return -1; + } + + /* + * Advanced Configuration and Power Interface + * Specification version 6.1. Chapter 5.2.17 + * System Locality Distance Information Table + * ... Distance values of 0-9 are reserved. + */ + if (value < LOCAL_DISTANCE) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value of %d is in 0-9 reserved range"), + value); + return -1; + } + + if (value > LOCAL_DISTANCE && node == cellid) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value %d under node %zu does " + "not meet LOCAL_DISTANCE of 10"),
Maybe you meant REMOTE_DISTANCE here?
LOCAL_DISTANCE is meant here because the (node == cellid) expression tests for the node itself. Maybe moving (node == cellid) to the front makes it more clear.
Oh, I missed the "node == cellid" part, sorry for that. In that case why parse the value when it can only be 10?
If there were tests it would be visible. Similarly if this function was used anywhere in this patch.
I am in agreement on adding more tests and specifically one for this scenario.
I don't feel like reading further, other may continue the review.
Thanks for your review comments so far.
Have a nice day, Martin
Regards, - Wim.

On Mon, 4 Sep 2017 08:49:33 +0200 Martin Kletzander <mkletzan@redhat.com> wrote:
On Fri, Sep 01, 2017 at 04:31:50PM +0200, Wim ten Have wrote:
On Thu, 31 Aug 2017 16:36:58 +0200 Martin Kletzander <mkletzan@redhat.com> wrote:
diff --git a/src/conf/cpu_conf.c b/src/conf/cpu_conf.c index c21d11d..8d804a1 100644 --- a/src/conf/cpu_conf.c +++ b/src/conf/cpu_conf.c @@ -642,7 +642,7 @@ virCPUDefFormatBufFull(virBufferPtr buf, if (virCPUDefFormatBuf(&childrenBuf, def, updateCPU) < 0) goto cleanup;
- if (virDomainNumaDefCPUFormat(&childrenBuf, numa) < 0) + if (virDomainNumaDefCPUFormatXML(&childrenBuf, numa) < 0) goto cleanup;
Changing function names should be separate patch. Why is this changed anyway?
I renamed virDomainNumaDefCPUFormat() to virDomainNumaDefCPUFormatXML() to make it consistent with already existing function names like virDomainNumaDefCPUParseXML()
Then put it in a separate patch.
Sure. Do you advise me to put this patch in same or in a separated set? - Wim.

On Mon, Sep 04, 2017 at 11:51:21AM +0200, Wim ten Have wrote:
On Mon, 4 Sep 2017 08:49:33 +0200 Martin Kletzander <mkletzan@redhat.com> wrote:
On Fri, Sep 01, 2017 at 04:31:50PM +0200, Wim ten Have wrote:
On Thu, 31 Aug 2017 16:36:58 +0200 Martin Kletzander <mkletzan@redhat.com> wrote:
diff --git a/src/conf/cpu_conf.c b/src/conf/cpu_conf.c index c21d11d..8d804a1 100644 --- a/src/conf/cpu_conf.c +++ b/src/conf/cpu_conf.c @@ -642,7 +642,7 @@ virCPUDefFormatBufFull(virBufferPtr buf, if (virCPUDefFormatBuf(&childrenBuf, def, updateCPU) < 0) goto cleanup;
- if (virDomainNumaDefCPUFormat(&childrenBuf, numa) < 0) + if (virDomainNumaDefCPUFormatXML(&childrenBuf, numa) < 0) goto cleanup;
Changing function names should be separate patch. Why is this changed anyway?
I renamed virDomainNumaDefCPUFormat() to virDomainNumaDefCPUFormatXML() to make it consistent with already existing function names like virDomainNumaDefCPUParseXML()
Then put it in a separate patch.
Sure. Do you advise me to put this patch in same or in a separated set?
Whatever suits you, I usually put clean-ups in the series as first patches so that it is cleanly prepared for the actual changes. But it's only a matter of not doing multiple things in one patch in case someone would be targetting one change in the future (finding a regression, back-porting it, reverting it). It also reads a bit more nicely.
- Wim.

On Mon, 4 Sep 2017 08:49:33 +0200 Martin Kletzander <mkletzan@redhat.com> wrote:
On Fri, Sep 01, 2017 at 04:31:50PM +0200, Wim ten Have wrote:
On Thu, 31 Aug 2017 16:36:58 +0200 Martin Kletzander <mkletzan@redhat.com> wrote:
On Thu, Aug 31, 2017 at 04:02:38PM +0200, Wim Ten Have wrote:
From: Wim ten Have <wim.ten.have@oracle.com>
I haven't seen the previous versions, so sorry if I point out something that got changed already.
There was a v1 and v2 cycle by Jim and Daniel 2 month back. Due to personal issues whole got delayed at my end where i am catching up.
Add libvirtd NUMA cell domain administration functionality to describe underlying cell id sibling distances in full fashion when configuring HVM guests.
[below is an example of a 4 node setup]
<cpu> <numa> <cell id='0' cpus='0' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='1' cpus='1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='2' cpus='2' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='41'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu>
Are all those required? I don't see the point in requiring duplicate values.
Sorry I am not sure I understand what you mean by "duplicate values" here. Can you elaborate?
Sure. For simplicity let's assume 2 cells:
<cell id='0' cpus='0' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> </cell> <cell id='1' cpus='1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <!-- This should not be needed --> <sibling id='1' value='10'/> </distances> </cell>
The fields are not a necessity. And to reduce even more we could also remove LOCAL_DISTANCES as they make a constant factor where; (cell_id == sibling_id) So, although the <distance> presentation of a guest domain may heavily, giving this a look it seems far from attractive to me. I am not sure if this is what we want and should do. Let me go forward sending out "PATCH v4" addressing other comments and reserve this idea for "PATCH v5". Regards, - Wim.
Changes under this commit concern all those that require adding the valid data structures, virDomainNuma* functional routines and the XML doc schema enhancements to enforce appropriate administration.
I don't understand the point of this paragraph.
Let me rephrase this paragraph in future version.
These changes are accompanied with topic related documentation under docs/formatdomain.html within the "CPU model and topology" extending the "Guest NUMA topology" paragraph.
This can be seen from the patch.
Agree, so I'll address this in future version too.
For terminology we refer to sockets as "nodes" where access to each others' distinct resources such as memory make them "siblings" with a designated "distance" between them. A specific design is described under the ACPI (Advanced Configuration and Power Interface Specification) within the chapter explaining the system's SLIT (System Locality Distance Information Table).
The above paragraph is also being added in the docs.
True so I'll scratch this part of the commit message.
These patches extend core libvirt's XML description of a virtual machine's hardware to include NUMA distance information for sibling nodes, which is then passed to Xen guests via libxl. Recently qemu landed support for constructing the SLIT since commit 0f203430dd ("numa: Allow setting NUMA distance for different NUMA nodes"), hence these core libvirt extensions can also help other drivers in supporting this feature.
This is not true, libxl changes are in separate patch.
Like the previous comments and replies, I'll rearrange the commit messages and rephrase them to be more specific to the patch.
There's a lot of copy-paste from the cover-letter...
There is indeed room for improvement.
These changes alter the docs/schemas/cputypes.rng enforcing domain administration to follow the syntax below per numa cell id.
These changes also alter docs/schemas/basictypes.rng to add "numaDistanceValue" which is an "unsignedInt" with a minimum value of 10 as 0-9 are reserved values and can not be used as System Locality Distance Information Table data.
Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> --- Changes on v1: - Add changes to docs/formatdomain.html.in describing schema update. Changes on v2: - Automatically apply distance symmetry maintaining cell <-> sibling. - Check for maximum '255' on numaDistanceValue. - Automatically complete empty distance ranges. - Check that sibling_id's are in range with cell identifiers. - Allow non-contiguous ranges, starting from any node id. - Respect parameters as ATTRIBUTE_NONNULL fix functions and callers. - Add and apply topoly for LOCAL_DISTANCE=10 and REMOTE_DISTANCE=20. --- docs/formatdomain.html.in | 70 +++++++++- docs/schemas/basictypes.rng | 9 ++ docs/schemas/cputypes.rng | 18 +++ src/conf/cpu_conf.c | 2 +- src/conf/numa_conf.c | 323 +++++++++++++++++++++++++++++++++++++++++++- src/conf/numa_conf.h | 25 +++- src/libvirt_private.syms | 6 + 7 files changed, 444 insertions(+), 9 deletions(-)
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 8ca7637..19a2ac7 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -1529,7 +1529,75 @@ </p>
<p> - This guest NUMA specification is currently available only for QEMU/KVM. + This guest NUMA specification is currently available only for + QEMU/KVM and Xen. Whereas Xen driver also allows for a distinct + description of NUMA arranged <code>sibling</code> <code>cell</code> + <code>distances</code> <span class="since">Since 3.6.0</span>. + </p> + + <p> + Under NUMA h/w architecture, distinct resources such as memory + create a designated distance between <code>cell</code> and + <code>siblings</code> that now can be described with the help of + <code>distances</code>. A detailed describtion can be found within + the ACPI (Advanced Configuration and Power Interface Specification) + within the chapter explaining the system's SLIT (System Locality + Distance Information Table). + </p> + +<pre> +... +<cpu> + ... + <numa> + <cell id='0' cpus='0,4-7' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='10'/> + <sibling id='1' value='21'/> + <sibling id='2' value='31'/> + <sibling id='3' value='41'/> + </distances> + </cell> + <cell id='1' cpus='1,8-10,12-15' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='21'/> + <sibling id='1' value='10'/> + <sibling id='2' value='21'/> + <sibling id='3' value='31'/> + </distances> + </cell> + <cell id='2' cpus='2,11' memory='512000' unit='KiB' memAccess='shared'> + <distances> + <sibling id='0' value='31'/> + <sibling id='1' value='21'/> + <sibling id='2' value='10'/> + <sibling id='3' value='21'/> + </distances> + </cell> + <cell id='3' cpus='3' memory='512000' unit='KiB'> + <distances> + <sibling id='0' value='41'/> + <sibling id='1' value='31'/> + <sibling id='2' value='21'/> + <sibling id='3' value='10'/> + </distances> + </cell> + </numa> + ... +</cpu> +...</pre> + + <p> + Under Xen driver, if no <code>distances</code> are given to describe + the SLIT data between different cells, it will default to a scheme + using 10 for local and 20 for remote distances. Made defaults + for guest OS when no SLIT is specified. + <br/>See include/linux/topology.h under + <pre> + /* Conform to ACPI 2.0 SLIT distance definitions */ + #define LOCAL_DISTANCE 10 + #define REMOTE_DISTANCE 20 + </pre>
I wouldn't include the code in the <pre>, I don't see the point for that.
Let me check.
</p>
<h3><a id="elementsEvents">Events configuration</a></h3> diff --git a/docs/schemas/basictypes.rng b/docs/schemas/basictypes.rng index 1ea667c..bbef282 100644 --- a/docs/schemas/basictypes.rng +++ b/docs/schemas/basictypes.rng @@ -77,6 +77,15 @@ </choice> </define>
+ <define name="numaDistanceValue"> + <choice> + <data type="unsignedInt"> + <param name="minInclusive">10</param> + <param name="maxInclusive">255</param> + </data> + </choice>
Why <choice>?
Good point! ... will fix this.
+ </define> + <define name="pciaddress"> <optional> <attribute name="domain"> diff --git a/docs/schemas/cputypes.rng b/docs/schemas/cputypes.rng index 3eef16a..c45b6df 100644 --- a/docs/schemas/cputypes.rng +++ b/docs/schemas/cputypes.rng @@ -129,6 +129,24 @@ </choice> </attribute> </optional> + <optional> + <element name="distances"> + <oneOrMore> + <ref name="numaDistance"/> + </oneOrMore> + </element> + </optional> + </element> + </define> + + <define name="numaDistance"> + <element name="sibling"> + <attribute name="id"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="value"> + <ref name="numaDistanceValue"/> + </attribute> </element> </define>
diff --git a/src/conf/cpu_conf.c b/src/conf/cpu_conf.c index c21d11d..8d804a1 100644 --- a/src/conf/cpu_conf.c +++ b/src/conf/cpu_conf.c @@ -642,7 +642,7 @@ virCPUDefFormatBufFull(virBufferPtr buf, if (virCPUDefFormatBuf(&childrenBuf, def, updateCPU) < 0) goto cleanup;
- if (virDomainNumaDefCPUFormat(&childrenBuf, numa) < 0) + if (virDomainNumaDefCPUFormatXML(&childrenBuf, numa) < 0) goto cleanup;
Changing function names should be separate patch. Why is this changed anyway?
I renamed virDomainNumaDefCPUFormat() to virDomainNumaDefCPUFormatXML() to make it consistent with already existing function names like virDomainNumaDefCPUParseXML()
Then put it in a separate patch.
if (virBufferCheckError(&attributeBuf) < 0 || diff --git a/src/conf/numa_conf.c b/src/conf/numa_conf.c index bfd3703..fb2a74c 100644 --- a/src/conf/numa_conf.c +++ b/src/conf/numa_conf.c @@ -29,6 +29,13 @@ #include "virnuma.h" #include "virstring.h"
+/* + * Distance definitions defined Conform ACPI 2.0 SLIT. + * See include/linux/topology.h + */ +#define LOCAL_DISTANCE 10 +#define REMOTE_DISTANCE 20 + #define VIR_FROM_THIS VIR_FROM_DOMAIN
VIR_ENUM_IMPL(virDomainNumatuneMemMode, @@ -48,6 +55,8 @@ VIR_ENUM_IMPL(virDomainMemoryAccess, VIR_DOMAIN_MEMORY_ACCESS_LAST, "shared", "private")
+typedef struct _virDomainNumaDistance virDomainNumaDistance; +typedef virDomainNumaDistance *virDomainNumaDistancePtr;
typedef struct _virDomainNumaNode virDomainNumaNode; typedef virDomainNumaNode *virDomainNumaNodePtr; @@ -66,6 +75,12 @@ struct _virDomainNuma { virBitmapPtr nodeset; /* host memory nodes where this guest node resides */ virDomainNumatuneMemMode mode; /* memory mode selection */ virDomainMemoryAccess memAccess; /* shared memory access configuration */ + + struct _virDomainNumaDistance { + unsigned int value; /* locality value for node i->j or j->i */ + unsigned int cellid; + } *distances; /* remote node distances */ + size_t ndistances; } *mem_nodes; /* guest node configuration */ size_t nmem_nodes;
@@ -686,6 +701,128 @@ virDomainNumatuneNodesetIsAvailable(virDomainNumaPtr numatune, }
+static int +virDomainNumaDefNodeDistanceParseXML(virDomainNumaPtr def, + xmlXPathContextPtr ctxt, + unsigned int cur_cell) +{ + int ret = -1; + int sibling; + char *tmp = NULL; + xmlNodePtr *nodes = NULL; + size_t i, ndistances = def->nmem_nodes; + + if (!ndistances) + return 0; + + if (!virXPathNode("./distances[1]/sibling", ctxt)) + return 0; + + if ((sibling = virXPathNodeSet("./distances[1]/sibling", ctxt, &nodes)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("NUMA distances defined without siblings")); + goto cleanup; + } +
Aren't these two blocks of code checking the same thing?
They are not checking the same thing. The first one checks if there are distances under the cell description at all, where the second test ensures that distances descibing the cell actually hold siblings.
Sure, but if you remove the first block the code will work the same way, thus it's pointless IMHO.
Let me add a comment similar to the one in virDomainNumaDefCPUParseXML().
+ for (i = 0; i < sibling; i++) { + virDomainNumaDistancePtr ldist, rdist; + unsigned int sibling_id, sibling_value; + + /* siblings are in order of parsing or explicitly numbered */ + if ((tmp = virXMLPropString(nodes[i], "id"))) { + if (virStrToLong_uip(tmp, NULL, 10, &sibling_id) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'id' attribute in NUMA distances for sibling: '%s'"), + tmp); + goto cleanup; + } + } + VIR_FREE(tmp); + + if (sibling_id >= ndistances) {
Isn't sibling_id uninitialized in case there is no 'id' specified?
You are right ... I'll fix this.
+ virReportError(VIR_ERR_XML_ERROR, + _("No cell administration for 'sibling_id %d' under NUMA cell '%d' "), + sibling_id, cur_cell); + goto cleanup; + } + + /* We need a locality value. Check and correct + * distance to local and distance to remote node. + */ + if (!(tmp = virXMLPropString(nodes[i], "value"))) { + virReportError(VIR_ERR_XML_ERROR, + _("Missing 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + /* It needs to be applicable */ + if (virStrToLong_uip(tmp, NULL, 10, &sibling_value) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid 'value' attribute in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + VIR_FREE(tmp); + + ldist = def->mem_nodes[cur_cell].distances; + if (!ldist) { + if (def->mem_nodes[cur_cell].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + cur_cell); + goto cleanup; + } + + if (VIR_ALLOC_N(ldist, ndistances) < 0) + goto cleanup; + + if (!ldist[cur_cell].value) + ldist[cur_cell].value = LOCAL_DISTANCE; + ldist[cur_cell].cellid = cur_cell; + def->mem_nodes[cur_cell].ndistances = ndistances; + } + + ldist[sibling_id].cellid = sibling_id; + ldist[sibling_id].value = sibling_value;
I don't see the check for 10 <= sibling_value <= 255 anywhere.
That is already taken care of by the schema validation.
Which can be skipped. You cannot rely on that.
+ def->mem_nodes[cur_cell].distances = ldist; + + rdist = def->mem_nodes[sibling_id].distances; + if (!rdist) { + if (def->mem_nodes[sibling_id].ndistances) { + virReportError(VIR_ERR_XML_ERROR, + _("Erratic 'ndistances' set in NUMA distances for sibling id: '%d'"), + sibling_id); + goto cleanup; + } + + if (VIR_ALLOC_N(rdist, ndistances) < 0) + goto cleanup; + + if (!rdist[sibling_id].value) + rdist[sibling_id].value = LOCAL_DISTANCE; + rdist[sibling_id].cellid = sibling_id; + def->mem_nodes[sibling_id].ndistances = ndistances; + } + + rdist[cur_cell].cellid = cur_cell; + rdist[cur_cell].value = sibling_value; + def->mem_nodes[sibling_id].distances = rdist; + } + + ret = 0; + + cleanup: + if (ret) { + for (i = 0; i < ndistances; i++) + VIR_FREE(def->mem_nodes[i].distances); + } + VIR_FREE(nodes); + VIR_FREE(tmp); + + return ret; +} + int virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlXPathContextPtr ctxt) @@ -694,7 +831,7 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlNodePtr oldNode = ctxt->node; char *tmp = NULL; int n; - size_t i; + size_t i, j; int ret = -1;
/* check if NUMA definition is present */ @@ -712,7 +849,6 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->nmem_nodes = n;
for (i = 0; i < n; i++) { - size_t j; int rc; unsigned int cur_cell = i;
@@ -788,6 +924,10 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def, def->mem_nodes[cur_cell].memAccess = rc; VIR_FREE(tmp); } + + /* Parse NUMA distances info */ + if (virDomainNumaDefNodeDistanceParseXML(def, ctxt, cur_cell) < 0) + goto cleanup; }
ret = 0; @@ -801,8 +941,8 @@ virDomainNumaDefCPUParseXML(virDomainNumaPtr def,
int -virDomainNumaDefCPUFormat(virBufferPtr buf, - virDomainNumaPtr def) +virDomainNumaDefCPUFormatXML(virBufferPtr buf, + virDomainNumaPtr def)
Same here with the rename.
To make it consistent with already existing function names like virDomainNumaDefCPUParseXML()
{ virDomainMemoryAccess memAccess; char *cpustr; @@ -815,6 +955,8 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, virBufferAddLit(buf, "<numa>\n"); virBufferAdjustIndent(buf, 2); for (i = 0; i < ncells; i++) { + int ndistances; + memAccess = virDomainNumaGetNodeMemoryAccessMode(def, i);
if (!(cpustr = virBitmapFormat(virDomainNumaGetNodeCpumask(def, i)))) @@ -829,7 +971,32 @@ virDomainNumaDefCPUFormat(virBufferPtr buf, if (memAccess) virBufferAsprintf(buf, " memAccess='%s'", virDomainMemoryAccessTypeToString(memAccess)); - virBufferAddLit(buf, "/>\n"); + + ndistances = def->mem_nodes[i].ndistances; + if (!ndistances) { + virBufferAddLit(buf, "/>\n"); + } else { + size_t j; + virDomainNumaDistancePtr distances = def->mem_nodes[i].distances; + + virBufferAddLit(buf, ">\n"); + virBufferAdjustIndent(buf, 2); + virBufferAddLit(buf, "<distances>\n"); + virBufferAdjustIndent(buf, 2); + for (j = 0; j < ndistances; j++) { + if (distances[j].value) { + virBufferAddLit(buf, "<sibling"); + virBufferAsprintf(buf, " id='%d'", distances[j].cellid); + virBufferAsprintf(buf, " value='%d'", distances[j].value); + virBufferAddLit(buf, "/>\n"); + } + } + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</distances>\n"); + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</cell>\n"); + } + VIR_FREE(cpustr); } virBufferAdjustIndent(buf, -2); @@ -922,13 +1089,146 @@ virDomainNumaCheckABIStability(virDomainNumaPtr src, size_t virDomainNumaGetNodeCount(virDomainNumaPtr numa) { - if (!numa) + if (!numa || !numa->mem_nodes)
Just "why?" ^^
Overly cautious. ... Will revert!
return 0;
return numa->nmem_nodes; }
+size_t +virDomainNumaSetNodeCount(virDomainNumaPtr numa, size_t nmem_nodes) +{ + if (!nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot set an empty nmem_nodes set")); + return 0; + } + + if (numa->mem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot alter an existing nmem_nodes set")); + return 0; + } + + if (VIR_ALLOC_N(numa->mem_nodes, nmem_nodes) < 0) + return 0; + + numa->nmem_nodes = nmem_nodes; + + return numa->nmem_nodes; +} +
Are you introducing a function that is not used anywhere as a part of a patch that does something else as well? I can't see the point of this function really... It looks like you are guarding against user input which doesn't really makes sense here for me. Maybe I need to see where it is used but it's not in this patch...
It is used in [PATCH v3 3/4] by xenParseXLVnuma().
Then introduce it either there or in its own patch. Not in a patch that is parsing XML.
+size_t +virDomainNumaGetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid) +{ + virDomainNumaDistancePtr distances = NULL; + + if (node < numa->nmem_nodes) + distances = numa->mem_nodes[node].distances; + + /* + * Present the configured distance value. If + * out of range or not available set the platform + * defined default for local and remote nodes. + */ + if (!distances || + !distances[cellid].value || + !numa->mem_nodes[node].ndistances) + return (node == cellid) ? \ + LOCAL_DISTANCE : REMOTE_DISTANCE; + + return distances[cellid].value; +} + + +int +virDomainNumaSetNodeDistance(virDomainNumaPtr numa, + size_t node, + size_t cellid, + unsigned int value) +{ + virDomainNumaDistancePtr distances; + + if (node >= numa->nmem_nodes) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Argument 'node' %zu outranges defined number " + "of NUMA nodes"), node); + return -1; + } + + distances = numa->mem_nodes[node].distances; + if (!distances || + cellid >= numa->mem_nodes[node].ndistances) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Arguments under memnode element do not " + "correspond with existing guest's NUMA cell")); + return -1; + } + + /* + * Advanced Configuration and Power Interface + * Specification version 6.1. Chapter 5.2.17 + * System Locality Distance Information Table + * ... Distance values of 0-9 are reserved. + */ + if (value < LOCAL_DISTANCE) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value of %d is in 0-9 reserved range"), + value); + return -1; + } + + if (value > LOCAL_DISTANCE && node == cellid) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Distance value %d under node %zu does " + "not meet LOCAL_DISTANCE of 10"),
Maybe you meant REMOTE_DISTANCE here?
LOCAL_DISTANCE is meant here because the (node == cellid) expression tests for the node itself. Maybe moving (node == cellid) to the front makes it more clear.
Oh, I missed the "node == cellid" part, sorry for that. In that case why parse the value when it can only be 10?
If there were tests it would be visible. Similarly if this function was used anywhere in this patch.
I am in agreement on adding more tests and specifically one for this scenario.
I don't feel like reading further, other may continue the review.
Thanks for your review comments so far.
Have a nice day, Martin
Regards, - Wim.

On Fri, Sep 08, 2017 at 04:45:52PM +0200, Wim ten Have wrote:
The fields are not a necessity. And to reduce even more we could also remove LOCAL_DISTANCES as they make a constant factor where; (cell_id == sibling_id)
So, although the <distance> presentation of a guest domain may heavily, giving this a look it seems far from attractive to me. I am not sure if this is what we want and should do.
Let me go forward sending out "PATCH v4" addressing other comments and reserve this idea for "PATCH v5".
Sure, let's see what others think. Have a nice day, Martin

From: Wim ten Have <wim.ten.have@oracle.com> This patch generates a NUMA distance-aware libxl description from the information extracted from a NUMA distance-aware libvirt XML file. By default, if no NUMA node distance information is supplied in the libvirt XML file, this patch uses the distances 10 for local and 21 for remote nodes/sockets." Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> --- Changes on v1: - Fix function calling (coding) standards. - Use virReportError(...) under all failing circumstances. - Reduce redundant (format->parse) code sorting bitmaps. - Avoid non GNU shorthand notations, difficult code practise. Changes on v2: - Have autonomous defaults applied from virDomainNumaGetNodeDistance. - Automatically make Xen driver simulate vnuma pnodes on non NUMA h/w. - Compute 'memory unit=' making it the sum of all node memory. --- src/libxl/libxl_conf.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++ src/libxl/libxl_driver.c | 3 +- 2 files changed, 122 insertions(+), 1 deletion(-) diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c index 4416a09..5fb3561 100644 --- a/src/libxl/libxl_conf.c +++ b/src/libxl/libxl_conf.c @@ -618,6 +618,121 @@ libxlMakeDomBuildInfo(virDomainDefPtr def, return 0; } +#ifdef LIBXL_HAVE_VNUMA +static int +libxlMakeVnumaList(virDomainDefPtr def, + libxl_ctx *ctx, + libxl_domain_config *d_config) +{ + int ret = -1; + size_t i, j; + size_t nr_nodes; + size_t num_vnuma; + bool simulate = false; + virBitmapPtr bitmap = NULL; + virDomainNumaPtr numa = def->numa; + libxl_domain_build_info *b_info = &d_config->b_info; + libxl_physinfo physinfo; + libxl_vnode_info *vnuma_nodes = NULL; + + if (!numa) + return 0; + + num_vnuma = virDomainNumaGetNodeCount(numa); + if (!num_vnuma) + return 0; + + libxl_physinfo_init(&physinfo); + if (libxl_get_physinfo(ctx, &physinfo) < 0) { + libxl_physinfo_dispose(&physinfo); + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("libxl_get_physinfo_info failed")); + return -1; + } + nr_nodes = physinfo.nr_nodes; + libxl_physinfo_dispose(&physinfo); + + if (num_vnuma > nr_nodes) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Number of configured numa cells %zu exceeds the physical available nodes %zu, guest simulates numa"), + num_vnuma, nr_nodes); + simulate = true; + } + + /* + * allocate the vnuma_nodes for assignment under b_info. + */ + if (VIR_ALLOC_N(vnuma_nodes, num_vnuma) < 0) + return -1; + + /* + * parse the vnuma vnodes data. + */ + for (i = 0; i < num_vnuma; i++) { + int cpu; + libxl_bitmap vcpu_bitmap; + libxl_vnode_info *p = &vnuma_nodes[i]; + + libxl_vnode_info_init(p); + + /* pnode */ + p->pnode = simulate ? 0 : i; + + /* memory size */ + p->memkb = virDomainNumaGetNodeMemorySize(numa, i); + + /* vcpus */ + bitmap = virDomainNumaGetNodeCpumask(numa, i); + if (bitmap == NULL) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("vnuma sibling %zu missing vcpus set"), i); + goto cleanup; + } + + if ((cpu = virBitmapNextSetBit(bitmap, -1)) < 0) + goto cleanup; + + libxl_bitmap_init(&vcpu_bitmap); + if (libxl_cpu_bitmap_alloc(ctx, &vcpu_bitmap, b_info->max_vcpus)) { + virReportOOMError(); + goto cleanup; + } + + do { + libxl_bitmap_set(&vcpu_bitmap, cpu); + } while ((cpu = virBitmapNextSetBit(bitmap, cpu)) >= 0); + + libxl_bitmap_copy_alloc(ctx, &p->vcpus, &vcpu_bitmap); + libxl_bitmap_dispose(&vcpu_bitmap); + + /* vdistances */ + if (VIR_ALLOC_N(p->distances, num_vnuma) < 0) + goto cleanup; + p->num_distances = num_vnuma; + + for (j = 0; j < num_vnuma; j++) + p->distances[j] = virDomainNumaGetNodeDistance(numa, i, j); + } + + b_info->vnuma_nodes = vnuma_nodes; + b_info->num_vnuma_nodes = num_vnuma; + + ret = 0; + + cleanup: + if (ret) { + for (i = 0; i < num_vnuma; i++) { + libxl_vnode_info *p = &vnuma_nodes[i]; + + VIR_FREE(p->distances); + } + VIR_FREE(vnuma_nodes); + } + + return ret; +} +#endif + static int libxlDiskSetDiscard(libxl_device_disk *x_disk, int discard) { @@ -2208,6 +2323,11 @@ libxlBuildDomainConfig(virPortAllocatorPtr graphicsports, if (libxlMakeDomBuildInfo(def, ctx, caps, d_config) < 0) return -1; +#ifdef LIBXL_HAVE_VNUMA + if (libxlMakeVnumaList(def, ctx, d_config) < 0) + return -1; +#endif + if (libxlMakeDiskList(def, d_config) < 0) return -1; diff --git a/src/libxl/libxl_driver.c b/src/libxl/libxl_driver.c index 8fefce6..a7d8bfe 100644 --- a/src/libxl/libxl_driver.c +++ b/src/libxl/libxl_driver.c @@ -2788,7 +2788,8 @@ libxlDomainDefineXMLFlags(virConnectPtr conn, const char *xml, unsigned int flag virCheckFlags(VIR_DOMAIN_DEFINE_VALIDATE, NULL); if (flags & VIR_DOMAIN_DEFINE_VALIDATE) - parse_flags |= VIR_DOMAIN_DEF_PARSE_VALIDATE_SCHEMA; + parse_flags |= (VIR_DOMAIN_DEF_PARSE_VALIDATE_SCHEMA | + VIR_DOMAIN_DEF_PARSE_ABI_UPDATE); if (!(def = virDomainDefParseString(xml, cfg->caps, driver->xmlopt, NULL, parse_flags))) -- 2.9.5

From: Wim ten Have <wim.ten.have@oracle.com> This patch converts NUMA configurations between the Xen libxl configuration file format and libvirt's XML format. XML HVM domain configuration: <cpu> <numa> <cell id='0' cpus='0-1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='1' cpus='2-3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='21'/> <sibling id='3' value='31'/> </distances> </cell> <cell id='2' cpus='3-4' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> </cell> <cell id='3' cpus='5-6' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='41'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu> Xen xl.cfg domain configuration: vnuma = [["pnode=0","size=2048","vcpus=0-1","vdistances=10,21,31,41"], ["pnode=1","size=2048","vcpus=2-3","vdistances=21,10,21,31"], ["pnode=2","size=2048","vcpus=4-5","vdistances=31,21,10,21"], ["pnode=3","size=2048","vcpus=6-7","vdistances=41,31,21,10"]] If there is no XML <distances> description amongst the <cell> data the conversion schema from xml to native will generate 10 for local and 21 for all remote instances. Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> --- Changes on v2: - Reduce the indentation level under xenParseXLVnuma(). --- src/xenconfig/xen_xl.c | 333 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 333 insertions(+) diff --git a/src/xenconfig/xen_xl.c b/src/xenconfig/xen_xl.c index d168d3f..b72ada9 100644 --- a/src/xenconfig/xen_xl.c +++ b/src/xenconfig/xen_xl.c @@ -309,6 +309,205 @@ xenParseXLSpice(virConfPtr conf, virDomainDefPtr def) return -1; } +#ifdef LIBXL_HAVE_VNUMA +static int +xenParseXLVnuma(virConfPtr conf, virDomainDefPtr def) +{ + int ret = -1; + char *tmp = NULL; + char **token = NULL; + size_t vcpus = 0; + size_t nr_nodes = 0; + size_t vnodeCnt = 0; + virCPUDefPtr cpu = NULL; + virConfValuePtr list; + virConfValuePtr vnode; + virDomainNumaPtr numa; + + numa = def->numa; + if (numa == NULL) + return -1; + + list = virConfGetValue(conf, "vnuma"); + if (!list || list->type != VIR_CONF_LIST) + return 0; + + vnode = list->list; + while (vnode && vnode->type == VIR_CONF_LIST) { + vnode = vnode->next; + nr_nodes++; + } + + if (!virDomainNumaSetNodeCount(numa, nr_nodes)) + goto cleanup; + + if (VIR_ALLOC(cpu) < 0) + goto cleanup; + + list = list->list; + while (list) { + int pnode = -1; + virBitmapPtr cpumask = NULL; + unsigned long long kbsize = 0; + + /* Is there a sublist (vnode)? */ + if (list && list->type == VIR_CONF_LIST) { + vnode = list->list; + + while (vnode && vnode->type == VIR_CONF_STRING) { + const char *data; + const char *str = vnode->str; + + if (!str || + !(data = strrchr(str, '='))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("vnuma vnode invalid format '%s'"), + str); + goto cleanup; + } + data++; + + if (*data) { + size_t len; + char vtoken[64]; + + if (STRPREFIX(str, "pnode")) { + unsigned int cellid; + + len = strlen(data); + if (!virStrncpy(vtoken, data, + len, sizeof(vtoken))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("vnuma vnode %zu pnode '%s' too long for destination"), + vnodeCnt, data); + goto cleanup; + } + + if ((virStrToLong_ui(vtoken, NULL, 10, &cellid) < 0) || + (cellid >= nr_nodes)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("vnuma vnode %zu invalid value pnode '%s'"), + vnodeCnt, data); + goto cleanup; + } + pnode = cellid; + } else if (STRPREFIX(str, "size")) { + len = strlen(data); + if (!virStrncpy(vtoken, data, + len, sizeof(vtoken))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("vnuma vnode %zu size '%s' too long for destination"), + vnodeCnt, data); + goto cleanup; + } + + if (virStrToLong_ull(vtoken, NULL, 10, &kbsize) < 0) + goto cleanup; + + virDomainNumaSetNodeMemorySize(numa, vnodeCnt, (kbsize * 1024)); + + } else if (STRPREFIX(str, "vcpus")) { + len = strlen(data); + if (!virStrncpy(vtoken, data, + len, sizeof(vtoken))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("vnuma vnode %zu vcpus '%s' too long for destination"), + vnodeCnt, data); + goto cleanup; + } + + if ((virBitmapParse(vtoken, &cpumask, VIR_DOMAIN_CPUMASK_LEN) < 0) || + (virDomainNumaSetNodeCpumask(numa, vnodeCnt, cpumask) == NULL)) + goto cleanup; + + vcpus += virBitmapCountBits(cpumask); + + } else if (STRPREFIX(str, "vdistances")) { + size_t i, ndistances; + unsigned int value; + + len = strlen(data); + if (!virStrncpy(vtoken, data, + len, sizeof(vtoken))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("vnuma vnode %zu vdistances '%s' too long for destination"), + vnodeCnt, data); + goto cleanup; + } + + if (VIR_STRDUP(tmp, vtoken) < 0) + goto cleanup; + + if (!(token = virStringSplitCount(tmp, ",", 0, &ndistances))) + goto cleanup; + + if (ndistances != nr_nodes) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("vnuma pnode %d configured '%s' (count %zu) doesn't fit the number of specified vnodes %zu"), + pnode, str, ndistances, nr_nodes); + goto cleanup; + } + + if (virDomainNumaSetNodeDistanceCount(numa, vnodeCnt, ndistances) != ndistances) + goto cleanup; + + for (i = 0; i < ndistances; i++) { + if ((virStrToLong_ui(token[i], NULL, 10, &value) < 0) || + (virDomainNumaSetNodeDistance(numa, vnodeCnt, i, value) != value)) + goto cleanup; + } + + } else { + virReportError(VIR_ERR_CONF_SYNTAX, + _("vnuma vnode %zu invalid token '%s'"), + vnodeCnt, str); + goto cleanup; + } + } + vnode = vnode->next; + } + } + + if ((pnode < 0) || + (cpumask == NULL) || + (kbsize == 0)) { + virReportError(VIR_ERR_CONF_SYNTAX, + _("vnuma vnode %zu incomplete token. Missing%s%s%s"), + vnodeCnt, + (pnode < 0) ? " \'pnode\'":"", + (cpumask == NULL) ? " \'vcpus\'":"", + (kbsize == 0) ? " \'size\'":""); + goto cleanup; + } + + list = list->next; + vnodeCnt++; + } + + if (def->maxvcpus == 0) + def->maxvcpus = vcpus; + + if (def->maxvcpus < vcpus) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("vnuma requested vcpus %zu fails available maxvcpus %zu"), + vcpus, def->maxvcpus); + goto cleanup; + } + + cpu->type = VIR_CPU_TYPE_GUEST; + def->cpu = cpu; + + ret = 0; + + cleanup: + if (ret) + VIR_FREE(cpu); + virStringListFree(token); + VIR_FREE(tmp); + + return ret; +} +#endif static int xenParseXLDiskSrc(virDomainDiskDefPtr disk, char *srcstr) @@ -863,6 +1062,11 @@ xenParseXL(virConfPtr conf, if (xenParseXLOS(conf, def, caps) < 0) goto cleanup; +#ifdef LIBXL_HAVE_VNUMA + if (xenParseXLVnuma(conf, def) < 0) + goto cleanup; +#endif + if (xenParseXLDisk(conf, def) < 0) goto cleanup; @@ -1005,6 +1209,130 @@ xenFormatXLOS(virConfPtr conf, virDomainDefPtr def) return 0; } +#ifdef LIBXL_HAVE_VNUMA +static int +xenFormatXLVnode(virConfValuePtr list, virBufferPtr buf) +{ + int ret = -1; + virConfValuePtr numaPnode, tmp; + + if (virBufferCheckError(buf) < 0) + goto cleanup; + + if (VIR_ALLOC(numaPnode) < 0) + goto cleanup; + + /* Place VNODE directive */ + numaPnode->type = VIR_CONF_STRING; + numaPnode->str = virBufferContentAndReset(buf); + + tmp = list->list; + while (tmp && tmp->next) + tmp = tmp->next; + if (tmp) + tmp->next = numaPnode; + else + list->list = numaPnode; + ret = 0; + + cleanup: + virBufferFreeAndReset(buf); + return ret; +} + +static int ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) +xenFormatXLVnuma(virConfValuePtr list, + virDomainNumaPtr numa, size_t node, size_t nr_nodes) +{ + int ret = -1; + size_t i; + + virBuffer buf = VIR_BUFFER_INITIALIZER; + virConfValuePtr numaVnode, tmp; + + size_t nodeSize = virDomainNumaGetNodeMemorySize(numa, node) / 1024; + char *nodeVcpus = virBitmapFormat(virDomainNumaGetNodeCpumask(numa, node)); + + if (VIR_ALLOC(numaVnode) < 0) + goto cleanup; + + numaVnode->type = VIR_CONF_LIST; + numaVnode->list = NULL; + + /* pnode */ + virBufferAsprintf(&buf, "pnode=%ld", node); + xenFormatXLVnode(numaVnode, &buf); + + /* size */ + virBufferAsprintf(&buf, "size=%ld", nodeSize); + xenFormatXLVnode(numaVnode, &buf); + + /* vcpus */ + virBufferAsprintf(&buf, "vcpus=%s", nodeVcpus); + xenFormatXLVnode(numaVnode, &buf); + + /* distances */ + virBufferAddLit(&buf, "vdistances="); + for (i = 0; i < nr_nodes; i++) { + virBufferAsprintf(&buf, "%zu", + virDomainNumaGetNodeDistance(numa, node, i)); + if ((nr_nodes - i) > 1) + virBufferAddLit(&buf, ","); + } + xenFormatXLVnode(numaVnode, &buf); + + tmp = list->list; + while (tmp && tmp->next) + tmp = tmp->next; + if (tmp) + tmp->next = numaVnode; + else + list->list = numaVnode; + ret = 0; + + cleanup: + VIR_FREE(nodeVcpus); + return ret; +} + +static int +xenFormatXLDomainVnuma(virConfPtr conf, virDomainDefPtr def) +{ + virDomainNumaPtr numa = def->numa; + virConfValuePtr vnumaVal; + size_t i; + size_t nr_nodes; + + if (numa == NULL) + return -1; + + if (VIR_ALLOC(vnumaVal) < 0) + return -1; + + vnumaVal->type = VIR_CONF_LIST; + vnumaVal->list = NULL; + + nr_nodes = virDomainNumaGetNodeCount(numa); + for (i = 0; i < nr_nodes; i++) { + if (xenFormatXLVnuma(vnumaVal, numa, i, nr_nodes) < 0) + goto cleanup; + } + + if (vnumaVal->list != NULL) { + int ret = virConfSetValue(conf, "vnuma", vnumaVal); + vnumaVal = NULL; + if (ret < 0) + return -1; + } + VIR_FREE(vnumaVal); + + return 0; + + cleanup: + virConfFreeValue(vnumaVal); + return -1; +} +#endif static char * xenFormatXLDiskSrcNet(virStorageSourcePtr src) @@ -1641,6 +1969,11 @@ xenFormatXL(virDomainDefPtr def, virConnectPtr conn) if (xenFormatXLOS(conf, def) < 0) goto cleanup; +#ifdef LIBXL_HAVE_VNUMA + if (xenFormatXLDomainVnuma(conf, def) < 0) + goto cleanup; +#endif + if (xenFormatXLDomainDisks(conf, def) < 0) goto cleanup; -- 2.9.5

From: Wim ten Have <wim.ten.have@oracle.com> Test a bidirectional xen-xl domxml to and from native for numa support administration as brought under this patch series. Signed-off-by: Wim ten Have <wim.ten.have@oracle.com> --- .../test-fullvirt-vnuma-nodistances.cfg | 26 +++++++ .../test-fullvirt-vnuma-nodistances.xml | 53 ++++++++++++++ tests/xlconfigdata/test-fullvirt-vnuma.cfg | 26 +++++++ tests/xlconfigdata/test-fullvirt-vnuma.xml | 81 ++++++++++++++++++++++ tests/xlconfigtest.c | 4 ++ 5 files changed, 190 insertions(+) create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.cfg create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.xml create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.cfg create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.xml diff --git a/tests/xlconfigdata/test-fullvirt-vnuma-nodistances.cfg b/tests/xlconfigdata/test-fullvirt-vnuma-nodistances.cfg new file mode 100644 index 0000000..8186edf --- /dev/null +++ b/tests/xlconfigdata/test-fullvirt-vnuma-nodistances.cfg @@ -0,0 +1,26 @@ +name = "XenGuest2" +uuid = "c7a5fdb2-cdaf-9455-926a-d65c16db1809" +maxmem = 8192 +memory = 8192 +vcpus = 8 +pae = 1 +acpi = 1 +apic = 1 +viridian = 0 +rtc_timeoffset = 0 +localtime = 0 +on_poweroff = "destroy" +on_reboot = "restart" +on_crash = "restart" +device_model = "/usr/lib/xen/bin/qemu-system-i386" +sdl = 0 +vnc = 1 +vncunused = 1 +vnclisten = "127.0.0.1" +vif = [ "mac=00:16:3e:66:92:9c,bridge=xenbr1,script=vif-bridge,model=e1000" ] +parallel = "none" +serial = "none" +builder = "hvm" +boot = "d" +vnuma = [ [ "pnode=0", "size=2048", "vcpus=0-1", "vdistances=10,20,20,20" ], [ "pnode=1", "size=2048", "vcpus=2-3", "vdistances=20,10,20,20" ], [ "pnode=2", "size=2048", "vcpus=4-5", "vdistances=20,20,10,20" ], [ "pnode=3", "size=2048", "vcpus=6-7", "vdistances=20,20,20,10" ] ] +disk = [ "format=raw,vdev=hda,access=rw,backendtype=phy,target=/dev/HostVG/XenGuest2" ] diff --git a/tests/xlconfigdata/test-fullvirt-vnuma-nodistances.xml b/tests/xlconfigdata/test-fullvirt-vnuma-nodistances.xml new file mode 100644 index 0000000..9cab3ca --- /dev/null +++ b/tests/xlconfigdata/test-fullvirt-vnuma-nodistances.xml @@ -0,0 +1,53 @@ +<domain type='xen'> + <name>XenGuest2</name> + <uuid>c7a5fdb2-cdaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>8388608</memory> + <currentMemory unit='KiB'>8388608</currentMemory> + <vcpu placement='static'>8</vcpu> + <os> + <type arch='x86_64' machine='xenfv'>hvm</type> + <loader type='rom'>/usr/lib/xen/boot/hvmloader</loader> + <boot dev='cdrom'/> + </os> + <features> + <acpi/> + <apic/> + <pae/> + </features> + <cpu> + <numa> + <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/> + <cell id='1' cpus='2-3' memory='2097152' unit='KiB'/> + <cell id='2' cpus='4-5' memory='2097152' unit='KiB'/> + <cell id='3' cpus='6-7' memory='2097152' unit='KiB'/> + </numa> + </cpu> + <clock offset='variable' adjustment='0' basis='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>restart</on_crash> + <devices> + <emulator>/usr/lib/xen/bin/qemu-system-i386</emulator> + <disk type='block' device='disk'> + <driver name='phy' type='raw'/> + <source dev='/dev/HostVG/XenGuest2'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='ide' index='0'/> + <interface type='bridge'> + <mac address='00:16:3e:66:92:9c'/> + <source bridge='xenbr1'/> + <script path='vif-bridge'/> + <model type='e1000'/> + </interface> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <graphics type='vnc' port='-1' autoport='yes' listen='127.0.0.1'> + <listen type='address' address='127.0.0.1'/> + </graphics> + <video> + <model type='cirrus' vram='8192' heads='1' primary='yes'/> + </video> + </devices> +</domain> diff --git a/tests/xlconfigdata/test-fullvirt-vnuma.cfg b/tests/xlconfigdata/test-fullvirt-vnuma.cfg new file mode 100644 index 0000000..91e233a --- /dev/null +++ b/tests/xlconfigdata/test-fullvirt-vnuma.cfg @@ -0,0 +1,26 @@ +name = "XenGuest2" +uuid = "c7a5fdb2-cdaf-9455-926a-d65c16db1809" +maxmem = 8192 +memory = 8192 +vcpus = 8 +pae = 1 +acpi = 1 +apic = 1 +viridian = 0 +rtc_timeoffset = 0 +localtime = 0 +on_poweroff = "destroy" +on_reboot = "restart" +on_crash = "restart" +device_model = "/usr/lib/xen/bin/qemu-system-i386" +sdl = 0 +vnc = 1 +vncunused = 1 +vnclisten = "127.0.0.1" +vif = [ "mac=00:16:3e:66:92:9c,bridge=xenbr1,script=vif-bridge,model=e1000" ] +parallel = "none" +serial = "none" +builder = "hvm" +boot = "d" +vnuma = [ [ "pnode=0", "size=2048", "vcpus=0-1", "vdistances=10,21,31,41" ], [ "pnode=1", "size=2048", "vcpus=2-3", "vdistances=21,10,21,31" ], [ "pnode=2", "size=2048", "vcpus=4-5", "vdistances=31,21,10,21" ], [ "pnode=3", "size=2048", "vcpus=6-7", "vdistances=41,31,21,10" ] ] +disk = [ "format=raw,vdev=hda,access=rw,backendtype=phy,target=/dev/HostVG/XenGuest2" ] diff --git a/tests/xlconfigdata/test-fullvirt-vnuma.xml b/tests/xlconfigdata/test-fullvirt-vnuma.xml new file mode 100644 index 0000000..5368b0d --- /dev/null +++ b/tests/xlconfigdata/test-fullvirt-vnuma.xml @@ -0,0 +1,81 @@ +<domain type='xen'> + <name>XenGuest2</name> + <uuid>c7a5fdb2-cdaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>8388608</memory> + <currentMemory unit='KiB'>8388608</currentMemory> + <vcpu placement='static'>8</vcpu> + <os> + <type arch='x86_64' machine='xenfv'>hvm</type> + <loader type='rom'>/usr/lib/xen/boot/hvmloader</loader> + <boot dev='cdrom'/> + </os> + <features> + <acpi/> + <apic/> + <pae/> + </features> + <cpu> + <numa> + <cell id='0' cpus='0-1' memory='2097152' unit='KiB'> + <distances> + <sibling id='0' value='10'/> + <sibling id='1' value='21'/> + <sibling id='2' value='31'/> + <sibling id='3' value='41'/> + </distances> + </cell> + <cell id='1' cpus='2-3' memory='2097152' unit='KiB'> + <distances> + <sibling id='0' value='21'/> + <sibling id='1' value='10'/> + <sibling id='2' value='21'/> + <sibling id='3' value='31'/> + </distances> + </cell> + <cell id='2' cpus='4-5' memory='2097152' unit='KiB'> + <distances> + <sibling id='0' value='31'/> + <sibling id='1' value='21'/> + <sibling id='2' value='10'/> + <sibling id='3' value='21'/> + </distances> + </cell> + <cell id='3' cpus='6-7' memory='2097152' unit='KiB'> + <distances> + <sibling id='0' value='41'/> + <sibling id='1' value='31'/> + <sibling id='2' value='21'/> + <sibling id='3' value='10'/> + </distances> + </cell> + </numa> + </cpu> + <clock offset='variable' adjustment='0' basis='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>restart</on_crash> + <devices> + <emulator>/usr/lib/xen/bin/qemu-system-i386</emulator> + <disk type='block' device='disk'> + <driver name='phy' type='raw'/> + <source dev='/dev/HostVG/XenGuest2'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='ide' index='0'/> + <interface type='bridge'> + <mac address='00:16:3e:66:92:9c'/> + <source bridge='xenbr1'/> + <script path='vif-bridge'/> + <model type='e1000'/> + </interface> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <graphics type='vnc' port='-1' autoport='yes' listen='127.0.0.1'> + <listen type='address' address='127.0.0.1'/> + </graphics> + <video> + <model type='cirrus' vram='8192' heads='1' primary='yes'/> + </video> + </devices> +</domain> diff --git a/tests/xlconfigtest.c b/tests/xlconfigtest.c index 30468c9..4605d34 100644 --- a/tests/xlconfigtest.c +++ b/tests/xlconfigtest.c @@ -270,6 +270,10 @@ mymain(void) DO_TEST("fullvirt-multi-timer"); DO_TEST("fullvirt-nestedhvm"); DO_TEST("fullvirt-nestedhvm-disabled"); +#ifdef LIBXL_HAVE_VNUMA + DO_TEST("fullvirt-vnuma"); + DO_TEST_PARSE("fullvirt-vnuma-nodistances", false); +#endif DO_TEST("paravirt-cmdline"); DO_TEST_FORMAT("paravirt-cmdline-extra-root", false); -- 2.9.5

Hi Wim, I'll be away for a few weeks and won't be able to review this in detail until later in the month. I see Martin provided some feedback on patch1, which is awesome since I'd prefer a broader agreement on that patch than my single 'ack'. BTW, the new code in patch2 can also be tested now that we have domXML <-> libxl_domain_config conversion tests :-). See tests/libxlxml2domconfigtest.c Regards, Jim On 08/31/2017 08:02 AM, Wim Ten Have wrote:
From: Wim ten Have <wim.ten.have@oracle.com>
This patch extents guest domain administration adding support to advertise node sibling distances when configuring HVM numa guests.
NUMA (non-uniform memory access), a method of configuring a cluster of nodes within a single multiprocessing system such that it shares processor local memory amongst others improving performance and the ability of the system to be expanded.
A NUMA system could be illustrated as shown below. Within this 4-node system, every socket is equipped with its own distinct memory. The whole typically resembles a SMP (symmetric multiprocessing) system being a "tightly-coupled," "share everything" system in which multiple processors are working under a single operating system and can access each others' memory over multiple "Bus Interconnect" paths.
+-----+-----+-----+ +-----+-----+-----+ | M | CPU | CPU | | CPU | CPU | M | | E | | | | | | E | | M +- Socket0 -+ +- Socket3 -+ M | | O | | | | | | O | | R | CPU | CPU <---------> CPU | CPU | R | | Y | | | | | | Y | +-----+--^--+-----+ +-----+--^--+-----+ | | | Bus Interconnect | | | +-----+--v--+-----+ +-----+--v--+-----+ | M | | | | | | M | | E | CPU | CPU <---------> CPU | CPU | E | | M | | | | | | M | | O +- Socket1 -+ +- Socket2 -+ O | | R | | | | | | R | | Y | CPU | CPU | | CPU | CPU | Y | +-----+-----+-----+ +-----+-----+-----+
In contrast there is the limitation of a flat SMP system, not illustrated. Here, as sockets are added, the bus (data and address path), under high activity, gets overloaded and easily becomes a performance bottleneck. NUMA adds an intermediate level of memory shared amongst a few cores per socket as illustrated above, so that data accesses do not have to travel over a single bus.
Unfortunately the way NUMA does this adds its own limitations. This, as visualized in the illustration above, happens when data is stored in memory associated with Socket2 and is accessed by a CPU (core) in Socket0. The processors use the "Bus Interconnect" to create gateways between the sockets (nodes) enabling inter-socket access to memory. These "Bus Interconnect" hops add data access delays when a CPU (core) accesses memory associated with a remote socket (node).
For terminology we refer to sockets as "nodes" where access to each others' distinct resources such as memory make them "siblings" with a designated "distance" between them. A specific design is described under the ACPI (Advanced Configuration and Power Interface Specification) within the chapter explaining the system's SLIT (System Locality Distance Information Table).
These patches extend core libvirt's XML description of a virtual machine's hardware to include NUMA distance information for sibling nodes, which is then passed to Xen guests via libxl. Recently qemu landed support for constructing the SLIT since commit 0f203430dd ("numa: Allow setting NUMA distance for different NUMA nodes"), hence these core libvirt extensions can also help other drivers in supporting this feature.
The XML changes made allow to describe the <cell> (or node/sockets) <distances> amongst <sibling> node identifiers and propagate these towards the numa domain functionality finally adding support to libxl.
[below is an example illustrating a 4 node/socket <cell> setup]
<cpu> <numa> <cell id='0' cpus='0,4-7' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='41'/> </distances> </cell> <cell id='1' cpus='1,8-10,12-15' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='21'/> <sibling id='3' value='31'/> </distances> </cell> <cell id='2' cpus='2,11' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> </cell> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='41'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu>
By default on libxl, if no <distances> are given to describe the SLIT data between different <cell>s, this patch will default to a scheme using 10 for local and 21 for any remote node/socket, which is the assumption of guest OS when no SLIT is specified. While SLIT is optional, libxl requires that distances are set nonetheless.
On Linux systems the SLIT detail can be listed with help of the 'numactl -H' command. An above HVM guest as described would on such prompt with below output.
[root@f25 ~]# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 4 5 6 7 node 0 size: 1988 MB node 0 free: 1743 MB node 1 cpus: 1 8 9 10 12 13 14 15 node 1 size: 1946 MB node 1 free: 1885 MB node 2 cpus: 2 11 node 2 size: 2011 MB node 2 free: 1912 MB node 3 cpus: 3 node 3 size: 2010 MB node 3 free: 1980 MB node distances: node 0 1 2 3 0: 10 21 31 41 1: 21 10 21 31 2: 31 21 10 21 3: 41 31 21 10
Wim ten Have (4): numa: describe siblings distances within cells libxl: vnuma support xenconfig: add domxml conversions for xen-xl xlconfigtest: add tests for numa cell sibling distances
docs/formatdomain.html.in | 70 ++++- docs/schemas/basictypes.rng | 9 + docs/schemas/cputypes.rng | 18 ++ src/conf/cpu_conf.c | 2 +- src/conf/numa_conf.c | 323 +++++++++++++++++++- src/conf/numa_conf.h | 25 +- src/libvirt_private.syms | 6 + src/libxl/libxl_conf.c | 120 ++++++++ src/libxl/libxl_driver.c | 3 +- src/xenconfig/xen_xl.c | 333 +++++++++++++++++++++ .../test-fullvirt-vnuma-nodistances.cfg | 26 ++ .../test-fullvirt-vnuma-nodistances.xml | 53 ++++ tests/xlconfigdata/test-fullvirt-vnuma.cfg | 26 ++ tests/xlconfigdata/test-fullvirt-vnuma.xml | 81 +++++ tests/xlconfigtest.c | 4 + 15 files changed, 1089 insertions(+), 10 deletions(-) create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.cfg create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma-nodistances.xml create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.cfg create mode 100644 tests/xlconfigdata/test-fullvirt-vnuma.xml
participants (4)
-
Jim Fehlig
-
Martin Kletzander
-
Wim Ten Have
-
Wim ten Have