[PATCH v2 0/4] NUMA CPUs 'auto-fill' for incomplete topologies

changes in v2: - removed patch 5/5 Gitlab link: https://gitlab.com/danielhb/libvirt/-/tree/vcpus_numa_v2 v1 link: https://www.redhat.com/archives/libvir-list/2020-June/msg00016.html Daniel Henrique Barboza (4): numa_conf.c: add helper functions for cpumap operations qemu_domain.c: NUMA CPUs auto-fill for incomplete topologies qemuxml2xmltest.c: add NUMA vcpus auto fill tests formatdomain.html.in: document the NUMA cpus auto fill feature docs/formatdomain.html.in | 11 ++++- src/conf/numa_conf.c | 46 ++++++++++++++++++ src/conf/numa_conf.h | 3 ++ src/libvirt_private.syms | 1 + src/qemu/qemu_domain.c | 47 +++++++++++++++++++ src/qemu/qemu_domain.h | 4 ++ src/qemu/qemu_driver.c | 9 ++++ .../numavcpus-topology-mismatch.xml | 37 +++++++++++++++ ...avcpus-topology-mismatch.x86_64-latest.xml | 38 +++++++++++++++ tests/qemuxml2xmltest.c | 1 + 10 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/numavcpus-topology-mismatch.xml create mode 100644 tests/qemuxml2xmloutdata/numavcpus-topology-mismatch.x86_64-latest.xml -- 2.26.2

These helpers will be used in an auto-fill feature for incomplete NUMA topologies in the next patch. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/conf/numa_conf.c | 46 ++++++++++++++++++++++++++++++++++++++++ src/conf/numa_conf.h | 3 +++ src/libvirt_private.syms | 1 + 3 files changed, 50 insertions(+) diff --git a/src/conf/numa_conf.c b/src/conf/numa_conf.c index 09811cb51b..d022685284 100644 --- a/src/conf/numa_conf.c +++ b/src/conf/numa_conf.c @@ -1372,3 +1372,49 @@ virDomainNumaGetMemorySize(virDomainNumaPtr numa) return ret; } + + +static int +virDomainNumaRemoveCPUsFromMap(virBitmapPtr result, virBitmapPtr exclude) +{ + size_t i; + + for (i = 0; i < virBitmapSize(exclude); i++) { + if (!virBitmapIsBitSet(exclude, i)) + continue; + + if (virBitmapClearBitExpand(result, i) < 0) + return -1; + } + + return 0; +} + + +int +virDomainNumaFillCPUsInNode(virDomainNumaPtr numa, size_t node, + unsigned int maxCpus) +{ + g_autoptr(virBitmap) maxCPUsBitmap = virBitmapNew(maxCpus); + size_t i; + + if (node >= virDomainNumaGetNodeCount(numa)) + return -1; + + virBitmapSetAll(maxCPUsBitmap); + + for (i = 0; i < numa->nmem_nodes; i++) { + virBitmapPtr nodeCpus = virDomainNumaGetNodeCpumask(numa, i); + + if (i == node) + continue; + + if (virDomainNumaRemoveCPUsFromMap(maxCPUsBitmap, nodeCpus) < 0) + return -1; + } + + virBitmapFree(numa->mem_nodes[node].cpumask); + numa->mem_nodes[node].cpumask = g_steal_pointer(&maxCPUsBitmap); + + return 0; +} diff --git a/src/conf/numa_conf.h b/src/conf/numa_conf.h index ae3599bb8b..cdf87a87e8 100644 --- a/src/conf/numa_conf.h +++ b/src/conf/numa_conf.h @@ -185,3 +185,6 @@ int virDomainNumaDefCPUParseXML(virDomainNumaPtr def, xmlXPathContextPtr ctxt); int virDomainNumaDefCPUFormatXML(virBufferPtr buf, virDomainNumaPtr def); unsigned int virDomainNumaGetCPUCountTotal(virDomainNumaPtr numa); + +int virDomainNumaFillCPUsInNode(virDomainNumaPtr numa, size_t node, + unsigned int maxCpus); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index a6af44fe1c..788e08045a 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -834,6 +834,7 @@ virDomainMemoryAccessTypeFromString; virDomainMemoryAccessTypeToString; virDomainNumaCheckABIStability; virDomainNumaEquals; +virDomainNumaFillCPUsInNode; virDomainNumaFree; virDomainNumaGetCPUCountTotal; virDomainNumaGetMaxCPUID; -- 2.26.2

On 6/10/20 8:35 PM, Daniel Henrique Barboza wrote:
These helpers will be used in an auto-fill feature for incomplete NUMA topologies in the next patch.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/conf/numa_conf.c | 46 ++++++++++++++++++++++++++++++++++++++++ src/conf/numa_conf.h | 3 +++ src/libvirt_private.syms | 1 + 3 files changed, 50 insertions(+)
diff --git a/src/conf/numa_conf.c b/src/conf/numa_conf.c index 09811cb51b..d022685284 100644 --- a/src/conf/numa_conf.c +++ b/src/conf/numa_conf.c @@ -1372,3 +1372,49 @@ virDomainNumaGetMemorySize(virDomainNumaPtr numa)
return ret; } + + +static int +virDomainNumaRemoveCPUsFromMap(virBitmapPtr result, virBitmapPtr exclude) +{ + size_t i; + + for (i = 0; i < virBitmapSize(exclude); i++) { + if (!virBitmapIsBitSet(exclude, i)) + continue; + + if (virBitmapClearBitExpand(result, i) < 0) + return -1; + } + + return 0; +} +
This exactly what virBitmapSubtract() does :-)
+ +int +virDomainNumaFillCPUsInNode(virDomainNumaPtr numa, size_t node, + unsigned int maxCpus) +{ + g_autoptr(virBitmap) maxCPUsBitmap = virBitmapNew(maxCpus); + size_t i; + + if (node >= virDomainNumaGetNodeCount(numa)) + return -1; + + virBitmapSetAll(maxCPUsBitmap); + + for (i = 0; i < numa->nmem_nodes; i++) { + virBitmapPtr nodeCpus = virDomainNumaGetNodeCpumask(numa, i); + + if (i == node) + continue; + + if (virDomainNumaRemoveCPUsFromMap(maxCPUsBitmap, nodeCpus) < 0) + return -1; + } + + virBitmapFree(numa->mem_nodes[node].cpumask); + numa->mem_nodes[node].cpumask = g_steal_pointer(&maxCPUsBitmap);
For some weird reason, I'd feel better if the bitmap is replaced only if it differs. I can't really explain why. Michal

On 6/18/20 7:34 AM, Michal Privoznik wrote:
On 6/10/20 8:35 PM, Daniel Henrique Barboza wrote:
These helpers will be used in an auto-fill feature for incomplete NUMA topologies in the next patch.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/conf/numa_conf.c | 46 ++++++++++++++++++++++++++++++++++++++++ src/conf/numa_conf.h | 3 +++ src/libvirt_private.syms | 1 + 3 files changed, 50 insertions(+)
diff --git a/src/conf/numa_conf.c b/src/conf/numa_conf.c index 09811cb51b..d022685284 100644 --- a/src/conf/numa_conf.c +++ b/src/conf/numa_conf.c @@ -1372,3 +1372,49 @@ virDomainNumaGetMemorySize(virDomainNumaPtr numa) return ret; } + + +static int +virDomainNumaRemoveCPUsFromMap(virBitmapPtr result, virBitmapPtr exclude) +{ + size_t i; + + for (i = 0; i < virBitmapSize(exclude); i++) { + if (!virBitmapIsBitSet(exclude, i)) + continue; + + if (virBitmapClearBitExpand(result, i) < 0) + return -1; + } + + return 0; +} +
This exactly what virBitmapSubtract() does :-)
Completely forgot about virBitmapSubtract() after I tried to use it in an earlier interaction of the code and it didn't suit what I need. Granted, I was doing it a convoluted way of doing what I'm doing now. virBitmapSubtract really works as a replacement in the code below.
+ +int +virDomainNumaFillCPUsInNode(virDomainNumaPtr numa, size_t node, + unsigned int maxCpus) +{ + g_autoptr(virBitmap) maxCPUsBitmap = virBitmapNew(maxCpus); + size_t i; + + if (node >= virDomainNumaGetNodeCount(numa)) + return -1; + + virBitmapSetAll(maxCPUsBitmap); + + for (i = 0; i < numa->nmem_nodes; i++) { + virBitmapPtr nodeCpus = virDomainNumaGetNodeCpumask(numa, i); + + if (i == node) + continue; + + if (virDomainNumaRemoveCPUsFromMap(maxCPUsBitmap, nodeCpus) < 0) + return -1; + } + + virBitmapFree(numa->mem_nodes[node].cpumask); + numa->mem_nodes[node].cpumask = g_steal_pointer(&maxCPUsBitmap);
For some weird reason, I'd feel better if the bitmap is replaced only if it differs. I can't really explain why.
I think it's ok to check for virBitmapEqual() before replacing the existing one. DHB
Michal

Libvirt allows the user to define an incomplete NUMA topology, where the sum of all CPUs in each cell is less than the total of VCPUs. What ends up happening is that QEMU allocates the non-enumerated CPUs in the first NUMA node. This behavior is being flagged as 'to be deprecated' at least since QEMU commit ec78f8114bc4 ("numa: use possible_cpus for not mapped CPUs check"). In [1], Maxiwell suggested that we forbid the user to define such topologies. In his review [2], Peter Krempa pointed out that we can't break existing guests, and suggested that Libvirt should emulate the QEMU behavior of putting the remaining vCPUs in the first NUMA node in these cases. This patch implements Peter Krempa's suggestion. Since we're going to most likely end up with disjointed NUMA configuration in node 0 after the auto-fill, we're making auto-fill dependent on QEMU_CAPS_NUMA. A following patch will update the documentation not just to inform about the auto-fill mechanic with incomplete NUMA topologies, but also to discourage the user to create such topologies in the future. This approach also makes Libvirt independent of whether QEMU changes its current behavior since we're either auto-filling the CPUs in node 0 or the user (hopefully) is aware that incomplete topologies, although supported in Libvirt, are to be avoided. [1] https://www.redhat.com/archives/libvir-list/2019-June/msg00224.html [2] https://www.redhat.com/archives/libvir-list/2019-June/msg00263.html Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 47 ++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 4 ++++ src/qemu/qemu_driver.c | 9 ++++++++ 3 files changed, 60 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2dad823a86..76191e028b 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -4963,6 +4963,50 @@ qemuDomainDefTsegPostParse(virDomainDefPtr def, } +/** + * qemuDomainDefNumaCPUsRectify: + * @numa: pointer to numa definition + * @maxCpus: number of CPUs this numa is supposed to have + * + * This function emulates the (to be deprecated) behavior of filling + * up in node0 with the remaining CPUs, in case of an incomplete NUMA + * setup, up to getVcpusMax. + * + * Returns: 0 on success, -1 on error + */ +int +qemuDomainDefNumaCPUsRectify(virDomainDefPtr def, virQEMUCapsPtr qemuCaps) +{ + unsigned int vcpusMax, numacpus; + + /* QEMU_CAPS_NUMA tells us if QEMU is able to handle disjointed + * NUMA CPU ranges. The filling process will create a disjointed + * setup in node0 most of the time. Do not proceed if QEMU + * can't handle it.*/ + if (virDomainNumaGetNodeCount(def->numa) == 0 || + !virQEMUCapsGet(qemuCaps, QEMU_CAPS_NUMA)) + return 0; + + vcpusMax = virDomainDefGetVcpusMax(def); + numacpus = virDomainNumaGetCPUCountTotal(def->numa); + + if (numacpus < vcpusMax) { + if (virDomainNumaFillCPUsInNode(def->numa, 0, vcpusMax) < 0) + return -1; + } + + return 0; +} + + +static int +qemuDomainDefNumaCPUsPostParse(virDomainDefPtr def, + virQEMUCapsPtr qemuCaps) +{ + return qemuDomainDefNumaCPUsRectify(def, qemuCaps); +} + + static int qemuDomainDefPostParseBasic(virDomainDefPtr def, void *opaque G_GNUC_UNUSED) @@ -5049,6 +5093,9 @@ qemuDomainDefPostParse(virDomainDefPtr def, if (qemuDomainDefTsegPostParse(def, qemuCaps) < 0) return -1; + if (qemuDomainDefNumaCPUsPostParse(def, qemuCaps) < 0) + return -1; + return 0; } diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 41d3f1561d..e78a2b935d 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -1297,3 +1297,7 @@ qemuDomainInitializePflashStorageSource(virDomainObjPtr vm); bool qemuDomainDiskBlockJobIsSupported(virDomainObjPtr vm, virDomainDiskDefPtr disk); + +int +qemuDomainDefNumaCPUsRectify(virDomainDefPtr def, + virQEMUCapsPtr qemuCaps); diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 88517ba6a7..ff9414f3c4 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4999,6 +4999,7 @@ qemuDomainSetVcpusMax(virQEMUDriverPtr driver, unsigned int nvcpus) { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + g_autoptr(virQEMUCaps) qemuCaps = NULL; unsigned int topologycpus; if (def) { @@ -5029,6 +5030,14 @@ qemuDomainSetVcpusMax(virQEMUDriverPtr driver, if (virDomainDefSetVcpusMax(persistentDef, nvcpus, driver->xmlopt) < 0) return -1; + /* re-adjust NUMA nodes if needed */ + if (!(qemuCaps = virQEMUCapsCacheLookup(driver->qemuCapsCache, + persistentDef->emulator))) + return -1; + + if (qemuDomainDefNumaCPUsRectify(persistentDef, qemuCaps) < 0) + return -1; + if (virDomainDefSave(persistentDef, driver->xmlopt, cfg->configDir) < 0) return -1; -- 2.26.2

On 6/10/20 8:35 PM, Daniel Henrique Barboza wrote:
Libvirt allows the user to define an incomplete NUMA topology, where the sum of all CPUs in each cell is less than the total of VCPUs. What ends up happening is that QEMU allocates the non-enumerated CPUs in the first NUMA node. This behavior is being flagged as 'to be deprecated' at least since QEMU commit ec78f8114bc4 ("numa: use possible_cpus for not mapped CPUs check").
In [1], Maxiwell suggested that we forbid the user to define such topologies. In his review [2], Peter Krempa pointed out that we can't break existing guests, and suggested that Libvirt should emulate the QEMU behavior of putting the remaining vCPUs in the first NUMA node in these cases.
This patch implements Peter Krempa's suggestion. Since we're going to most likely end up with disjointed NUMA configuration in node 0 after the auto-fill, we're making auto-fill dependent on QEMU_CAPS_NUMA.
A following patch will update the documentation not just to inform about the auto-fill mechanic with incomplete NUMA topologies, but also to discourage the user to create such topologies in the future. This approach also makes Libvirt independent of whether QEMU changes its current behavior since we're either auto-filling the CPUs in node 0 or the user (hopefully) is aware that incomplete topologies, although supported in Libvirt, are to be avoided.
[1] https://www.redhat.com/archives/libvir-list/2019-June/msg00224.html [2] https://www.redhat.com/archives/libvir-list/2019-June/msg00263.html Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 47 ++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 4 ++++ src/qemu/qemu_driver.c | 9 ++++++++ 3 files changed, 60 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2dad823a86..76191e028b 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -4963,6 +4963,50 @@ qemuDomainDefTsegPostParse(virDomainDefPtr def, }
+/** + * qemuDomainDefNumaCPUsRectify: + * @numa: pointer to numa definition + * @maxCpus: number of CPUs this numa is supposed to have + * + * This function emulates the (to be deprecated) behavior of filling + * up in node0 with the remaining CPUs, in case of an incomplete NUMA + * setup, up to getVcpusMax. + * + * Returns: 0 on success, -1 on error + */ +int +qemuDomainDefNumaCPUsRectify(virDomainDefPtr def, virQEMUCapsPtr qemuCaps) +{ + unsigned int vcpusMax, numacpus; + + /* QEMU_CAPS_NUMA tells us if QEMU is able to handle disjointed + * NUMA CPU ranges. The filling process will create a disjointed + * setup in node0 most of the time. Do not proceed if QEMU + * can't handle it.*/ + if (virDomainNumaGetNodeCount(def->numa) == 0 || + !virQEMUCapsGet(qemuCaps, QEMU_CAPS_NUMA)) + return 0; + + vcpusMax = virDomainDefGetVcpusMax(def); + numacpus = virDomainNumaGetCPUCountTotal(def->numa); + + if (numacpus < vcpusMax) { + if (virDomainNumaFillCPUsInNode(def->numa, 0, vcpusMax) < 0) + return -1; + } + + return 0; +} + + +static int +qemuDomainDefNumaCPUsPostParse(virDomainDefPtr def, + virQEMUCapsPtr qemuCaps) +{ + return qemuDomainDefNumaCPUsRectify(def, qemuCaps); +} + + static int qemuDomainDefPostParseBasic(virDomainDefPtr def, void *opaque G_GNUC_UNUSED) @@ -5049,6 +5093,9 @@ qemuDomainDefPostParse(virDomainDefPtr def, if (qemuDomainDefTsegPostParse(def, qemuCaps) < 0) return -1;
+ if (qemuDomainDefNumaCPUsPostParse(def, qemuCaps) < 0) + return -1; + return 0; }
diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 41d3f1561d..e78a2b935d 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -1297,3 +1297,7 @@ qemuDomainInitializePflashStorageSource(virDomainObjPtr vm); bool qemuDomainDiskBlockJobIsSupported(virDomainObjPtr vm, virDomainDiskDefPtr disk); + +int +qemuDomainDefNumaCPUsRectify(virDomainDefPtr def, + virQEMUCapsPtr qemuCaps); diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 88517ba6a7..ff9414f3c4 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4999,6 +4999,7 @@ qemuDomainSetVcpusMax(virQEMUDriverPtr driver, unsigned int nvcpus) { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + g_autoptr(virQEMUCaps) qemuCaps = NULL; unsigned int topologycpus;
if (def) { @@ -5029,6 +5030,14 @@ qemuDomainSetVcpusMax(virQEMUDriverPtr driver, if (virDomainDefSetVcpusMax(persistentDef, nvcpus, driver->xmlopt) < 0) return -1;
+ /* re-adjust NUMA nodes if needed */ + if (!(qemuCaps = virQEMUCapsCacheLookup(driver->qemuCapsCache, + persistentDef->emulator)))
This will not do what you think it will do. Although, perhaps for QEMU_CAPS_NUMA (introduced in what QEMU 2.1.0? point is ancient QEMU) it doesn't matter because the likely hood is very small. But, virQEMUCapsCacheLookup() will return caps as detected now, by current version of libvirt. But what we need is to compare the capability against the set of caps when the domain was started (vm->priv->qemuCaps). For instance (and this shows how unlikely this corner case is, but still): 1) a domain is started with qemu-1.5.0 (or what the minimal version is) 2) yum update qemu 3) virQEMUCapsCacheLookup() will now fetch new caps for the updated QEMU. Michal

Add a unit test to verify the NUMA vcpus autocomplete implemented in the previous patch. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- .../numavcpus-topology-mismatch.xml | 37 ++++++++++++++++++ ...avcpus-topology-mismatch.x86_64-latest.xml | 38 +++++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 3 files changed, 76 insertions(+) create mode 100644 tests/qemuxml2argvdata/numavcpus-topology-mismatch.xml create mode 100644 tests/qemuxml2xmloutdata/numavcpus-topology-mismatch.x86_64-latest.xml diff --git a/tests/qemuxml2argvdata/numavcpus-topology-mismatch.xml b/tests/qemuxml2argvdata/numavcpus-topology-mismatch.xml new file mode 100644 index 0000000000..d3b2c1297e --- /dev/null +++ b/tests/qemuxml2argvdata/numavcpus-topology-mismatch.xml @@ -0,0 +1,37 @@ +<domain type='kvm'> + <name>dummy</name> + <uuid>4d92ec27-9ebf-400b-ae91-20c71c647c19</uuid> + <maxMemory slots='16' unit='KiB'>16777216</maxMemory> + <memory unit='KiB'>7864320</memory> + <currentMemory unit='KiB'>3906240</currentMemory> + <vcpu placement='auto'>12</vcpu> + <numatune> + <memory mode='strict' placement='auto'/> + </numatune> + <os> + <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type> + <boot dev='hd'/> + </os> + <cpu> + <numa> + <cell id='0' cpus='0' memory='3932160' unit='KiB'/> + <cell id='1' cpus='2-4,6' memory='3932160' unit='KiB'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0' model='piix3-uhci'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmloutdata/numavcpus-topology-mismatch.x86_64-latest.xml b/tests/qemuxml2xmloutdata/numavcpus-topology-mismatch.x86_64-latest.xml new file mode 100644 index 0000000000..2d2245b10c --- /dev/null +++ b/tests/qemuxml2xmloutdata/numavcpus-topology-mismatch.x86_64-latest.xml @@ -0,0 +1,38 @@ +<domain type='kvm'> + <name>dummy</name> + <uuid>4d92ec27-9ebf-400b-ae91-20c71c647c19</uuid> + <maxMemory slots='16' unit='KiB'>16777216</maxMemory> + <memory unit='KiB'>7864320</memory> + <currentMemory unit='KiB'>3906240</currentMemory> + <vcpu placement='auto'>12</vcpu> + <numatune> + <memory mode='strict' placement='auto'/> + </numatune> + <os> + <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type> + <boot dev='hd'/> + </os> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>qemu64</model> + <numa> + <cell id='0' cpus='0-1,5,7-11' memory='3932160' unit='KiB'/> + <cell id='1' cpus='2-4,6' memory='3932160' unit='KiB'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0' model='piix3-uhci'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 157e686f2a..0526a1f104 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -553,6 +553,7 @@ mymain(void) DO_TEST("vcpu-placement-static", QEMU_CAPS_KVM, QEMU_CAPS_OBJECT_IOTHREAD); + DO_TEST_CAPS_LATEST("numavcpus-topology-mismatch"); DO_TEST("smp", NONE); DO_TEST("iothreads", NONE); -- 2.26.2

We're not mentioning that we're replicating QEMU behavior on purpose. First because QEMU will one day, maybe, change the behavior and start to refuse incomplete NUMA setups, and then our documentation is now deprecated. Second, auto filling the CPUs in the first cell will work regardless of QEMU changes in the future. The idea is to encourage the user to provide a complete NUMA CPU topology, not relying on the CPU auto fill mechanic. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/formatdomain.html.in | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 20c28a47e3..07d0fa5c70 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -1829,7 +1829,16 @@ <p> Each <code>cell</code> element specifies a NUMA cell or a NUMA node. <code>cpus</code> specifies the CPU or range of CPUs that are - part of the node. <code>memory</code> specifies the node memory + part of the node. <span class="since">Since 6.5.0</span> For the qemu + driver, if the emulator binary supports disjointed <code>cpus</code> ranges + in each <code>cell</code>, the sum of all CPUs declared in each <code>cell</code> + will be matched with the maximum number of virtual CPUs declared in the + <code>vcpu</code> element. This is done by filling any remaining CPUs + into the first NUMA <code>cell</code>. Users are encouraged to supply a + complete NUMA topology, where the sum of the NUMA CPUs matches the maximum + virtual CPUs number declared in <code>vcpus</code>, to make the domain + consistent across qemu and libvirt versions. + <code>memory</code> specifies the node memory in kibibytes (i.e. blocks of 1024 bytes). <span class="since">Since 1.2.11</span> one can use an additional <a href="#elementsMemoryAllocation"><code>unit</code></a> attribute to -- 2.26.2

On 6/10/20 3:35 PM, Daniel Henrique Barboza wrote:
We're not mentioning that we're replicating QEMU behavior on purpose. First because QEMU will one day, maybe, change the behavior and start to refuse incomplete NUMA setups, and then our documentation is now deprecated. Second, auto filling the CPUs in the first cell will work regardless of QEMU changes in the future.
The idea is to encourage the user to provide a complete NUMA CPU topology, not relying on the CPU auto fill mechanic.
Update: this patch (after the 3 previous patches) up a solution for https://bugzilla.redhat.com/show_bug.cgi?id=1656762 If a new version of the series is required I'll include the bug link in the commit message of this patch. Thanks, DHB
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/formatdomain.html.in | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 20c28a47e3..07d0fa5c70 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -1829,7 +1829,16 @@ <p> Each <code>cell</code> element specifies a NUMA cell or a NUMA node. <code>cpus</code> specifies the CPU or range of CPUs that are - part of the node. <code>memory</code> specifies the node memory + part of the node. <span class="since">Since 6.5.0</span> For the qemu + driver, if the emulator binary supports disjointed <code>cpus</code> ranges + in each <code>cell</code>, the sum of all CPUs declared in each <code>cell</code> + will be matched with the maximum number of virtual CPUs declared in the + <code>vcpu</code> element. This is done by filling any remaining CPUs + into the first NUMA <code>cell</code>. Users are encouraged to supply a + complete NUMA topology, where the sum of the NUMA CPUs matches the maximum + virtual CPUs number declared in <code>vcpus</code>, to make the domain + consistent across qemu and libvirt versions. + <code>memory</code> specifies the node memory in kibibytes (i.e. blocks of 1024 bytes). <span class="since">Since 1.2.11</span> one can use an additional <a href="#elementsMemoryAllocation"><code>unit</code></a> attribute to

On 6/10/20 8:35 PM, Daniel Henrique Barboza wrote:
changes in v2: - removed patch 5/5
Gitlab link: https://gitlab.com/danielhb/libvirt/-/tree/vcpus_numa_v2
v1 link: https://www.redhat.com/archives/libvir-list/2020-June/msg00016.html
Daniel Henrique Barboza (4): numa_conf.c: add helper functions for cpumap operations qemu_domain.c: NUMA CPUs auto-fill for incomplete topologies qemuxml2xmltest.c: add NUMA vcpus auto fill tests formatdomain.html.in: document the NUMA cpus auto fill feature
docs/formatdomain.html.in | 11 ++++- src/conf/numa_conf.c | 46 ++++++++++++++++++ src/conf/numa_conf.h | 3 ++ src/libvirt_private.syms | 1 + src/qemu/qemu_domain.c | 47 +++++++++++++++++++ src/qemu/qemu_domain.h | 4 ++ src/qemu/qemu_driver.c | 9 ++++ .../numavcpus-topology-mismatch.xml | 37 +++++++++++++++ ...avcpus-topology-mismatch.x86_64-latest.xml | 38 +++++++++++++++ tests/qemuxml2xmltest.c | 1 + 10 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/numavcpus-topology-mismatch.xml create mode 100644 tests/qemuxml2xmloutdata/numavcpus-topology-mismatch.x86_64-latest.xml
Patches look good to me. My only concern is that I plan to introduce vCPU-less NUMA nodes [1] (because of HMAT [2]). But I guess if user assigns vCPUs to NUMA nodes fully, then we still can have vCPU-less nodes because your code would be NOP, right? Michal 1: https://gitlab.com/MichalPrivoznik/libvirt/-/tree/hmat 2: https://www.redhat.com/archives/libvir-list/2020-January/msg00422.html

On 6/17/20 4:19 PM, Michal Privoznik wrote:
On 6/10/20 8:35 PM, Daniel Henrique Barboza wrote:
changes in v2: - removed patch 5/5
Gitlab link: https://gitlab.com/danielhb/libvirt/-/tree/vcpus_numa_v2
v1 link: https://www.redhat.com/archives/libvir-list/2020-June/msg00016.html
Daniel Henrique Barboza (4): numa_conf.c: add helper functions for cpumap operations qemu_domain.c: NUMA CPUs auto-fill for incomplete topologies qemuxml2xmltest.c: add NUMA vcpus auto fill tests formatdomain.html.in: document the NUMA cpus auto fill feature
docs/formatdomain.html.in | 11 ++++- src/conf/numa_conf.c | 46 ++++++++++++++++++ src/conf/numa_conf.h | 3 ++ src/libvirt_private.syms | 1 + src/qemu/qemu_domain.c | 47 +++++++++++++++++++ src/qemu/qemu_domain.h | 4 ++ src/qemu/qemu_driver.c | 9 ++++ .../numavcpus-topology-mismatch.xml | 37 +++++++++++++++ ...avcpus-topology-mismatch.x86_64-latest.xml | 38 +++++++++++++++ tests/qemuxml2xmltest.c | 1 + 10 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/numavcpus-topology-mismatch.xml create mode 100644 tests/qemuxml2xmloutdata/numavcpus-topology-mismatch.x86_64-latest.xml
Patches look good to me.
Thanks for the review!
My only concern is that I plan to introduce vCPU-less NUMA nodes [1] (because of HMAT [2]). But I guess if user assigns vCPUs to NUMA nodes fully, then we still can have vCPU-less nodes because your code would be NOP, right?
It'll be a NOP because the sum of CPUs in the NUMA topology would be equal to the maxcpus declared in <vcpus> Now, for the new use case you're going to introduce, you'll need to either (1) forbid incomplete NUMA nodes entirely for this case or (2) check how QEMU fills in the vcpus in this scenario. For (2) my brutal uneducated guess is that the behavior would be similar, but populating the first non-cpuless NUMA node (which wouldn't be necessarily node0). This can be arranged by creating a function that returns whether a node is cpu-less and using the first non-cpuless cpu in the qemuDomainDefNumaCPUsRectify() function (patch 2) instead of node0. You'll want to check it with QEMU first (Igor Mammedov perhaps?) to ensure that this is what QEMU would do in these cases. TBH I believe that cpu-less NUMA nodes is quite an advanced feature and (1) is a good approach for that, specially because there is no existing guests in the wild that would be impacted by this restriction since Libvirt does not support it yet. Thanks, DHB
Michal
1: https://gitlab.com/MichalPrivoznik/libvirt/-/tree/hmat 2: https://www.redhat.com/archives/libvir-list/2020-January/msg00422.html

On 6/17/20 10:08 PM, Daniel Henrique Barboza wrote:
On 6/17/20 4:19 PM, Michal Privoznik wrote:
On 6/10/20 8:35 PM, Daniel Henrique Barboza wrote:
changes in v2: - removed patch 5/5
Gitlab link: https://gitlab.com/danielhb/libvirt/-/tree/vcpus_numa_v2
v1 link: https://www.redhat.com/archives/libvir-list/2020-June/msg00016.html
Daniel Henrique Barboza (4): numa_conf.c: add helper functions for cpumap operations qemu_domain.c: NUMA CPUs auto-fill for incomplete topologies qemuxml2xmltest.c: add NUMA vcpus auto fill tests formatdomain.html.in: document the NUMA cpus auto fill feature
docs/formatdomain.html.in | 11 ++++- src/conf/numa_conf.c | 46 ++++++++++++++++++ src/conf/numa_conf.h | 3 ++ src/libvirt_private.syms | 1 + src/qemu/qemu_domain.c | 47 +++++++++++++++++++ src/qemu/qemu_domain.h | 4 ++ src/qemu/qemu_driver.c | 9 ++++ .../numavcpus-topology-mismatch.xml | 37 +++++++++++++++ ...avcpus-topology-mismatch.x86_64-latest.xml | 38 +++++++++++++++ tests/qemuxml2xmltest.c | 1 + 10 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/numavcpus-topology-mismatch.xml create mode 100644 tests/qemuxml2xmloutdata/numavcpus-topology-mismatch.x86_64-latest.xml
Patches look good to me.
Thanks for the review!
My only concern is that I plan to introduce vCPU-less NUMA nodes [1] (because of HMAT [2]). But I guess if user assigns vCPUs to NUMA nodes fully, then we still can have vCPU-less nodes because your code would be NOP, right?
It'll be a NOP because the sum of CPUs in the NUMA topology would be equal to the maxcpus declared in <vcpus>
Now, for the new use case you're going to introduce, you'll need to either (1) forbid incomplete NUMA nodes entirely for this case or (2) check how QEMU fills in the vcpus in this scenario.
For (2) my brutal uneducated guess is that the behavior would be similar, but populating the first non-cpuless NUMA node (which wouldn't be necessarily node0). This can be arranged by creating a function that returns whether a node is cpu-less and using the first non-cpuless cpu in the qemuDomainDefNumaCPUsRectify() function (patch 2) instead of node0. You'll want to check it with QEMU first (Igor Mammedov perhaps?) to ensure that this is what QEMU would do in these cases.
TBH I believe that cpu-less NUMA nodes is quite an advanced feature and (1) is a good approach for that, specially because there is no existing guests in the wild that would be impacted by this restriction since Libvirt does not support it yet.
Yes, for qemu 2.7+ in qemuValidateDomainDef() if topology is specified then we require full vCPU to NUMA assignment. Well, we warn users if it is not the case (see QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS check and code around). Anyways, as I said your code is okay (I'm fixing couple of small nits) and pushing. Reviewed-by: Michal Privoznik <mprivozn@redhat.com> Do you think it's worth documenting in the release notes too? Michal

On 6/18/20 7:34 AM, Michal Privoznik wrote:
On 6/17/20 10:08 PM, Daniel Henrique Barboza wrote:
On 6/17/20 4:19 PM, Michal Privoznik wrote:
On 6/10/20 8:35 PM, Daniel Henrique Barboza wrote:
changes in v2: - removed patch 5/5
Gitlab link: https://gitlab.com/danielhb/libvirt/-/tree/vcpus_numa_v2
v1 link: https://www.redhat.com/archives/libvir-list/2020-June/msg00016.html
Daniel Henrique Barboza (4): numa_conf.c: add helper functions for cpumap operations qemu_domain.c: NUMA CPUs auto-fill for incomplete topologies qemuxml2xmltest.c: add NUMA vcpus auto fill tests formatdomain.html.in: document the NUMA cpus auto fill feature
docs/formatdomain.html.in | 11 ++++- src/conf/numa_conf.c | 46 ++++++++++++++++++ src/conf/numa_conf.h | 3 ++ src/libvirt_private.syms | 1 + src/qemu/qemu_domain.c | 47 +++++++++++++++++++ src/qemu/qemu_domain.h | 4 ++ src/qemu/qemu_driver.c | 9 ++++ .../numavcpus-topology-mismatch.xml | 37 +++++++++++++++ ...avcpus-topology-mismatch.x86_64-latest.xml | 38 +++++++++++++++ tests/qemuxml2xmltest.c | 1 + 10 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/numavcpus-topology-mismatch.xml create mode 100644 tests/qemuxml2xmloutdata/numavcpus-topology-mismatch.x86_64-latest.xml
Patches look good to me.
Thanks for the review!
My only concern is that I plan to introduce vCPU-less NUMA nodes [1] (because of HMAT [2]). But I guess if user assigns vCPUs to NUMA nodes fully, then we still can have vCPU-less nodes because your code would be NOP, right?
It'll be a NOP because the sum of CPUs in the NUMA topology would be equal to the maxcpus declared in <vcpus>
Now, for the new use case you're going to introduce, you'll need to either (1) forbid incomplete NUMA nodes entirely for this case or (2) check how QEMU fills in the vcpus in this scenario.
For (2) my brutal uneducated guess is that the behavior would be similar, but populating the first non-cpuless NUMA node (which wouldn't be necessarily node0). This can be arranged by creating a function that returns whether a node is cpu-less and using the first non-cpuless cpu in the qemuDomainDefNumaCPUsRectify() function (patch 2) instead of node0. You'll want to check it with QEMU first (Igor Mammedov perhaps?) to ensure that this is what QEMU would do in these cases.
TBH I believe that cpu-less NUMA nodes is quite an advanced feature and (1) is a good approach for that, specially because there is no existing guests in the wild that would be impacted by this restriction since Libvirt does not support it yet.
Yes, for qemu 2.7+ in qemuValidateDomainDef() if topology is specified then we require full vCPU to NUMA assignment. Well, we warn users if it is not the case (see QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS check and code around).
Anyways, as I said your code is okay (I'm fixing couple of small nits) and pushing.
Thanks for the tweaks before before pushing!
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Do you think it's worth documenting in the release notes too?
In the "Improvements" section perhaps. I just sent a patch. Thanks, DHB
Michal
participants (2)
-
Daniel Henrique Barboza
-
Michal Privoznik