[libvirt] [PATCH 0/5] [RFC] Add support for per-guest-node binding

Currently we are only able to bind the whole domain to some host nodes using the /domain/numatune/memory element. Numerous requests were made to support host<->guest numa node bindings, so this series tries to pinch an idea on how to do that using /domain/numatune/memnode elements. So here are few ideas I'd like to know others opinions on: For some reason, qemu wants to know what host nodes it can use to for allocation of the memory. While adding support for that, qemu added various memory objects (-object memory*) with different backends. There's 'memory-file' which is used for hugepages and 'memory-ram' which is used for standard allocation. Latest version of the qemu proposal is here: http://lists.gnu.org/archive/html/qemu-devel/2014-05/msg02706.html Caveats: - I'm not sure how cpu hotplug is done with guest numa nodes, but if there is a possibility to increase the number of numa nodes (which does not make sense to me from (a) user point of view and (b) our XMLs and APIs), we need to be able to hotplug the ram as well, - virDomainGetNumaParameters() now reflects only the /domain/numatune/memory settings, not 'memnode' ones, - virDomainSetNumaParameters() is not allowed when there is some /domain/numatune/memnode parameter as we can query memdev info, but not change it (if I understood the QEMU side correctly), - when domain is started, cpuset.mems cgroup is not modified per for each vcpu, this will be fixed, but the question is how to handle it for non-strict settings [*], - automatic numad placement can be now used together with memnode settings which IMHO doesn't make any sense, but I was hesitant to disable that in case somebody has a constructive criticism in this area. - This series alone is broken when used with /domain/memoryBacking/hugepages, because it will still use the memory-ram object, but that will be fixed with Michal's patches on top of this series. One idea how to solve some of the problems is to say that /domain/numatune/memory is set for the whole domain regardless of what anyone puts in /domain/numatune/memnode. virDomainGetNumaParameters() could be extended to tell the info for all guest numa nodes, although it seems new API would suit better for this kind of information. But is it really neede when we are not able to modify it live and the information is available in the domain XML? *) does (or should) this: ... <numatune> <memory mode='strict' placement='static' nodeset='0-7'/> <memnode nodeid='0' mode='preferred' nodeset='7'/> </numatune> ... mean what it looks like it means, that is "in guest node 0, prefer allocating from host node 7 but feel free to allocate from 0-6 as well in case you can't use 7, but never try allocating from host nodes 8-15"? Martin Kletzander (5): conf, schema: add 'id' field for cells conf, schema: add support for numatune memnode element qemu: purely a code movement qemu: numa capability probing qemu: pass numa node binding preferences to qemu docs/formatdomain.html.in | 29 +++- docs/schemas/domaincommon.rng | 22 +++ src/conf/cpu_conf.c | 39 ++++- src/conf/domain_conf.c | 181 +++++++++++++++++---- src/qemu/qemu_capabilities.c | 2 + src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_cgroup.c | 2 + src/qemu/qemu_command.c | 160 ++++++++++++++++-- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 23 ++- src/qemu/qemu_driver.c | 14 +- src/qemu/qemu_process.c | 3 +- src/util/virnuma.h | 14 +- tests/qemuxml2argvdata/qemuxml2argv-cpu-numa1.xml | 6 +- tests/qemuxml2argvdata/qemuxml2argv-cpu-numa2.xml | 6 +- tests/qemuxml2argvdata/qemuxml2argv-cpu-numa3.xml | 25 +++ .../qemuxml2argv-numatune-auto-prefer.args | 6 + .../qemuxml2argv-numatune-auto-prefer.xml | 29 ++++ .../qemuxml2argv-numatune-auto.args | 6 + .../qemuxml2argv-numatune-auto.xml | 26 +++ .../qemuxml2argv-numatune-memnode-nocpu.xml | 25 +++ .../qemuxml2argv-numatune-memnodes-problematic.xml | 31 ++++ .../qemuxml2argv-numatune-memnodes.args | 8 + .../qemuxml2argv-numatune-memnodes.xml | 31 ++++ .../qemuxml2argv-numatune-prefer.args | 6 + .../qemuxml2argv-numatune-prefer.xml | 29 ++++ tests/qemuxml2argvtest.c | 51 ++++-- .../qemuxml2xmlout-cpu-numa1.xml | 28 ++++ .../qemuxml2xmlout-cpu-numa2.xml | 28 ++++ tests/qemuxml2xmltest.c | 4 + tests/qemuxmlnstest.c | 2 +- 31 files changed, 747 insertions(+), 93 deletions(-) create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-cpu-numa3.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memnode-nocpu.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes-problematic.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.xml create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa1.xml create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa2.xml -- 1.9.3

In XML format, by definition, order of fields should not matter, so oder of parsing the elements doesn't affect the end result. When specifying guest NUMA cells, we depend only on the order of the 'cell' elements. With this patch all older domain XMLs are parsed as before, but with the 'id' attribute they are parsed and formatted according to that field. This will be useful when we have tuning settings for particular guest NUMA node. Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- docs/formatdomain.html.in | 11 +++--- docs/schemas/domaincommon.rng | 5 +++ src/conf/cpu_conf.c | 39 +++++++++++++++++++--- tests/qemuxml2argvdata/qemuxml2argv-cpu-numa1.xml | 6 ++-- tests/qemuxml2argvdata/qemuxml2argv-cpu-numa2.xml | 6 ++-- tests/qemuxml2argvdata/qemuxml2argv-cpu-numa3.xml | 25 ++++++++++++++ tests/qemuxml2argvtest.c | 1 + .../qemuxml2xmlout-cpu-numa1.xml | 28 ++++++++++++++++ .../qemuxml2xmlout-cpu-numa2.xml | 28 ++++++++++++++++ tests/qemuxml2xmltest.c | 3 ++ 10 files changed, 138 insertions(+), 14 deletions(-) create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-cpu-numa3.xml create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa1.xml create mode 100644 tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa2.xml diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 691a451..041f70d 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -1030,8 +1030,8 @@ <cpu> ... <numa> - <cell cpus='0-3' memory='512000'/> - <cell cpus='4-7' memory='512000'/> + <cell id='0' cpus='0-3' memory='512000'/> + <cell id='1' cpus='4-7' memory='512000'/> </numa> ... </cpu> @@ -1041,8 +1041,11 @@ Each <code>cell</code> element specifies a NUMA cell or a NUMA node. <code>cpus</code> specifies the CPU or range of CPUs that are part of the node. <code>memory</code> specifies the node memory in kibibytes - (i.e. blocks of 1024 bytes). Each cell or node is assigned cellid - or nodeid in the increasing order starting from 0. + (i.e. blocks of 1024 bytes). All cells should have <code>id</code> + attribute in case referring to some cell is necessary in the code, + otherwise the cells are assigned ids in the increasing order starting + from 0. Mixing cells with and without the <code>id</code> attribute + is not recommended as it may result in unwanted behaviour. </p> <p> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index af67123..0787b5a 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -3892,6 +3892,11 @@ <define name="numaCell"> <element name="cell"> + <optional> + <attribute name="id"> + <ref name="unsignedInt"/> + </attribute> + </optional> <attribute name="cpus"> <ref name="cpuset"/> </attribute> diff --git a/src/conf/cpu_conf.c b/src/conf/cpu_conf.c index ebdaa19..ec80340 100644 --- a/src/conf/cpu_conf.c +++ b/src/conf/cpu_conf.c @@ -438,17 +438,47 @@ virCPUDefParseXML(xmlNodePtr node, for (i = 0; i < n; i++) { char *cpus, *memory; int ret, ncpus = 0; + unsigned int cur_cell; + char *tmp = NULL; + + tmp = virXMLPropString(nodes[i], "id"); + if (!tmp) { + cur_cell = i; + } else { + ret = virStrToLong_ui(tmp, NULL, 10, &cur_cell); + VIR_FREE(tmp); + if (ret == -1) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Invalid 'id' attribute in NUMA cell")); + goto error; + } + } + + if (cur_cell >= n) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Exactly one 'cell' element per guest " + "NUMA cell allowed, non-contiguous ranges or " + "ranges not starting from 0 are not allowed")); + goto error; + } + + if (def->cells[cur_cell].cpustr) { + virReportError(VIR_ERR_XML_ERROR, + _("Duplicate NUMA cell info for cell id '%u'"), + cur_cell); + goto error; + } - def->cells[i].cellid = i; + def->cells[cur_cell].cellid = cur_cell; cpus = virXMLPropString(nodes[i], "cpus"); if (!cpus) { virReportError(VIR_ERR_XML_ERROR, "%s", _("Missing 'cpus' attribute in NUMA cell")); goto error; } - def->cells[i].cpustr = cpus; + def->cells[cur_cell].cpustr = cpus; - ncpus = virBitmapParse(cpus, 0, &def->cells[i].cpumask, + ncpus = virBitmapParse(cpus, 0, &def->cells[cur_cell].cpumask, VIR_DOMAIN_CPUMASK_LEN); if (ncpus <= 0) goto error; @@ -461,7 +491,7 @@ virCPUDefParseXML(xmlNodePtr node, goto error; } - ret = virStrToLong_ui(memory, NULL, 10, &def->cells[i].mem); + ret = virStrToLong_ui(memory, NULL, 10, &def->cells[cur_cell].mem); if (ret == -1) { virReportError(VIR_ERR_XML_ERROR, "%s", _("Invalid 'memory' attribute in NUMA cell")); @@ -647,6 +677,7 @@ virCPUDefFormatBuf(virBufferPtr buf, virBufferAdjustIndent(buf, 2); for (i = 0; i < def->ncells; i++) { virBufferAddLit(buf, "<cell"); + virBufferAsprintf(buf, " id='%u'", def->cells[i].cellid); virBufferAsprintf(buf, " cpus='%s'", def->cells[i].cpustr); virBufferAsprintf(buf, " memory='%d'", def->cells[i].mem); virBufferAddLit(buf, "/>\n"); diff --git a/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa1.xml b/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa1.xml index ee402c8..0543f7f 100644 --- a/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa1.xml +++ b/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa1.xml @@ -9,10 +9,10 @@ <boot dev='network'/> </os> <cpu> - <topology sockets="2" cores="4" threads="2"/> + <topology sockets='2' cores='4' threads='2'/> <numa> - <cell cpus="0-7" memory="109550"/> - <cell cpus="8-15" memory="109550"/> + <cell cpus='0-7' memory='109550'/> + <cell cpus='8-15' memory='109550'/> </numa> </cpu> <clock offset='utc'/> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa2.xml b/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa2.xml index ee402c8..0a5f9fc 100644 --- a/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa2.xml +++ b/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa2.xml @@ -9,10 +9,10 @@ <boot dev='network'/> </os> <cpu> - <topology sockets="2" cores="4" threads="2"/> + <topology sockets='2' cores='4' threads='2'/> <numa> - <cell cpus="0-7" memory="109550"/> - <cell cpus="8-15" memory="109550"/> + <cell id='1' cpus='8-15' memory='109550'/> + <cell id='0' cpus='0-7' memory='109550'/> </numa> </cpu> <clock offset='utc'/> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa3.xml b/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa3.xml new file mode 100644 index 0000000..fa3070d --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-cpu-numa3.xml @@ -0,0 +1,25 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>16</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='network'/> + </os> + <cpu> + <topology sockets='2' cores='4' threads='2'/> + <numa> + <cell id='1' cpus='0-7' memory='109550'/> + <cell id='2' cpus='8-15' memory='109550'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 24d104e..bdea8b0 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1169,6 +1169,7 @@ mymain(void) DO_TEST("cpu-strict1", NONE); DO_TEST("cpu-numa1", NONE); DO_TEST("cpu-numa2", QEMU_CAPS_SMP_TOPOLOGY); + DO_TEST_PARSE_ERROR("cpu-numa3", NONE); DO_TEST("cpu-host-model", NONE); skipLegacyCPUs = true; DO_TEST("cpu-host-model-fallback", NONE); diff --git a/tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa1.xml b/tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa1.xml new file mode 100644 index 0000000..227bf1c --- /dev/null +++ b/tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa1.xml @@ -0,0 +1,28 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>16</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='network'/> + </os> + <cpu> + <topology sockets='2' cores='4' threads='2'/> + <numa> + <cell id='0' cpus='0-7' memory='109550'/> + <cell id='1' cpus='8-15' memory='109550'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa2.xml b/tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa2.xml new file mode 100644 index 0000000..227bf1c --- /dev/null +++ b/tests/qemuxml2xmloutdata/qemuxml2xmlout-cpu-numa2.xml @@ -0,0 +1,28 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>16</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='network'/> + </os> + <cpu> + <topology sockets='2' cores='4' threads='2'/> + <numa> + <cell id='0' cpus='0-7' memory='109550'/> + <cell id='1' cpus='8-15' memory='109550'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index da528da..a446ea1 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -364,6 +364,9 @@ mymain(void) DO_TEST("chardev-label"); + DO_TEST_DIFFERENT("cpu-numa1"); + DO_TEST_DIFFERENT("cpu-numa2"); + virObjectUnref(driver.caps); virObjectUnref(driver.xmlopt); -- 1.9.3

This element specifies similar settings as the memory element, although memnode can be used per guest NUMA node. Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- docs/formatdomain.html.in | 18 +++++ docs/schemas/domaincommon.rng | 17 ++++ src/conf/domain_conf.c | 181 +++++++++++++++++++++++++++++++++--------- src/qemu/qemu_domain.c | 23 +++++- src/qemu/qemu_driver.c | 12 +++ src/util/virnuma.h | 14 +++- 6 files changed, 225 insertions(+), 40 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 041f70d..fd29ae3 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -709,6 +709,8 @@ ... <numatune> <memory mode="strict" nodeset="1-4,^3"/> + <memnode cellid="0" mode="strict" nodeset="1"/> + <memnode cellid="2" mode="preferred" nodeset="2"/> </numatune> ... </domain> @@ -745,6 +747,22 @@ <span class='since'>Since 0.9.3</span> </dd> + <dt><code>memnode</code></dt> + <dd> + Optional <code>memnode</code> elements can specify memory allocation + policies per each guest NUMA node. For those nodes having no + corresponding <code>memnode</code> element, the default from + element <code>memory</code> will be used. Attribute <code>cellid</code> + addresses guest NUMA node for which the settings are applied. + Attributes <code>mode</code> and <code>nodeset</code> have the same + meaning and syntax as in <code>memory</code> element. + + Due to possible memory migration issues according to kernel settings, + using this <code>memnode</code> element effectively disables any live + changes of numatune settings in current versions of libvirt. + + <span class='since'>QEMU Since 1.2.6</span> + </dd> </dl> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 0787b5a..a8e3ba0 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -789,6 +789,23 @@ </choice> </element> </optional> + <zeroOrMore> + <element name="memnode"> + <attribute name="cellid"> + <ref name="unsignedInt"/> + </attribute> + <attribute name="mode"> + <choice> + <value>strict</value> + <value>preferred</value> + <value>interleave</value> + </choice> + </attribute> + <attribute name='nodeset'> + <ref name='cpuset'/> + </attribute> + </element> + </zeroOrMore> </element> </define> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index f1df092..4818cfb 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2085,6 +2085,9 @@ void virDomainDefFree(virDomainDefPtr def) virDomainVcpuPinDefFree(def->cputune.emulatorpin); virBitmapFree(def->numatune.memory.nodemask); + for (i = 0; i < def->numatune.nmem_nodes; i++) + virBitmapFree(def->numatune.mem_nodes[i].nodemask); + VIR_FREE(def->numatune.mem_nodes); virSysinfoDefFree(def->sysinfo); @@ -11233,6 +11236,7 @@ virDomainDefParseXML(xmlDocPtr xml, bool usb_master = false; bool primaryVideo = false; + if (VIR_ALLOC(def) < 0) return NULL; @@ -11666,6 +11670,33 @@ virDomainDefParseXML(xmlDocPtr xml, } VIR_FREE(nodes); + + /* analysis of cpu handling */ + if ((node = virXPathNode("./cpu[1]", ctxt)) != NULL) { + xmlNodePtr oldnode = ctxt->node; + ctxt->node = node; + def->cpu = virCPUDefParseXML(node, ctxt, VIR_CPU_TYPE_GUEST); + ctxt->node = oldnode; + + if (def->cpu == NULL) + goto error; + + if (def->cpu->sockets && + def->maxvcpus > + def->cpu->sockets * def->cpu->cores * def->cpu->threads) { + virReportError(VIR_ERR_XML_DETAIL, "%s", + _("Maximum CPUs greater than topology limit")); + goto error; + } + + if (def->cpu->cells_cpus > def->maxvcpus) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Number of CPUs in <numa> exceeds the" + " <vcpu> count")); + goto error; + } + } + /* Extract numatune if exists. */ if ((n = virXPathNodeSet("./numatune", ctxt, &nodes)) < 0) { virReportError(VIR_ERR_INTERNAL_ERROR, @@ -11682,6 +11713,12 @@ virDomainDefParseXML(xmlDocPtr xml, if (n) { cur = nodes[0]->children; + if (def->cpu) { + if (VIR_ALLOC_N(def->numatune.mem_nodes, def->cpu->ncells) < 0) + goto error; + def->numatune.nmem_nodes = def->cpu->ncells; + } + while (cur != NULL) { if (cur->type == XML_ELEMENT_NODE) { if (xmlStrEqual(cur->name, BAD_CAST "memory")) { @@ -11764,6 +11801,78 @@ virDomainDefParseXML(xmlDocPtr xml, def->placement_mode = VIR_DOMAIN_CPU_PLACEMENT_MODE_AUTO; def->numatune.memory.placement_mode = placement_mode; + + } else if (xmlStrEqual(cur->name, BAD_CAST "memnode")) { + unsigned int cellid; + struct mem_node *mem_node = NULL; + + if (!def->numatune.nmem_nodes) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Element 'memnode' is invalid without " + "any guest NUMA cells")); + goto error; + } + tmp = virXMLPropString(cur, "cellid"); + if (!tmp) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Missing required cellid attribute " + "in numatune memnode element")); + goto error; + } + if (virStrToLong_ui(tmp, NULL, 10, &cellid) < 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Invalid cellid attribute " + "in numatune memnode element")); + goto error; + } + VIR_FREE(tmp); + + if (cellid >= def->numatune.nmem_nodes) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Argument 'cellid' in numatune " + "memnode element must correspond to " + "existing guest's NUMA cell")); + goto error; + } + + mem_node = &def->numatune.mem_nodes[cellid]; + + if (mem_node->specified) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Multiple numatune memnode elements " + "with duplicate 'cellid'")); + goto error; + } + + mem_node->specified = true; + + tmp = virXMLPropString(cur, "mode"); + if (tmp && + (mem_node->mode = + virDomainNumatuneMemModeTypeFromString(tmp)) < 0) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Invalid mode attribute " + "in numatune memnode element")); + goto error; + } else if (!tmp) { + mem_node->mode = VIR_DOMAIN_NUMATUNE_MEM_STRICT; + } + VIR_FREE(tmp); + + tmp = virXMLPropString(cur, "nodeset"); + if (!tmp) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Missing required nodeset attribute " + "in numatune memnode element")); + goto error; + } + if (virBitmapParse(tmp, 0, + &mem_node->nodemask, + VIR_DOMAIN_CPUMASK_LEN) < 0) { + goto error; + } + VIR_FREE(tmp); + } else { virReportError(VIR_ERR_XML_ERROR, _("unsupported XML element %s"), @@ -12863,32 +12972,6 @@ virDomainDefParseXML(xmlDocPtr xml, goto error; } - /* analysis of cpu handling */ - if ((node = virXPathNode("./cpu[1]", ctxt)) != NULL) { - xmlNodePtr oldnode = ctxt->node; - ctxt->node = node; - def->cpu = virCPUDefParseXML(node, ctxt, VIR_CPU_TYPE_GUEST); - ctxt->node = oldnode; - - if (def->cpu == NULL) - goto error; - - if (def->cpu->sockets && - def->maxvcpus > - def->cpu->sockets * def->cpu->cores * def->cpu->threads) { - virReportError(VIR_ERR_XML_DETAIL, "%s", - _("Maximum CPUs greater than topology limit")); - goto error; - } - - if (def->cpu->cells_cpus > def->maxvcpus) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("Number of CPUs in <numa> exceeds the" - " <vcpu> count")); - goto error; - } - } - if ((node = virXPathNode("./sysinfo[1]", ctxt)) != NULL) { xmlNodePtr oldnode = ctxt->node; ctxt->node = node; @@ -17395,31 +17478,57 @@ virDomainDefFormatInternal(virDomainDefPtr def, virBufferAddLit(buf, "</cputune>\n"); if (def->numatune.memory.nodemask || - def->numatune.memory.placement_mode) { + def->numatune.memory.placement_mode || + def->numatune.nmem_nodes) { const char *mode; char *nodemask = NULL; const char *placement; virBufferAddLit(buf, "<numatune>\n"); virBufferAdjustIndent(buf, 2); - mode = virDomainNumatuneMemModeTypeToString(def->numatune.memory.mode); - virBufferAsprintf(buf, "<memory mode='%s' ", mode); - if (def->numatune.memory.placement_mode == - VIR_NUMA_TUNE_MEM_PLACEMENT_MODE_STATIC) { - nodemask = virBitmapFormat(def->numatune.memory.nodemask); + if (def->numatune.memory.nodemask || + def->numatune.memory.placement_mode) { + + mode = virDomainNumatuneMemModeTypeToString(def->numatune.memory.mode); + virBufferAsprintf(buf, "<memory mode='%s' ", mode); + + if (def->numatune.memory.placement_mode == + VIR_NUMA_TUNE_MEM_PLACEMENT_MODE_STATIC) { + nodemask = virBitmapFormat(def->numatune.memory.nodemask); + if (nodemask == NULL) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("failed to format nodeset for " + "NUMA memory tuning")); + goto error; + } + virBufferAsprintf(buf, "nodeset='%s'/>\n", nodemask); + VIR_FREE(nodemask); + } else if (def->numatune.memory.placement_mode) { + placement = virNumaTuneMemPlacementModeTypeToString(def->numatune.memory.placement_mode); + virBufferAsprintf(buf, "placement='%s'/>\n", placement); + } + } + + for (i = 0; i < def->numatune.nmem_nodes; i++) { + struct mem_node *mem_node = &def->numatune.mem_nodes[i]; + if (!mem_node->specified) + continue; + + nodemask = virBitmapFormat(mem_node->nodemask); + mode = virDomainNumatuneMemModeTypeToString(mem_node->mode); if (nodemask == NULL) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("failed to format nodeset for " "NUMA memory tuning")); goto error; } - virBufferAsprintf(buf, "nodeset='%s'/>\n", nodemask); + virBufferAsprintf(buf, + "<memnode nodeid='%zu' mode='%s' nodeset='%s'/>\n", + i, mode, nodemask); VIR_FREE(nodemask); - } else if (def->numatune.memory.placement_mode) { - placement = virNumaTuneMemPlacementModeTypeToString(def->numatune.memory.placement_mode); - virBufferAsprintf(buf, "placement='%s'/>\n", placement); } + virBufferAdjustIndent(buf, -2); virBufferAddLit(buf, "</numatune>\n"); } diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 78cfdc6..89800ab 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -710,7 +710,28 @@ qemuDomainDefPostParse(virDomainDefPtr def, bool addDefaultMemballoon = true; bool addDefaultUSBKBD = false; bool addDefaultUSBMouse = false; - + size_t i = 0; + + if (def->numatune.memory.nodemask) { + for (i = 0; i < def->numatune.nmem_nodes; i++) { + struct mem_node *mem_node = &def->numatune.mem_nodes[i]; + ssize_t pos = -1; + bool bit = false; + + if (!mem_node->specified) + continue; + + while ((pos = virBitmapNextSetBit(mem_node->nodemask, pos)) >= 0) { + if (virBitmapGetBit(def->numatune.memory.nodemask, pos, &bit) < 0 || + !bit) { + virReportError(VIR_ERR_XML_DETAIL, "%s", + _("memnode nodeset must be subset of the " + "global memory nodeset")); + return -1; + } + } + } + } /* check for emulator and create a default one if needed */ if (!def->emulator && !(def->emulator = virDomainDefGetDefaultEmulator(def, caps))) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 2b852eb..d838ad2 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -8613,6 +8613,16 @@ qemuDomainSetNumaParamsLive(virDomainObjPtr vm, goto cleanup; } + for (i = 0; i < def->numatune->nmem_nodes; i++) { + if (def->numatune->mem_nodes[i].specified) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("change of nodeset for running domain " + "with per guest NUMA node numatune settings " + "is not supported")); + goto cleanup; + } + } + /* Get existing nodeset values */ if (virCgroupGetCpusetMems(priv->cgroup, &nodeset_str) < 0 || virBitmapParse(nodeset_str, 0, &temp_nodeset, @@ -8853,6 +8863,8 @@ qemuDomainGetNumaParameters(virDomainPtr dom, } } + /* ASDF: <memnode> settings are not reflected here! yet */ + for (i = 0; i < QEMU_NB_NUMA_PARAM && i < *nparams; i++) { virMemoryParameterPtr param = ¶ms[i]; diff --git a/src/util/virnuma.h b/src/util/virnuma.h index 8464b19..5e7608d 100644 --- a/src/util/virnuma.h +++ b/src/util/virnuma.h @@ -1,7 +1,7 @@ /* * virnuma.h: helper APIs for managing numa * - * Copyright (C) 2011-2013 Red Hat, Inc. + * Copyright (C) 2011-2014 Red Hat, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -43,9 +43,17 @@ typedef virNumaTuneDef *virNumaTuneDefPtr; struct _virNumaTuneDef { struct { virBitmapPtr nodemask; - int mode; + int mode; /* enum virDomainNumatuneMemMode */ int placement_mode; /* enum virNumaTuneMemPlacementMode */ - } memory; + } memory; /* pinning for all the memory */ + + struct mem_node { + bool specified; + unsigned int nodeid; + virBitmapPtr nodemask; + int mode; + } *mem_nodes; /* pinning per guest's NUMA node */ + size_t nmem_nodes; /* Future NUMA tuning related stuff should go here. */ }; -- 1.9.3

to ease the review of commits to follow. Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/qemu/qemu_command.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 1d5bce6..9e674a2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6541,11 +6541,22 @@ qemuBuildNumaArgStr(const virDomainDef *def, virCommandPtr cmd) for (i = 0; i < def->cpu->ncells; i++) { VIR_FREE(cpumask); + int cellmem = VIR_DIV_UP(def->cpu->cells[i].mem, 1024) + def->cpu->cells[i].mem = cellmem * 1024; + + if (!(cpumask = virBitmapFormat(def->cpu->cells[i].cpumask))) + goto cleanup; + + if (strchr(cpumask, ',')) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("disjoint NUMA cpu ranges are not supported " + "with this QEMU")); + goto cleanup; + } + virCommandAddArg(cmd, "-numa"); virBufferAsprintf(&buf, "node,nodeid=%d", def->cpu->cells[i].cellid); virBufferAddLit(&buf, ",cpus="); - if (!(cpumask = virBitmapFormat(def->cpu->cells[i].cpumask))) - goto cleanup; /* Up through qemu 1.4, -numa does not accept a cpus * argument any more complex than start-stop. @@ -6553,16 +6564,8 @@ qemuBuildNumaArgStr(const virDomainDef *def, virCommandPtr cmd) * XXX For qemu 1.5, the syntax has not yet been decided; * but when it is, we need a capability bit and * translation of our cpumask into the qemu syntax. */ - if (strchr(cpumask, ',')) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("disjoint NUMA cpu ranges are not supported " - "with this QEMU")); - goto cleanup; - } virBufferAdd(&buf, cpumask, -1); - def->cpu->cells[i].mem = VIR_DIV_UP(def->cpu->cells[i].mem, - 1024) * 1024; - virBufferAsprintf(&buf, ",mem=%d", def->cpu->cells[i].mem / 1024); + virBufferAsprintf(&buf, ",mem=%d", cellmem); if (virBufferError(&buf)) { virReportOOMError(); -- 1.9.3

The numa patch series in qemu adds "query-memdev" command by which we can tell whether we can use memory objects. Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + 2 files changed, 3 insertions(+) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 08c3d04..4d4f024 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -256,6 +256,7 @@ VIR_ENUM_IMPL(virQEMUCaps, QEMU_CAPS_LAST, "usb-kbd", /* 165 */ "host-pci-multidomain", "msg-timestamp", + "memdev", ); @@ -1382,6 +1383,7 @@ struct virQEMUCapsStringFlags virQEMUCapsCommands[] = { { "blockdev-snapshot-sync", QEMU_CAPS_DISK_SNAPSHOT }, { "add-fd", QEMU_CAPS_ADD_FD }, { "nbd-server-start", QEMU_CAPS_NBD_SERVER }, + { "query-memdev", QEMU_CAPS_OBJECT_MEMORY }, }; struct virQEMUCapsStringFlags virQEMUCapsEvents[] = { diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index d755caa..a3461e1 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -206,6 +206,7 @@ enum virQEMUCapsFlags { QEMU_CAPS_DEVICE_USB_KBD = 165, /* -device usb-kbd */ QEMU_CAPS_HOST_PCI_MULTIDOMAIN = 166, /* support domain > 0 in host pci address */ QEMU_CAPS_MSG_TIMESTAMP = 167, /* -msg timestamp */ + QEMU_CAPS_OBJECT_MEMORY = 168, /* -object memory */ QEMU_CAPS_LAST, /* this must always be the last item */ }; -- 1.9.3

Currently, we only bind the whole QEMU domain to memory nodes specified in nodemask altogether. That, however, doesn't make much sense when one wants to control from where the memory for particular guest nodes should be allocated. QEMU allows us to do that by specifying 'host-nodes' parameter for the 'memory-ram' object, so let's use that. Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/conf/domain_conf.c | 2 +- src/qemu/qemu_cgroup.c | 2 + src/qemu/qemu_command.c | 141 +++++++++++++++++++-- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_driver.c | 6 +- src/qemu/qemu_process.c | 3 +- .../qemuxml2argv-numatune-auto-prefer.args | 6 + .../qemuxml2argv-numatune-auto-prefer.xml | 29 +++++ .../qemuxml2argv-numatune-auto.args | 6 + .../qemuxml2argv-numatune-auto.xml | 26 ++++ .../qemuxml2argv-numatune-memnode-nocpu.xml | 25 ++++ .../qemuxml2argv-numatune-memnodes-problematic.xml | 31 +++++ .../qemuxml2argv-numatune-memnodes.args | 8 ++ .../qemuxml2argv-numatune-memnodes.xml | 31 +++++ .../qemuxml2argv-numatune-prefer.args | 6 + .../qemuxml2argv-numatune-prefer.xml | 29 +++++ tests/qemuxml2argvtest.c | 50 +++++--- tests/qemuxml2xmltest.c | 1 + tests/qemuxmlnstest.c | 2 +- 19 files changed, 373 insertions(+), 34 deletions(-) create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memnode-nocpu.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes-problematic.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.xml create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.xml diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 4818cfb..b77914e 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -17524,7 +17524,7 @@ virDomainDefFormatInternal(virDomainDefPtr def, goto error; } virBufferAsprintf(buf, - "<memnode nodeid='%zu' mode='%s' nodeset='%s'/>\n", + "<memnode cellid='%zu' mode='%s' nodeset='%s'/>\n", i, mode, nodemask); VIR_FREE(nodemask); } diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b1bfb5a..ce03a4f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -588,6 +588,8 @@ qemuSetupCpusetCgroup(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0; + /* ASDF: per-node pinnings are not reflected here */ + if ((vm->def->numatune.memory.nodemask || (vm->def->numatune.memory.placement_mode == VIR_NUMA_TUNE_MEM_PLACEMENT_MODE_AUTO)) && diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 9e674a2..a913101 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6531,21 +6531,99 @@ qemuBuildSmpArgStr(const virDomainDef *def, return virBufferContentAndReset(&buf); } +static const char * +qemuNumaPolicyToString(virDomainNumatuneMemMode mode) +{ + const char *policy = NULL; + + switch (mode) { + case VIR_DOMAIN_NUMATUNE_MEM_STRICT: + policy = "bind"; + break; + + case VIR_DOMAIN_NUMATUNE_MEM_PREFERRED: + policy = "preferred"; + break; + + case VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE: + policy = "interleave"; + break; + + case VIR_DOMAIN_NUMATUNE_MEM_LAST: + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot translate memory policy")); + break; + } + + return policy; +} + static int -qemuBuildNumaArgStr(const virDomainDef *def, virCommandPtr cmd) +qemuBuildNumaArgStr(const virDomainDef *def, + virCommandPtr cmd, + virQEMUDriverPtr driver, + virQEMUCapsPtr qemuCaps, + virBitmapPtr autoNodemask) { + int ret = -1; size_t i; - virBuffer buf = VIR_BUFFER_INITIALIZER; char *cpumask = NULL; - int ret = -1; + virCapsPtr caps = virQEMUDriverGetCapabilities(driver, false); + virBuffer buf = VIR_BUFFER_INITIALIZER; + char *nodemask = NULL; + char *globalNodemask = NULL; + const char *globalPolicy = qemuNumaPolicyToString(def->numatune.memory.mode); + + if (!globalPolicy) + goto cleanup; + + /* + * The @autoNodemask parameter means there is numa placement advise we + * should use if possible. If it's NULL, then we ought to format our + * own nodemask from the numatune specification. Let's hide it from + * most of the code in this function for the sake of simplification. + */ + if (autoNodemask || def->numatune.memory.nodemask) { + + if (autoNodemask) + globalNodemask = virBitmapFormat(autoNodemask); + else + globalNodemask = virBitmapFormat(def->numatune.memory.nodemask); + + if (!globalNodemask) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Unable to format nodemask")); + goto cleanup; + } + } for (i = 0; i < def->cpu->ncells; i++) { + const char *policy = NULL; + int cellmem = VIR_DIV_UP(def->cpu->cells[i].mem, 1024); + + VIR_FREE(nodemask); + if (def->numatune.mem_nodes && + def->numatune.mem_nodes[i].specified) { + policy = qemuNumaPolicyToString(def->numatune.mem_nodes[i].mode); + if (!policy) + goto cleanup; + + nodemask = virBitmapFormat(def->numatune.mem_nodes[i].nodemask); + if (!nodemask) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Unable to format nodemask")); + goto cleanup; + } + } + VIR_FREE(cpumask); - int cellmem = VIR_DIV_UP(def->cpu->cells[i].mem, 1024) def->cpu->cells[i].mem = cellmem * 1024; - if (!(cpumask = virBitmapFormat(def->cpu->cells[i].cpumask))) + if (!(cpumask = virBitmapFormat(def->cpu->cells[i].cpumask))) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Unable to format nodemask")); goto cleanup; + } if (strchr(cpumask, ',')) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", @@ -6554,6 +6632,38 @@ qemuBuildNumaArgStr(const virDomainDef *def, virCommandPtr cmd) goto cleanup; } + if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY)) { + virBufferAsprintf(&buf, "memory-ram,size=%dM,id=ram-node%d", + cellmem, def->cpu->cells[i].cellid); + + if (nodemask) { + if (strchr(nodemask, ',')) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("disjoint NUMA node ranges are not supported " + "with this QEMU")); + goto cleanup; + } + + virBufferEscape(&buf, ',', ",", ",host-nodes=%s", nodemask); + virBufferAsprintf(&buf, ",policy=%s", policy); + + } else if (globalNodemask) { + + if (strchr(globalNodemask, ',')) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("disjoint NUMA node ranges are not supported " + "with this QEMU")); + goto cleanup; + } + + virBufferEscape(&buf, ',', ",", ",host-nodes=%s", globalNodemask); + virBufferAsprintf(&buf, ",policy=%s", globalPolicy); + } + + virCommandAddArg(cmd, "-object"); + virCommandAddArgBuffer(cmd, &buf); + } + virCommandAddArg(cmd, "-numa"); virBufferAsprintf(&buf, "node,nodeid=%d", def->cpu->cells[i].cellid); virBufferAddLit(&buf, ",cpus="); @@ -6565,7 +6675,13 @@ qemuBuildNumaArgStr(const virDomainDef *def, virCommandPtr cmd) * but when it is, we need a capability bit and * translation of our cpumask into the qemu syntax. */ virBufferAdd(&buf, cpumask, -1); - virBufferAsprintf(&buf, ",mem=%d", cellmem); + + if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY)) { + virBufferAsprintf(&buf, ",memdev=ram-node%d", + def->cpu->cells[i].cellid); + } else { + virBufferAsprintf(&buf, ",mem=%d", cellmem); + } if (virBufferError(&buf)) { virReportOOMError(); @@ -6577,6 +6693,10 @@ qemuBuildNumaArgStr(const virDomainDef *def, virCommandPtr cmd) ret = 0; cleanup: + if (!autoNodemask) + VIR_FREE(globalNodemask); + VIR_FREE(nodemask); + virObjectUnref(caps); VIR_FREE(cpumask); virBufferFreeAndReset(&buf); return ret; @@ -7215,7 +7335,8 @@ qemuBuildCommandLine(virConnectPtr conn, virDomainSnapshotObjPtr snapshot, virNetDevVPortProfileOp vmop, qemuBuildCommandLineCallbacksPtr callbacks, - bool standalone) + bool standalone, + virBitmapPtr nodemask) { virErrorPtr originalError = NULL; size_t i, j; @@ -7399,9 +7520,9 @@ qemuBuildCommandLine(virConnectPtr conn, virCommandAddArg(cmd, smp); VIR_FREE(smp); - if (def->cpu && def->cpu->ncells) - if (qemuBuildNumaArgStr(def, cmd) < 0) - goto error; + if (def->cpu && def->cpu->ncells && + qemuBuildNumaArgStr(def, cmd, driver, qemuCaps, nodemask) < 0) + goto error; if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_UUID)) virCommandAddArgList(cmd, "-uuid", uuid, NULL); diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index afbd6ff..4856d74 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -78,7 +78,8 @@ virCommandPtr qemuBuildCommandLine(virConnectPtr conn, virDomainSnapshotObjPtr current_snapshot, virNetDevVPortProfileOp vmop, qemuBuildCommandLineCallbacksPtr callbacks, - bool forXMLToArgv) + bool standalone, + virBitmapPtr nodemask) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(11); /* Generate '-device' string for chardev device */ diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index d838ad2..197f13c 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5987,7 +5987,7 @@ static char *qemuConnectDomainXMLToNative(virConnectPtr conn, NULL, -1, NULL, VIR_NETDEV_VPORT_PROFILE_OP_NO_OP, &buildCommandLineCallbacks, - true))) + true, NULL))) goto cleanup; ret = virCommandToString(cmd); @@ -8613,8 +8613,8 @@ qemuDomainSetNumaParamsLive(virDomainObjPtr vm, goto cleanup; } - for (i = 0; i < def->numatune->nmem_nodes; i++) { - if (def->numatune->mem_nodes[i].specified) { + for (i = 0; i < vm->def->numatune.nmem_nodes; i++) { + if (vm->def->numatune.mem_nodes[i].specified) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", _("change of nodeset for running domain " "with per guest NUMA node numatune settings " diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 124fe28..46d0010 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3910,7 +3910,8 @@ int qemuProcessStart(virConnectPtr conn, if (!(cmd = qemuBuildCommandLine(conn, driver, vm->def, priv->monConfig, priv->monJSON, priv->qemuCaps, migrateFrom, stdin_fd, snapshot, vmop, - &buildCommandLineCallbacks, false))) + &buildCommandLineCallbacks, false, + nodemask))) goto cleanup; /* now that we know it is about to start call the hook if present */ diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.args b/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.args new file mode 100644 index 0000000..d797088 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.args @@ -0,0 +1,6 @@ +LC_ALL=C PATH=/bin HOME=/home/test USER=test LOGNAME=test \ +QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M pc -m 64 -smp 1 -object \ +memory-ram,size=64M,id=ram-node0 -numa \ +node,nodeid=0,cpus=0,memdev=ram-node0 -nographic -monitor \ +unix:/tmp/test-monitor,server,nowait -no-acpi -boot c -usb -net none \ +-serial none -parallel none diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.xml b/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.xml new file mode 100644 index 0000000..63f0d1f --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto-prefer.xml @@ -0,0 +1,29 @@ +<domain type='qemu'> + <name>QEMUGuest</name> + <uuid>9f4b6512-e73a-4a25-93e8-5307802821ce</uuid> + <memory unit='KiB'>65536</memory> + <currentMemory unit='KiB'>65536</currentMemory> + <vcpu placement='auto'>1</vcpu> + <numatune> + <memory mode='preferred'/> + </numatune> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu> + <numa> + <cell id='0' cpus='0' memory='65536'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/kvm</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.args b/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.args new file mode 100644 index 0000000..7287f4f --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.args @@ -0,0 +1,6 @@ +LC_ALL=C PATH=/bin HOME=/home/test USER=test LOGNAME=test \ +QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M pc -m 64 -smp 1 -object \ +memory-ram,size=64M,id=ram-node0,host-nodes=4-5,policy=bind -numa \ +node,nodeid=0,cpus=0,memdev=ram-node0 -nographic -monitor \ +unix:/tmp/test-monitor,server,nowait -no-acpi -boot c -usb -net none \ +-serial none -parallel none diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.xml b/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.xml new file mode 100644 index 0000000..8b9c90d --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-auto.xml @@ -0,0 +1,26 @@ +<domain type='qemu'> + <name>QEMUGuest</name> + <uuid>9f4b6512-e73a-4a25-93e8-5307802821ce</uuid> + <memory unit='KiB'>65536</memory> + <currentMemory unit='KiB'>65536</currentMemory> + <vcpu placement='auto'>1</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu> + <numa> + <cell id='0' cpus='0' memory='65536'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/kvm</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnode-nocpu.xml b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnode-nocpu.xml new file mode 100644 index 0000000..7b0e248 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnode-nocpu.xml @@ -0,0 +1,25 @@ +<domain type='kvm'> + <name>QEMUGuest</name> + <uuid>9f4b6512-e73a-4a25-93e8-5307802821ce</uuid> + <memory unit='KiB'>65536</memory> + <currentMemory unit='KiB'>65536</currentMemory> + <vcpu placement='static'>1</vcpu> + <numatune> + <memory mode='strict' nodeset='0,2'/> + <memnode cellid='0' mode='interleave' nodeset='3'/> + </numatune> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/kvm</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes-problematic.xml b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes-problematic.xml new file mode 100644 index 0000000..390f6af --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes-problematic.xml @@ -0,0 +1,31 @@ +<domain type='qemu'> + <name>QEMUGuest</name> + <uuid>9f4b6512-e73a-4a25-93e8-5307802821ce</uuid> + <memory unit='KiB'>65536</memory> + <currentMemory unit='KiB'>65536</currentMemory> + <vcpu placement='static'>2</vcpu> + <numatune> + <memory mode='strict' nodeset='0-2'/> + <memnode cellid='0' mode='preferred' nodeset='3'/> + </numatune> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu> + <numa> + <cell id='0' cpus='0' memory='32768'/> + <cell id='1' cpus='1' memory='32768'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/kvm</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.args b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.args new file mode 100644 index 0000000..47db091 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.args @@ -0,0 +1,8 @@ +LC_ALL=C PATH=/bin HOME=/home/test USER=test LOGNAME=test \ +QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M pc -m 64 -smp 2 -object \ +memory-ram,size=32M,id=ram-node0,host-nodes=3,policy=preferred -numa \ +node,nodeid=0,cpus=0,memdev=ram-node0 -object \ +memory-ram,size=32M,id=ram-node1,host-nodes=0-3,policy=bind -numa \ +node,nodeid=1,cpus=1,memdev=ram-node1 -nographic -monitor \ +unix:/tmp/test-monitor,server,nowait -no-acpi -boot c -usb -net none \ +-serial none -parallel none diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.xml b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.xml new file mode 100644 index 0000000..18b00d8 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memnodes.xml @@ -0,0 +1,31 @@ +<domain type='qemu'> + <name>QEMUGuest</name> + <uuid>9f4b6512-e73a-4a25-93e8-5307802821ce</uuid> + <memory unit='KiB'>65536</memory> + <currentMemory unit='KiB'>65536</currentMemory> + <vcpu placement='static'>2</vcpu> + <numatune> + <memory mode='strict' nodeset='0-3'/> + <memnode cellid='0' mode='preferred' nodeset='3'/> + </numatune> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu> + <numa> + <cell id='0' cpus='0' memory='32768'/> + <cell id='1' cpus='1' memory='32768'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/kvm</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.args b/tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.args new file mode 100644 index 0000000..079930a --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.args @@ -0,0 +1,6 @@ +LC_ALL=C PATH=/bin HOME=/home/test USER=test LOGNAME=test \ +QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M pc -m 64 -smp 1 -object \ +memory-ram,size=64M,id=ram-node0,host-nodes=1-2,policy=preferred -numa \ +node,nodeid=0,cpus=0,memdev=ram-node0 -nographic -monitor \ +unix:/tmp/test-monitor,server,nowait -no-acpi -boot c -usb -net none \ +-serial none -parallel none diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.xml b/tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.xml new file mode 100644 index 0000000..58cd12c --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-prefer.xml @@ -0,0 +1,29 @@ +<domain type='qemu'> + <name>QEMUGuest</name> + <uuid>9f4b6512-e73a-4a25-93e8-5307802821ce</uuid> + <memory unit='KiB'>65536</memory> + <currentMemory unit='KiB'>65536</currentMemory> + <vcpu placement='static'>1</vcpu> + <numatune> + <memory mode='preferred'/> + </numatune> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu> + <numa> + <cell id='0' cpus='0' memory='65536'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/kvm</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index bdea8b0..ecc0b43 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -263,7 +263,8 @@ static int testCompareXMLToArgvFiles(const char *xml, virQEMUCapsPtr extraFlags, const char *migrateFrom, int migrateFd, - virQemuXML2ArgvTestFlags flags) + virQemuXML2ArgvTestFlags flags, + const char *nodemaskStr) { char *expectargv = NULL; int len; @@ -271,10 +272,15 @@ static int testCompareXMLToArgvFiles(const char *xml, int ret = -1; virDomainDefPtr vmdef = NULL; virDomainChrSourceDef monitor_chr; - virConnectPtr conn; + virConnectPtr conn = NULL; char *log = NULL; virCommandPtr cmd = NULL; size_t i; + virBitmapPtr nodemask = NULL; + + if (nodemaskStr && + virBitmapParse(nodemaskStr, 0, &nodemask, VIR_DOMAIN_CPUMASK_LEN) < 0) + goto out; if (!(conn = virGetConnect())) goto out; @@ -357,7 +363,7 @@ static int testCompareXMLToArgvFiles(const char *xml, (flags & FLAG_JSON), extraFlags, migrateFrom, migrateFd, NULL, VIR_NETDEV_VPORT_PROFILE_OP_NO_OP, - &testCallbacks, false))) { + &testCallbacks, false, nodemask))) { if (!virtTestOOMActive() && (flags & FLAG_EXPECT_FAILURE)) { ret = 0; @@ -404,6 +410,7 @@ static int testCompareXMLToArgvFiles(const char *xml, ret = 0; out: + virBitmapFree(nodemask); VIR_FREE(log); VIR_FREE(expectargv); VIR_FREE(actualargv); @@ -420,6 +427,7 @@ struct testInfo { const char *migrateFrom; int migrateFd; unsigned int flags; + const char *nodemask; }; static int @@ -442,7 +450,7 @@ testCompareXMLToArgvHelper(const void *data) result = testCompareXMLToArgvFiles(xml, args, info->extraFlags, info->migrateFrom, info->migrateFd, - flags); + flags, info->nodemask); cleanup: VIR_FREE(xml); @@ -532,10 +540,10 @@ mymain(void) if (VIR_STRDUP_QUIET(driver.config->spicePassword, "123456") < 0) return EXIT_FAILURE; -# define DO_TEST_FULL(name, migrateFrom, migrateFd, flags, ...) \ +# define DO_TEST_FULL(name, migrateFrom, migrateFd, flags, nodemask, ...) \ do { \ static struct testInfo info = { \ - name, NULL, migrateFrom, migrateFd, (flags) \ + name, NULL, migrateFrom, migrateFd, (flags), nodemask \ }; \ if (!(info.extraFlags = virQEMUCapsNew())) \ return EXIT_FAILURE; \ @@ -549,18 +557,18 @@ mymain(void) } while (0) # define DO_TEST(name, ...) \ - DO_TEST_FULL(name, NULL, -1, 0, __VA_ARGS__) + DO_TEST_FULL(name, NULL, -1, 0, NULL, __VA_ARGS__) # define DO_TEST_ERROR(name, ...) \ - DO_TEST_FULL(name, NULL, -1, FLAG_EXPECT_ERROR, __VA_ARGS__) + DO_TEST_FULL(name, NULL, -1, FLAG_EXPECT_ERROR, NULL, __VA_ARGS__) # define DO_TEST_FAILURE(name, ...) \ - DO_TEST_FULL(name, NULL, -1, FLAG_EXPECT_FAILURE, __VA_ARGS__) + DO_TEST_FULL(name, NULL, -1, FLAG_EXPECT_FAILURE, NULL, __VA_ARGS__) # define DO_TEST_PARSE_ERROR(name, ...) \ DO_TEST_FULL(name, NULL, -1, \ FLAG_EXPECT_PARSE_ERROR | FLAG_EXPECT_ERROR, \ - __VA_ARGS__) + NULL, __VA_ARGS__) # define NONE QEMU_CAPS_LAST @@ -1144,13 +1152,13 @@ mymain(void) QEMU_CAPS_PCIDEVICE, QEMU_CAPS_DEVICE, QEMU_CAPS_NODEFCONFIG, QEMU_CAPS_PCI_ROMBAR); - DO_TEST_FULL("restore-v1", "stdio", 7, 0, QEMU_CAPS_MIGRATE_KVM_STDIO); - DO_TEST_FULL("restore-v2", "stdio", 7, 0, QEMU_CAPS_MIGRATE_QEMU_EXEC); - DO_TEST_FULL("restore-v2", "exec:cat", 7, 0, QEMU_CAPS_MIGRATE_QEMU_EXEC); - DO_TEST_FULL("restore-v2-fd", "stdio", 7, 0, QEMU_CAPS_MIGRATE_QEMU_FD); - DO_TEST_FULL("restore-v2-fd", "fd:7", 7, 0, QEMU_CAPS_MIGRATE_QEMU_FD); - DO_TEST_FULL("migrate", "tcp:10.0.0.1:5000", -1, 0, - QEMU_CAPS_MIGRATE_QEMU_TCP); + DO_TEST_FULL("restore-v1", "stdio", 7, 0, NULL, QEMU_CAPS_MIGRATE_KVM_STDIO); + DO_TEST_FULL("restore-v2", "stdio", 7, 0, NULL, QEMU_CAPS_MIGRATE_QEMU_EXEC); + DO_TEST_FULL("restore-v2", "exec:cat", 7, 0, NULL, QEMU_CAPS_MIGRATE_QEMU_EXEC); + DO_TEST_FULL("restore-v2-fd", "stdio", 7, 0, NULL, QEMU_CAPS_MIGRATE_QEMU_FD); + DO_TEST_FULL("restore-v2-fd", "fd:7", 7, 0, NULL, QEMU_CAPS_MIGRATE_QEMU_FD); + DO_TEST_FULL("migrate", "tcp:10.0.0.1:5000", -1, 0, NULL, + QEMU_CAPS_MIGRATE_QEMU_TCP); DO_TEST("qemu-ns", NONE); @@ -1187,7 +1195,12 @@ mymain(void) DO_TEST("cputune", QEMU_CAPS_NAME); DO_TEST("cputune-zero-shares", QEMU_CAPS_NAME); DO_TEST("numatune-memory", NONE); + DO_TEST("numatune-memnodes", QEMU_CAPS_OBJECT_MEMORY); + DO_TEST_PARSE_ERROR("numatune-memnodes-problematic", QEMU_CAPS_OBJECT_MEMORY); + DO_TEST_PARSE_ERROR("numatune-memnode-nocpu", QEMU_CAPS_OBJECT_MEMORY); + DO_TEST("numatune-auto-prefer", QEMU_CAPS_OBJECT_MEMORY); DO_TEST("numatune-auto-nodeset-invalid", NONE); + DO_TEST_FULL("numatune-auto", NULL, -1, 0, "4-5", QEMU_CAPS_OBJECT_MEMORY); DO_TEST("numad", NONE); DO_TEST("numad-auto-vcpu-static-numatune", NONE); DO_TEST("numad-auto-memory-vcpu-cpuset", NONE); @@ -1378,6 +1391,9 @@ mymain(void) DO_TEST("panic", QEMU_CAPS_DEVICE_PANIC, QEMU_CAPS_DEVICE, QEMU_CAPS_NODEFCONFIG); + DO_TEST("panic", QEMU_CAPS_DEVICE_PANIC, + QEMU_CAPS_DEVICE, QEMU_CAPS_NODEFCONFIG); + virObjectUnref(driver.config); virObjectUnref(driver.caps); virObjectUnref(driver.xmlopt); diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index a446ea1..f1f419a 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -366,6 +366,7 @@ mymain(void) DO_TEST_DIFFERENT("cpu-numa1"); DO_TEST_DIFFERENT("cpu-numa2"); + DO_TEST("numatune-memnodes"); virObjectUnref(driver.caps); virObjectUnref(driver.xmlopt); diff --git a/tests/qemuxmlnstest.c b/tests/qemuxmlnstest.c index e8f70d6..ec5eec5 100644 --- a/tests/qemuxmlnstest.c +++ b/tests/qemuxmlnstest.c @@ -119,7 +119,7 @@ static int testCompareXMLToArgvFiles(const char *xml, vmdef, &monitor_chr, json, extraFlags, migrateFrom, migrateFd, NULL, VIR_NETDEV_VPORT_PROFILE_OP_NO_OP, - &testCallbacks, false))) + &testCallbacks, false, NULL))) goto fail; if (!virtTestOOMActive()) { -- 1.9.3

On Wed, May 28, 2014 at 11:48:36AM +0200, Martin Kletzander wrote:
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 4818cfb..b77914e 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -17524,7 +17524,7 @@ virDomainDefFormatInternal(virDomainDefPtr def, goto error; } virBufferAsprintf(buf, - "<memnode nodeid='%zu' mode='%s' nodeset='%s'/>\n", + "<memnode cellid='%zu' mode='%s' nodeset='%s'/>\n", i, mode, nodemask); VIR_FREE(nodemask); }
Presumably this belongs in an earlier patch in the series.
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b1bfb5a..ce03a4f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -588,6 +588,8 @@ qemuSetupCpusetCgroup(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0;
+ /* ASDF: per-node pinnings are not reflected here */
Are you refering to per-node vCPU pinnings here ? We already have a per-vCPU pin capability in both the XML and the cgroup setup code and the new <memnode> affinity mask you're adding presumably only applies to memory, not vCPUs, otherwise we'd be effectively having two places to set the same info. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, May 28, 2014 at 11:29:40AM +0100, Daniel P. Berrange wrote:
On Wed, May 28, 2014 at 11:48:36AM +0200, Martin Kletzander wrote:
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 4818cfb..b77914e 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -17524,7 +17524,7 @@ virDomainDefFormatInternal(virDomainDefPtr def, goto error; } virBufferAsprintf(buf, - "<memnode nodeid='%zu' mode='%s' nodeset='%s'/>\n", + "<memnode cellid='%zu' mode='%s' nodeset='%s'/>\n", i, mode, nodemask); VIR_FREE(nodemask); }
Presumably this belongs in an earlier patch in the series.
Oh, yes :) the fixup was on the wrong line in interactive rebase.
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b1bfb5a..ce03a4f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -588,6 +588,8 @@ qemuSetupCpusetCgroup(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0;
+ /* ASDF: per-node pinnings are not reflected here */
Are you refering to per-node vCPU pinnings here ?
We already have a per-vCPU pin capability in both the XML and the cgroup setup code and the new <memnode> affinity mask you're adding presumably only applies to memory, not vCPUs, otherwise we'd be effectively having two places to set the same info.
What we are doing now is that we take the nodeset from <memory>, put it into cpuset.mems and do virCgroupCpuSetInherit() for the underlying cgroups in the hierarchy. What we should do is put the right nodeset in each (related memnode or global memory, depending on which ones were set), but *only* if there's mode='strict'. However, I'm not sure how to properly handle both memory and memnode settings with different modes (strict, prefer, interleave).
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, May 28, 2014 at 01:44:49PM +0200, Martin Kletzander wrote:
On Wed, May 28, 2014 at 11:29:40AM +0100, Daniel P. Berrange wrote:
On Wed, May 28, 2014 at 11:48:36AM +0200, Martin Kletzander wrote:
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 4818cfb..b77914e 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -17524,7 +17524,7 @@ virDomainDefFormatInternal(virDomainDefPtr def, goto error; } virBufferAsprintf(buf, - "<memnode nodeid='%zu' mode='%s' nodeset='%s'/>\n", + "<memnode cellid='%zu' mode='%s' nodeset='%s'/>\n", i, mode, nodemask); VIR_FREE(nodemask); }
Presumably this belongs in an earlier patch in the series.
Oh, yes :) the fixup was on the wrong line in interactive rebase.
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b1bfb5a..ce03a4f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -588,6 +588,8 @@ qemuSetupCpusetCgroup(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0;
+ /* ASDF: per-node pinnings are not reflected here */
Are you refering to per-node vCPU pinnings here ?
We already have a per-vCPU pin capability in both the XML and the cgroup setup code and the new <memnode> affinity mask you're adding presumably only applies to memory, not vCPUs, otherwise we'd be effectively having two places to set the same info.
What we are doing now is that we take the nodeset from <memory>, put it into cpuset.mems and do virCgroupCpuSetInherit() for the underlying cgroups in the hierarchy. What we should do is put the right nodeset in each (related memnode or global memory, depending on which ones were set), but *only* if there's mode='strict'. However, I'm not sure how to properly handle both memory and memnode settings with different modes (strict, prefer, interleave).
Oh, so you're thinking that you want to set the memory allocation policy for the vCPU threads to match the <memnode> that the vCPU is associated with. I'm not convinced that is a good idea actually. IMHO <memnode> should only affect the placement of the guest RAM block. I'd class memory allocations done by the vCPU threads as being part of the QEMU global memory pool, so I think I'd expect them to be using <memory> settings not <memnode>. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, May 28, 2014 at 11:48:31AM +0200, Martin Kletzander wrote:
Caveats:
- I'm not sure how cpu hotplug is done with guest numa nodes, but if there is a possibility to increase the number of numa nodes (which does not make sense to me from (a) user point of view and (b) our XMLs and APIs), we need to be able to hotplug the ram as well,
AFAIK, you cannot change the NUMA topology once booted. You'd have to and any new CPUs to existing NUMA nodes (assuming they have space). Likewise I'd expect that any RAM would have to be added to existing defined nodes. Of course ultimately nothing should stop the user defining some "empty" NUMA nodes if they want space to add new CPUs beyond the initial setup.
- virDomainGetNumaParameters() now reflects only the /domain/numatune/memory settings, not 'memnode' ones,
Yep, that's fine, though we'll likely want to have new APIs to deal with the guest NUMA node settings.
- virDomainSetNumaParameters() is not allowed when there is some /domain/numatune/memnode parameter as we can query memdev info, but not change it (if I understood the QEMU side correctly),
- when domain is started, cpuset.mems cgroup is not modified per for each vcpu, this will be fixed, but the question is how to handle it for non-strict settings [*],
- automatic numad placement can be now used together with memnode settings which IMHO doesn't make any sense, but I was hesitant to disable that in case somebody has a constructive criticism in this area.
IMHO if you're doing fine grained configuration of guest <-> host NUMA nodes, then you're not going to want numad. numad is really serving the use cases where you're lazy and want to ignore NUMA settings in the guest config. IOW, I think it is fine to forbid numad.
- This series alone is broken when used with /domain/memoryBacking/hugepages, because it will still use the memory-ram object, but that will be fixed with Michal's patches on top of this series.
One idea how to solve some of the problems is to say that /domain/numatune/memory is set for the whole domain regardless of what anyone puts in /domain/numatune/memnode. virDomainGetNumaParameters() could be extended to tell the info for all guest numa nodes, although it seems new API would suit better for this kind of information. But is it really neede when we are not able to modify it live and the information is available in the domain XML?
*) does (or should) this:
... <numatune> <memory mode='strict' placement='static' nodeset='0-7'/> <memnode nodeid='0' mode='preferred' nodeset='7'/> </numatune> ...
mean what it looks like it means, that is "in guest node 0, prefer allocating from host node 7 but feel free to allocate from 0-6 as well in case you can't use 7, but never try allocating from host nodes 8-15"?
What we have to remember is that there's two different sets of threads and memory we're dealing with. There are the vCPU threads and the guest RAM allocation as one set, and there are misc emulator threads and other QEMU memory allocations. The <memnode> elements only apply to guest vCPUs and guest RAM. The <memory> element will still apply policy to the other QEMU emulator threads / RAM allocations. WRT your question about virDomainSetNumaParameters above - I think the answer to whether that makes sense to allow, is dependant on whether it would be able to affect the QEMU emulator threads/RAM, without affecting the vCPU threads/guest RAM. We'd likely want a new virDomainSetNumaNodeParameters API to control settings for the <memnode> elements directly. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, May 28, 2014 at 11:18:30AM +0100, Daniel P. Berrange wrote:
On Wed, May 28, 2014 at 11:48:31AM +0200, Martin Kletzander wrote:
Caveats:
- I'm not sure how cpu hotplug is done with guest numa nodes, but if there is a possibility to increase the number of numa nodes (which does not make sense to me from (a) user point of view and (b) our XMLs and APIs), we need to be able to hotplug the ram as well,
AFAIK, you cannot change the NUMA topology once booted. You'd have to and any new CPUs to existing NUMA nodes (assuming they have space). Likewise I'd expect that any RAM would have to be added to existing defined nodes. Of course ultimately nothing should stop the user defining some "empty" NUMA nodes if they want space to add new CPUs beyond the initial setup.
- virDomainGetNumaParameters() now reflects only the /domain/numatune/memory settings, not 'memnode' ones,
Yep, that's fine, though we'll likely want to have new APIs to deal with the guest NUMA node settings.
I was thinking how to extend the current one, but all the ideas seem too "messy".
- virDomainSetNumaParameters() is not allowed when there is some /domain/numatune/memnode parameter as we can query memdev info, but not change it (if I understood the QEMU side correctly),
- when domain is started, cpuset.mems cgroup is not modified per for each vcpu, this will be fixed, but the question is how to handle it for non-strict settings [*],
- automatic numad placement can be now used together with memnode settings which IMHO doesn't make any sense, but I was hesitant to disable that in case somebody has a constructive criticism in this area.
IMHO if you're doing fine grained configuration of guest <-> host NUMA nodes, then you're not going to want numad. numad is really serving the use cases where you're lazy and want to ignore NUMA settings in the guest config. IOW, I think it is fine to forbid numad.
Good we're on the same page here, I just wanted to make sure before jumping to conclusions. Most of the use cases where one depends on numad are not very thought-through I guess.
- This series alone is broken when used with /domain/memoryBacking/hugepages, because it will still use the memory-ram object, but that will be fixed with Michal's patches on top of this series.
One idea how to solve some of the problems is to say that /domain/numatune/memory is set for the whole domain regardless of what anyone puts in /domain/numatune/memnode. virDomainGetNumaParameters() could be extended to tell the info for all guest numa nodes, although it seems new API would suit better for this kind of information. But is it really neede when we are not able to modify it live and the information is available in the domain XML?
*) does (or should) this:
... <numatune> <memory mode='strict' placement='static' nodeset='0-7'/> <memnode nodeid='0' mode='preferred' nodeset='7'/> </numatune> ...
mean what it looks like it means, that is "in guest node 0, prefer allocating from host node 7 but feel free to allocate from 0-6 as well in case you can't use 7, but never try allocating from host nodes 8-15"?
What we have to remember is that there's two different sets of threads and memory we're dealing with. There are the vCPU threads and the guest RAM allocation as one set, and there are misc emulator threads and other QEMU memory allocations. The <memnode> elements only apply to guest vCPUs and guest RAM. The <memory> element will still apply policy to the other QEMU emulator threads / RAM allocations.
Yes, my sense is that emulator should be set according to memory (as it's done now IIUC). It would mean we need to restrict the cpuset.cpus for that as well (in case there is no <emulatorpin>). We also need to properly error out on configuration that will fail (strict memory mode with nodes and cpus from different host numa nodes). This should be done for both emulator threads and cpu threads (it is not done now, btw).
WRT your question about virDomainSetNumaParameters above - I think the answer to whether that makes sense to allow, is dependant on whether it would be able to affect the QEMU emulator threads/RAM, without affecting the vCPU threads/guest RAM. We'd likely want a new virDomainSetNumaNodeParameters API to control settings for the <memnode> elements directly.
Yes. It also depends whether we'll be able to modify the guest node settings in QEMU (if not then we have to somehow unify when we use host-nodes and when not). Another question that came to my mind right now is whether we want to expose cpuset.memory_migrate as a parameter, too (or set it to some default value) since it will affect the performace a lot when the cpuset.mems are changes, won't it? Thank you for the responses, Martin
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (2)
-
Daniel P. Berrange
-
Martin Kletzander