[libvirt] [RFC PATCH 0/3] Guest NUMA topology support - v0

Hi, I discussed the possibilities of adding NUMA topology XML specification support for guests here some time back. Since my latest proposal (http://permalink.gmane.org/gmane.comp.emulators.libvirt/44626) didn't get any response, I am posting a prototype implementation that supports specifying NUMA topology for QEMU guests. - The implementation is based on the last proposal I listed above. - The implementation is for QEMU only. - The patchset has gone through extremely light testing and I have just tested booting a 2 socket 2 core 2 thread QEMU guest. - I haven't really bothered to cover all the corner cases and haven't run libvirt tests after this patchset. For eg, there is no code to validate if the CPU combination specified by <topology> and <numa> match with each other. I plan to cover all these after we freeze the specification itself. Regards, Bharata.

Add XML definitions for guest NUMA specifications. From: Bharata B Rao <bharata@linux.vnet.ibm.com> NUMA topology for guest is specified as follows: <cpu> ... <numa> <node cpus='0-3' mems='1024'> <node cpus='4,5,6,7' mems='1024'> <node cpus='8-10',11-12^12' mems='1024'> </numa> </cpu> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> --- docs/schemas/domaincommon.rng | 32 ++++++++++++++++++++++++++++++++ 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 4972fac..99b70c3 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -2252,6 +2252,9 @@ <zeroOrMore> <ref name="cpuFeature"/> </zeroOrMore> + <optional> + <ref name="cpuNuma"/> + </optional> </interleave> </group> </choice> @@ -2312,6 +2315,25 @@ </element> </define> + <define name="cpuNuma"> + <element name="numa"> + <oneOrMore> + <ref name="numaNode"/> + </oneOrMore> + </element> + </define> + + <define name="numaNode"> + <element name="node"> + <attribute name="cpus"> + <ref name="Nodecpus"/> + </attribute> + <attribute name="mems"> + <ref name="Nodemems"/> + </attribute> + </element> + </define> + <!-- System information specification: Placeholder for system specific informations likes the ones @@ -2665,4 +2687,14 @@ <param name="pattern">[a-zA-Z0-9_\.:]+</param> </data> </define> + <define name="Nodecpus"> + <data type="string"> + <param name="pattern">([0-9]+(-[0-9]+)?|\^[0-9]+)(,([0-9]+(-[0-9]+)?|\^[0-9]+))*</param> + </data> + </define> + <define name="Nodemems"> + <data type="unsignedInt"> + <param name="pattern">[0-9]+</param> + </data> + </define> </grammar>

Routines to parse <numa> ... </numa> From: Bharata B Rao <bharata@linux.vnet.ibm.com> This patch adds routines to parse guest numa XML configuration for qemu. Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> --- src/conf/cpu_conf.c | 48 ++++++++++++++++++++++++ src/conf/cpu_conf.h | 11 ++++++ src/qemu/qemu_command.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 153 insertions(+), 0 deletions(-) diff --git a/src/conf/cpu_conf.c b/src/conf/cpu_conf.c index 5cecda2..c520025 100644 --- a/src/conf/cpu_conf.c +++ b/src/conf/cpu_conf.c @@ -28,6 +28,7 @@ #include "util.h" #include "buf.h" #include "cpu_conf.h" +#include "domain_conf.h" #define VIR_FROM_THIS VIR_FROM_CPU @@ -67,6 +68,10 @@ virCPUDefFree(virCPUDefPtr def) VIR_FREE(def->features[i].name); VIR_FREE(def->features); + for (i = 0 ; i < def->nnodes ; i++) + VIR_FREE(def->nodes[i].cpumask); + VIR_FREE(def->nodes); + VIR_FREE(def); } @@ -289,6 +294,49 @@ virCPUDefParseXML(const xmlNodePtr node, def->features[i].policy = policy; } + if (virXPathNode("./numa[1]", ctxt)) { + VIR_FREE(nodes); + n = virXPathNodeSet("./numa[1]/node", ctxt, &nodes); + if (n < 0 || n == 0) { + virCPUReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("NUMA topology defined without NUMA nodes")); + goto error; + } + + if (VIR_RESIZE_N(def->nodes, def->nnodes_max, + def->nnodes, n) < 0) + goto no_memory; + + def->nnodes = n; + + for (i = 0 ; i < n ; i++) { + char *cpus; + int cpumasklen = VIR_DOMAIN_CPUMASK_LEN; + unsigned long ul; + int ret; + + def->nodes[i].nodeid = i; + cpus = virXMLPropString(nodes[i], "cpus"); + + if (VIR_ALLOC_N(def->nodes[i].cpumask, cpumasklen) < 0) + goto no_memory; + + if (virDomainCpuSetParse((const char **)&cpus, + 0, def->nodes[i].cpumask, + cpumasklen) < 0) + goto error; + + ret = virXPathULong("string(./numa[1]/node/@mems)", + ctxt, &ul); + if (ret < 0) { + virCPUReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("Missing 'mems' attribute in NUMA topology")); + goto error; + } + def->nodes[i].mem = (unsigned int) ul; + } + } + cleanup: VIR_FREE(nodes); diff --git a/src/conf/cpu_conf.h b/src/conf/cpu_conf.h index 57b85e1..266ec81 100644 --- a/src/conf/cpu_conf.h +++ b/src/conf/cpu_conf.h @@ -67,6 +67,14 @@ struct _virCPUFeatureDef { int policy; /* enum virCPUFeaturePolicy */ }; +typedef struct _virNodeDef virNodeDef; +typedef virNodeDef *virNodeDefPtr; +struct _virNodeDef { + int nodeid; + char *cpumask; /* CPUs that are part of this node */ + unsigned int mem; /* Node memory */ +}; + typedef struct _virCPUDef virCPUDef; typedef virCPUDef *virCPUDefPtr; struct _virCPUDef { @@ -81,6 +89,9 @@ struct _virCPUDef { size_t nfeatures; size_t nfeatures_max; virCPUFeatureDefPtr features; + size_t nnodes; + size_t nnodes_max; + virNodeDefPtr nodes; }; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index a13ba71..5b4345e 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3153,6 +3153,97 @@ qemuBuildSmpArgStr(const virDomainDefPtr def, return virBufferContentAndReset(&buf); } +static char * +virParseNodeCPUs(char *cpumask) +{ + int i, first, last, ret; + char *cpus, *ptr; + int cpuSet = 0; + int remaining = 128; + + if (VIR_ALLOC_N(cpus, remaining) < 0) + return NULL; + + ptr = cpus; + + for (i = 0; i < VIR_DOMAIN_CPUMASK_LEN; i++) { + if (cpumask[i]) { + if (cpuSet) + last = i; + else { + first = last = i; + cpuSet = 1; + } + } else { + if (!cpuSet) + continue; + if (first == last) + ret = snprintf(ptr, remaining, "%d,", first); + else + ret = snprintf(ptr, remaining, "%d-%d,", first, last); + if (ret > remaining) + goto error; + ptr += ret; + remaining -= ret; + cpuSet = 0; + } + } + + if (cpuSet) { + if (first == last) + ret = snprintf(ptr, remaining, "%d,", first); + else + ret = snprintf(ptr, remaining, "%d-%d,", first, last); + if (ret > remaining) + goto error; + } + + /* Remove the trailing comma */ + *(--ptr) = '\0'; + return cpus; + +error: + VIR_FREE(cpus); + return NULL; +} + +static int +qemuBuildNumaArgStr(const virDomainDefPtr def, virCommandPtr cmd) +{ + int i; + char *cpus, *node; + virBuffer buf = VIR_BUFFER_INITIALIZER; + + for (i = 0; i < def->cpu->nnodes; i++) { + virCommandAddArg(cmd, "-numa"); + virBufferAsprintf(&buf, "%s", "node"); + virBufferAsprintf(&buf, ",nodeid=%d", def->cpu->nodes[i].nodeid); + + cpus = virParseNodeCPUs(def->cpu->nodes[i].cpumask); + if (!cpus) + goto error; + + virBufferAsprintf(&buf, ",cpus=%s", cpus); + virBufferAsprintf(&buf, ",mems=%d", def->cpu->nodes[i].mem); + + if (virBufferError(&buf)) { + VIR_FREE(cpus); + goto error; + } + + node = virBufferContentAndReset(&buf); + virCommandAddArg(cmd, node); + + VIR_FREE(cpus); + VIR_FREE(node); + } + return 0; + +error: + virBufferFreeAndReset(&buf); + virReportOOMError(); + return -1; +} /* * Constructs a argv suitable for launching qemu with config defined @@ -3319,6 +3410,9 @@ qemuBuildCommandLine(virConnectPtr conn, virCommandAddArg(cmd, smp); VIR_FREE(smp); + if (def->cpu->nnodes && qemuBuildNumaArgStr(def, cmd)) + goto error; + if (qemuCapsGet(qemuCaps, QEMU_CAPS_NAME)) { virCommandAddArg(cmd, "-name"); if (driver->setProcessName &&

On Mon, Oct 03, 2011 at 03:31:35PM +0530, Bharata B Rao wrote:
Routines to parse <numa> ... </numa>
From: Bharata B Rao <bharata@linux.vnet.ibm.com>
This patch adds routines to parse guest numa XML configuration for qemu.
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> ---
src/conf/cpu_conf.c | 48 ++++++++++++++++++++++++ src/conf/cpu_conf.h | 11 ++++++ src/qemu/qemu_command.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 153 insertions(+), 0 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index a13ba71..5b4345e 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3153,6 +3153,97 @@ qemuBuildSmpArgStr(const virDomainDefPtr def, return virBufferContentAndReset(&buf); }
+static char * +virParseNodeCPUs(char *cpumask)
This is a rather misleading name, since it is really formatting the argument value. Can you rename this to qemuBuildNumaCPUArgStr
+{ + int i, first, last, ret; + char *cpus, *ptr; + int cpuSet = 0; + int remaining = 128; + + if (VIR_ALLOC_N(cpus, remaining) < 0) + return NULL; + + ptr = cpus; + + for (i = 0; i < VIR_DOMAIN_CPUMASK_LEN; i++) { + if (cpumask[i]) { + if (cpuSet) + last = i; + else { + first = last = i; + cpuSet = 1; + } + } else { + if (!cpuSet) + continue; + if (first == last) + ret = snprintf(ptr, remaining, "%d,", first); + else + ret = snprintf(ptr, remaining, "%d-%d,", first, last); + if (ret > remaining) + goto error; + ptr += ret; + remaining -= ret; + cpuSet = 0; + } + } + + if (cpuSet) { + if (first == last) + ret = snprintf(ptr, remaining, "%d,", first); + else + ret = snprintf(ptr, remaining, "%d-%d,", first, last); + if (ret > remaining) + goto error; + } + + /* Remove the trailing comma */ + *(--ptr) = '\0'; + return cpus; + +error: + VIR_FREE(cpus); + return NULL;
Using VIR_ALLOC_N + snprintf here is not desirable, when you already have a nice virBufferPtr object in the call that you could use. Just pass the virBufferPtr straight into this method.
+} + +static int +qemuBuildNumaArgStr(const virDomainDefPtr def, virCommandPtr cmd) +{ + int i; + char *cpus, *node; + virBuffer buf = VIR_BUFFER_INITIALIZER; + + for (i = 0; i < def->cpu->nnodes; i++) { + virCommandAddArg(cmd, "-numa"); + virBufferAsprintf(&buf, "%s", "node"); + virBufferAsprintf(&buf, ",nodeid=%d", def->cpu->nodes[i].nodeid); + + cpus = virParseNodeCPUs(def->cpu->nodes[i].cpumask); + if (!cpus) + goto error; + + virBufferAsprintf(&buf, ",cpus=%s", cpus); + virBufferAsprintf(&buf, ",mems=%d", def->cpu->nodes[i].mem); + + if (virBufferError(&buf)) { + VIR_FREE(cpus); + goto error; + } + + node = virBufferContentAndReset(&buf); + virCommandAddArg(cmd, node); + + VIR_FREE(cpus); + VIR_FREE(node); + } + return 0; + +error: + virBufferFreeAndReset(&buf); + virReportOOMError(); + return -1; +}
/* * Constructs a argv suitable for launching qemu with config defined @@ -3319,6 +3410,9 @@ qemuBuildCommandLine(virConnectPtr conn, virCommandAddArg(cmd, smp); VIR_FREE(smp);
+ if (def->cpu->nnodes && qemuBuildNumaArgStr(def, cmd)) + goto error; + if (qemuCapsGet(qemuCaps, QEMU_CAPS_NAME)) { virCommandAddArg(cmd, "-name"); if (driver->setProcessName &&
Looks fine code wise. For the future iterations, it would be good to change the split of patches slightly - Patch 1 for XML bits: cpu_conf.c, cpu_conf.h, and docs/schemas/domain.rn - Patch 2 for qemu_command.c, tests/qemuxml2argvtest.c Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Oct 13, 2011 at 12:50:00PM +0100, Daniel P. Berrange wrote:
On Mon, Oct 03, 2011 at 03:31:35PM +0530, Bharata B Rao wrote:
Routines to parse <numa> ... </numa>
From: Bharata B Rao <bharata@linux.vnet.ibm.com>
This patch adds routines to parse guest numa XML configuration for qemu.
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> ---
src/conf/cpu_conf.c | 48 ++++++++++++++++++++++++ src/conf/cpu_conf.h | 11 ++++++ src/qemu/qemu_command.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 153 insertions(+), 0 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index a13ba71..5b4345e 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3153,6 +3153,97 @@ qemuBuildSmpArgStr(const virDomainDefPtr def, return virBufferContentAndReset(&buf); }
+static char * +virParseNodeCPUs(char *cpumask)
This is a rather misleading name, since it is really formatting the argument value. Can you rename this to qemuBuildNumaCPUArgStr
Sure will do.
+{ + int i, first, last, ret; + char *cpus, *ptr; + int cpuSet = 0; + int remaining = 128; + + if (VIR_ALLOC_N(cpus, remaining) < 0) + return NULL; + + ptr = cpus; + + for (i = 0; i < VIR_DOMAIN_CPUMASK_LEN; i++) { + if (cpumask[i]) { + if (cpuSet) + last = i; + else { + first = last = i; + cpuSet = 1; + } + } else { + if (!cpuSet) + continue; + if (first == last) + ret = snprintf(ptr, remaining, "%d,", first); + else + ret = snprintf(ptr, remaining, "%d-%d,", first, last); + if (ret > remaining) + goto error; + ptr += ret; + remaining -= ret; + cpuSet = 0; + } + } + + if (cpuSet) { + if (first == last) + ret = snprintf(ptr, remaining, "%d,", first); + else + ret = snprintf(ptr, remaining, "%d-%d,", first, last); + if (ret > remaining) + goto error; + } + + /* Remove the trailing comma */ + *(--ptr) = '\0'; + return cpus; + +error: + VIR_FREE(cpus); + return NULL;
Using VIR_ALLOC_N + snprintf here is not desirable, when you already have a nice virBufferPtr object in the call that you could use. Just pass the virBufferPtr straight into this method.
Wanted to user virBufferPtr, but I realized that I need to remove the last comma from the string and coudn't find an easy way to do that. Hence resorted to this method. But I think I can still achive this by not appending a comma to the next part of the agrument (,mems). Let me see if I can do this cleanly in the next post.
+} + +static int +qemuBuildNumaArgStr(const virDomainDefPtr def, virCommandPtr cmd) +{ + int i; + char *cpus, *node; + virBuffer buf = VIR_BUFFER_INITIALIZER; + + for (i = 0; i < def->cpu->nnodes; i++) { + virCommandAddArg(cmd, "-numa"); + virBufferAsprintf(&buf, "%s", "node"); + virBufferAsprintf(&buf, ",nodeid=%d", def->cpu->nodes[i].nodeid); + + cpus = virParseNodeCPUs(def->cpu->nodes[i].cpumask); + if (!cpus) + goto error; + + virBufferAsprintf(&buf, ",cpus=%s", cpus); + virBufferAsprintf(&buf, ",mems=%d", def->cpu->nodes[i].mem); + + if (virBufferError(&buf)) { + VIR_FREE(cpus); + goto error; + } + + node = virBufferContentAndReset(&buf); + virCommandAddArg(cmd, node); + + VIR_FREE(cpus); + VIR_FREE(node); + } + return 0; + +error: + virBufferFreeAndReset(&buf); + virReportOOMError(); + return -1; +}
/* * Constructs a argv suitable for launching qemu with config defined @@ -3319,6 +3410,9 @@ qemuBuildCommandLine(virConnectPtr conn, virCommandAddArg(cmd, smp); VIR_FREE(smp);
+ if (def->cpu->nnodes && qemuBuildNumaArgStr(def, cmd)) + goto error; + if (qemuCapsGet(qemuCaps, QEMU_CAPS_NAME)) { virCommandAddArg(cmd, "-name"); if (driver->setProcessName &&
Looks fine code wise. For the future iterations, it would be good to change the split of patches slightly
- Patch 1 for XML bits: cpu_conf.c, cpu_conf.h, and docs/schemas/domain.rn - Patch 2 for qemu_command.c, tests/qemuxml2argvtest.c
Sure will rearrange in the next iteration. Thanks for your review. Regards, Bharata.

On 10/13/2011 10:10 PM, Bharata B Rao wrote:
Using VIR_ALLOC_N + snprintf here is not desirable, when you already have a nice virBufferPtr object in the call that you could use. Just pass the virBufferPtr straight into this method.
Wanted to user virBufferPtr, but I realized that I need to remove the last comma from the string and coudn't find an easy way to do that. Hence resorted to this method. But I think I can still achive this by not appending a comma to the next part of the agrument (,mems). Let me see if I can do this cleanly in the next post.
Yeah, it might make sense to first add a helper function to util/buf.[ch] that allows one to truncate previously appended bytes, since sometimes it is easier logic to say always emit a trailing comma then trim than it is to say emit a leading comma if I'm not first. Nothing wrong with adding helper functions to make life easier, if virBuffer doesn't already meet your needs. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Mon, Oct 03, 2011 at 03:28:44PM +0530, Bharata B Rao wrote:
Hi,
I discussed the possibilities of adding NUMA topology XML specification support for guests here some time back. Since my latest proposal (http://permalink.gmane.org/gmane.comp.emulators.libvirt/44626) didn't get any response, I am posting a prototype implementation that supports specifying NUMA topology for QEMU guests.
- The implementation is based on the last proposal I listed above.
So we're basically only allowing a flat NUMA toplogy <numa> <node cpus='0,2,4,6' mems='1024> <node cpus='8,10,12,14' mems='1024> <node cpus='1,3,5,7' mems='1024'> <node cpus='9,11,13,15' mems='1024'> </numa> which mirrors what QEMU allows currently. Should we need to support a hierarchy, we can trivially extend this syntax in a backwards compatible fashion <numa> <node> <node cpus='0,2,4,6' mems='1024> <node cpus='8,10,12,14' mems='1024> </node> <node> <node cpus='1,3,5,7' mems='1024'> <node cpus='9,11,13,15' mems='1024'> </node> </numa> so I think this limitation is OK for now. In the virsh capabilities XML, we actually use the word 'cell' rather than 'node'. I think it might be preferrable to be consistent and use 'cell' here too.
- The implementation is for QEMU only.
That's fine.
- The patchset has gone through extremely light testing and I have just tested booting a 2 socket 2 core 2 thread QEMU guest. - I haven't really bothered to cover all the corner cases and haven't run libvirt tests after this patchset. For eg, there is no code to validate if the CPU combination specified by <topology> and <numa> match with each other. I plan to cover all these after we freeze the specification itself.
ok WRT the question about CPU enumeration order in the URL quoted above. I don't think it matters whether we enumerate CPUs in the same order as real hardware. The key thing is that we just choose an order, document what *our* enumeration order is, and then stick to it forever. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Oct 13, 2011 at 12:53:22PM +0100, Daniel P. Berrange wrote:
On Mon, Oct 03, 2011 at 03:28:44PM +0530, Bharata B Rao wrote:
Hi,
I discussed the possibilities of adding NUMA topology XML specification support for guests here some time back. Since my latest proposal (http://permalink.gmane.org/gmane.comp.emulators.libvirt/44626) didn't get any response, I am posting a prototype implementation that supports specifying NUMA topology for QEMU guests.
- The implementation is based on the last proposal I listed above.
So we're basically only allowing a flat NUMA toplogy
<numa> <node cpus='0,2,4,6' mems='1024> <node cpus='8,10,12,14' mems='1024> <node cpus='1,3,5,7' mems='1024'> <node cpus='9,11,13,15' mems='1024'> </numa>
which mirrors what QEMU allows currently. Should we need to support a hierarchy, we can trivially extend this syntax in a backwards compatible fashion
<numa> <node> <node cpus='0,2,4,6' mems='1024> <node cpus='8,10,12,14' mems='1024> </node> <node> <node cpus='1,3,5,7' mems='1024'> <node cpus='9,11,13,15' mems='1024'> </node> </numa>
so I think this limitation is OK for now.
Fine then.
In the virsh capabilities XML, we actually use the word 'cell' rather than 'node'. I think it might be preferrable to be consistent and use 'cell' here too.
I feel NUMA node sounds more familiar than NUMA cell. But if libvirt prefers cell, we can go with cell I suppose.
- The implementation is for QEMU only.
That's fine.
- The patchset has gone through extremely light testing and I have just tested booting a 2 socket 2 core 2 thread QEMU guest. - I haven't really bothered to cover all the corner cases and haven't run libvirt tests after this patchset. For eg, there is no code to validate if the CPU combination specified by <topology> and <numa> match with each other. I plan to cover all these after we freeze the specification itself.
ok
WRT the question about CPU enumeration order in the URL quoted above. I don't think it matters whether we enumerate CPUs in the same order as real hardware. The key thing is that we just choose an order, document what *our* enumeration order is, and then stick to it forever.
Ok fine. Regards, Bharata.
participants (3)
-
Bharata B Rao
-
Daniel P. Berrange
-
Eric Blake