[libvirt] [PATCH 0/5 v3] Add NUMA memory tuning support

Hi, This series incorporated previous feedbacks received on the XML schema, especially on syntax of "nodeset". Now using same syntax of "cpuset" of "vcpu". And doing conversion in qemu driver internally, as libnuma uses different bitmask format to represent NUMA nodes. The XML is like: <numatune> <memory mode="strict" nodeset="0-10,^5"/> </numatune> Value of "mode" can be either of "strict", "preferred", or "interleave". One may specify nodemask which is out of range, for this case, libvirt will report soft warnings instead of hard error. (Kernel is silent as long as one of the set bit is valid, E.g. For a host with 2 NUMA nodes, kernel will be silent for nodemask "0101010101", as first "bit" which represents the first NUMA node is set). The reason for reporting soft warnings instead of hard error is one may want to define the memory tuning policy priorly with a node doesn't exist yet, but may come soon. Any feedback is appreciated. [PATCH 1/5 v3] numatune: Define xml schema [PATCH 2/5 v3] numatune: Add doc for new numatune xml [PATCH 3/5 v3] numatune: Support persistent XML for numatune [PATCH 4/5 v3] numatune: Support numa tuning in qemu driver. [PATCH 5/5 v3] numatune: Add tests for numatune XML Regards Osier

Example XML: <numatune> <memory mode="strict" nodeset="0-10,^4"/> </numatune> --- docs/schemas/domain.rng | 20 ++++++++++++++++++++ 1 files changed, 20 insertions(+), 0 deletions(-) diff --git a/docs/schemas/domain.rng b/docs/schemas/domain.rng index b71778b..6de024e 100644 --- a/docs/schemas/domain.rng +++ b/docs/schemas/domain.rng @@ -387,6 +387,26 @@ </zeroOrMore> </element> </optional> + + <!-- All the NUMA related tunables would go in the numatune --> + <optional> + <element name="numatune"> + <optional> + <element name="memory"> + <attribute name="mode"> + <choice> + <value>strict</value> + <value>preferred</value> + <value>interleave</value> + </choice> + </attribute> + <attribute name="nodeset"> + <ref name="cpuset"/> + </attribute> + </element> + </optional> + </element> + </optional> </interleave> </define> <define name="clock"> -- 1.7.4

--- docs/formatdomain.html.in | 18 ++++++++++++++++++ 1 files changed, 18 insertions(+), 0 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 267839b..ab39417 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -295,6 +295,9 @@ <vcpupin vcpu="3" cpuset="0,4"/> <shares>2048</shares> </cputune> + <numatune> + <memory mode="strict" nodeset="1-4,^3"/> + </numatune> ...</pre> <dl> @@ -382,6 +385,21 @@ 2048 will get twice as much CPU time as a VM configured with value 1024. <span class="since">Since 0.9.0</span> </dd> + <dt><code>numatune</code></dt> + <dd> + The optional <code>numatune</code> element provides details of + how to tune the performance of a NUMA host via controlling NUMA policy + for domain process. NB, only supported by QEMU driver. + <span class='since'>Since 0.9.3</span> + <dt><code>memory</code></dt> + <dd> + The optional <code>memory</code> element specify how to allocate memory + for the domain process on a NUMA host. It contains two attributes, + attribute <code>mode</code> is either 'interleave', 'strict', or 'preferred', + attribute <code>nodeset</code> specifies the NUMA nodes, it leads same + syntax with attribute <code>cpuset</code> of element <code>vcpu</code>. + <span class='since'>Since 0.9.3</span> + </dd> </dl> <h3><a name="elementsCPU">CPU model and topology</a></h3> -- 1.7.4

* src/conf/domain_conf.h: Introduce one new struct for representing NUMA tuning related stuffs. * src/conf/domain_conf.c: Parse and format numatune XML. --- src/conf/domain_conf.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++ src/conf/domain_conf.h | 22 ++++++++++++++ src/libvirt_private.syms | 2 + 3 files changed, 96 insertions(+), 0 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 02707dd..8d5e173 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -484,6 +484,11 @@ VIR_ENUM_IMPL(virDomainTimerMode, VIR_DOMAIN_TIMER_MODE_LAST, "paravirt", "smpsafe"); +VIR_ENUM_IMPL(virDomainNumatuneMemMode, VIR_DOMAIN_NUMATUNE_MEM_LAST, + "strict", + "preferred", + "interleave"); + #define virDomainReportError(code, ...) \ virReportErrorHelper(VIR_FROM_DOMAIN, code, __FILE__, \ __FUNCTION__, __LINE__, __VA_ARGS__) @@ -5875,6 +5880,51 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps, } VIR_FREE(nodes); + /* Extract numatune if exists. */ + if ((n = virXPathNodeSet("./numatune", ctxt, NULL)) < 0) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("cannot extract numatune nodes")); + goto error; + } + + if (n) { + tmp = virXPathString("string(./numatune/memory/@nodeset)", ctxt); + if (tmp) { + char *set = tmp; + int nodemasklen = VIR_DOMAIN_CPUMASK_LEN; + + if (VIR_ALLOC_N(def->numatune.memory.nodemask, nodemasklen) < 0) { + virReportOOMError(); + goto error; + } + + /* "nodeset" leads same syntax with "cpuset". */ + if (virDomainCpuSetParse((const char **)&set, + 0, def->numatune.memory.nodemask, + nodemasklen) < 0) + goto error; + VIR_FREE(tmp); + } else { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("nodeset for NUMA memory tuning must be set")); + goto error; + } + + tmp = virXPathString("string(./numatune/memory/@mode)", ctxt); + if (tmp) { + if ((def->numatune.memory.mode = + virDomainNumatuneMemModeTypeFromString(tmp)) < 0) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + _("Unsupported NUMA memory tuning mode '%s'"), + tmp); + goto error; + } + VIR_FREE(tmp); + } else { + def->numatune.memory.mode = VIR_DOMAIN_NUMATUNE_MEM_STRICT; + } + } + n = virXPathNodeSet("./features/*", ctxt, &nodes); if (n < 0) goto error; @@ -9457,6 +9507,28 @@ char *virDomainDefFormat(virDomainDefPtr def, if (def->cputune.shares || def->cputune.vcpupin) virBufferAddLit(&buf, " </cputune>\n"); + if (def->numatune.memory.nodemask) + virBufferAddLit(&buf, " <numatune>\n"); + + if (def->numatune.memory.nodemask) { + char *nodemask = NULL; + nodemask = virDomainCpuSetFormat(def->numatune.memory.nodemask, + VIR_DOMAIN_CPUMASK_LEN); + if (nodemask == NULL) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("failed to format nodeset for NUMA memory tuning")); + goto cleanup; + } + + virBufferAsprintf(&buf, " <memory mode='%s' nodeset='%s'/>\n", + virDomainNumatuneMemModeTypeToString(def->numatune.memory.mode), + nodemask); + VIR_FREE(nodemask); + } + + if (def->numatune.memory.nodemask) + virBufferAddLit(&buf, " </numatune>\n"); + if (def->sysinfo) virDomainSysinfoDefFormat(&buf, def->sysinfo); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 1f4bf1d..3c54e8b 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1117,6 +1117,25 @@ virDomainVcpupinDefPtr virDomainVcpupinFindByVcpu(virDomainVcpupinDefPtr *def, int nvcpupin, int vcpu); +enum virDomainNumatuneMemMode { + VIR_DOMAIN_NUMATUNE_MEM_STRICT, + VIR_DOMAIN_NUMATUNE_MEM_PREFERRED, + VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE, + + VIR_DOMAIN_NUMATUNE_MEM_LAST +}; + +typedef struct _virDomainNumatuneDef virDomainNumatuneDef; +typedef virDomainNumatuneDef *virDomainNumatuneDefPtr; +struct _virDomainNumatuneDef { + struct { + char *nodemask; + int mode; + } memory; + + /* Future NUMA tuning related stuff should go here. */ +}; + /* * Guest VM main configuration * @@ -1156,6 +1175,8 @@ struct _virDomainDef { virDomainVcpupinDefPtr *vcpupin; } cputune; + virDomainNumatuneDef numatune; + /* These 3 are based on virDomainLifeCycleAction enum flags */ int onReboot; int onPoweroff; @@ -1564,6 +1585,7 @@ VIR_ENUM_DECL(virDomainGraphicsSpiceZlibCompression) VIR_ENUM_DECL(virDomainGraphicsSpicePlaybackCompression) VIR_ENUM_DECL(virDomainGraphicsSpiceStreamingMode) VIR_ENUM_DECL(virDomainGraphicsSpiceClipboardCopypaste) +VIR_ENUM_DECL(virDomainNumatuneMemMode) /* from libvirt.h */ VIR_ENUM_DECL(virDomainState) VIR_ENUM_DECL(virDomainNostateReason) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 5202da3..ae18c0d 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -310,6 +310,8 @@ virDomainNetIndexByMac; virDomainNetInsert; virDomainNetRemoveByMac; virDomainNetTypeToString; +virDomainNumatuneMemModeTypeFromString; +virDomainNumatuneMemModeTypeToString; virDomainObjAssignDef; virDomainObjCopyPersistentDef; virDomainObjGetPersistentDef; -- 1.7.4

On Fri, Jun 17, 2011 at 06:22:56PM +0800, Osier Yang wrote:
diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 1f4bf1d..3c54e8b 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1117,6 +1117,25 @@ virDomainVcpupinDefPtr virDomainVcpupinFindByVcpu(virDomainVcpupinDefPtr *def, int nvcpupin, int vcpu);
+enum virDomainNumatuneMemMode { + VIR_DOMAIN_NUMATUNE_MEM_STRICT, + VIR_DOMAIN_NUMATUNE_MEM_PREFERRED, + VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE, + + VIR_DOMAIN_NUMATUNE_MEM_LAST +}; + +typedef struct _virDomainNumatuneDef virDomainNumatuneDef; +typedef virDomainNumatuneDef *virDomainNumatuneDefPtr; +struct _virDomainNumatuneDef { + struct { + char *nodemask;
This is a heap allocated string, but there was no addition in domain_conf.c that ever frees it, so this leaks AFAICT.
+ int mode; + } memory; + + /* Future NUMA tuning related stuff should go here. */ +}; +
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

于 2011年06月21日 18:46, Daniel P. Berrange 写道:
On Fri, Jun 17, 2011 at 06:22:56PM +0800, Osier Yang wrote:
diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 1f4bf1d..3c54e8b 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1117,6 +1117,25 @@ virDomainVcpupinDefPtr virDomainVcpupinFindByVcpu(virDomainVcpupinDefPtr *def, int nvcpupin, int vcpu);
+enum virDomainNumatuneMemMode { + VIR_DOMAIN_NUMATUNE_MEM_STRICT, + VIR_DOMAIN_NUMATUNE_MEM_PREFERRED, + VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE, + + VIR_DOMAIN_NUMATUNE_MEM_LAST +}; + +typedef struct _virDomainNumatuneDef virDomainNumatuneDef; +typedef virDomainNumatuneDef *virDomainNumatuneDefPtr; +struct _virDomainNumatuneDef { + struct { + char *nodemask;
This is a heap allocated string, but there was no addition in domain_conf.c that ever frees it, so this leaks AFAICT.
This was fixed in commit 9a2ac25a63de74
+ int mode; + } memory; + + /* Future NUMA tuning related stuff should go here. */ +}; +
Regards, Daniel

Implemented as setting NUMA policy between fork and exec as a hook, using libnuma. Only support memory tuning on domain process currently. For the nodemask out of range, will report soft warning instead of hard error in libvirt layer. (Kernel will be silent as long as one of set bit in the nodemask is valid on the host. E.g. For a host has two NUMA nodes, kernel will be silent for nodemask "01010101"). So, soft warning is the only thing libvirt can do, as one might want to specify the numa policy prior to a node that doesn't exist yet, however, it may come as hotplug soon. --- src/qemu/qemu_process.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 167 insertions(+), 1 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index fe66100..8e09e52 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -39,6 +39,10 @@ #include "qemu_hotplug.h" #include "qemu_bridge_filter.h" +#if HAVE_NUMACTL +# include <numa.h> +#endif + #include "datatypes.h" #include "logging.h" #include "virterror_internal.h" @@ -1175,6 +1179,166 @@ qemuProcessDetectVcpuPIDs(struct qemud_driver *driver, return 0; } + +/* + * Set NUMA memory policy for qemu process, to be run between + * fork/exec of QEMU only. + */ +#if HAVE_NUMACTL +static int +qemuProcessInitNumaMemoryPolicy(virDomainObjPtr vm) +{ + struct bitmask *mask = NULL; + virErrorPtr orig_err = NULL; + virErrorPtr err = NULL; + int mode = -1; + int node = -1; + int ret = -1; + int i = 0; + int maxnode = 0; + + VIR_DEBUG("Setting NUMA memory policy"); + + if (numa_available() < 0) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("Host kernel is not aware of NUMA.")); + return -1; + } + + if (!vm->def->numatune.memory.nodemask) + return 0; + + if (VIR_ALLOC(mask) < 0) { + virReportOOMError(); + return -1; + } + + mask->size = VIR_DOMAIN_CPUMASK_LEN / sizeof (unsigned long); + if (VIR_ALLOC_N(mask->maskp, mask->size) < 0) { + virReportOOMError(); + goto cleanup; + } + /* Convert nodemask to NUMA bitmask. */ + for (i = 0; i < VIR_DOMAIN_CPUMASK_LEN; i++) { + int cur = 0; + int mod = 0; + + if (i) { + cur = i / (8 * sizeof (unsigned long)); + mod = i % (8 * sizeof (unsigned long)); + } + + if (vm->def->numatune.memory.nodemask[i]) { + mask->maskp[cur] |= 1 << mod; + } + } + + maxnode = numa_max_node() + 1; + + for (i = 0; i < mask->size; i++) { + if (i < (maxnode % (8 * sizeof (unsigned long)))) { + if (mask->maskp[i] & ~maxnode) { + VIR_WARN("nodeset is out of range, there is only %d NUMA " + "nodes on host", maxnode); + break; + } + + continue; + } + + if (mask->maskp[i]) { + VIR_WARN("nodeset is out of range, there is only %d NUMA nodes " + "on host", maxnode); + } + } + + orig_err = virSaveLastError(); + mode = vm->def->numatune.memory.mode; + + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + numa_set_bind_policy(1); + numa_set_membind(mask); + numa_set_bind_policy(0); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to bind memory to specified nodeset: %s"), + err ? err->message : _("unknown error")); + virResetLastError(); + goto cleanup; + } + } else if (mode == VIR_DOMAIN_NUMATUNE_MEM_PREFERRED) { + int nnodes = 0; + for (i = 0; i < mask->size; i++) { + if (numa_bitmask_isbitset(mask, i)) { + node = i; + nnodes++; + } + } + + if (nnodes != 1) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("NUMA memory tuning in 'preferred' mode " + "only supports single node")); + goto cleanup; + } + + numa_set_bind_policy(0); + numa_set_preferred(node); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to set memory policy as preferred to specified " + "node: %s"), err ? err->message : _("unknown error")); + virResetLastError(); + goto cleanup; + } + } else if (mode == VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE) { + numa_set_interleave_mask(mask); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to interleave memory to specified nodeset: %s"), + err ? err->message : _("unknown error")); + virResetLastError(); + goto cleanup; + } + } else { + /* XXX: Shouldn't go here, as we already do checking when + * parsing domain XML. + */ + qemuReportError(VIR_ERR_XML_ERROR, + "%s", _("Invalid mode for memory NUMA tuning.")); + goto cleanup; + } + + ret = 0; + +cleanup: + numa_bitmask_free(mask); + return ret; +} +#else +static int +qemuProcessInitNumaMemoryPolicy(virDomainObjPtr vm) +{ + if (vm->def->numatune.memory.nodemask) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("libvirt is compiled without NUMA tuning support")); + + return -1; + } + + return 0; +} +#endif + /* * To be run between fork/exec of QEMU only */ @@ -1891,6 +2055,9 @@ static int qemuProcessHook(void *data) if (qemuProcessInitCpuAffinity(h->vm) < 0) goto cleanup; + if (qemuProcessInitNumaMemoryPolicy(h->vm) < 0) + return -1; + VIR_DEBUG("Setting up security labeling"); if (virSecurityManagerSetProcessLabel(h->driver->securityManager, h->vm) < 0) goto cleanup; @@ -1902,7 +2069,6 @@ cleanup: return ret; } - int qemuProcessPrepareMonitorChr(struct qemud_driver *driver, virDomainChrSourceDefPtr monConfig, -- 1.7.4

On Fri, Jun 17, 2011 at 06:22:57PM +0800, Osier Yang wrote:
Implemented as setting NUMA policy between fork and exec as a hook, using libnuma. Only support memory tuning on domain process currently.
For the nodemask out of range, will report soft warning instead of hard error in libvirt layer. (Kernel will be silent as long as one of set bit in the nodemask is valid on the host. E.g. For a host has two NUMA nodes, kernel will be silent for nodemask "01010101"). So, soft warning is the only thing libvirt can do, as one might want to specify the numa policy prior to a node that doesn't exist yet, however, it may come as hotplug soon. --- src/qemu/qemu_process.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 167 insertions(+), 1 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index fe66100..8e09e52 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -39,6 +39,10 @@ #include "qemu_hotplug.h" #include "qemu_bridge_filter.h"
+#if HAVE_NUMACTL +# include <numa.h> +#endif + #include "datatypes.h" #include "logging.h" #include "virterror_internal.h" @@ -1175,6 +1179,166 @@ qemuProcessDetectVcpuPIDs(struct qemud_driver *driver, return 0; }
+ +/* + * Set NUMA memory policy for qemu process, to be run between + * fork/exec of QEMU only. + */ +#if HAVE_NUMACTL +static int +qemuProcessInitNumaMemoryPolicy(virDomainObjPtr vm) +{ + struct bitmask *mask = NULL; + virErrorPtr orig_err = NULL; + virErrorPtr err = NULL; + int mode = -1; + int node = -1; + int ret = -1; + int i = 0; + int maxnode = 0; + + VIR_DEBUG("Setting NUMA memory policy"); + + if (numa_available() < 0) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("Host kernel is not aware of NUMA.")); + return -1; + } + + if (!vm->def->numatune.memory.nodemask) + return 0; + + if (VIR_ALLOC(mask) < 0) { + virReportOOMError(); + return -1; + } + + mask->size = VIR_DOMAIN_CPUMASK_LEN / sizeof (unsigned long); + if (VIR_ALLOC_N(mask->maskp, mask->size) < 0) { + virReportOOMError(); + goto cleanup; + } + /* Convert nodemask to NUMA bitmask. */ + for (i = 0; i < VIR_DOMAIN_CPUMASK_LEN; i++) { + int cur = 0; + int mod = 0; + + if (i) { + cur = i / (8 * sizeof (unsigned long)); + mod = i % (8 * sizeof (unsigned long)); + } + + if (vm->def->numatune.memory.nodemask[i]) { + mask->maskp[cur] |= 1 << mod; + } + } + + maxnode = numa_max_node() + 1; + + for (i = 0; i < mask->size; i++) { + if (i < (maxnode % (8 * sizeof (unsigned long)))) { + if (mask->maskp[i] & ~maxnode) { + VIR_WARN("nodeset is out of range, there is only %d NUMA " + "nodes on host", maxnode);
This should be a fatal error.
+ break; + } + + continue; + } + + if (mask->maskp[i]) { + VIR_WARN("nodeset is out of range, there is only %d NUMA nodes " + "on host", maxnode);
This should be a fatal error
+ } + } + + orig_err = virSaveLastError(); + mode = vm->def->numatune.memory.mode; + + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + numa_set_bind_policy(1); + numa_set_membind(mask); + numa_set_bind_policy(0); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to bind memory to specified nodeset: %s"), + err ? err->message : _("unknown error")); + virResetLastError();
What ????? There is no code here between the virSaveLastError call and virGetLastError that will ever set an error. This then calls qemuReportError(), and immediately clears it again by calling virResetLastError. The numa_XXXX API calls can't fail, so I don't see what this error reporting code is trying todo. AFAICT this all needs to be deleted.
+ goto cleanup; + } + } else if (mode == VIR_DOMAIN_NUMATUNE_MEM_PREFERRED) { + int nnodes = 0; + for (i = 0; i < mask->size; i++) { + if (numa_bitmask_isbitset(mask, i)) { + node = i; + nnodes++; + } + } + + if (nnodes != 1) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("NUMA memory tuning in 'preferred' mode " + "only supports single node")); + goto cleanup; + } + + numa_set_bind_policy(0); + numa_set_preferred(node); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to set memory policy as preferred to specified " + "node: %s"), err ? err->message : _("unknown error")); + virResetLastError(); + goto cleanup; + }
Again, this code seems just bogus
+ } else if (mode == VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE) { + numa_set_interleave_mask(mask); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to interleave memory to specified nodeset: %s"), + err ? err->message : _("unknown error")); + virResetLastError(); + goto cleanup; + }
And this code
+ } else { + /* XXX: Shouldn't go here, as we already do checking when + * parsing domain XML. + */ + qemuReportError(VIR_ERR_XML_ERROR, + "%s", _("Invalid mode for memory NUMA tuning.")); + goto cleanup; + } + + ret = 0; + +cleanup: + numa_bitmask_free(mask); + return ret; +} +#else +static int +qemuProcessInitNumaMemoryPolicy(virDomainObjPtr vm) +{ + if (vm->def->numatune.memory.nodemask) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("libvirt is compiled without NUMA tuning support")); + + return -1; + } + + return 0; +} +#endif
Since this appears to have already been commited, please send a followup patch to fix these problems. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Jun 21, 2011 at 11:51:44AM +0100, Daniel P. Berrange wrote:
On Fri, Jun 17, 2011 at 06:22:57PM +0800, Osier Yang wrote:
Implemented as setting NUMA policy between fork and exec as a hook, using libnuma. Only support memory tuning on domain process currently.
For the nodemask out of range, will report soft warning instead of hard error in libvirt layer. (Kernel will be silent as long as one of set bit in the nodemask is valid on the host. E.g. For a host has two NUMA nodes, kernel will be silent for nodemask "01010101"). So, soft warning is the only thing libvirt can do, as one might want to specify the numa policy prior to a node that doesn't exist yet, however, it may come as hotplug soon. --- src/qemu/qemu_process.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 167 insertions(+), 1 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index fe66100..8e09e52 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -39,6 +39,10 @@ #include "qemu_hotplug.h" #include "qemu_bridge_filter.h"
+#if HAVE_NUMACTL +# include <numa.h> +#endif + #include "datatypes.h" #include "logging.h" #include "virterror_internal.h" @@ -1175,6 +1179,166 @@ qemuProcessDetectVcpuPIDs(struct qemud_driver *driver, return 0; }
+ +/* + * Set NUMA memory policy for qemu process, to be run between + * fork/exec of QEMU only. + */ +#if HAVE_NUMACTL +static int +qemuProcessInitNumaMemoryPolicy(virDomainObjPtr vm) +{ + struct bitmask *mask = NULL; + virErrorPtr orig_err = NULL; + virErrorPtr err = NULL; + int mode = -1; + int node = -1; + int ret = -1; + int i = 0; + int maxnode = 0; + + VIR_DEBUG("Setting NUMA memory policy"); + + if (numa_available() < 0) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("Host kernel is not aware of NUMA.")); + return -1; + } + + if (!vm->def->numatune.memory.nodemask) + return 0; + + if (VIR_ALLOC(mask) < 0) { + virReportOOMError(); + return -1; + } + + mask->size = VIR_DOMAIN_CPUMASK_LEN / sizeof (unsigned long); + if (VIR_ALLOC_N(mask->maskp, mask->size) < 0) { + virReportOOMError(); + goto cleanup; + } + /* Convert nodemask to NUMA bitmask. */ + for (i = 0; i < VIR_DOMAIN_CPUMASK_LEN; i++) { + int cur = 0; + int mod = 0; + + if (i) { + cur = i / (8 * sizeof (unsigned long)); + mod = i % (8 * sizeof (unsigned long)); + } + + if (vm->def->numatune.memory.nodemask[i]) { + mask->maskp[cur] |= 1 << mod; + } + } + + maxnode = numa_max_node() + 1; + + for (i = 0; i < mask->size; i++) { + if (i < (maxnode % (8 * sizeof (unsigned long)))) { + if (mask->maskp[i] & ~maxnode) { + VIR_WARN("nodeset is out of range, there is only %d NUMA " + "nodes on host", maxnode);
This should be a fatal error.
We discussed this question with a bunch of NUMA people and the consensus was that since it's possible to hotplug nodes, it's not definitively wrong to set a mask containing nodes that do not currently exist. Since it's an unusual thing to do, we thought it was appropriate to warn the user, but the consensus was that it should not be categorically rejected.
+ break; + } + + continue; + } + + if (mask->maskp[i]) { + VIR_WARN("nodeset is out of range, there is only %d NUMA nodes " + "on host", maxnode);
This should be a fatal error
+ } + } + + orig_err = virSaveLastError(); + mode = vm->def->numatune.memory.mode; + + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + numa_set_bind_policy(1); + numa_set_membind(mask); + numa_set_bind_policy(0); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to bind memory to specified nodeset: %s"), + err ? err->message : _("unknown error")); + virResetLastError();
What ?????
There is no code here between the virSaveLastError call and virGetLastError that will ever set an error. This then calls qemuReportError(), and immediately clears it again by calling virResetLastError. The numa_XXXX API calls can't fail, so I don't see what this error reporting code is trying todo. AFAICT this all needs to be deleted.
+ goto cleanup; + } + } else if (mode == VIR_DOMAIN_NUMATUNE_MEM_PREFERRED) { + int nnodes = 0; + for (i = 0; i < mask->size; i++) { + if (numa_bitmask_isbitset(mask, i)) { + node = i; + nnodes++; + } + } + + if (nnodes != 1) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("NUMA memory tuning in 'preferred' mode " + "only supports single node")); + goto cleanup; + } + + numa_set_bind_policy(0); + numa_set_preferred(node); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to set memory policy as preferred to specified " + "node: %s"), err ? err->message : _("unknown error")); + virResetLastError(); + goto cleanup; + }
Again, this code seems just bogus
+ } else if (mode == VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE) { + numa_set_interleave_mask(mask); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to interleave memory to specified nodeset: %s"), + err ? err->message : _("unknown error")); + virResetLastError(); + goto cleanup; + }
And this code
+ } else { + /* XXX: Shouldn't go here, as we already do checking when + * parsing domain XML. + */ + qemuReportError(VIR_ERR_XML_ERROR, + "%s", _("Invalid mode for memory NUMA tuning.")); + goto cleanup; + } + + ret = 0; + +cleanup: + numa_bitmask_free(mask); + return ret; +} +#else +static int +qemuProcessInitNumaMemoryPolicy(virDomainObjPtr vm) +{ + if (vm->def->numatune.memory.nodemask) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("libvirt is compiled without NUMA tuning support")); + + return -1; + } + + return 0; +} +#endif
Since this appears to have already been commited, please send a followup patch to fix these problems.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

于 2011年06月21日 21:58, Dave Allan 写道:
On Tue, Jun 21, 2011 at 11:51:44AM +0100, Daniel P. Berrange wrote:
On Fri, Jun 17, 2011 at 06:22:57PM +0800, Osier Yang wrote:
Implemented as setting NUMA policy between fork and exec as a hook, using libnuma. Only support memory tuning on domain process currently.
For the nodemask out of range, will report soft warning instead of hard error in libvirt layer. (Kernel will be silent as long as one of set bit in the nodemask is valid on the host. E.g. For a host has two NUMA nodes, kernel will be silent for nodemask "01010101"). So, soft warning is the only thing libvirt can do, as one might want to specify the numa policy prior to a node that doesn't exist yet, however, it may come as hotplug soon. --- src/qemu/qemu_process.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 167 insertions(+), 1 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index fe66100..8e09e52 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -39,6 +39,10 @@ #include "qemu_hotplug.h" #include "qemu_bridge_filter.h"
+#if HAVE_NUMACTL +# include<numa.h> +#endif + #include "datatypes.h" #include "logging.h" #include "virterror_internal.h" @@ -1175,6 +1179,166 @@ qemuProcessDetectVcpuPIDs(struct qemud_driver *driver, return 0; }
+ +/* + * Set NUMA memory policy for qemu process, to be run between + * fork/exec of QEMU only. + */ +#if HAVE_NUMACTL +static int +qemuProcessInitNumaMemoryPolicy(virDomainObjPtr vm) +{ + struct bitmask *mask = NULL; + virErrorPtr orig_err = NULL; + virErrorPtr err = NULL; + int mode = -1; + int node = -1; + int ret = -1; + int i = 0; + int maxnode = 0; + + VIR_DEBUG("Setting NUMA memory policy"); + + if (numa_available()< 0) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("Host kernel is not aware of NUMA.")); + return -1; + } + + if (!vm->def->numatune.memory.nodemask) + return 0; + + if (VIR_ALLOC(mask)< 0) { + virReportOOMError(); + return -1; + } + + mask->size = VIR_DOMAIN_CPUMASK_LEN / sizeof (unsigned long); + if (VIR_ALLOC_N(mask->maskp, mask->size)< 0) { + virReportOOMError(); + goto cleanup; + } + /* Convert nodemask to NUMA bitmask. */ + for (i = 0; i< VIR_DOMAIN_CPUMASK_LEN; i++) { + int cur = 0; + int mod = 0; + + if (i) { + cur = i / (8 * sizeof (unsigned long)); + mod = i % (8 * sizeof (unsigned long)); + } + + if (vm->def->numatune.memory.nodemask[i]) { + mask->maskp[cur] |= 1<< mod; + } + } + + maxnode = numa_max_node() + 1; + + for (i = 0; i< mask->size; i++) { + if (i< (maxnode % (8 * sizeof (unsigned long)))) { + if (mask->maskp[i]& ~maxnode) { + VIR_WARN("nodeset is out of range, there is only %d NUMA " + "nodes on host", maxnode);
This should be a fatal error.
We discussed this question with a bunch of NUMA people and the consensus was that since it's possible to hotplug nodes, it's not definitively wrong to set a mask containing nodes that do not currently exist. Since it's an unusual thing to do, we thought it was appropriate to warn the user, but the consensus was that it should not be categorically rejected.
Yes, explained this in [PATCH 0/5] <snip> One may specify nodemask which is out of range, for this case, libvirt will report soft warnings instead of hard error. (Kernel is silent as long as one of the set bit is valid, E.g. For a host with 2 NUMA nodes, kernel will be silent for nodemask "0101010101", as first "bit" which represents the first NUMA node is set). The reason for reporting soft warnings instead of hard error is one may want to define the memory tuning policy priorly with a node doesn't exist yet, but may come soon. </snip> The point is kernel allows it, so our thought is we'd better to not reject it at libvirt layer.
+ break; + } + + continue; + } + + if (mask->maskp[i]) { + VIR_WARN("nodeset is out of range, there is only %d NUMA nodes " + "on host", maxnode);
This should be a fatal error
+ } + } + + orig_err = virSaveLastError(); + mode = vm->def->numatune.memory.mode; + + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + numa_set_bind_policy(1); + numa_set_membind(mask); + numa_set_bind_policy(0); + + err = virGetLastError(); + if ((err&& (err->code != orig_err->code)) || + (err&& !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to bind memory to specified nodeset: %s"), + err ? err->message : _("unknown error")); + virResetLastError();
What ?????
There is no code here between the virSaveLastError call and virGetLastError that will ever set an error. This then calls qemuReportError(), and immediately clears it again by calling virResetLastError. The numa_XXXX API calls can't fail, so I don't see what this error reporting code is trying todo. AFAICT this all needs to be deleted.
Agree, will update.
+ goto cleanup; + } + } else if (mode == VIR_DOMAIN_NUMATUNE_MEM_PREFERRED) { + int nnodes = 0; + for (i = 0; i< mask->size; i++) { + if (numa_bitmask_isbitset(mask, i)) { + node = i; + nnodes++; + } + } + + if (nnodes != 1) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("NUMA memory tuning in 'preferred' mode " + "only supports single node")); + goto cleanup; + } + + numa_set_bind_policy(0); + numa_set_preferred(node); + + err = virGetLastError(); + if ((err&& (err->code != orig_err->code)) || + (err&& !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to set memory policy as preferred to specified " + "node: %s"), err ? err->message : _("unknown error")); + virResetLastError(); + goto cleanup; + }
Again, this code seems just bogus
+ } else if (mode == VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE) { + numa_set_interleave_mask(mask); + + err = virGetLastError(); + if ((err&& (err->code != orig_err->code)) || + (err&& !orig_err)) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, + _("Failed to interleave memory to specified nodeset: %s"), + err ? err->message : _("unknown error")); + virResetLastError(); + goto cleanup; + }
And this code
+ } else { + /* XXX: Shouldn't go here, as we already do checking when + * parsing domain XML. + */ + qemuReportError(VIR_ERR_XML_ERROR, + "%s", _("Invalid mode for memory NUMA tuning.")); + goto cleanup; + } + + ret = 0; + +cleanup: + numa_bitmask_free(mask); + return ret; +} +#else +static int +qemuProcessInitNumaMemoryPolicy(virDomainObjPtr vm) +{ + if (vm->def->numatune.memory.nodemask) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("libvirt is compiled without NUMA tuning support")); + + return -1; + } + + return 0; +} +#endif
Since this appears to have already been commited, please send a followup patch to fix these problems.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 06/17/2011 04:22 AM, Osier Yang wrote:
+#if HAVE_NUMACTL +static int +qemuProcessInitNumaMemoryPolicy(virDomainObjPtr vm) +{ + struct bitmask *mask = NULL;
+ + if (VIR_ALLOC(mask) < 0) {
On RHEL 5, I'm getting compilation failures: qemu/qemu_process.c: In function 'qemuProcessInitNumaMemoryPolicy': qemu/qemu_process.c:1211: error: dereferencing pointer to incomplete type ... qemu/qemu_process.c:1260: warning: passing argument 1 of 'numa_set_membind' from incompatible pointer type Where is 'struct bitmask' defined? It's not in libvirt, so it is probably a struct leaked by a newer NUMA header, but not available with older NUMA headers (and if so, it's not very namespace safe, is it). On RHEL 5, 'man numa_set_membind' shows me: void numa_set_membind(nodemask_t *nodemask); On Fedora 14, numa.h has: typedef struct { unsigned long n[NUMA_NUM_NODES/(sizeof(unsigned long)*8)]; } nodemask_t; struct bitmask { unsigned long size; /* number of bits in the map */ unsigned long *maskp; }; Sounds like we need to use the various nodemask_* functions (in the older numa.h, they directly operate on nodemask_t; in the newer numa.h, they are compatibility wrappers that convert to the newer struct bitmask). -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

Only add tests for qemuxmlargvtest.c, as there is no qemu command line for numatune XML, just want to make sure the XML could be validated well. --- .../qemuxml2argv-numatune-memory.args | 4 ++ .../qemuxml2argv-numatune-memory.xml | 31 ++++++++++++++++++++ tests/qemuxml2argvtest.c | 1 + 3 files changed, 36 insertions(+), 0 deletions(-) create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memory.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numatune-memory.xml diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-memory.args b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memory.args new file mode 100644 index 0000000..23bcb70 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memory.args @@ -0,0 +1,4 @@ +LC_ALL=C PATH=/bin HOME=/home/test USER=test LOGNAME=test /usr/bin/qemu -S -M \ +pc -m 214 -smp 2 -nographic -monitor \ +unix:/tmp/test-monitor,server,nowait -no-acpi -boot c -hda \ +/dev/HostVG/QEMUGuest1 -net none -serial none -parallel none -usb diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numatune-memory.xml b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memory.xml new file mode 100644 index 0000000..66ec6d0 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numatune-memory.xml @@ -0,0 +1,31 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory>219136</memory> + <currentMemory>219136</currentMemory> + <vcpu>2</vcpu> + <numatune> + <memory mode="strict" nodeset="0-5,^4"/> + </numatune> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu> + <topology sockets='2' cores='1' threads='1'/> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <disk type='block' device='disk'> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' unit='0'/> + </disk> + <controller type='ide' index='0'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 6fd0f9b..bd07efa 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -519,6 +519,7 @@ mymain(void) DO_TEST("memtune", false, QEMU_CAPS_NAME); DO_TEST("blkiotune", false, QEMU_CAPS_NAME); DO_TEST("cputune", false, QEMU_CAPS_NAME); + DO_TEST("numatune-memory", false, NONE); DO_TEST("multifunction-pci-device", false, QEMU_CAPS_DRIVE, QEMU_CAPS_DEVICE, QEMU_CAPS_NODEFCONFIG, -- 1.7.4

On Fri, Jun 17, 2011 at 06:22:53PM +0800, Osier Yang wrote:
Hi,
This series incorporated previous feedbacks received on the XML schema, especially on syntax of "nodeset". Now using same syntax of "cpuset" of "vcpu". And doing conversion in qemu driver internally, as libnuma uses different bitmask format to represent NUMA nodes.
The XML is like: <numatune> <memory mode="strict" nodeset="0-10,^5"/> </numatune>
Value of "mode" can be either of "strict", "preferred", or "interleave".
One may specify nodemask which is out of range, for this case, libvirt will report soft warnings instead of hard error. (Kernel is silent as long as one of the set bit is valid, E.g. For a host with 2 NUMA nodes, kernel will be silent for nodemask "0101010101", as first "bit" which represents the first NUMA node is set). The reason for reporting soft warnings instead of hard error is one may want to define the memory tuning policy priorly with a node doesn't exist yet, but may come soon.
Any feedback is appreciated.
At this point I think this patchset is ripe for consumption, ACK, please push, thanks ! Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

于 2011年06月20日 13:30, Daniel Veillard 写道:
On Fri, Jun 17, 2011 at 06:22:53PM +0800, Osier Yang wrote:
Hi,
This series incorporated previous feedbacks received on the XML schema, especially on syntax of "nodeset". Now using same syntax of "cpuset" of "vcpu". And doing conversion in qemu driver internally, as libnuma uses different bitmask format to represent NUMA nodes.
The XML is like: <numatune> <memory mode="strict" nodeset="0-10,^5"/> </numatune>
Value of "mode" can be either of "strict", "preferred", or "interleave".
One may specify nodemask which is out of range, for this case, libvirt will report soft warnings instead of hard error. (Kernel is silent as long as one of the set bit is valid, E.g. For a host with 2 NUMA nodes, kernel will be silent for nodemask "0101010101", as first "bit" which represents the first NUMA node is set). The reason for reporting soft warnings instead of hard error is one may want to define the memory tuning policy priorly with a node doesn't exist yet, but may come soon.
Any feedback is appreciated.
At this point I think this patchset is ripe for consumption,
ACK, please push,
thanks !
Daniel
Thanks, COMMITED. Regards Osier
participants (5)
-
Daniel P. Berrange
-
Daniel Veillard
-
Dave Allan
-
Eric Blake
-
Osier Yang