[libvirt] [numatune PATCH v2] Support NUMA tuning

Hi, All This series adopts Daniel's suggestion on v1, using libnuma but not invoking numactl to set the NUMA policy. Add support for "interleave" and "preferred" modes, except the "strict" mode supported in v1. The new XML is like: <numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune> I persist in using the numactl nodeset syntax to represent the "nodeset", as I think the purpose of adding NUMA tuning support is to provide the use for NUMA users, keeping the syntax same as numactl will make them feel better. Regards Osier [PATCH 1/4] numatune: Define XML schema and add docs [PATCH 2/4] numatune: Support persistent XML for numa tuning [PATCH 3/4] numatune: Set NUMA policy between fork and exec as a hook [PATCH 4/4] numatune: Add tests to validate the persistent XML

Example of numatune XML: <numatune> <memory model="interleave" nodeset="+0-4,8-12"/> </numatune> --- docs/formatdomain.html.in | 14 ++++++++++++++ docs/schemas/domain.rng | 25 +++++++++++++++++++++++++ 2 files changed, 39 insertions(+), 0 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index dcfcd94..f6ab621 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -288,6 +288,9 @@ <min_guarantee>65536</min_guarantee> </memtune> <vcpu cpuset="1-4,^3,6" current="1">2</vcpu> + <numatune> + <memory model="strict" nodeset="1,2,!3-6"> + </numatune> ...</pre> <dl> @@ -366,6 +369,17 @@ the OS provided defaults. NB, There is no unit for the value, it's a relative measure based on the setting of other VM, e.g. A VM configured with value 2048 will get twice as much CPU time as a VM configured with value 1024.</dd> + <dt><code>numatune</code></dt> + <dd> The optional <code>numatune</code> element provides details of + how to tune the performance of a NUMA host via controlling NUMA policy for + domain process. NB, only supported by QEMU driver. + <dt><code>memory</code></dt> + <dd> The optional <code>memory</code> element specify how to allocate memory + for the domain process on a NUMA host. It contains two attributes, + attribute <code>model</code> is either 'interleave', 'strict', or 'preferred'. + attribute <code>nodeset</code> specifies the NUMA nodes, it can be specified as + 25 or 12-15 or 1,3,5-7 or +6-10 or 1-7,!3-5 or !+6-10. NB, if <code>model</code> + is "preferred", <code>nodeset</code> only accepts single node.</dd> </dl> <h3><a name="elementsCPU">CPU model and topology</a></h3> diff --git a/docs/schemas/domain.rng b/docs/schemas/domain.rng index 7163c6e..7e7765d 100644 --- a/docs/schemas/domain.rng +++ b/docs/schemas/domain.rng @@ -387,6 +387,26 @@ </zeroOrMore> </element> </optional> + + <!-- All the NUMA related tunables would go in the numatune --> + <optional> + <element name="numatune"> + <optional> + <element name="memory"> + <attribute name="model"> + <choice> + <value>interleave</value> + <value>strict</value> + <value>preferred</value> + </choice> + </attribute> + <attribute name="nodeset"> + <ref name="nodeset"/> + </attribute> + </element> + </optional> + </element> + </optional> </interleave> </define> <define name="clock"> @@ -2265,6 +2285,11 @@ <param name="pattern">([0-9]+(-[0-9]+)?|\^[0-9]+)(,([0-9]+(-[0-9]+)?|\^[0-9]+))*</param> </data> </define> + <define name="nodeset"> + <data type="string"> + <param name="pattern">([!\+]?[0-9]+(-[0-9]+)?)(,([!\+]?[0-9]+(-[0-9]+)?))*</param> + </data> + </define> <define name="countCPU"> <data type="unsignedShort"> <param name="pattern">[0-9]+</param> -- 1.7.4

于 2011年05月12日 18:22, Osier Yang 写道:
Example of numatune XML:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> </numatune> --- docs/formatdomain.html.in | 14 ++++++++++++++ docs/schemas/domain.rng | 25 +++++++++++++++++++++++++ 2 files changed, 39 insertions(+), 0 deletions(-)
diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index dcfcd94..f6ab621 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -288,6 +288,9 @@ <min_guarantee>65536</min_guarantee> </memtune> <vcpu cpuset="1-4,^3,6" current="1">2</vcpu> +<numatune> +<memory model="strict" nodeset="1,2,!3-6"> +</numatune> ...</pre>
<dl> @@ -366,6 +369,17 @@ the OS provided defaults. NB, There is no unit for the value, it's a relative measure based on the setting of other VM, e.g. A VM configured with value 2048 will get twice as much CPU time as a VM configured with value 1024.</dd> +<dt><code>numatune</code></dt> +<dd> The optional<code>numatune</code> element provides details of + how to tune the performance of a NUMA host via controlling NUMA policy for + domain process. NB, only supported by QEMU driver. +<dt><code>memory</code></dt> +<dd> The optional<code>memory</code> element specify how to allocate memory + for the domain process on a NUMA host. It contains two attributes, + attribute<code>model</code> is either 'interleave', 'strict', or 'preferred'. + attribute<code>nodeset</code> specifies the NUMA nodes, it can be specified as + 25 or 12-15 or 1,3,5-7 or +6-10 or 1-7,!3-5 or !+6-10. NB, if<code>model</code> + is "preferred",<code>nodeset</code> only accepts single node.</dd>
As Igor pointed out, will add version information when do pushing once the patch is fine. Regards Osier

* src/conf/domain_conf.h (Define data stucture for new XML) * src/conf/domain_conf.c (Parse and Format new XML) * src/libvirt_private.syms (Add functions that to convert numa memory tuning model types to string, or inversely) --- src/conf/domain_conf.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++ src/conf/domain_conf.h | 21 +++++++++++ src/libvirt_private.syms | 2 + 3 files changed, 114 insertions(+), 0 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index d3efec6..038c6ad 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -30,6 +30,10 @@ #include <dirent.h> #include <sys/time.h> +#if HAVE_NUMACTL +# include <numa.h> +#endif + #include "virterror_internal.h" #include "datatypes.h" #include "domain_conf.h" @@ -421,6 +425,11 @@ VIR_ENUM_IMPL(virDomainTimerMode, VIR_DOMAIN_TIMER_MODE_LAST, "paravirt", "smpsafe"); +VIR_ENUM_IMPL(virDomainNumatuneMemModel, VIR_DOMAIN_NUMATUNE_MEM_LAST, + "strict", + "preferred", + "interleave"); + #define virDomainReportError(code, ...) \ virReportErrorHelper(VIR_FROM_DOMAIN, code, __FILE__, \ __FUNCTION__, __LINE__, __VA_ARGS__) @@ -1006,6 +1015,8 @@ void virDomainDefFree(virDomainDefPtr def) virDomainVcpupinDefFree(def->cputune.vcpupin, def->cputune.nvcpupin); + VIR_FREE(def->numatune.memory.nodeset); + virSysinfoDefFree(def->sysinfo); if (def->namespaceData && def->ns.free) @@ -5551,6 +5562,77 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps, } VIR_FREE(nodes); + /* Extract numatune if exists. */ + if ((n = virXPathNodeSet("./numatune", ctxt, NULL)) < 0) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("cannot extract numatune nodes")); + goto error; + } + + if (n) { +#ifdef HAVE_NUMACTL + if (numa_available() < 0) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("Host kernel is not aware of NUMA.")); + goto error; + } + + tmp = virXPathString("string(./numatune/memory/@model)", ctxt); + if (tmp) { + if ((def->numatune.memory.model = + virDomainNumatuneMemModelTypeFromString(tmp)) < 0) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + _("Unsupported NUMA memory tuning model '%s'"), + tmp); + goto error; + } + VIR_FREE(tmp); + } else { + def->numatune.memory.model = VIR_DOMAIN_NUMATUNE_MEM_STRICT; + } + + char * nodeset = NULL; + nodeset = virXPathString("string(./numatune/memory/@nodeset)", ctxt); + if (!nodeset) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("nodeset for NUMA memory tuning must be set")); + goto error; + } + + struct bitmask *mask = NULL; + mask = numa_parse_nodestring(nodeset); + if (!mask) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + _("Invalid nodeset for NUMA memory tuning")); + goto error; + } + + int nnodes = 0; + if (def->numatune.memory.model == VIR_DOMAIN_NUMATUNE_MEM_PREFERRED) { + for (i=0; i<mask->size; i++) { + if (numa_bitmask_isbitset(mask, i)) { + nnodes++; + } + } + + if (nnodes != 1) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("NUMA memory tuning in 'preferred' mode " + "only supports single node")); + numa_bitmask_free(mask); + goto error; + } + } + + def->numatune.memory.nodeset = nodeset; + numa_bitmask_free(mask); +#else + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("libvirt is compiled without NUMA tuning support")); + goto error; +#endif + } + n = virXPathNodeSet("./features/*", ctxt, &nodes); if (n < 0) goto error; @@ -8219,6 +8301,15 @@ char *virDomainDefFormat(virDomainDefPtr def, if (def->cputune.shares || def->cputune.vcpupin) virBufferAddLit(&buf, " </cputune>\n"); + if (def->numatune.memory.nodeset) + virBufferAddLit(&buf, " <numatune>\n"); + if (def->numatune.memory.nodeset) + virBufferAsprintf(&buf, " <memory model='%s' nodeset='%s'/>\n", + virDomainNumatuneMemModelTypeToString(def->numatune.memory.model), + def->numatune.memory.nodeset); + if (def->numatune.memory.nodeset) + virBufferAddLit(&buf, " </numatune>\n"); + if (def->sysinfo) virDomainSysinfoDefFormat(&buf, def->sysinfo); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index a0f820c..8685611 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1085,6 +1085,24 @@ int virDomainVcpupinIsDuplicate(virDomainVcpupinDefPtr *def, virDomainVcpupinDefPtr virDomainVcpupinFindByVcpu(virDomainVcpupinDefPtr *def, int nvcpupin, int vcpu); +enum virDomainNumatuneMemModel { + VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE, + VIR_DOMAIN_NUMATUNE_MEM_STRICT, + VIR_DOMAIN_NUMATUNE_MEM_PREFERRED, + + VIR_DOMAIN_NUMATUNE_MEM_LAST +}; + +typedef struct _virDomainNumatuneDef virDomainNumatuneDef; +typedef virDomainNumatuneDef *virDomainNumatuneDefPtr; +struct _virDomainNumatuneDef { + struct { + char *nodeset; + int model; + } memory; + + /* Future NUMA tuning related stuff should go here. */ +}; /* Guest VM main configuration */ typedef struct _virDomainDef virDomainDef; @@ -1120,6 +1138,8 @@ struct _virDomainDef { virDomainVcpupinDefPtr *vcpupin; } cputune; + virDomainNumatuneDef numatune; + /* These 3 are based on virDomainLifeCycleAction enum flags */ int onReboot; int onPoweroff; @@ -1492,6 +1512,7 @@ VIR_ENUM_DECL(virDomainGraphicsSpiceImageCompression) VIR_ENUM_DECL(virDomainGraphicsSpiceJpegCompression) VIR_ENUM_DECL(virDomainGraphicsSpiceZlibCompression) VIR_ENUM_DECL(virDomainGraphicsSpicePlaybackCompression) +VIR_ENUM_DECL(virDomainNumatuneMemModel) /* from libvirt.h */ VIR_ENUM_DECL(virDomainState) VIR_ENUM_DECL(virDomainSeclabel) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index e2e706d..0ca4f8a 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -289,6 +289,8 @@ virDomainMemballoonModelTypeFromString; virDomainMemballoonModelTypeToString; virDomainNetDefFree; virDomainNetTypeToString; +virDomainNumatuneMemModelTypeFromString; +virDomainNumatuneMemModelTypeToString; virDomainObjAssignDef; virDomainObjCopyPersistentDef; virDomainObjGetPersistentDef; -- 1.7.4

* src/qemu/qemu_process.c (New function qemuProcessInitNumaPolicy, and invoke it in function qemuProcessHook) --- src/qemu/qemu_process.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 98 insertions(+), 0 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index bd7c932..43d793c 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -28,6 +28,10 @@ #include <sys/time.h> #include <sys/resource.h> +#if HAVE_NUMACTL +# include <numa.h> +#endif + #include "qemu_process.h" #include "qemu_domain.h" #include "qemu_cgroup.h" @@ -1081,6 +1085,97 @@ qemuProcessDetectVcpuPIDs(struct qemud_driver *driver, } /* + * Set Numa policy for qemu process, to be run between fork/exec of + * QEMU only. + */ +#if HAVA_NUMACTL +static int +qemuProcessInitNumaPolicy(virDomainObjPtr vm) +{ + struct bitmask mask = NULL; + virErrorPtr orig_err = NULL; + virErrorPtr err = NULL; + int model = -1; + int node = -1; + int ret = -1; + int i = 0; + + if (!vm->def->numatune.memory.nodeset) + return 0; + + model = vm->def->numatune.memory.model; + mask = numa_parse_nodestring(vm->def->numatune.memory.nodeset); + + orig_err = virSaveLastError(); + + if (model == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + numa_set_bind_policy(1); + numa_set_membind(mask); + numa_set_bind_policy(0); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + VIR_ERROR(_("Failed to bind memory to nodeset '%s': %s"), + vm->def->numatune.memory.nodeset, + err ? err->message : _("unknown error")); + virResetLastError(err); + goto cleanup; + } + } else if (model == VIR_DOMAIN_NUMATUNE_MEM_PREFERRED) { + for (i=0; i<mask->size; i++) { + if (numa_bitmask_isbitset(mask, i)) + node = i; + } + + numa_set_bind_policy(0); + numa_set_preferred(node); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + VIR_ERROR(_("Failed to set memory policy as preferred to node " + "'%s': %s"), vm->def->numatune.memory.nodeset, + err ? err->message : _("unknown error")); + virResetLastError(err); + goto cleanup; + } + } else if (model == VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE) { + numa_set_interleave_mask(mask); + + err = virGetLastError(); + if ((err && (err->code != orig_err->code)) || + (err && !orig_err)) { + VIR_ERROR(_("Failed to interleave memory to nodeset '%s': %s"), + vm->def->numatune.memory.nodeset, + err ? err->message : _("unknown error")); + virResetLastError(err); + goto cleanup; + } + } else { + /* XXX: Shouldn't go here, as we already do checking when + * parsing domain XML. + */ + qemuReportError(VIR_ERR_XML_ERROR, + "%s", _("Invalid model for memory NUMA tuning.")); + goto cleanup; + } + + ret = 0; + +cleanup: + numa_bitmask_free(mask); + return ret; +} +#else +static int +qemuProcessInitNumaPolicy(virDomainObjPtr vm ATTRIBUTE_UNUSED) +{ + return 0; +} +#endif + +/* * To be run between fork/exec of QEMU only */ static int @@ -1789,6 +1884,9 @@ static int qemuProcessHook(void *data) if (qemuProcessInitCpuAffinity(h->vm) < 0) return -1; + if (qemuProcessInitNumaPolicy(h->vm) < 0) + return -1; + if (virSecurityManagerSetProcessLabel(h->driver->securityManager, h->vm) < 0) return -1; -- 1.7.4

The added tests can only be used to validate if the persistent XML is correct, can not check if the NUMA policy is correctly set, we may need write specific tests to do it. --- .../qemuxml2argvdata/qemuxml2argv-numa-memory.args | 4 +++ .../qemuxml2argvdata/qemuxml2argv-numa-memory.xml | 28 ++++++++++++++++++++ tests/qemuxml2argvtest.c | 2 + tests/qemuxml2xmltest.c | 2 + 4 files changed, 36 insertions(+), 0 deletions(-) create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numa-memory.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numa-memory.xml diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numa-memory.args b/tests/qemuxml2argvdata/qemuxml2argv-numa-memory.args new file mode 100644 index 0000000..f44b73a --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numa-memory.args @@ -0,0 +1,4 @@ +LC_ALL=C PATH=/bin HOME=/home/test USER=test LOGNAME=test /usr/bin/qemu \ +-S -M pc -m 214 -smp 2 -nographic -monitor \ +unix:/tmp/test-monitor,server,nowait -no-acpi -boot c -hda \ +/dev/HostVG/QEMUGuest1 -net none -serial none -parallel none -usb diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numa-memory.xml b/tests/qemuxml2argvdata/qemuxml2argv-numa-memory.xml new file mode 100644 index 0000000..d350f7c --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numa-memory.xml @@ -0,0 +1,28 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory>219136</memory> + <currentMemory>219136</currentMemory> + <vcpu>2</vcpu> + <numatune> + <memory model='interleave' nodeset='0'/> + </numatune> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <disk type='block' device='disk'> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' unit='0'/> + </disk> + <controller type='ide' index='0'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index a7e4cc0..880c59d 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -480,6 +480,8 @@ mymain(void) DO_TEST("smp", false, QEMU_CAPS_SMP_TOPOLOGY); + DO_TEST("numa-memory", false, NONE); + DO_TEST("cpu-topology1", false, QEMU_CAPS_SMP_TOPOLOGY); DO_TEST("cpu-topology2", false, QEMU_CAPS_SMP_TOPOLOGY); DO_TEST("cpu-topology3", false, NONE); diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 5bfbcab..a9d40ca 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -180,6 +180,8 @@ mymain(void) DO_TEST("smp"); + DO_TEST("numa-memory"); + /* These tests generate different XML */ DO_TEST_DIFFERENT("balloon-device-auto"); DO_TEST_DIFFERENT("channel-virtio-auto"); -- 1.7.4

On Thu, May 12, 2011 at 06:22:49PM +0800, Osier Yang wrote:
Hi, All
This series adopts Daniel's suggestion on v1, using libnuma but not invoking numactl to set the NUMA policy. Add support for "interleave" and "preferred" modes, except the "strict" mode supported in v1.
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
I persist in using the numactl nodeset syntax to represent the "nodeset", as I think the purpose of adding NUMA tuning support is to provide the use for NUMA users, keeping the syntax same as numactl will make them feel better.
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

于 2011年05月12日 18:45, Daniel P. Berrange 写道:
On Thu, May 12, 2011 at 06:22:49PM +0800, Osier Yang wrote:
Hi, All
This series adopts Daniel's suggestion on v1, using libnuma but not invoking numactl to set the NUMA policy. Add support for "interleave" and "preferred" modes, except the "strict" mode supported in v1.
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
I persist in using the numactl nodeset syntax to represent the "nodeset", as I think the purpose of adding NUMA tuning support is to provide the use for NUMA users, keeping the syntax same as numactl will make them feel better.
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
Hi, Dan The syntax is actually not of numactl, but of libnuma, it provides API numa_parse_nodestring() to parse the syntax, I'm not sure how ESX/VirtualBox will support numa tuning, if they will use libnuma, IMHO there is no problem here. Regards Osier

On 05/12/2011 05:01 AM, Osier Yang wrote:
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
Hi, Dan
The syntax is actually not of numactl, but of libnuma, it provides API numa_parse_nodestring() to parse the syntax,
The point we're trying to make is that the XML should _not_ match libnuma, but should match <vcpu cpuset=...>. That is, the XML should use 0-4,^3 to mean 0, 1, 2, 4; and then we need an internal translation routine that converts ^ to + before calling libnuma functions. The fact that we use libnuma under the hood is an implementation detail; in the future we may find it easier to use some other mechanism to get the same semantic effect, and that other mechanism may have yet some third syntax. Therefore, it is better for libvirt to present consistent syntax for all of its cpuset parsing, rather than to have two different cpuset spellings based on what under-the-hood capability it is targetting. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

于 2011年05月12日 22:11, Eric Blake 写道:
On 05/12/2011 05:01 AM, Osier Yang wrote:
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
Hi, Dan
The syntax is actually not of numactl, but of libnuma, it provides API numa_parse_nodestring() to parse the syntax,
The point we're trying to make is that the XML should _not_ match libnuma, but should match<vcpu cpuset=...>. That is, the XML should use 0-4,^3 to mean 0, 1, 2, 4; and then we need an internal translation routine that converts ^ to + before calling libnuma functions.
"+" means different with "^". [quote] The + indicates that the node numbers are relative to the process' set of allowed nodes in its current cpuset. [/quote] Also "!", [quote] A !N-N notation indicates the inverse of N-N, in other words all nodes except N-N [/quote]
The fact that we use libnuma under the hood is an implementation detail; in the future we may find it easier to use some other mechanism to get the same semantic effect, and that other mechanism may have yet some third syntax. Therefore, it is better for libvirt to present consistent syntax for all of its cpuset parsing, rather than to have two different cpuset spellings based on what under-the-hood capability it is targetting.
Agree that we may use other mechanism to get the same sementic effect in future, this is good consideration. but the syntax is for NUMA *NODE* set, not *CPU* set, if we use same syntax as cpuset for NUMA nodeset, then we lose some semantics, e.g. for a "!2-4", we could use "^2,^3,^4" as an alternative solution, though it looks quite uncomfortable, and the disadvantage is we abort some smarter syntax, but the advantage is we follow the syntax of cpuset, and actually they are just different at user visible level, the final bitmask are same. But for "+2-4", we have no alternative solution with cpuset's syntax, as far as I could understand, it has specific meaning for NUMA nodeset. So, I still think we need to introduce a different syntax for NUMA nodeset, rather than reusing cpuset syntax, even if we may use some other mechanism in future, we can ask the new mechanism follows the NUMA nodeset syntax then? Regards Osier

On Thu, May 12, 2011 at 11:01:09PM +0800, Osier Yang wrote:
于 2011年05月12日 22:11, Eric Blake 写道:
On 05/12/2011 05:01 AM, Osier Yang wrote:
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
Hi, Dan
The syntax is actually not of numactl, but of libnuma, it provides API numa_parse_nodestring() to parse the syntax,
The point we're trying to make is that the XML should _not_ match libnuma, but should match<vcpu cpuset=...>. That is, the XML should use 0-4,^3 to mean 0, 1, 2, 4; and then we need an internal translation routine that converts ^ to + before calling libnuma functions.
"+" means different with "^".
[quote] The + indicates that the node numbers are relative to the process' set of allowed nodes in its current cpuset. [/quote]
Also "!",
[quote] A !N-N notation indicates the inverse of N-N, in other words all nodes except N-N [/quote]
The fact that we use libnuma under the hood is an implementation detail; in the future we may find it easier to use some other mechanism to get the same semantic effect, and that other mechanism may have yet some third syntax. Therefore, it is better for libvirt to present consistent syntax for all of its cpuset parsing, rather than to have two different cpuset spellings based on what under-the-hood capability it is targetting.
Agree that we may use other mechanism to get the same sementic effect in future, this is good consideration.
but the syntax is for NUMA *NODE* set, not *CPU* set, if we use same syntax as cpuset for NUMA nodeset, then we lose some semantics, e.g. for a "!2-4", we could use "^2,^3,^4" as an alternative solution, though it looks quite uncomfortable, and the disadvantage is we abort some smarter syntax, but the advantage is we follow the syntax of cpuset, and actually they are just different at user visible level, the final bitmask are same.
There is no reason why '^2' couldn't also be made to support '^2-4' in our current CPU set parsing code.
But for "+2-4", we have no alternative solution with cpuset's syntax, as far as I could understand, it has specific meaning for NUMA nodeset.
That kind of syntax does not make sense for libvirt. Configuration should not be described relative to the current runtime policy. The XML description should be self-contained & canonical format. So I consider it a benefit that we don't support that syntax. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

于 2011年05月13日 17:42, Daniel P. Berrange 写道:
On Thu, May 12, 2011 at 11:01:09PM +0800, Osier Yang wrote:
于 2011年05月12日 22:11, Eric Blake 写道:
On 05/12/2011 05:01 AM, Osier Yang wrote:
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
Hi, Dan
The syntax is actually not of numactl, but of libnuma, it provides API numa_parse_nodestring() to parse the syntax,
The point we're trying to make is that the XML should _not_ match libnuma, but should match<vcpu cpuset=...>. That is, the XML should use 0-4,^3 to mean 0, 1, 2, 4; and then we need an internal translation routine that converts ^ to + before calling libnuma functions.
"+" means different with "^".
[quote] The + indicates that the node numbers are relative to the process' set of allowed nodes in its current cpuset. [/quote]
Also "!",
[quote] A !N-N notation indicates the inverse of N-N, in other words all nodes except N-N [/quote]
The fact that we use libnuma under the hood is an implementation detail; in the future we may find it easier to use some other mechanism to get the same semantic effect, and that other mechanism may have yet some third syntax. Therefore, it is better for libvirt to present consistent syntax for all of its cpuset parsing, rather than to have two different cpuset spellings based on what under-the-hood capability it is targetting.
Agree that we may use other mechanism to get the same sementic effect in future, this is good consideration.
but the syntax is for NUMA *NODE* set, not *CPU* set, if we use same syntax as cpuset for NUMA nodeset, then we lose some semantics, e.g. for a "!2-4", we could use "^2,^3,^4" as an alternative solution, though it looks quite uncomfortable, and the disadvantage is we abort some smarter syntax, but the advantage is we follow the syntax of cpuset, and actually they are just different at user visible level, the final bitmask are same.
There is no reason why '^2' couldn't also be made to support '^2-4' in our current CPU set parsing code.
But for "+2-4", we have no alternative solution with cpuset's syntax, as far as I could understand, it has specific meaning for NUMA nodeset.
That kind of syntax does not make sense for libvirt. Configuration should not be described relative to the current runtime policy. The XML description should be self-contained& canonical format. So I consider it a benefit that we don't support that syntax.
Okay, talked with Bill, he's fine with losing of syntax "+", so my mainly reason to persistent in the syntax is not meaningful anymore, :-) Without considering the losing of syntax "+", yes, we need to confirm with the already existed cpuset syntax. Will update. Thanks for the patience. Regards Osier

On 05/12/2011 06:45 AM, Daniel P. Berrange wrote:
On Thu, May 12, 2011 at 06:22:49PM +0800, Osier Yang wrote:
Hi, All
This series adopts Daniel's suggestion on v1, using libnuma but not invoking numactl to set the NUMA policy. Add support for "interleave" and "preferred" modes, except the "strict" mode supported in v1.
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
I persist in using the numactl nodeset syntax to represent the "nodeset", as I think the purpose of adding NUMA tuning support is to provide the use for NUMA users, keeping the syntax same as numactl will make them feel better.
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
I won't argue semantic of XML with you, but please keep in mind that one of the main differences between using a numactl like mechanism and taskset is that the NUMA mechanisms also let you bind to specific, NUMA node memory, as well as specifying the access type. So from the outside looking in, keeping things in terms of cpusets would seem to not be in full agreement with the RFE for NUMA support. I would think that the specification of NUMA binding would need to include NUMA nodes and specify memory bindings as well as the access type. From a performance perspective, support for true NUMA is what is the last hurdle that is keeping libvirt from being used in high performance situations. I think that specifying things in terms of nodes instead of cpus will make it easier for the end user. So I guess I need to withdraw the part about not arguing XML... Thanks for your time, -mark
Regards, Daniel
-- Mark Wagner Principal SW Engineer - Performance Red Hat

On Sun, May 15, 2011 at 09:37:21PM -0400, Mark Wagner wrote:
On 05/12/2011 06:45 AM, Daniel P. Berrange wrote:
On Thu, May 12, 2011 at 06:22:49PM +0800, Osier Yang wrote:
Hi, All
This series adopts Daniel's suggestion on v1, using libnuma but not invoking numactl to set the NUMA policy. Add support for "interleave" and "preferred" modes, except the "strict" mode supported in v1.
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
I persist in using the numactl nodeset syntax to represent the "nodeset", as I think the purpose of adding NUMA tuning support is to provide the use for NUMA users, keeping the syntax same as numactl will make them feel better.
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
I won't argue semantic of XML with you, but please keep in mind that one of the main differences between using a numactl like mechanism and taskset is that the NUMA mechanisms also let you bind to specific, NUMA node memory, as well as specifying the access type.
So from the outside looking in, keeping things in terms of cpusets would seem to not be in full agreement with the RFE for NUMA support. I would think that the specification of NUMA binding would need to include NUMA nodes and specify memory bindings as well as the access type. From a performance perspective, support for true NUMA is what is the last hurdle that is keeping libvirt from being used in high performance situations.
I think that specifying things in terms of nodes instead of cpus will make it easier for the end user. So I guess I need to withdraw the part about not arguing XML...
Hi Mark, I'm not 100% sure I understand what you disagreeing with: - it seems to me that the proposed model does allow the specification of the nodes and the memory binding associated - I wonder if you just object to the "nodeset" attribute name here - please note that "Node" in the context of libvirt has the specific meaning of the whole physical machine http://libvirt.org/goals.html that terminology was set up 5 years ago and present in many places of the libvirt API. On the other hand "nodeset" is being used in other places to specify a set of cpu nodes in a NUMA context. Could you help us clarify your point of view ? thanks ! Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

于 2011年05月19日 15:34, Daniel Veillard 写道:
On Sun, May 15, 2011 at 09:37:21PM -0400, Mark Wagner wrote:
On 05/12/2011 06:45 AM, Daniel P. Berrange wrote:
On Thu, May 12, 2011 at 06:22:49PM +0800, Osier Yang wrote:
Hi, All
This series adopts Daniel's suggestion on v1, using libnuma but not invoking numactl to set the NUMA policy. Add support for "interleave" and "preferred" modes, except the "strict" mode supported in v1.
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
I persist in using the numactl nodeset syntax to represent the "nodeset", as I think the purpose of adding NUMA tuning support is to provide the use for NUMA users, keeping the syntax same as numactl will make them feel better.
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
I won't argue semantic of XML with you, but please keep in mind that one of the main differences between using a numactl like mechanism and taskset is that the NUMA mechanisms also let you bind to specific, NUMA node memory, as well as specifying the access type.
So from the outside looking in, keeping things in terms of cpusets would seem to not be in full agreement with the RFE for NUMA support. I would think that the specification of NUMA binding would need to include NUMA nodes and specify memory bindings as well as the access type. From a performance perspective, support for true NUMA is what is the last hurdle that is keeping libvirt from being used in high performance situations.
I think that specifying things in terms of nodes instead of cpus will make it easier for the end user. So I guess I need to withdraw the part about not arguing XML...
Hi Mark,
I'm not 100% sure I understand what you disagreeing with: - it seems to me that the proposed model does allow the specification of the nodes and the memory binding associated - I wonder if you just object to the "nodeset" attribute name here - please note that "Node" in the context of libvirt has the specific meaning of the whole physical machine http://libvirt.org/goals.html that terminology was set up 5 years ago and present in many places of the libvirt API. On the other hand "nodeset" is being used in other places to specify a set of cpu nodes in a NUMA context.
I guess Mark is not objecting to the attribute name "nodeset", seems he means if we use same syntax as "cpuset", it's not the full agreement with PRE "NUMA support", as we will lose some syntax that libnuma uses. As a conclusion after the discussion, we will use "nodeset" as the attribute name, and with same syntax of "cpuset", and we won't use the nodestring parsing function "numa_parse_nodestring", which is provided by libnuma, if we don't want to make things a mess: "numa_parse_nodestring" only accepts "!" (also "+", but as we won't support "+", so skip it here) at the beginning of the specified node string, e.g "0-4,!8-12" is not valid, however, our current "cpuset" syntax allows "^" could be specified anywhere, e.g. "0-8,^2-4" is valid, so even if we convert "^" to "!" before passing the string to "numa_parse_nodestring", that's still doesn't make sense, unless we declare in the documents, that we use same syntax of "cpuset", however, the "^" must be specified at the beginning, but that's no better than introducing a different syntax. On the other hand, "numa_parse_nodestring" doesn't support syntax like "!6", so in one word, if we will use same syntax with "cpuset", we can't/won't use the numa parsing function. We will use "virDomainCpuSetParse" to parse the value of "nodeset" to bit mask. and then pass it to numa setting functions, we need to do some conversion before pass it for numa functions' use though, as the datatypes are different. Even if we modify current "cpuset" parsing function to support "^2-4", that will still diffrent with what "!" means in libnuma. That means we will use a nearly completely diffrent syntax with libnuma to represents NUMA nodes in libvirt, with losing sementics of both "+" and "!" in presentation layer. Thoughts? Regards Osier

On Fri, May 20, 2011 at 03:22:10PM +0800, Osier Yang wrote:
于 2011年05月19日 15:34, Daniel Veillard 写道:
On Sun, May 15, 2011 at 09:37:21PM -0400, Mark Wagner wrote:
On 05/12/2011 06:45 AM, Daniel P. Berrange wrote:
On Thu, May 12, 2011 at 06:22:49PM +0800, Osier Yang wrote:
Hi, All
This series adopts Daniel's suggestion on v1, using libnuma but not invoking numactl to set the NUMA policy. Add support for "interleave" and "preferred" modes, except the "strict" mode supported in v1.
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
I persist in using the numactl nodeset syntax to represent the "nodeset", as I think the purpose of adding NUMA tuning support is to provide the use for NUMA users, keeping the syntax same as numactl will make them feel better.
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
I won't argue semantic of XML with you, but please keep in mind that one of the main differences between using a numactl like mechanism and taskset is that the NUMA mechanisms also let you bind to specific, NUMA node memory, as well as specifying the access type.
So from the outside looking in, keeping things in terms of cpusets would seem to not be in full agreement with the RFE for NUMA support. I would think that the specification of NUMA binding would need to include NUMA nodes and specify memory bindings as well as the access type. From a performance perspective, support for true NUMA is what is the last hurdle that is keeping libvirt from being used in high performance situations.
I think that specifying things in terms of nodes instead of cpus will make it easier for the end user. So I guess I need to withdraw the part about not arguing XML...
Hi Mark,
I'm not 100% sure I understand what you disagreeing with: - it seems to me that the proposed model does allow the specification of the nodes and the memory binding associated - I wonder if you just object to the "nodeset" attribute name here - please note that "Node" in the context of libvirt has the specific meaning of the whole physical machine http://libvirt.org/goals.html that terminology was set up 5 years ago and present in many places of the libvirt API. On the other hand "nodeset" is being used in other places to specify a set of cpu nodes in a NUMA context.
I guess Mark is not objecting to the attribute name "nodeset", seems he means if we use same syntax as "cpuset", it's not the full agreement with PRE "NUMA support", as we will lose some syntax that libnuma uses.
As a conclusion after the discussion, we will use "nodeset" as the attribute name, and with same syntax of "cpuset", and we won't use the nodestring parsing function "numa_parse_nodestring", which is provided by libnuma, if we don't want to make things a mess:
"numa_parse_nodestring" only accepts "!" (also "+", but as we won't support "+", so skip it here) at the beginning of the specified node string, e.g "0-4,!8-12" is not valid, however, our current "cpuset" syntax allows "^" could be specified anywhere, e.g. "0-8,^2-4" is valid, so even if we convert "^" to "!" before passing the string to "numa_parse_nodestring", that's still doesn't make sense, unless we declare in the documents, that we use same syntax of "cpuset", however, the "^" must be specified at the beginning, but that's no better than introducing a different syntax. On the other hand, "numa_parse_nodestring" doesn't support syntax like "!6", so in one word, if we will use same syntax with "cpuset", we can't/won't use the numa parsing function.
We will use "virDomainCpuSetParse" to parse the value of "nodeset" to bit mask. and then pass it to numa setting functions, we need to do some conversion before pass it for numa functions' use though, as the datatypes are different.
That is neccessary regardless. The virDomainDef struct can not use any data types or parsing APIs from libnuma, since it must be generic platform independant code. The libnuma data types & APIs will only be relevant in the QEMU driver at the time when the affinity is being set.
Even if we modify current "cpuset" parsing function to support "^2-4", that will still diffrent with what "!" means in libnuma.
That means we will use a nearly completely diffrent syntax with libnuma to represents NUMA nodes in libvirt, with losing sementics of both "+" and "!" in presentation layer.
"+" is irrelevant, and everything else libvirt does is functionally equivalent, even if the syntax is different, so we're not loosing anything here. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 05/19/2011 03:34 AM, Daniel Veillard wrote:
On Sun, May 15, 2011 at 09:37:21PM -0400, Mark Wagner wrote:
On 05/12/2011 06:45 AM, Daniel P. Berrange wrote:
On Thu, May 12, 2011 at 06:22:49PM +0800, Osier Yang wrote:
Hi, All
This series adopts Daniel's suggestion on v1, using libnuma but not invoking numactl to set the NUMA policy. Add support for "interleave" and "preferred" modes, except the "strict" mode supported in v1.
The new XML is like:
<numatune> <memory model="interleave" nodeset="+0-4,8-12"/> <numatune>
I persist in using the numactl nodeset syntax to represent the "nodeset", as I think the purpose of adding NUMA tuning support is to provide the use for NUMA users, keeping the syntax same as numactl will make them feel better.
Compatibility with numactl syntax is an explicit non-goal. numactl is just one platform specific impl. Compatibility with numactl syntax is of no interest to the ESX or VirtualBox drivers. The libvirt NUMA syntax should be using other existing libvirt XML as the design compatibility target.
I won't argue semantic of XML with you, but please keep in mind that one of the main differences between using a numactl like mechanism and taskset is that the NUMA mechanisms also let you bind to specific, NUMA node memory, as well as specifying the access type.
So from the outside looking in, keeping things in terms of cpusets would seem to not be in full agreement with the RFE for NUMA support. I would think that the specification of NUMA binding would need to include NUMA nodes and specify memory bindings as well as the access type. From a performance perspective, support for true NUMA is what is the last hurdle that is keeping libvirt from being used in high performance situations.
I think that specifying things in terms of nodes instead of cpus will make it easier for the end user. So I guess I need to withdraw the part about not arguing XML...
Hi Mark,
I'm not 100% sure I understand what you disagreeing with: - it seems to me that the proposed model does allow the specification of the nodes and the memory binding associated - I wonder if you just object to the "nodeset" attribute name here - please note that "Node" in the context of libvirt has the specific meaning of the whole physical machine http://libvirt.org/goals.html that terminology was set up 5 years ago and present in many places of the libvirt API. On the other hand "nodeset" is being used in other places to specify a set of cpu nodes in a NUMA context.
Could you help us clarify your point of view ?
thanks !
Daniel
Daniel I think that maybe I didn't fully understand the entire context. My main goal is to make sure that we consider the differences between NUMA and simple CPU pinning. After a rereading the threads and some conversations it appears that you are doing that. Sorry for the noise on this issue btw - I must say that I think that the libvirt team is doing a great job overall in getting support for the features that are needed to achieve good to top performance from KVM. I actually based a lot of my Summit presentation around using libvirt and virt-manager. Once NUMA support is in, I expect that we will see some SPECvirt submissions that are based on libvirt. Thanks for all of the hard work! -mark -- Mark Wagner Principal SW Engineer - Performance Red Hat

On Fri, May 20, 2011 at 07:06:46AM -0400, Mark Wagner wrote:
On 05/19/2011 03:34 AM, Daniel Veillard wrote:
Hi Mark,
I think that maybe I didn't fully understand the entire context.
Well I wasn't sure either if the problem was lack of expressivity or some problem in the syntax :-)
My main goal is to make sure that we consider the differences between NUMA and simple CPU pinning. After a rereading the threads and some conversations it appears that you are doing that.
Ah, good, hopefully you will be able to fully check the result soon
Sorry for the noise on this issue
Well I prefer a bit of noise and making sure things are implemented fully than pushing ode that would not be adequate, so no problem ! :-)
btw - I must say that I think that the libvirt team is doing a great job overall in getting support for the features that are needed to achieve good to top performance from KVM. I actually based a lot of my Summit presentation around using libvirt and virt-manager. Once NUMA support is in, I expect that we will see some SPECvirt submissions that are based on libvirt. Thanks for all of the hard work!
Thanks for the kind words, hopefully we will be able to get all of the results using libvirt soon :-) Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/
participants (5)
-
Daniel P. Berrange
-
Daniel Veillard
-
Eric Blake
-
Mark Wagner
-
Osier Yang