[libvirt] [RFC PATCH] NUMA tuning support

newer
[libvirt] RFC: APIs for managing a...

Osier Yang

5 May 2011 5 May '11

9:38 a.m.

Hi, All, This is a simple implenmentation for NUMA tuning support based on binary program 'numactl', currently only supports to bind memory to specified nodes, using option "--membind", perhaps it need to support more, but I'd like send it early so that could make sure if the principle is correct. Ideally, NUMA tuning support should be added in qemu-kvm first, such as they could provide command options, then what we need to do in libvirt is just to pass the options to qemu-kvm, but unfortunately qemu-kvm doesn't support it yet, what we could do currently is only to use numactl, it forks process, a bit expensive than qemu-kvm supports NUMA tuning inside with libnuma, but it shouldn't affects much I guess. The NUMA tuning XML is like: <numatune> <membind nodeset='+0-4,8-12'/> </numatune> Any thoughts/feedback is appreciated. Regards Osier [PATCH 1/5] build: Define NUMACTL for numa tuning use [PATCH 2/5] docs: Define XML schema for numa tuning and add docs [PATCH 3/5] conf: Support NUMA tuning XML [PATCH 4/5] qemu: Build command line for NUMA tuning [PATCH 5/5] tests: Add tests for guest use NUMA tuning

Show replies by date

Osier Yang

5 May 5 May

9:38 a.m.

New subject: [libvirt] [PATCH 1/5] build: Define NUMACTL for numa tuning use

NUMACTL is the path of binary program 'numactl'. --- configure.ac | 14 ++++++++++---- 1 files changed, 10 insertions(+), 4 deletions(-) diff --git a/configure.ac b/configure.ac index 7c68bca..3b8e0be 100644 --- a/configure.ac +++ b/configure.ac @@ -1233,19 +1233,22 @@ AM_CONDITIONAL([WITH_DTRACE], [test "$with_dtrace" != "no"]) dnl NUMA lib AC_ARG_WITH([numactl], - AC_HELP_STRING([--with-numactl], [use numactl for host topology info @<:@default=check@:>@]), - [], - [with_numactl=check]) + AC_HELP_STRING([--with-numactl], + [use numactl for host topology info and setting NUMA policy for domain process @<:@default=check@:>@]), + [], + [with_numactl=check]) NUMACTL_CFLAGS= NUMACTL_LIBS= if test "$with_qemu" = "yes" && test "$with_numactl" != "no"; then old_cflags="$CFLAGS" old_libs="$LIBS" + AC_PATH_PROG([NUMACTL], [numactl], [], [$PATH:/sbin:/usr/sbin]) + if test "$with_numactl" = "check"; then AC_CHECK_HEADER([numa.h],[],[with_numactl=no]) AC_CHECK_LIB([numa], [numa_available],[],[with_numactl=no]) - if test "$with_numactl" != "no"; then + if test "$with_numactl" != "no" && test -n "$NUMACTL"; then with_numactl="yes" fi else @@ -1254,6 +1257,8 @@ if test "$with_qemu" = "yes" && test "$with_numactl" != "no"; then AC_CHECK_LIB([numa], [numa_available],[],[fail=1]) test $fail = 1 && AC_MSG_ERROR([You must install the numactl development package in order to compile and run libvirt]) + test -z "$NUMACTL" && + AC_MSG_ERROR([You must install numactl package to compile and run libvirt]) fi CFLAGS="$old_cflags" LIBS="$old_libs" @@ -1261,6 +1266,7 @@ fi if test "$with_numactl" = "yes"; then NUMACTL_LIBS="-lnuma" AC_DEFINE_UNQUOTED([HAVE_NUMACTL], 1, [whether numactl is available for topology info]) + AC_DEFINE_UNQUOTED([NUMACTL],["$NUMACTL"], [Location or name of the numactl program]) fi AM_CONDITIONAL([HAVE_NUMACTL], [test "$with_numactl" != "no"]) AC_SUBST([NUMACTL_CFLAGS]) -- 1.7.4

Osier Yang

9:38 a.m.

New subject: [libvirt] [PATCH 2/5] docs: Define XML schema for numa tuning and add docs

Currently we only want to use "membind" function of numactl, but perhaps more other functions in future, so introduce element "<numatune>", future NUMA tuning related XML stuffs should go into it. --- docs/formatdomain.html.in | 17 +++++++++++++++++ docs/schemas/domain.rng | 20 ++++++++++++++++++++ 2 files changed, 37 insertions(+), 0 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 5013c48..6da6465 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -288,6 +288,9 @@ <min_guarantee>65536</min_guarantee> </memtune> <vcpu cpuset="1-4,^3,6" current="1">2</vcpu> + <numatune> + <membind nodeset="1,2,!3-6"> + </numatune> ...</pre> <dl> @@ -366,6 +369,20 @@ the OS provided defaults. NB, There is no unit for the value, it's a relative measure based on the setting of other VM, e.g. A VM configured with value 2048 will get twice as much CPU time as a VM configured with value 1024.</dd> + <dt><code>numatune</code></dt> + <dd> The optional <code>numatune</code> element provides details of + how to tune the performance of a NUMA host via controlling NUMA policy for + domain process. + <dt><code>membind</code></dt> + <dd> The optional <code>membind</code> element specify how to allocate memory + for the domain process on a NUMA host. It contains attribute <code>nodeset</code> + , which specifies the NUMA nodes, the memory of domain process will only be + allocated from the specified nodes. <code>nodeset</code> can be specified as + N,N,N or N-N or N,N-N or N-N,N-N and so forth. Relative nodes may be specifed + as +N,N,N or +N-N or +N,N-N and so forth. The + indicates that the node numbers + are relative to the process' set of allowed nodes in its current cpuset. A + !N-N notation indicates the inverse of N-N, in other words all nodes except N-N. + If used with + notation, specify !+N-N.</dd> </dl> <h3><a name="elementsCPU">CPU model and topology</a></h3> diff --git a/docs/schemas/domain.rng b/docs/schemas/domain.rng index 7163c6e..811f5ed 100644 --- a/docs/schemas/domain.rng +++ b/docs/schemas/domain.rng @@ -387,6 +387,21 @@ </zeroOrMore> </element> </optional> + +  + <optional> + <element name="numatune"> + <optional> +  + <element name="membind"> + <attribute name="nodeset"> + <ref name="nodeset"/> + </attribute> + </element> + </optional> + </element> + </optional> + </interleave> </define> <define name="clock"> @@ -2265,6 +2280,11 @@ <param name="pattern">([0-9]+(-[0-9]+)?|\^[0-9]+)(,([0-9]+(-[0-9]+)?|\^[0-9]+))*</param> </data> </define> + <define name="nodeset"> + <data type="string"> + <param name="pattern">([!\+]?[0-9]+(-[0-9]+)?)(,([!\+]?[0-9]+(-[0-9]+)?))*</param> + </data> + </define> <define name="countCPU"> <data type="unsignedShort"> <param name="pattern">[0-9]+</param> -- 1.7.4

Daniel P. Berrange

3:29 p.m.

New subject: [libvirt] [PATCH 2/5] docs: Define XML schema for numa tuning and add docs

On Thu, May 05, 2011 at 05:38:27PM +0800, Osier Yang wrote:

...

Currently we only want to use "membind" function of numactl, but perhaps more other functions in future, so introduce element "<numatune>", future NUMA tuning related XML stuffs should go into it. --- docs/formatdomain.html.in | 17 +++++++++++++++++ docs/schemas/domain.rng | 20 ++++++++++++++++++++ 2 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 5013c48..6da6465 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -288,6 +288,9 @@ <min_guarantee>65536</min_guarantee> </memtune> <vcpu cpuset="1-4,^3,6" current="1">2</vcpu> + <numatune> + <membind nodeset="1,2,!3-6"> + </numatune>

I don't think we should be creating a new <numatune> element here since it is not actually covering all aspects of NUMA tuning. We already have CPU NUMA pinning in the separate <vcpu> element. NUMA memory pinning should likely be either in the <memtune> or <memoryBacking> elements, probably the latter. Also, it is not very nice to use a different syntax for negation for the VCPU specification, vs memory node specification "^3" vs "!3" Looking to the future, we may want to consider how we'd allow host NUMA mapping on a fine grained basis, per guest NUMA node. eg It is possible with QEMU to actually define a guest visible NUMA topology for the virtual CPUs and memory using -numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node] We don't support that yet, which is something we ought to do. At which point you probably also want to be ale to map guest NUMA nodes to host NUMA nodes. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Osier Yang

6 May 6 May

2:25 a.m.

New subject: [libvirt] [PATCH 2/5] docs: Define XML schema for numa tuning and add docs

于 2011年05月05日 23:29, Daniel P. Berrange 写道:

...

On Thu, May 05, 2011 at 05:38:27PM +0800, Osier Yang wrote:

...
Currently we only want to use "membind" function of numactl, but perhaps more other functions in future, so introduce element "<numatune>", future NUMA tuning related XML stuffs should go into it. --- docs/formatdomain.html.in | 17 +++++++++++++++++ docs/schemas/domain.rng | 20 ++++++++++++++++++++ 2 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 5013c48..6da6465 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -288,6 +288,9 @@ <min_guarantee>65536</min_guarantee> </memtune> <vcpu cpuset="1-4,^3,6" current="1">2</vcpu> +<numatune> +<membind nodeset="1,2,!3-6"> +</numatune>

I don't think we should be creating a new<numatune> element here since it is not actually covering all aspects of NUMA tuning. We already have CPU NUMA pinning in the separate<vcpu> element. NUMA memory pinning should likely be either in the<memtune> or<memoryBacking> elements, probably the latter.

Agree that it doesn't cover all aspects of NUMA tuning, actually we also have <vcpupin>, the reason I did't put it into <memtune> is that I'm not sure if we will also support other tuning stuffs.

...

Also, it is not very nice to use a different syntax for negation for the VCPU specification, vs memory node specification "^3" vs "!3"

NUMA tuning use different syntax, actually also has "+", which is not used by VCPU specification, so IMHO once we have to accept "+", "!" should be accepted too, or we can do a converstion, from "^" to "!"?

...

Looking to the future, we may want to consider how we'd allow host NUMA mapping on a fine grained basis, per guest NUMA node. eg It is possible with QEMU to actually define a guest visible NUMA topology for the virtual CPUs and memory using

-numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]

We don't support that yet, which is something we ought to do. At which point you probably also want to be ale to map guest NUMA nodes to host NUMA nodes.

As far as I understand this, doesn't we need a standalone <numatune> for things like this? Thanks Osier

Daniel P. Berrange

9:25 a.m.

New subject: [libvirt] [PATCH 2/5] docs: Define XML schema for numa tuning and add docs

On Fri, May 06, 2011 at 10:25:31AM +0800, Osier Yang wrote:

...

于 2011年05月05日 23:29, Daniel P. Berrange 写道:

...
On Thu, May 05, 2011 at 05:38:27PM +0800, Osier Yang wrote:

...
Currently we only want to use "membind" function of numactl, but perhaps more other functions in future, so introduce element "<numatune>", future NUMA tuning related XML stuffs should go into it. --- docs/formatdomain.html.in | 17 +++++++++++++++++ docs/schemas/domain.rng | 20 ++++++++++++++++++++ 2 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 5013c48..6da6465 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -288,6 +288,9 @@ <min_guarantee>65536</min_guarantee> </memtune> <vcpu cpuset="1-4,^3,6" current="1">2</vcpu> +<numatune> +<membind nodeset="1,2,!3-6"> +</numatune>

I don't think we should be creating a new<numatune> element here since it is not actually covering all aspects of NUMA tuning. We already have CPU NUMA pinning in the separate<vcpu> element. NUMA memory pinning should likely be either in the<memtune> or<memoryBacking> elements, probably the latter.

Agree that it doesn't cover all aspects of NUMA tuning, actually we also have <vcpupin>, the reason I did't put it into <memtune> is that I'm not sure if we will also support other tuning stuffs.

...
Also, it is not very nice to use a different syntax for negation for the VCPU specification, vs memory node specification "^3" vs "!3"

NUMA tuning use different syntax, actually also has "+", which is not used by VCPU specification, so IMHO once we have to accept "+", "!" should be accepted too, or we can do a converstion, from "^" to "!"?

My point is that it should *not* use a different syntax. The reason it currenly uses a different syntax, is because the code is directly exposing the numactl command line in the XML, rather than defining the syntax ourselves. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Osier Yang

5 May 5 May

9:38 a.m.

New subject: [libvirt] [PATCH 3/5] conf: Support NUMA tuning XML

E.g. <numatune> <membind nodeset='+0-4,8-12'/> </numatune> --- src/conf/domain_conf.c | 42 +++++++++++++++++++++++++++++++++++++++++- src/conf/domain_conf.h | 11 +++++++++++ 2 files changed, 52 insertions(+), 1 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 2a681d9..fbf5f81 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -48,6 +48,11 @@ #include "files.h" #include "bitmap.h" +#if HAVE_NUMACTL +# include <numa.h> +#endif + + #define VIR_FROM_THIS VIR_FROM_DOMAIN VIR_ENUM_IMPL(virDomainVirt, VIR_DOMAIN_VIRT_LAST, @@ -5530,6 +5535,29 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps, } VIR_FREE(nodes); + /* Extract numatune if exists. */ + if ((n = virXPathNodeSet("./numatune", ctxt, NULL)) < 0) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("cannot extract numatune nodes")); + goto error; + } + + if (n) { +#ifdef HAVE_NUMACTL + if (numa_available() < 0) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("Host kernel is not aware of NUMA.")); + goto error; + } + + def->numatune.membind.nodeset = virXPathString("string(./numatune/membind/@nodeset)", ctxt); +#else + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + "%s", _("libvirt is compiled without NUMA tuning support")); + goto error; +#endif + } + n = virXPathNodeSet("./features/*", ctxt, &nodes); if (n < 0) goto error; @@ -6020,7 +6048,6 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps, def->ninputs++; } - /* analysis of the sound devices */ if ((n = virXPathNodeSet("./devices/sound", ctxt, &nodes)) < 0) { virDomainReportError(VIR_ERR_INTERNAL_ERROR, @@ -8175,6 +8202,19 @@ char *virDomainDefFormat(virDomainDefPtr def, if (def->cputune.shares || def->cputune.vcpupin) virBufferAddLit(&buf, " </cputune>\n"); +#ifdef HAVE_NUMACTL + if (def->numatune.membind.nodeset) { + virBufferAddLit(&buf, " <numatune>\n"); + } + + if (def->numatune.membind.nodeset) + virBufferVSprintf(&buf, " <membind nodeset='%s'/>\n", + def->numatune.membind.nodeset); + + if (def->numatune.membind.nodeset) + virBufferAddLit(&buf, " </numatune>\n"); +#endif + if (def->sysinfo) virDomainSysinfoDefFormat(&buf, def->sysinfo); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 1dadf98..61c7d65 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -1085,6 +1085,15 @@ int virDomainVcpupinIsDuplicate(virDomainVcpupinDefPtr *def, virDomainVcpupinDefPtr virDomainVcpupinFindByVcpu(virDomainVcpupinDefPtr *def, int nvcpupin, int vcpu); +typedef struct _virDomainNumatuneDef virDomainNumatuneDef; +typedef virDomainNumatuneDef *virDomainNumatuneDefPtr; +struct _virDomainNumatuneDef { + struct { + char *nodeset; + } membind; + + /* Future NUMA tuning related stuff should go here. */ +}; /* Guest VM main configuration */ typedef struct _virDomainDef virDomainDef; @@ -1120,6 +1129,8 @@ struct _virDomainDef { virDomainVcpupinDefPtr *vcpupin; } cputune; + virDomainNumatuneDef numatune; + /* These 3 are based on virDomainLifeCycleAction enum flags */ int onReboot; int onPoweroff; -- 1.7.4

Osier Yang

9:38 a.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

Just prepend numactl command line to qemu command line. --- src/qemu/qemu_command.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2205ed1..156fdfb 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -2762,7 +2762,23 @@ qemuBuildCommandLine(virConnectPtr conn, break; } +#ifdef HAVE_NUMACTL + /* XXX: Logic here to build numactl commmand line need to be changed if + * more NUMA tuning related stuffs in future. + */ + if (def->numatune.membind.nodeset) { + cmd = virCommandNewArgList(NUMACTL, + "-m", + def->numatune.membind.nodeset, + NULL); + + virCommandAddArgList(cmd, emulator, "-S", NULL); + } else { + cmd = virCommandNewArgList(emulator, "-S", NULL); + } +#else cmd = virCommandNewArgList(emulator, "-S", NULL); +#endif virCommandAddEnvPassCommon(cmd); -- 1.7.4

Daniel P. Berrange

3:21 p.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

On Thu, May 05, 2011 at 05:38:29PM +0800, Osier Yang wrote:

...

Just prepend numactl command line to qemu command line. --- src/qemu/qemu_command.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2205ed1..156fdfb 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -2762,7 +2762,23 @@ qemuBuildCommandLine(virConnectPtr conn, break; }

+#ifdef HAVE_NUMACTL + /* XXX: Logic here to build numactl commmand line need to be changed if + * more NUMA tuning related stuffs in future. + */ + if (def->numatune.membind.nodeset) { + cmd = virCommandNewArgList(NUMACTL, + "-m", + def->numatune.membind.nodeset, + NULL); + + virCommandAddArgList(cmd, emulator, "-S", NULL); + } else { + cmd = virCommandNewArgList(emulator, "-S", NULL); + } +#else cmd = virCommandNewArgList(emulator, "-S", NULL); +#endif

NACK to this approach. We should be using the libnuma APIs to set the memory binding for the process directly, and not running the numactl command. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Bill Gray

8:30 p.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

Hi Daniel, How can we get NUMA-aligned memory and CPUs if we apply binding APIs after the process has already started? Might not all the memory already be allocated on the wrong nodes by then? For expert users, what are the problems with starting qemu with an external numactl command (with --cpunodebind and --membind) to guarantee optimal alignment? Alternatives? - Bill On 05/05/2011 11:21 AM, Daniel P. Berrange wrote:

...

On Thu, May 05, 2011 at 05:38:29PM +0800, Osier Yang wrote:

...
Just prepend numactl command line to qemu command line. --- src/qemu/qemu_command.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2205ed1..156fdfb 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -2762,7 +2762,23 @@ qemuBuildCommandLine(virConnectPtr conn, break; }

+#ifdef HAVE_NUMACTL + /* XXX: Logic here to build numactl commmand line need to be changed if + * more NUMA tuning related stuffs in future. + */ + if (def->numatune.membind.nodeset) { + cmd = virCommandNewArgList(NUMACTL, + "-m", + def->numatune.membind.nodeset, + NULL); + + virCommandAddArgList(cmd, emulator, "-S", NULL); + } else { + cmd = virCommandNewArgList(emulator, "-S", NULL); + } +#else cmd = virCommandNewArgList(emulator, "-S", NULL); +#endif

NACK to this approach. We should be using the libnuma APIs to set the memory binding for the process directly, and not running the numactl command.

Regards, Daniel

Osier Yang

6 May 6 May

3:49 a.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

于 2011年05月06日 04:30, Bill Gray 写道:

...

Hi Daniel,

How can we get NUMA-aligned memory and CPUs if we apply binding APIs after the process has already started? Might not all the memory already be allocated on the wrong nodes by then?

That's what I guess why libnuma only support to set NUMA policy when the process start up.

...

For expert users, what are the problems with starting qemu with an external numactl command (with --cpunodebind and --membind) to guarantee optimal alignment?

Alternatives?

- Bill

On 05/05/2011 11:21 AM, Daniel P. Berrange wrote:

...
On Thu, May 05, 2011 at 05:38:29PM +0800, Osier Yang wrote:

...
Just prepend numactl command line to qemu command line. --- src/qemu/qemu_command.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2205ed1..156fdfb 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -2762,7 +2762,23 @@ qemuBuildCommandLine(virConnectPtr conn, break; }

+#ifdef HAVE_NUMACTL + /* XXX: Logic here to build numactl commmand line need to be changed if + * more NUMA tuning related stuffs in future. + */ + if (def->numatune.membind.nodeset) { + cmd = virCommandNewArgList(NUMACTL, + "-m", + def->numatune.membind.nodeset, + NULL); + + virCommandAddArgList(cmd, emulator, "-S", NULL); + } else { + cmd = virCommandNewArgList(emulator, "-S", NULL); + } +#else cmd = virCommandNewArgList(emulator, "-S", NULL); +#endif

NACK to this approach. We should be using the libnuma APIs to set the memory binding for the process directly, and not running the numactl command.

Regards, Daniel

Daniel P. Berrange

9:23 a.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

On Thu, May 05, 2011 at 04:30:30PM -0400, Bill Gray wrote:

...

Hi Daniel,

How can we get NUMA-aligned memory and CPUs if we apply binding APIs after the process has already started? Might not all the memory already be allocated on the wrong nodes by then?

The policy has to be set after fork'ing the new QEMU process, but before exec'ing QEMU. This is essentially what you're doing with numactl, but with the problem of an extra binary that screws up the SELinux domain transitions from libvirtd_t -> svirt_t.

...

For expert users, what are the problems with starting qemu with an external numactl command (with --cpunodebind and --membind) to guarantee optimal alignment?

Adding an intermediate process will prevent the neccessary SELinux domain transitions from working. We don't want to allow the numactl binary to be able to transition to svirt_t because that would be inappropriate for most users of numactl Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Osier Yang

1:20 p.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

于 2011年05月06日 17:23, Daniel P. Berrange 写道:

...

On Thu, May 05, 2011 at 04:30:30PM -0400, Bill Gray wrote:

...
Hi Daniel,

How can we get NUMA-aligned memory and CPUs if we apply binding APIs after the process has already started? Might not all the memory already be allocated on the wrong nodes by then?

The policy has to be set after fork'ing the new QEMU process, but before exec'ing QEMU. This is essentially what you're doing with numactl, but with the problem of an extra binary that screws up the SELinux domain transitions from libvirtd_t -> svirt_t.

...
For expert users, what are the problems with starting qemu with an external numactl command (with --cpunodebind and --membind) to guarantee optimal alignment?

Adding an intermediate process will prevent the neccessary SELinux domain transitions from working. We don't want to allow the numactl binary to be able to transition to svirt_t because that would be inappropriate for most users of numactl

This make sense, as you said in another mail, perhaps we need to do some work on __virExec, will make v2 series. Thanks for feedback. Regards Osier

Daniel P. Berrange

1:24 p.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

On Fri, May 06, 2011 at 09:20:18PM +0800, Osier Yang wrote:

...

于 2011年05月06日 17:23, Daniel P. Berrange 写道:

...
On Thu, May 05, 2011 at 04:30:30PM -0400, Bill Gray wrote:

...
Hi Daniel,

How can we get NUMA-aligned memory and CPUs if we apply binding APIs after the process has already started? Might not all the memory already be allocated on the wrong nodes by then?

The policy has to be set after fork'ing the new QEMU process, but before exec'ing QEMU. This is essentially what you're doing with numactl, but with the problem of an extra binary that screws up the SELinux domain transitions from libvirtd_t -> svirt_t.

...
For expert users, what are the problems with starting qemu with an external numactl command (with --cpunodebind and --membind) to guarantee optimal alignment?

Adding an intermediate process will prevent the neccessary SELinux domain transitions from working. We don't want to allow the numactl binary to be able to transition to svirt_t because that would be inappropriate for most users of numactl

This make sense, as you said in another mail, perhaps we need to do some work on __virExec, will make v2 series. Thanks for feedback.

Not virExec, but rather in the QEMU exec hook function Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Bill Gray

9:04 p.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

Looks like there is only a single call-back function -- qemudSecurityHook() -- which has had some cgroup and CPU affinity code already added in it. Perhaps a good approach would be to add an invocation of a new function -- qemudInitMemAffinity() -- as a peer to the already present invocation of qemudInitCpuAffinity(). The qemudInitMemAffinity() function could use set_mempolicy() to bind or prefer local/specific memory (depending on whether the user specifies the explicit memory node list as mandatory or just advisory). Advisory / preferred won't work correctly for large, multi-node guests until multiple nodes can be preferred (presumably selected by amount of free memory resources when multiple nodes are preferred). It would also be helpful to have an additional attribute to specify interleaved memory. How does this approach sound? On 05/06/2011 09:24 AM, Daniel P. Berrange wrote:

...

On Fri, May 06, 2011 at 09:20:18PM +0800, Osier Yang wrote:

...
于 2011年05月06日 17:23, Daniel P. Berrange 写道:

...
On Thu, May 05, 2011 at 04:30:30PM -0400, Bill Gray wrote:

...
Hi Daniel,

How can we get NUMA-aligned memory and CPUs if we apply binding APIs after the process has already started? Might not all the memory already be allocated on the wrong nodes by then?

The policy has to be set after fork'ing the new QEMU process, but before exec'ing QEMU. This is essentially what you're doing with numactl, but with the problem of an extra binary that screws up the SELinux domain transitions from libvirtd_t -> svirt_t.

...
For expert users, what are the problems with starting qemu with an external numactl command (with --cpunodebind and --membind) to guarantee optimal alignment?

Adding an intermediate process will prevent the neccessary SELinux domain transitions from working. We don't want to allow the numactl binary to be able to transition to svirt_t because that would be inappropriate for most users of numactl

This make sense, as you said in another mail, perhaps we need to do some work on __virExec, will make v2 series. Thanks for feedback.

Not virExec, but rather in the QEMU exec hook function

Daniel

Osier Yang

9 May 9 May

8:40 a.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

于 2011年05月07日 05:04, Bill Gray 写道:

...

Looks like there is only a single call-back function -- qemudSecurityHook() -- which has had some cgroup and CPU affinity code already added in it.

Perhaps a good approach would be to add an invocation of a new function -- qemudInitMemAffinity() -- as a peer to the already present invocation of qemudInitCpuAffinity(). The qemudInitMemAffinity() function could use set_mempolicy() to bind or prefer local/specific memory (depending on whether the user specifies the explicit memory node list as mandatory or just advisory). Advisory / preferred won't work correctly for large, multi-node guests until multiple nodes can be preferred (presumably selected by amount of free memory resources when multiple nodes are preferred). It would also be helpful to have an additional attribute to specify interleaved memory.

How does this approach sound?

Yes, the process should be like so, except the codes you are viewing are not upstream libvirt. :) Will add support for "interleave" in next patch series. Regards Osier

...

On 05/06/2011 09:24 AM, Daniel P. Berrange wrote:

...
On Fri, May 06, 2011 at 09:20:18PM +0800, Osier Yang wrote:

...
于 2011年05月06日 17:23, Daniel P. Berrange 写道:

...
On Thu, May 05, 2011 at 04:30:30PM -0400, Bill Gray wrote:

...
Hi Daniel,

How can we get NUMA-aligned memory and CPUs if we apply binding APIs after the process has already started? Might not all the memory already be allocated on the wrong nodes by then?

The policy has to be set after fork'ing the new QEMU process, but before exec'ing QEMU. This is essentially what you're doing with numactl, but with the problem of an extra binary that screws up the SELinux domain transitions from libvirtd_t -> svirt_t.

...
For expert users, what are the problems with starting qemu with an external numactl command (with --cpunodebind and --membind) to guarantee optimal alignment?

Adding an intermediate process will prevent the neccessary SELinux domain transitions from working. We don't want to allow the numactl binary to be able to transition to svirt_t because that would be inappropriate for most users of numactl

This make sense, as you said in another mail, perhaps we need to do some work on __virExec, will make v2 series. Thanks for feedback.

Not virExec, but rather in the QEMU exec hook function

Daniel

Osier Yang

6 May 6 May

2:35 a.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

于 2011年05月05日 23:21, Daniel P. Berrange 写道:

...

On Thu, May 05, 2011 at 05:38:29PM +0800, Osier Yang wrote:

...
Just prepend numactl command line to qemu command line. --- src/qemu/qemu_command.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2205ed1..156fdfb 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -2762,7 +2762,23 @@ qemuBuildCommandLine(virConnectPtr conn, break; }

+#ifdef HAVE_NUMACTL + /* XXX: Logic here to build numactl commmand line need to be changed if + * more NUMA tuning related stuffs in future. + */ + if (def->numatune.membind.nodeset) { + cmd = virCommandNewArgList(NUMACTL, + "-m", + def->numatune.membind.nodeset, + NULL); + + virCommandAddArgList(cmd, emulator, "-S", NULL); + } else { + cmd = virCommandNewArgList(emulator, "-S", NULL); + } +#else cmd = virCommandNewArgList(emulator, "-S", NULL); +#endif

NACK to this approach. We should be using the libnuma APIs to set the memory binding for the process directly, and not running the numactl command.

Hi, Dan, I looked at libnuma API, it looks to me there is no API provided to change the NUMA policy for process with specified PID, like sched_setaffinity, think it's reasonable to not allow to set the policy with specified PID, otherwise if one change the NUMA policy for a process frequently, it will be pain for kernel. AFAIK, If one wants to use libnuma to change a process' NUMA policy, one way is to use libnuma in the codes of the program for which one want to set the policy. The other way is to use numactl. As qemu doesn't use libnuma to support NUMA tuning yet, IMHO what we only can do is to use numactl if want to add support for NUMA tuning currently. Thanks, Osier

Daniel P. Berrange

9:20 a.m.

New subject: [libvirt] [PATCH 4/5] qemu: Build command line for NUMA tuning

On Fri, May 06, 2011 at 10:35:48AM +0800, Osier Yang wrote:

...

于 2011年05月05日 23:21, Daniel P. Berrange 写道:

...
On Thu, May 05, 2011 at 05:38:29PM +0800, Osier Yang wrote:

...
Just prepend numactl command line to qemu command line. --- src/qemu/qemu_command.c | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2205ed1..156fdfb 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -2762,7 +2762,23 @@ qemuBuildCommandLine(virConnectPtr conn, break; }

+#ifdef HAVE_NUMACTL + /* XXX: Logic here to build numactl commmand line need to be changed if + * more NUMA tuning related stuffs in future. + */ + if (def->numatune.membind.nodeset) { + cmd = virCommandNewArgList(NUMACTL, + "-m", + def->numatune.membind.nodeset, + NULL); + + virCommandAddArgList(cmd, emulator, "-S", NULL); + } else { + cmd = virCommandNewArgList(emulator, "-S", NULL); + } +#else cmd = virCommandNewArgList(emulator, "-S", NULL); +#endif

NACK to this approach. We should be using the libnuma APIs to set the memory binding for the process directly, and not running the numactl command.

Hi, Dan,

I looked at libnuma API, it looks to me there is no API provided to change the NUMA policy for process with specified PID, like sched_setaffinity, think it's reasonable to not allow to set the policy with specified PID, otherwise if one change the NUMA policy for a process frequently, it will be pain for kernel.

AFAIK, If one wants to use libnuma to change a process' NUMA policy, one way is to use libnuma in the codes of the program for which one want to set the policy. The other way is to use numactl.

You don't want to set the policy for another PID. The code should be run in the child process after fork'ing it, but before exec'ing it. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Osier Yang

5 May 5 May

9:38 a.m.

New subject: [libvirt] [PATCH 5/5] tests: Add tests for guest use NUMA tuning

--- .../qemuxml2argv-numa-membind.args | 4 +++ .../qemuxml2argvdata/qemuxml2argv-numa-membind.xml | 28 ++++++++++++++++++++ tests/qemuxml2argvtest.c | 2 + tests/qemuxml2xmltest.c | 2 + 4 files changed, 36 insertions(+), 0 deletions(-) create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numa-membind.args create mode 100644 tests/qemuxml2argvdata/qemuxml2argv-numa-membind.xml diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numa-membind.args b/tests/qemuxml2argvdata/qemuxml2argv-numa-membind.args new file mode 100644 index 0000000..64f3975 --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numa-membind.args @@ -0,0 +1,4 @@ +LC_ALL=C PATH=/bin HOME=/home/test USER=test LOGNAME=test /usr/bin/numactl \ +--membind +0-4,8-12 /usr/bin/qemu -S -M pc -m 214 -smp 2 -nographic -monitor \ +unix:/tmp/test-monitor,server,nowait -no-acpi -boot c -hda \ +/dev/HostVG/QEMUGuest1 -net none -serial none -parallel none -usb diff --git a/tests/qemuxml2argvdata/qemuxml2argv-numa-membind.xml b/tests/qemuxml2argvdata/qemuxml2argv-numa-membind.xml new file mode 100644 index 0000000..2df608d --- /dev/null +++ b/tests/qemuxml2argvdata/qemuxml2argv-numa-membind.xml @@ -0,0 +1,28 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory>219136</memory> + <currentMemory>219136</currentMemory> + <vcpu>2</vcpu> + <numatune> + <membind nodeset='+0-4,8-12'/> + </numatune> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu</emulator> + <disk type='block' device='disk'> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' unit='0'/> + </disk> + <controller type='ide' index='0'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index a7e4cc0..16926c3 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -480,6 +480,8 @@ mymain(void) DO_TEST("smp", false, QEMU_CAPS_SMP_TOPOLOGY); + DO_TEST("numa-membind", false, NONE); + DO_TEST("cpu-topology1", false, QEMU_CAPS_SMP_TOPOLOGY); DO_TEST("cpu-topology2", false, QEMU_CAPS_SMP_TOPOLOGY); DO_TEST("cpu-topology3", false, NONE); diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 5bfbcab..71640ea 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -180,6 +180,8 @@ mymain(void) DO_TEST("smp"); + DO_TEST("numa-membind"); + /* These tests generate different XML */ DO_TEST_DIFFERENT("balloon-device-auto"); DO_TEST_DIFFERENT("channel-virtio-auto"); -- 1.7.4

Lee Schermerhorn

2:33 p.m.

On Thu, 2011-05-05 at 17:38 +0800, Osier Yang wrote:

...

Hi, All,

This is a simple implenmentation for NUMA tuning support based on binary program 'numactl', currently only supports to bind memory to specified nodes, using option "--membind", perhaps it need to support more, but I'd like send it early so that could make sure if the principle is correct.

Ideally, NUMA tuning support should be added in qemu-kvm first, such as they could provide command options, then what we need to do in libvirt is just to pass the options to qemu-kvm, but unfortunately qemu-kvm doesn't support it yet, what we could do currently is only to use numactl, it forks process, a bit expensive than qemu-kvm supports NUMA tuning inside with libnuma, but it shouldn't affects much I guess.

The NUMA tuning XML is like:

<numatune> <membind nodeset='+0-4,8-12'/> </numatune>

Any thoughts/feedback is appreciated.

Osier: A couple of thoughts/observations: 1) you can accomplish the same thing -- restricting a domain's memory to a specified set of nodes -- using the cpuset cgroup that is already associated with each domain. E.g., cgset -r cpuset.mems=<nodeset> /libvirt/qemu/<domain> Or the equivalent libcgroup call. However, numactl is more flexible; especially if you intend to support more policies: preferred, interleave. Which leads to the question: 2) Do you really want the full "membind" semantics as opposed to "preferred" by default? Membind policy will restrict the VMs pages to the specified nodeset and will initiate reclaim/stealing and wait for pages to become available or the task is OOM-killed because of mempolicy when all of the nodes in nodeset reach their minimum watermark. Membind works the same as cpuset.mems in this respect. Preferred policy will keep memory allocations [but not vcpu execution] local to the specified set of nodes as long as there is sufficient memory, and will silently "overflow" allocations to other nodes when necessary. I.e., it's a little more forgiving under memory pressure. But then pinning a VM's vcpus to the physical cpus of a set of nodes and retaining the default local allocation policy will have the same effect as "preferred" while ensuring that the VM component tasks execute locally to the memory footprint. Currently, I do this by looking up the cpulist associated with the node[s] from e.g., /sys/devices/system/node/node<i>/cpulist and using that list with the vcpu.cpuset attribute. Adding a 'nodeset' attribute to the cputune.vcpupin element would simplify specifying that configuration. Regards, Lee

Bill Gray

8:43 p.m.

Thanks for the feedback Lee! One reason to use "membind" instead of "preferred" is that one can prefer only a single node. For large guests, you can specify multiple nodes with "membind". I think "preferred" would be preferred if it allowed multiple nodes. - Bill On 05/05/2011 10:33 AM, Lee Schermerhorn wrote:

...

On Thu, 2011-05-05 at 17:38 +0800, Osier Yang wrote:

...
Hi, All,

This is a simple implenmentation for NUMA tuning support based on binary program 'numactl', currently only supports to bind memory to specified nodes, using option "--membind", perhaps it need to support more, but I'd like send it early so that could make sure if the principle is correct.

Ideally, NUMA tuning support should be added in qemu-kvm first, such as they could provide command options, then what we need to do in libvirt is just to pass the options to qemu-kvm, but unfortunately qemu-kvm doesn't support it yet, what we could do currently is only to use numactl, it forks process, a bit expensive than qemu-kvm supports NUMA tuning inside with libnuma, but it shouldn't affects much I guess.

The NUMA tuning XML is like:

<numatune> <membind nodeset='+0-4,8-12'/> </numatune>

Any thoughts/feedback is appreciated.

Osier:

A couple of thoughts/observations:

1) you can accomplish the same thing -- restricting a domain's memory to a specified set of nodes -- using the cpuset cgroup that is already associated with each domain. E.g.,

cgset -r cpuset.mems=<nodeset> /libvirt/qemu/<domain>

Or the equivalent libcgroup call.

However, numactl is more flexible; especially if you intend to support more policies: preferred, interleave. Which leads to the question:

2) Do you really want the full "membind" semantics as opposed to "preferred" by default? Membind policy will restrict the VMs pages to the specified nodeset and will initiate reclaim/stealing and wait for pages to become available or the task is OOM-killed because of mempolicy when all of the nodes in nodeset reach their minimum watermark. Membind works the same as cpuset.mems in this respect. Preferred policy will keep memory allocations [but not vcpu execution] local to the specified set of nodes as long as there is sufficient memory, and will silently "overflow" allocations to other nodes when necessary. I.e., it's a little more forgiving under memory pressure.

But then pinning a VM's vcpus to the physical cpus of a set of nodes and retaining the default local allocation policy will have the same effect as "preferred" while ensuring that the VM component tasks execute locally to the memory footprint. Currently, I do this by looking up the cpulist associated with the node[s] from e.g., /sys/devices/system/node/node<i>/cpulist and using that list with the vcpu.cpuset attribute. Adding a 'nodeset' attribute to the cputune.vcpupin element would simplify specifying that configuration.

Regards, Lee

Osier Yang

6 May 6 May

3:48 a.m.

于 2011年05月06日 04:43, Bill Gray 写道:

...

Thanks for the feedback Lee!

One reason to use "membind" instead of "preferred" is that one can prefer only a single node. For large guests, you can specify multiple nodes with "membind". I think "preferred" would be preferred if it allowed multiple nodes.

- Bill

Hi, Bill Will "preferred" be still useful even if it only support single node? Regards Osier

...

On 05/05/2011 10:33 AM, Lee Schermerhorn wrote:

...
On Thu, 2011-05-05 at 17:38 +0800, Osier Yang wrote:

...
Hi, All,

This is a simple implenmentation for NUMA tuning support based on binary program 'numactl', currently only supports to bind memory to specified nodes, using option "--membind", perhaps it need to support more, but I'd like send it early so that could make sure if the principle is correct.

Ideally, NUMA tuning support should be added in qemu-kvm first, such as they could provide command options, then what we need to do in libvirt is just to pass the options to qemu-kvm, but unfortunately qemu-kvm doesn't support it yet, what we could do currently is only to use numactl, it forks process, a bit expensive than qemu-kvm supports NUMA tuning inside with libnuma, but it shouldn't affects much I guess.

The NUMA tuning XML is like:

<numatune> <membind nodeset='+0-4,8-12'/> </numatune>

Any thoughts/feedback is appreciated.

Osier:

A couple of thoughts/observations:

1) you can accomplish the same thing -- restricting a domain's memory to a specified set of nodes -- using the cpuset cgroup that is already associated with each domain. E.g.,

cgset -r cpuset.mems=<nodeset> /libvirt/qemu/<domain>

Or the equivalent libcgroup call.

However, numactl is more flexible; especially if you intend to support more policies: preferred, interleave. Which leads to the question:

2) Do you really want the full "membind" semantics as opposed to "preferred" by default? Membind policy will restrict the VMs pages to the specified nodeset and will initiate reclaim/stealing and wait for pages to become available or the task is OOM-killed because of mempolicy when all of the nodes in nodeset reach their minimum watermark. Membind works the same as cpuset.mems in this respect. Preferred policy will keep memory allocations [but not vcpu execution] local to the specified set of nodes as long as there is sufficient memory, and will silently "overflow" allocations to other nodes when necessary. I.e., it's a little more forgiving under memory pressure.

But then pinning a VM's vcpus to the physical cpus of a set of nodes and retaining the default local allocation policy will have the same effect as "preferred" while ensuring that the VM component tasks execute locally to the memory footprint. Currently, I do this by looking up the cpulist associated with the node[s] from e.g., /sys/devices/system/node/node<i>/cpulist and using that list with the vcpu.cpuset attribute. Adding a 'nodeset' attribute to the cputune.vcpupin element would simplify specifying that configuration.

Regards, Lee

Osier Yang

3:45 a.m.

于 2011年05月05日 22:33, Lee Schermerhorn 写道:

...

On Thu, 2011-05-05 at 17:38 +0800, Osier Yang wrote:

...
Hi, All,

This is a simple implenmentation for NUMA tuning support based on binary program 'numactl', currently only supports to bind memory to specified nodes, using option "--membind", perhaps it need to support more, but I'd like send it early so that could make sure if the principle is correct.

Ideally, NUMA tuning support should be added in qemu-kvm first, such as they could provide command options, then what we need to do in libvirt is just to pass the options to qemu-kvm, but unfortunately qemu-kvm doesn't support it yet, what we could do currently is only to use numactl, it forks process, a bit expensive than qemu-kvm supports NUMA tuning inside with libnuma, but it shouldn't affects much I guess.

The NUMA tuning XML is like:

<numatune> <membind nodeset='+0-4,8-12'/> </numatune>

Any thoughts/feedback is appreciated.

Osier:

A couple of thoughts/observations:

1) you can accomplish the same thing -- restricting a domain's memory to a specified set of nodes -- using the cpuset cgroup that is already associated with each domain. E.g.,

cgset -r cpuset.mems=<nodeset> /libvirt/qemu/<domain>

Or the equivalent libcgroup call.

However, numactl is more flexible; especially if you intend to support more policies: preferred, interleave. Which leads to the question:

2) Do you really want the full "membind" semantics as opposed to "preferred" by default? Membind policy will restrict the VMs pages to the specified nodeset and will initiate reclaim/stealing and wait for pages to become available or the task is OOM-killed because of mempolicy when all of the nodes in nodeset reach their minimum watermark. Membind works the same as cpuset.mems in this respect. Preferred policy will keep memory allocations [but not vcpu execution] local to the specified set of nodes as long as there is sufficient memory, and will silently "overflow" allocations to other nodes when necessary. I.e., it's a little more forgiving under memory pressure.

Thanks for the thoughts, Lee, Yes, we might support "preferred" too, once it's needed.

...

But then pinning a VM's vcpus to the physical cpus of a set of nodes and retaining the default local allocation policy will have the same effect as "preferred" while ensuring that the VM component tasks execute locally to the memory footprint. Currently, I do this by looking up the cpulist associated with the node[s] from e.g., /sys/devices/system/node/node<i>/cpulist and using that list with the vcpu.cpuset attribute. Adding a 'nodeset' attribute to the cputune.vcpupin element would simplify specifying that configuration.

Yes, binding to specified nodeset can be achieved with current <vcpu cpuset="">, but it's not that clear enough, e.g. Here you need to look up */node/cpulist manualy. But I'm not sure if it's good to add another attribute "nodeset", as from senmentics, "nodeset" is implied in "cpuset", CPU ID is uniq regardless of which node it's belong to. Regards Osier

Daniel P. Berrange

9:27 a.m.

On Thu, May 05, 2011 at 10:33:46AM -0400, Lee Schermerhorn wrote:

...

On Thu, 2011-05-05 at 17:38 +0800, Osier Yang wrote:

...
Hi, All,

This is a simple implenmentation for NUMA tuning support based on binary program 'numactl', currently only supports to bind memory to specified nodes, using option "--membind", perhaps it need to support more, but I'd like send it early so that could make sure if the principle is correct.

Ideally, NUMA tuning support should be added in qemu-kvm first, such as they could provide command options, then what we need to do in libvirt is just to pass the options to qemu-kvm, but unfortunately qemu-kvm doesn't support it yet, what we could do currently is only to use numactl, it forks process, a bit expensive than qemu-kvm supports NUMA tuning inside with libnuma, but it shouldn't affects much I guess.

The NUMA tuning XML is like:

<numatune> <membind nodeset='+0-4,8-12'/> </numatune>

Any thoughts/feedback is appreciated.

Osier:

A couple of thoughts/observations:

1) you can accomplish the same thing -- restricting a domain's memory to a specified set of nodes -- using the cpuset cgroup that is already associated with each domain. E.g.,

cgset -r cpuset.mems=<nodeset> /libvirt/qemu/<domain>

Or the equivalent libcgroup call.

However, numactl is more flexible; especially if you intend to support more policies: preferred, interleave. Which leads to the question:

2) Do you really want the full "membind" semantics as opposed to "preferred" by default? Membind policy will restrict the VMs pages to the specified nodeset and will initiate reclaim/stealing and wait for pages to become available or the task is OOM-killed because of mempolicy when all of the nodes in nodeset reach their minimum watermark. Membind works the same as cpuset.mems in this respect. Preferred policy will keep memory allocations [but not vcpu execution] local to the specified set of nodes as long as there is sufficient memory, and will silently "overflow" allocations to other nodes when necessary. I.e., it's a little more forgiving under memory pressure.

I think we need to make the choice of strict binding, vs preferred binding an XML tunable, since both options are valid. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

5313

Age (days ago)

5317

Last active (days ago)

List overview

Download

23 comments

4 participants

participants (4)

Bill Gray
Daniel P. Berrange
Lee Schermerhorn
Osier Yang