[libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA

At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored. Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements. This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can). Note: This is not visible to the guest and does not appear to create a migration incompatibility. Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> --- There was some discussion leading up to this patch, here: https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html During that discussion it seemed that fixing this issue would cause migration incompatibility but after some careful testing, it appears that it only does when a memory-backend object is attached to a guest NUMA node and that is not the case here. If only one is created, and used globally (not attached via mem=<id>), the migration data does not seem to be changed and so it seems reasonable to try something like this patch. This patch does work for my test cases but I don't claim a deep understanding of the libvirt code so this is at least partly a RFC. Comments welcome :-) Cheers, Sam. src/qemu/qemu_command.c | 55 ++++++++++++++++++++++++++++++++++++++++--------- src/qemu/qemu_command.h | 2 +- 2 files changed, 46 insertions(+), 11 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 0804133..c28c8f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3143,7 +3143,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, int guestNode, virBitmapPtr userNodeset, virBitmapPtr autoNodeset, - virDomainDefPtr def, + const virDomainDefPtr def, virQEMUCapsPtr qemuCaps, virQEMUDriverConfigPtr cfg, const char **backendType, @@ -7129,12 +7129,18 @@ qemuBuildSmpCommandLine(virCommandPtr cmd, static int qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, - const virDomainDef *def, + const virDomainDefPtr def, virQEMUCapsPtr qemuCaps, - virCommandPtr cmd) + virCommandPtr cmd, + virBitmapPtr auto_nodeset) { const long system_page_size = virGetSystemPageSizeKB(); char *mem_path = NULL; + virBitmapPtr nodemask = NULL; + const char *backendType = NULL; + char *backendStr = NULL; + virJSONValuePtr props; + int rv = -1; /* * No-op if hugepages were not requested. @@ -7159,18 +7165,47 @@ qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, if (qemuGetHupageMemPath(cfg, def->mem.hugepages[0].size, &mem_path) < 0) return -1; - virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + if (virDomainNumatuneMaybeGetNodeset(def->numa, auto_nodeset, + &nodemask, -1) < 0) + return -1; + if (nodemask && virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY_FILE)) { + props = virJSONValueNewObject(); + if (qemuBuildMemoryBackendStr(virDomainDefGetMemoryInitial(def), + 0, -1, NULL, auto_nodeset, + def, qemuCaps, cfg, &backendType, + &props, false) < 0) + goto cleanup; + if (!(backendStr = virQEMUBuildObjectCommandlineFromJSON(backendType, + "mem", + props))) + goto cleanup; + virCommandAddArgList(cmd, "-object", backendStr, NULL); + rv = 0; +cleanup: + VIR_FREE(backendStr); + VIR_FREE(props); + } + else { + if (nodemask || 1) + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Memory file backend objects are " + "unsupported by QEMU binary. Global NUMA " + "hugepage policy will be ignored.")); + virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + rv = 0; + } VIR_FREE(mem_path); - return 0; + return rv; } static int qemuBuildMemCommandLine(virCommandPtr cmd, virQEMUDriverConfigPtr cfg, - const virDomainDef *def, - virQEMUCapsPtr qemuCaps) + const virDomainDefPtr def, + virQEMUCapsPtr qemuCaps, + virBitmapPtr auto_nodeset) { if (qemuDomainDefValidateMemoryHotplug(def, qemuCaps, NULL) < 0) return -1; @@ -7194,7 +7229,7 @@ qemuBuildMemCommandLine(virCommandPtr cmd, * there is no numa node specified. */ if (!virDomainNumaGetNodeCount(def->numa) && - qemuBuildMemPathStr(cfg, def, qemuCaps, cmd) < 0) + qemuBuildMemPathStr(cfg, def, qemuCaps, cmd, auto_nodeset) < 0) return -1; if (def->mem.locked && !virQEMUCapsGet(qemuCaps, QEMU_CAPS_REALTIME_MLOCK)) { @@ -7331,7 +7366,7 @@ qemuBuildNumaArgStr(virQEMUDriverConfigPtr cfg, } if (!needBackend && - qemuBuildMemPathStr(cfg, def, qemuCaps, cmd) < 0) + qemuBuildMemPathStr(cfg, def, qemuCaps, cmd, auto_nodeset) < 0) goto cleanup; for (i = 0; i < ncells; i++) { @@ -9381,7 +9416,7 @@ qemuBuildCommandLine(virQEMUDriverPtr driver, if (!migrateURI && !snapshot && qemuDomainAlignMemorySizes(def) < 0) goto error; - if (qemuBuildMemCommandLine(cmd, cfg, def, qemuCaps) < 0) + if (qemuBuildMemCommandLine(cmd, cfg, def, qemuCaps, nodeset) < 0) goto error; if (qemuBuildSmpCommandLine(cmd, def) < 0) diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index 2f2a6ff..1e97d76 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -126,7 +126,7 @@ int qemuBuildMemoryBackendStr(unsigned long long size, int guestNode, virBitmapPtr userNodeset, virBitmapPtr autoNodeset, - virDomainDefPtr def, + const virDomainDefPtr def, virQEMUCapsPtr qemuCaps, virQEMUDriverConfigPtr cfg, const char **backendType, -- 2.10.0.297.gf6727b0

On Wed, Oct 12, 2016 at 15:04:53 +1100, Sam Bobroff wrote:
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored.
Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements.
This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can).
Note: This is not visible to the guest and does not appear to create a migration incompatibility.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> --- There was some discussion leading up to this patch, here:
https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html
During that discussion it seemed that fixing this issue would cause migration incompatibility but after some careful testing, it appears that it only does when a memory-backend object is attached to a guest NUMA node and that is not the case here. If only one is created, and used globally (not attached via mem=<id>), the migration data does not seem to be changed and so it seems reasonable to try something like this patch.
This patch does work for my test cases but I don't claim a deep understanding of the libvirt code so this is at least partly a RFC. Comments welcome :-)
Cheers, Sam.
src/qemu/qemu_command.c | 55 ++++++++++++++++++++++++++++++++++++++++--------- src/qemu/qemu_command.h | 2 +- 2 files changed, 46 insertions(+), 11 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 0804133..c28c8f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3143,7 +3143,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, int guestNode, virBitmapPtr userNodeset, virBitmapPtr autoNodeset, - virDomainDefPtr def, + const virDomainDefPtr def,
This does not work as expected. Using 'const' with types that hide the pointer makes the pointer const and not the structure that the pointer points to.
virQEMUCapsPtr qemuCaps, virQEMUDriverConfigPtr cfg, const char **backendType, @@ -7129,12 +7129,18 @@ qemuBuildSmpCommandLine(virCommandPtr cmd,
static int qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, - const virDomainDef *def, + const virDomainDefPtr def,
Same as above. This change is wrong.
virQEMUCapsPtr qemuCaps, - virCommandPtr cmd) + virCommandPtr cmd, + virBitmapPtr auto_nodeset) { const long system_page_size = virGetSystemPageSizeKB(); char *mem_path = NULL; + virBitmapPtr nodemask = NULL; + const char *backendType = NULL; + char *backendStr = NULL; + virJSONValuePtr props; + int rv = -1;
/* * No-op if hugepages were not requested. @@ -7159,18 +7165,47 @@ qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, if (qemuGetHupageMemPath(cfg, def->mem.hugepages[0].size, &mem_path) < 0) return -1;
- virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + if (virDomainNumatuneMaybeGetNodeset(def->numa, auto_nodeset, + &nodemask, -1) < 0) + return -1; + if (nodemask && virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY_FILE)) { + props = virJSONValueNewObject();
This is leaked, qemuBuildMemoryBackendStr allocates one itself.
+ if (qemuBuildMemoryBackendStr(virDomainDefGetMemoryInitial(def), + 0, -1, NULL, auto_nodeset, + def, qemuCaps, cfg, &backendType, + &props, false) < 0) + goto cleanup; + if (!(backendStr = virQEMUBuildObjectCommandlineFromJSON(backendType, + "mem", + props))) + goto cleanup;
So you create a memory object that does not get used ...
+ virCommandAddArgList(cmd, "-object", backendStr, NULL);
So does qemu use memory objects that are not used magically as it's memory? That seems odd. Could you please elaborate on this part and how it's supposed to work? Without a proper elaboration how this is supposed to work with qemu this patch is not acceptable.
+ rv = 0; +cleanup:
This won't pass syntax check. Labels have to be spaced out.
+ VIR_FREE(backendStr); + VIR_FREE(props);
This is a JSON value object, you need to free it with the proper function.
+ } + else {
This breaks coding style.
+ if (nodemask || 1)
Erm, a tautology?
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Memory file backend objects are " + "unsupported by QEMU binary. Global NUMA " + "hugepage policy will be ignored."));
Reporting an error but not terminating the startup is not acceptable. Errors should not be reported unless something fails.
+ virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + rv = 0; + } VIR_FREE(mem_path);
- return 0; + return rv; }
Missing a change to the test files. Please make sure that you run 'make check' and 'make syntax-check' and send only patches that pass those. Peter

On Wed, Oct 12, 2016 at 09:48:07AM +0200, Peter Krempa wrote:
On Wed, Oct 12, 2016 at 15:04:53 +1100, Sam Bobroff wrote:
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored.
Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements.
This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can).
Note: This is not visible to the guest and does not appear to create a migration incompatibility.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> --- There was some discussion leading up to this patch, here:
https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html
During that discussion it seemed that fixing this issue would cause migration incompatibility but after some careful testing, it appears that it only does when a memory-backend object is attached to a guest NUMA node and that is not the case here. If only one is created, and used globally (not attached via mem=<id>), the migration data does not seem to be changed and so it seems reasonable to try something like this patch.
This patch does work for my test cases but I don't claim a deep understanding of the libvirt code so this is at least partly a RFC. Comments welcome :-)
Cheers, Sam.
src/qemu/qemu_command.c | 55 ++++++++++++++++++++++++++++++++++++++++--------- src/qemu/qemu_command.h | 2 +- 2 files changed, 46 insertions(+), 11 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 0804133..c28c8f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3143,7 +3143,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, int guestNode, virBitmapPtr userNodeset, virBitmapPtr autoNodeset, - virDomainDefPtr def, + const virDomainDefPtr def,
This does not work as expected. Using 'const' with types that hide the pointer makes the pointer const and not the structure that the pointer points to.
Ah, of course you're right.
virQEMUCapsPtr qemuCaps, virQEMUDriverConfigPtr cfg, const char **backendType, @@ -7129,12 +7129,18 @@ qemuBuildSmpCommandLine(virCommandPtr cmd,
static int qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, - const virDomainDef *def, + const virDomainDefPtr def,
Same as above. This change is wrong.
OK.
virQEMUCapsPtr qemuCaps, - virCommandPtr cmd) + virCommandPtr cmd, + virBitmapPtr auto_nodeset) { const long system_page_size = virGetSystemPageSizeKB(); char *mem_path = NULL; + virBitmapPtr nodemask = NULL; + const char *backendType = NULL; + char *backendStr = NULL; + virJSONValuePtr props; + int rv = -1;
/* * No-op if hugepages were not requested. @@ -7159,18 +7165,47 @@ qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, if (qemuGetHupageMemPath(cfg, def->mem.hugepages[0].size, &mem_path) < 0) return -1;
- virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + if (virDomainNumatuneMaybeGetNodeset(def->numa, auto_nodeset, + &nodemask, -1) < 0) + return -1; + if (nodemask && virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY_FILE)) { + props = virJSONValueNewObject();
This is leaked, qemuBuildMemoryBackendStr allocates one itself.
Ah of course, sorry. I think I can just remove that line.
+ if (qemuBuildMemoryBackendStr(virDomainDefGetMemoryInitial(def), + 0, -1, NULL, auto_nodeset, + def, qemuCaps, cfg, &backendType, + &props, false) < 0) + goto cleanup; + if (!(backendStr = virQEMUBuildObjectCommandlineFromJSON(backendType, + "mem", + props))) + goto cleanup;
So you create a memory object that does not get used ...
Yes, that is correct.
+ virCommandAddArgList(cmd, "-object", backendStr, NULL);
So does qemu use memory objects that are not used magically as it's memory? That seems odd. Could you please elaborate on this part and how it's supposed to work? Without a proper elaboration how this is supposed to work with qemu this patch is not acceptable.
Yes, it did seem odd to me, but I didn't know if it was odd to anyone else! ;-) I'll elaborate properly in the commit message if I end up doing a version 2 (and if the fix ends up using that method). But briefly, QEMU seems to treat a single memory backend object, that is not linked to any NUMA node, just as it would "-mem-path" and "-mem-prealloc" except that NUMA policy can also be set. It seems to "know" about this useage because it is careful to avoid changing the "memory region" name to the ID in this case and it's that quirk that allows migrations to be unaffected.
+ rv = 0; +cleanup:
This won't pass syntax check. Labels have to be spaced out.
Thanks, I'll use syntax check as noted below!
+ VIR_FREE(backendStr); + VIR_FREE(props);
This is a JSON value object, you need to free it with the proper function.
Ah, virJSONValueFree(). Thanks.
+ } + else {
This breaks coding style.
OK.
+ if (nodemask || 1)
Erm, a tautology?
Yes, an embarassing bit of debug from the last round of testing :-P (Testing the warning.)
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Memory file backend objects are " + "unsupported by QEMU binary. Global NUMA " + "hugepage policy will be ignored."));
Reporting an error but not terminating the startup is not acceptable. Errors should not be reported unless something fails.
Right, if I do another version I'll change this to a warning.
+ virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + rv = 0; + } VIR_FREE(mem_path);
- return 0; + return rv; }
Missing a change to the test files. Please make sure that you run 'make check' and 'make syntax-check' and send only patches that pass those.
Ah, thanks. I will always use them in future.
Peter
Thanks for your revew! Cheers, Sam.

On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored.
Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements.
This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can).
Note: This is not visible to the guest and does not appear to create a migration incompatibility.
It could make sense, I haven't tried yet, though. However, I still don't see the point in using memory-backend-file. Is it that when you don't have cpuset cgroup the allocation doesn't work well? Because it certainly does work for me.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> --- There was some discussion leading up to this patch, here:
https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html
During that discussion it seemed that fixing this issue would cause migration incompatibility but after some careful testing, it appears that it only does when a memory-backend object is attached to a guest NUMA node and that is not the case here. If only one is created, and used globally (not attached via mem=<id>), the migration data does not seem to be changed and so it seems reasonable to try something like this patch.
This patch does work for my test cases but I don't claim a deep understanding of the libvirt code so this is at least partly a RFC. Comments welcome :-)
Cheers, Sam.
src/qemu/qemu_command.c | 55 ++++++++++++++++++++++++++++++++++++++++--------- src/qemu/qemu_command.h | 2 +- 2 files changed, 46 insertions(+), 11 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 0804133..c28c8f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3143,7 +3143,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, int guestNode, virBitmapPtr userNodeset, virBitmapPtr autoNodeset, - virDomainDefPtr def, + const virDomainDefPtr def, virQEMUCapsPtr qemuCaps, virQEMUDriverConfigPtr cfg, const char **backendType, @@ -7129,12 +7129,18 @@ qemuBuildSmpCommandLine(virCommandPtr cmd,
static int qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, - const virDomainDef *def, + const virDomainDefPtr def,
These two hunks have nothing to do with the change. If you want to do them for some reason, explain the reasoning and put it into separate patch, please.
virQEMUCapsPtr qemuCaps, - virCommandPtr cmd) + virCommandPtr cmd, + virBitmapPtr auto_nodeset) { const long system_page_size = virGetSystemPageSizeKB(); char *mem_path = NULL; + virBitmapPtr nodemask = NULL; + const char *backendType = NULL; + char *backendStr = NULL; + virJSONValuePtr props; + int rv = -1;
/* * No-op if hugepages were not requested. @@ -7159,18 +7165,47 @@ qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, if (qemuGetHupageMemPath(cfg, def->mem.hugepages[0].size, &mem_path) < 0) return -1;
- virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + if (virDomainNumatuneMaybeGetNodeset(def->numa, auto_nodeset, + &nodemask, -1) < 0) + return -1; + if (nodemask && virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY_FILE)) { + props = virJSONValueNewObject(); + if (qemuBuildMemoryBackendStr(virDomainDefGetMemoryInitial(def), + 0, -1, NULL, auto_nodeset, + def, qemuCaps, cfg, &backendType, + &props, false) < 0) + goto cleanup; + if (!(backendStr = virQEMUBuildObjectCommandlineFromJSON(backendType, + "mem", + props))) + goto cleanup; + virCommandAddArgList(cmd, "-object", backendStr, NULL);
So now a function that's called BuildMemPathStr does more than that. It is also called when the memory is added in a different way in which case this function would just add even more memory which is clearly wrong.
+ rv = 0; +cleanup: + VIR_FREE(backendStr); + VIR_FREE(props); + } + else { + if (nodemask || 1)
" || 1" ? Leftover from debugging?
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Memory file backend objects are " + "unsupported by QEMU binary. Global NUMA " + "hugepage policy will be ignored."));
Reporting error without erroring out is wrong. VIR_WARN is probably what you wanted to do here. As I said, I don't feel the need for the change. If there is a need for it, it should clearly be done differently, so NACK to this particular patch. Martin

On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored.
Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements.
This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can).
Note: This is not visible to the guest and does not appear to create a migration incompatibility.
It could make sense, I haven't tried yet, though. However, I still don't see the point in using memory-backend-file. Is it that when you don't have cpuset cgroup the allocation doesn't work well? Because it certainly does work for me.
Thanks for taking a look at this :-) The point of using a memory-backend-file is that with it, the NUMA policy can be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell QEMU to apply NUMA policy in the right place. It does seem odd to me to use memory-backend-file without attaching the backend to a guest NUMA node, but it seems to do the right thing in this case. (If there are guest NUMA nodes, or if hugepages aren't being used, policy is correctly applied.) I'll describe my test case in detail, perhaps there's something I don't understand happening. * I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages on node 1, and none on node 0. * I create a 2G guest using virt-install: virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le * I "virsh destroy" and then "virsh edit" to add this block to the guest XML: <numatune> <memory mode='strict' nodeset='0'/> </numatune> * "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy). * "numastat -p $(pidof qemu-system-ppc64)" shows something like this: Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 2048.00 2048.00 Heap 8.12 0.00 8.12 Stack 0.03 0.00 0.03 Private 35.80 6.10 41.90 ---------------- --------------- --------------- --------------- Total 43.95 2054.10 2098.05 So it looks like it's allocated hugepages from node 1, isn't this violating the policy I set via numatune?
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> --- There was some discussion leading up to this patch, here:
https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html
During that discussion it seemed that fixing this issue would cause migration incompatibility but after some careful testing, it appears that it only does when a memory-backend object is attached to a guest NUMA node and that is not the case here. If only one is created, and used globally (not attached via mem=<id>), the migration data does not seem to be changed and so it seems reasonable to try something like this patch.
This patch does work for my test cases but I don't claim a deep understanding of the libvirt code so this is at least partly a RFC. Comments welcome :-)
Cheers, Sam.
src/qemu/qemu_command.c | 55 ++++++++++++++++++++++++++++++++++++++++--------- src/qemu/qemu_command.h | 2 +- 2 files changed, 46 insertions(+), 11 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 0804133..c28c8f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3143,7 +3143,7 @@ qemuBuildMemoryBackendStr(unsigned long long size, int guestNode, virBitmapPtr userNodeset, virBitmapPtr autoNodeset, - virDomainDefPtr def, + const virDomainDefPtr def, virQEMUCapsPtr qemuCaps, virQEMUDriverConfigPtr cfg, const char **backendType, @@ -7129,12 +7129,18 @@ qemuBuildSmpCommandLine(virCommandPtr cmd,
static int qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, - const virDomainDef *def, + const virDomainDefPtr def,
These two hunks have nothing to do with the change. If you want to do them for some reason, explain the reasoning and put it into separate patch, please.
OK.
virQEMUCapsPtr qemuCaps, - virCommandPtr cmd) + virCommandPtr cmd, + virBitmapPtr auto_nodeset) { const long system_page_size = virGetSystemPageSizeKB(); char *mem_path = NULL; + virBitmapPtr nodemask = NULL; + const char *backendType = NULL; + char *backendStr = NULL; + virJSONValuePtr props; + int rv = -1;
/* * No-op if hugepages were not requested. @@ -7159,18 +7165,47 @@ qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg, if (qemuGetHupageMemPath(cfg, def->mem.hugepages[0].size, &mem_path) < 0) return -1;
- virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path", mem_path, NULL); + if (virDomainNumatuneMaybeGetNodeset(def->numa, auto_nodeset, + &nodemask, -1) < 0) + return -1; + if (nodemask && virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY_FILE)) { + props = virJSONValueNewObject(); + if (qemuBuildMemoryBackendStr(virDomainDefGetMemoryInitial(def), + 0, -1, NULL, auto_nodeset, + def, qemuCaps, cfg, &backendType, + &props, false) < 0) + goto cleanup; + if (!(backendStr = virQEMUBuildObjectCommandlineFromJSON(backendType, + "mem", + props))) + goto cleanup; + virCommandAddArgList(cmd, "-object", backendStr, NULL);
So now a function that's called BuildMemPathStr does more than that. It is also called when the memory is added in a different way in which case this function would just add even more memory which is clearly wrong.
I'm not sure if I understand this; this function used to add "-mem-path" and it would now add either "-mem-path" or a memory backend (with the same path), both doing the same kind of configuration (is that not the case?). I did see that it must not be called twice (which was already the case, I think), but I convinced myself that it would only be called once: There are only two callers of qemuBuildMemPathStr(). One is in qemuBuildMemCommandLine() and it is protected by "if (!virDomainNumaGetNodeCount(def->numa)", and the other is in qemuBuildNumaArgStr() and the (only) call to that is protected by "if (virDomainNumaGetNodeCount(def->numa)". Is that safe enough?
+ rv = 0; +cleanup: + VIR_FREE(backendStr); + VIR_FREE(props); + } + else { + if (nodemask || 1)
" || 1" ? Leftover from debugging?
Oops, yes.
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Memory file backend objects are " + "unsupported by QEMU binary. Global NUMA " + "hugepage policy will be ignored."));
Reporting error without erroring out is wrong. VIR_WARN is probably what you wanted to do here.
OK.
As I said, I don't feel the need for the change. If there is a need for it, it should clearly be done differently, so NACK to this particular patch.
Martin
I understand. I'm interested to see what you think of the test case :-) Cheers, Sam.

On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote:
On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored.
Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements.
This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can).
Note: This is not visible to the guest and does not appear to create a migration incompatibility.
It could make sense, I haven't tried yet, though. However, I still don't see the point in using memory-backend-file. Is it that when you don't have cpuset cgroup the allocation doesn't work well? Because it certainly does work for me.
Thanks for taking a look at this :-)
The point of using a memory-backend-file is that with it, the NUMA policy can be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell QEMU to apply NUMA policy in the right place. It does seem odd to me to use memory-backend-file without attaching the backend to a guest NUMA node, but it seems to do the right thing in this case. (If there are guest NUMA nodes, or if hugepages aren't being used, policy is correctly applied.)
I'll describe my test case in detail, perhaps there's something I don't understand happening.
* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages on node 1, and none on node 0.
* I create a 2G guest using virt-install:
virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le
* I "virsh destroy" and then "virsh edit" to add this block to the guest XML:
<numatune> <memory mode='strict' nodeset='0'/> </numatune>
* "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy). * "numastat -p $(pidof qemu-system-ppc64)" shows something like this:
Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 2048.00 2048.00 Heap 8.12 0.00 8.12 Stack 0.03 0.00 0.03 Private 35.80 6.10 41.90 ---------------- --------------- --------------- --------------- Total 43.95 2054.10 2098.05
So it looks like it's allocated hugepages from node 1, isn't this violating the policy I set via numatune?
Oh, now I get it. We are doing our best to apply that policy to qemu even when we don't have this option. However, using this works even better (which is probably* what we want). And that's the reasoning behind this. * I'm saying probably because when I was adding numactl binding to be used together with cgroups, I was told that we couldn't change the binding afterwards and it's bad. I feel like we could do something with that and it would help us in the future, but there needs to be a discussion, I guess. Because I might be one of the few =) So to recapitulate that, there are three options how to affect the allocation of qemu's memory: 1) numactl (libnuma): it works as expected, but cannot be changed later 2) cgroups: so strict it has to be applied after qemu started, due to that it doesn't work right, especially for stuff that gets all pre-allocated (like hugepages). it can be changed later, but it won't always mean the memory will migrate, so upon change there is no guarantee. If it's unavailable, we fallback to (1) anyway 3) memory-backing-file's host-nodes=: this works as expected, but cannot be used with older QEMUs, cannot be changed later and in some cases (not your particular one) it might screw up migration if it wasn't used before. Selecting the best option from these, plus making the code work with every possibility (erroring out when you want to change the memory node and we had to use (1) for example) is a pain. We should really think about that and reorganize these things for the better of the future. Otherwise we're going to get overwhelm ourselves. Cc'ing Peter to get his thoughts as well as he worked on some parts of this as well. Martin

On Thu, Oct 13, 2016 at 11:34:43AM +0200, Martin Kletzander wrote:
On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote:
On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored.
Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements.
This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can).
Note: This is not visible to the guest and does not appear to create a migration incompatibility.
It could make sense, I haven't tried yet, though. However, I still don't see the point in using memory-backend-file. Is it that when you don't have cpuset cgroup the allocation doesn't work well? Because it certainly does work for me.
Thanks for taking a look at this :-)
The point of using a memory-backend-file is that with it, the NUMA policy can be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell QEMU to apply NUMA policy in the right place. It does seem odd to me to use memory-backend-file without attaching the backend to a guest NUMA node, but it seems to do the right thing in this case. (If there are guest NUMA nodes, or if hugepages aren't being used, policy is correctly applied.)
I'll describe my test case in detail, perhaps there's something I don't understand happening.
* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages on node 1, and none on node 0.
* I create a 2G guest using virt-install:
virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le
* I "virsh destroy" and then "virsh edit" to add this block to the guest XML:
<numatune> <memory mode='strict' nodeset='0'/> </numatune>
* "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy). * "numastat -p $(pidof qemu-system-ppc64)" shows something like this:
Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 2048.00 2048.00 Heap 8.12 0.00 8.12 Stack 0.03 0.00 0.03 Private 35.80 6.10 41.90 ---------------- --------------- --------------- --------------- Total 43.95 2054.10 2098.05
So it looks like it's allocated hugepages from node 1, isn't this violating the policy I set via numatune?
Oh, now I get it. We are doing our best to apply that policy to qemu even when we don't have this option. However, using this works even better (which is probably* what we want). And that's the reasoning behind this.
* I'm saying probably because when I was adding numactl binding to be used together with cgroups, I was told that we couldn't change the binding afterwards and it's bad. I feel like we could do something with that and it would help us in the future, but there needs to be a discussion, I guess. Because I might be one of the few =)
So to recapitulate that, there are three options how to affect the allocation of qemu's memory:
1) numactl (libnuma): it works as expected, but cannot be changed later
2) cgroups: so strict it has to be applied after qemu started, due to that it doesn't work right, especially for stuff that gets all pre-allocated (like hugepages). it can be changed later, but it won't always mean the memory will migrate, so upon change there is no guarantee. If it's unavailable, we fallback to (1) anyway
3) memory-backing-file's host-nodes=: this works as expected, but cannot be used with older QEMUs, cannot be changed later and in some cases (not your particular one) it might screw up migration if it wasn't used before.
Selecting the best option from these, plus making the code work with every possibility (erroring out when you want to change the memory node and we had to use (1) for example) is a pain. We should really think about that and reorganize these things for the better of the future. Otherwise we're going to get overwhelm ourselves. Cc'ing Peter to get his thoughts as well as he worked on some parts of this as well.
Martin
Thanks for the explanation, and I agree (I'm already a bit overwhelmed!) :-) What do you mean by "changed later"? Do you mean, if the domain XML is changed while the machine is running? I did look at the libnuma and cgroups approaches, but I was concerned they wouldn't work in this case, because of the way QEMU allocates memory when mem-prealloc is used: the memory is allocated in the main process, before the CPU threads are created. (This is based only on a bit of hacking and debugging in QEMU, but it does seem explain the behaviour I've seen so far.) If this is the case, it would seem to be a significant problem: if policy is set on the main thread, it will affect all allocations not just the VCPU memory and if it's set on the VCPU threads it won't catch the pre-allocation at all. (Is this what you were referring to by "it doesn't work right"?) That was my reasoning for trying to use the backend object in this case; it was the only method that worked and did not require changes to QEMU. I'd prefer the other approaches if they could be made to work. I think QEMU could be altered to move the preallocations into the VCPU threads but it didn't seem trivial and I suspected the QEMU community would point out that there was already a way to do it using backend objects. Another option would be to add a -host-nodes parameter to QEMU so that the policy can be given without adding a memory backend object. (That seems like a more reasonable change to QEMU.) Cheers, Sam.

On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote:
On Thu, Oct 13, 2016 at 11:34:43AM +0200, Martin Kletzander wrote:
On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote:
On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored.
Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements.
This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can).
Note: This is not visible to the guest and does not appear to create a migration incompatibility.
It could make sense, I haven't tried yet, though. However, I still don't see the point in using memory-backend-file. Is it that when you don't have cpuset cgroup the allocation doesn't work well? Because it certainly does work for me.
Thanks for taking a look at this :-)
The point of using a memory-backend-file is that with it, the NUMA policy can be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell QEMU to apply NUMA policy in the right place. It does seem odd to me to use memory-backend-file without attaching the backend to a guest NUMA node, but it seems to do the right thing in this case. (If there are guest NUMA nodes, or if hugepages aren't being used, policy is correctly applied.)
I'll describe my test case in detail, perhaps there's something I don't understand happening.
* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages on node 1, and none on node 0.
* I create a 2G guest using virt-install:
virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le
* I "virsh destroy" and then "virsh edit" to add this block to the guest XML:
<numatune> <memory mode='strict' nodeset='0'/> </numatune>
* "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy). * "numastat -p $(pidof qemu-system-ppc64)" shows something like this:
Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 2048.00 2048.00 Heap 8.12 0.00 8.12 Stack 0.03 0.00 0.03 Private 35.80 6.10 41.90 ---------------- --------------- --------------- --------------- Total 43.95 2054.10 2098.05
So it looks like it's allocated hugepages from node 1, isn't this violating the policy I set via numatune?
Oh, now I get it. We are doing our best to apply that policy to qemu even when we don't have this option. However, using this works even better (which is probably* what we want). And that's the reasoning behind this.
* I'm saying probably because when I was adding numactl binding to be used together with cgroups, I was told that we couldn't change the binding afterwards and it's bad. I feel like we could do something with that and it would help us in the future, but there needs to be a discussion, I guess. Because I might be one of the few =)
So to recapitulate that, there are three options how to affect the allocation of qemu's memory:
1) numactl (libnuma): it works as expected, but cannot be changed later
2) cgroups: so strict it has to be applied after qemu started, due to that it doesn't work right, especially for stuff that gets all pre-allocated (like hugepages). it can be changed later, but it won't always mean the memory will migrate, so upon change there is no guarantee. If it's unavailable, we fallback to (1) anyway
3) memory-backing-file's host-nodes=: this works as expected, but cannot be used with older QEMUs, cannot be changed later and in some cases (not your particular one) it might screw up migration if it wasn't used before.
Selecting the best option from these, plus making the code work with every possibility (erroring out when you want to change the memory node and we had to use (1) for example) is a pain. We should really think about that and reorganize these things for the better of the future. Otherwise we're going to get overwhelm ourselves. Cc'ing Peter to get his thoughts as well as he worked on some parts of this as well.
Martin
Thanks for the explanation, and I agree (I'm already a bit overwhelmed!) :-)
What do you mean by "changed later"? Do you mean, if the domain XML is changed while the machine is running?
E.g. by 'virsh numatune domain 1-2'
I did look at the libnuma and cgroups approaches, but I was concerned they wouldn't work in this case, because of the way QEMU allocates memory when mem-prealloc is used: the memory is allocated in the main process, before the CPU threads are created. (This is based only on a bit of hacking and debugging in QEMU, but it does seem explain the behaviour I've seen so far.)
But we use numactl before QEMU is exec()'d.
If this is the case, it would seem to be a significant problem: if policy is set on the main thread, it will affect all allocations not just the VCPU memory and if it's set on the VCPU threads it won't catch the pre-allocation at all. (Is this what you were referring to by "it doesn't work right"?)
Kind of, yes.
That was my reasoning for trying to use the backend object in this case; it was the only method that worked and did not require changes to QEMU. I'd prefer the other approaches if they could be made to work.
There is a workaround, you can disable the cpuset cgroup in libvirtd.conf, but that's not what you want, I guess.
I think QEMU could be altered to move the preallocations into the VCPU threads but it didn't seem trivial and I suspected the QEMU community would point out that there was already a way to do it using backend objects. Another option would be to add a -host-nodes parameter to QEMU so that the policy can be given without adding a memory backend object. (That seems like a more reasonable change to QEMU.)
I think upstream won't like that, mostly because there is already a way. And that is using memory-backend object. I think we could just use that and disable changing it live. But upstream will probably want that to be configurable or something.
Cheers, Sam.

On Fri, Oct 14, 2016 at 10:19:42AM +0200, Martin Kletzander wrote:
On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote:
On Thu, Oct 13, 2016 at 11:34:43AM +0200, Martin Kletzander wrote:
On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote:
On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
At the moment, guests that are backed by hugepages in the host are only able to use policy to control the placement of those hugepages on a per-(guest-)CPU basis. Policy applied globally is ignored.
Such guests would use <memoryBacking><hugepages/></memoryBacking> and a <numatune> block with <memory mode=... nodeset=.../> but no <memnode .../> elements.
This patch corrects this by, in this specific case, changing the QEMU command line from "-mem-prealloc -mem-path=..." (which cannot specify NUMA policy) to "-object memory-backend-file ..." (which can).
Note: This is not visible to the guest and does not appear to create a migration incompatibility.
It could make sense, I haven't tried yet, though. However, I still don't see the point in using memory-backend-file. Is it that when you don't have cpuset cgroup the allocation doesn't work well? Because it certainly does work for me.
Thanks for taking a look at this :-)
The point of using a memory-backend-file is that with it, the NUMA policy can be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell QEMU to apply NUMA policy in the right place. It does seem odd to me to use memory-backend-file without attaching the backend to a guest NUMA node, but it seems to do the right thing in this case. (If there are guest NUMA nodes, or if hugepages aren't being used, policy is correctly applied.)
I'll describe my test case in detail, perhaps there's something I don't understand happening.
* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages on node 1, and none on node 0.
* I create a 2G guest using virt-install:
virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le
* I "virsh destroy" and then "virsh edit" to add this block to the guest XML:
<numatune> <memory mode='strict' nodeset='0'/> </numatune>
* "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy). * "numastat -p $(pidof qemu-system-ppc64)" shows something like this:
Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 2048.00 2048.00 Heap 8.12 0.00 8.12 Stack 0.03 0.00 0.03 Private 35.80 6.10 41.90 ---------------- --------------- --------------- --------------- Total 43.95 2054.10 2098.05
So it looks like it's allocated hugepages from node 1, isn't this violating the policy I set via numatune?
Oh, now I get it. We are doing our best to apply that policy to qemu even when we don't have this option. However, using this works even better (which is probably* what we want). And that's the reasoning behind this.
* I'm saying probably because when I was adding numactl binding to be used together with cgroups, I was told that we couldn't change the binding afterwards and it's bad. I feel like we could do something with that and it would help us in the future, but there needs to be a discussion, I guess. Because I might be one of the few =)
So to recapitulate that, there are three options how to affect the allocation of qemu's memory:
1) numactl (libnuma): it works as expected, but cannot be changed later
2) cgroups: so strict it has to be applied after qemu started, due to that it doesn't work right, especially for stuff that gets all pre-allocated (like hugepages). it can be changed later, but it won't always mean the memory will migrate, so upon change there is no guarantee. If it's unavailable, we fallback to (1) anyway
3) memory-backing-file's host-nodes=: this works as expected, but cannot be used with older QEMUs, cannot be changed later and in some cases (not your particular one) it might screw up migration if it wasn't used before.
Selecting the best option from these, plus making the code work with every possibility (erroring out when you want to change the memory node and we had to use (1) for example) is a pain. We should really think about that and reorganize these things for the better of the future. Otherwise we're going to get overwhelm ourselves. Cc'ing Peter to get his thoughts as well as he worked on some parts of this as well.
Martin
Thanks for the explanation, and I agree (I'm already a bit overwhelmed!) :-)
What do you mean by "changed later"? Do you mean, if the domain XML is changed while the machine is running?
E.g. by 'virsh numatune domain 1-2'
Ah, thanks!
I did look at the libnuma and cgroups approaches, but I was concerned they wouldn't work in this case, because of the way QEMU allocates memory when mem-prealloc is used: the memory is allocated in the main process, before the CPU threads are created. (This is based only on a bit of hacking and debugging in QEMU, but it does seem explain the behaviour I've seen so far.)
But we use numactl before QEMU is exec()'d.
Sorry, I jumped ahead a bit. I'll try to explain what I mean: I think the problem with using this method would be that the NUMA policy is applied to all allocations by QEMU, not just ones related to the memory backing. I'm not sure if that would cause a serious problem but it seems untidy, and it doesn't happen in other situations (i.e. with separate memory backend objects, QEMU sets up the policy specifically for each one and other allocations aren't affected, AFAIK). Presumably, if memory were very restricted it could prevent the guest from starting.
If this is the case, it would seem to be a significant problem: if policy is set on the main thread, it will affect all allocations not just the VCPU memory and if it's set on the VCPU threads it won't catch the pre-allocation at all. (Is this what you were referring to by "it doesn't work right"?)
Kind of, yes.
That was my reasoning for trying to use the backend object in this case; it was the only method that worked and did not require changes to QEMU. I'd prefer the other approaches if they could be made to work.
There is a workaround, you can disable the cpuset cgroup in libvirtd.conf, but that's not what you want, I guess.
Thanks, but no it doesn't seem to be what I want due to the above issue.
I think QEMU could be altered to move the preallocations into the VCPU threads but it didn't seem trivial and I suspected the QEMU community would point out that there was already a way to do it using backend objects. Another option would be to add a -host-nodes parameter to QEMU so that the policy can be given without adding a memory backend object. (That seems like a more reasonable change to QEMU.)
I think upstream won't like that, mostly because there is already a way. And that is using memory-backend object. I think we could just use that and disable changing it live. But upstream will probably want that to be configurable or something.
Right, but isn't this already an issue in the cases where libvirt is already using memory backend objects and NUMA policy? (Or does libvirt already disable changing it live in those situations?)
Cheers, Sam.

On Mon, Oct 17, 2016 at 03:45:09PM +1100, Sam Bobroff wrote:
On Fri, Oct 14, 2016 at 10:19:42AM +0200, Martin Kletzander wrote:
On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote:
I did look at the libnuma and cgroups approaches, but I was concerned they wouldn't work in this case, because of the way QEMU allocates memory when mem-prealloc is used: the memory is allocated in the main process, before the CPU threads are created. (This is based only on a bit of hacking and debugging in QEMU, but it does seem explain the behaviour I've seen so far.)
But we use numactl before QEMU is exec()'d.
Sorry, I jumped ahead a bit. I'll try to explain what I mean:
I think the problem with using this method would be that the NUMA policy is applied to all allocations by QEMU, not just ones related to the memory backing. I'm not sure if that would cause a serious problem but it seems untidy, and it doesn't happen in other situations (i.e. with separate memory backend objects, QEMU sets up the policy specifically for each one and other allocations aren't affected, AFAIK). Presumably, if memory were very restricted it could prevent the guest from starting.
Yes, it is, that's what <numatune><memory/> does if you don't have any other (<memnode/>) specifics set.
I think QEMU could be altered to move the preallocations into the VCPU threads but it didn't seem trivial and I suspected the QEMU community would point out that there was already a way to do it using backend objects. Another option would be to add a -host-nodes parameter to QEMU so that the policy can be given without adding a memory backend object. (That seems like a more reasonable change to QEMU.)
I think upstream won't like that, mostly because there is already a way. And that is using memory-backend object. I think we could just use that and disable changing it live. But upstream will probably want that to be configurable or something.
Right, but isn't this already an issue in the cases where libvirt is already using memory backend objects and NUMA policy? (Or does libvirt already disable changing it live in those situations?)
It is. I'm not trying to say libvirt is perfect. There are bugs, e.g. like this one. The problem is that we tried to do *everything*, but it's not currently possible. I'm trying to explain how stuff works now. It definitely needs some fixing, though.

On Tue, Oct 18, 2016 at 10:43:31PM +0200, Martin Kletzander wrote:
On Mon, Oct 17, 2016 at 03:45:09PM +1100, Sam Bobroff wrote:
On Fri, Oct 14, 2016 at 10:19:42AM +0200, Martin Kletzander wrote:
On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote:
I did look at the libnuma and cgroups approaches, but I was concerned they wouldn't work in this case, because of the way QEMU allocates memory when mem-prealloc is used: the memory is allocated in the main process, before the CPU threads are created. (This is based only on a bit of hacking and debugging in QEMU, but it does seem explain the behaviour I've seen so far.)
But we use numactl before QEMU is exec()'d.
Sorry, I jumped ahead a bit. I'll try to explain what I mean:
I think the problem with using this method would be that the NUMA policy is applied to all allocations by QEMU, not just ones related to the memory backing. I'm not sure if that would cause a serious problem but it seems untidy, and it doesn't happen in other situations (i.e. with separate memory backend objects, QEMU sets up the policy specifically for each one and other allocations aren't affected, AFAIK). Presumably, if memory were very restricted it could prevent the guest from starting.
Yes, it is, that's what <numatune><memory/> does if you don't have any other (<memnode/>) specifics set.
I think QEMU could be altered to move the preallocations into the VCPU threads but it didn't seem trivial and I suspected the QEMU community would point out that there was already a way to do it using backend objects. Another option would be to add a -host-nodes parameter to QEMU so that the policy can be given without adding a memory backend object. (That seems like a more reasonable change to QEMU.)
I think upstream won't like that, mostly because there is already a way. And that is using memory-backend object. I think we could just use that and disable changing it live. But upstream will probably want that to be configurable or something.
Right, but isn't this already an issue in the cases where libvirt is already using memory backend objects and NUMA policy? (Or does libvirt already disable changing it live in those situations?)
It is. I'm not trying to say libvirt is perfect. There are bugs, e.g. like this one. The problem is that we tried to do *everything*, but it's not currently possible. I'm trying to explain how stuff works now. It definitely needs some fixing, though.
OK :-) Well, given our discussion, do you think it's worth a v2 of my original patch or would it be better to drop it in favour of some broader change? Cheers, Sam.

On Tue, Oct 25, 2016 at 01:10:23PM +1100, Sam Bobroff wrote:
On Tue, Oct 18, 2016 at 10:43:31PM +0200, Martin Kletzander wrote:
On Mon, Oct 17, 2016 at 03:45:09PM +1100, Sam Bobroff wrote:
On Fri, Oct 14, 2016 at 10:19:42AM +0200, Martin Kletzander wrote:
On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote:
I did look at the libnuma and cgroups approaches, but I was concerned they wouldn't work in this case, because of the way QEMU allocates memory when mem-prealloc is used: the memory is allocated in the main process, before the CPU threads are created. (This is based only on a bit of hacking and debugging in QEMU, but it does seem explain the behaviour I've seen so far.)
But we use numactl before QEMU is exec()'d.
Sorry, I jumped ahead a bit. I'll try to explain what I mean:
I think the problem with using this method would be that the NUMA policy is applied to all allocations by QEMU, not just ones related to the memory backing. I'm not sure if that would cause a serious problem but it seems untidy, and it doesn't happen in other situations (i.e. with separate memory backend objects, QEMU sets up the policy specifically for each one and other allocations aren't affected, AFAIK). Presumably, if memory were very restricted it could prevent the guest from starting.
Yes, it is, that's what <numatune><memory/> does if you don't have any other (<memnode/>) specifics set.
I think QEMU could be altered to move the preallocations into the VCPU threads but it didn't seem trivial and I suspected the QEMU community would point out that there was already a way to do it using backend objects. Another option would be to add a -host-nodes parameter to QEMU so that the policy can be given without adding a memory backend object. (That seems like a more reasonable change to QEMU.)
I think upstream won't like that, mostly because there is already a way. And that is using memory-backend object. I think we could just use that and disable changing it live. But upstream will probably want that to be configurable or something.
Right, but isn't this already an issue in the cases where libvirt is already using memory backend objects and NUMA policy? (Or does libvirt already disable changing it live in those situations?)
It is. I'm not trying to say libvirt is perfect. There are bugs, e.g. like this one. The problem is that we tried to do *everything*, but it's not currently possible. I'm trying to explain how stuff works now. It definitely needs some fixing, though.
OK :-)
Well, given our discussion, do you think it's worth a v2 of my original patch or would it be better to drop it in favour of some broader change?
Honestly, I thought about the approaches so much I'm now not sure I'll make a good decision. RFC could do. If I were to pick, I would go with a new setting that would control whether we want the binding to be changeable throughout the domain's lifetime or not so that we can make better decisions (and don't feel bad about the bad ones).
Cheers, Sam.

On Tue, Oct 25, 2016 at 02:13:07PM +0200, Martin Kletzander wrote:
On Tue, Oct 25, 2016 at 01:10:23PM +1100, Sam Bobroff wrote:
On Tue, Oct 18, 2016 at 10:43:31PM +0200, Martin Kletzander wrote:
On Mon, Oct 17, 2016 at 03:45:09PM +1100, Sam Bobroff wrote:
On Fri, Oct 14, 2016 at 10:19:42AM +0200, Martin Kletzander wrote:
On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote:
I did look at the libnuma and cgroups approaches, but I was concerned they wouldn't work in this case, because of the way QEMU allocates memory when mem-prealloc is used: the memory is allocated in the main process, before the CPU threads are created. (This is based only on a bit of hacking and debugging in QEMU, but it does seem explain the behaviour I've seen so far.)
But we use numactl before QEMU is exec()'d.
Sorry, I jumped ahead a bit. I'll try to explain what I mean:
I think the problem with using this method would be that the NUMA policy is applied to all allocations by QEMU, not just ones related to the memory backing. I'm not sure if that would cause a serious problem but it seems untidy, and it doesn't happen in other situations (i.e. with separate memory backend objects, QEMU sets up the policy specifically for each one and other allocations aren't affected, AFAIK). Presumably, if memory were very restricted it could prevent the guest from starting.
Yes, it is, that's what <numatune><memory/> does if you don't have any other (<memnode/>) specifics set.
I think QEMU could be altered to move the preallocations into the VCPU threads but it didn't seem trivial and I suspected the QEMU community would point out that there was already a way to do it using backend objects. Another option would be to add a -host-nodes parameter to QEMU so that the policy can be given without adding a memory backend object. (That seems like a more reasonable change to QEMU.)
I think upstream won't like that, mostly because there is already a way. And that is using memory-backend object. I think we could just use that and disable changing it live. But upstream will probably want that to be configurable or something.
Right, but isn't this already an issue in the cases where libvirt is already using memory backend objects and NUMA policy? (Or does libvirt already disable changing it live in those situations?)
It is. I'm not trying to say libvirt is perfect. There are bugs, e.g. like this one. The problem is that we tried to do *everything*, but it's not currently possible. I'm trying to explain how stuff works now. It definitely needs some fixing, though.
OK :-)
Well, given our discussion, do you think it's worth a v2 of my original patch or would it be better to drop it in favour of some broader change?
Honestly, I thought about the approaches so much I'm now not sure I'll make a good decision. RFC could do. If I were to pick, I would go with a new setting that would control whether we want the binding to be changeable throughout the domain's lifetime or not so that we can make better decisions (and don't feel bad about the bad ones).
I feel the same way. OK, I'll try an RFC patch with a lot of description. I'm specifically trying to address the issue I originally raised, which isn't quite the same thing as the changeability of the bindings but I'll keep that in mind. I think your point about changing the bindings will apply in the same way whenever QEMU's memory-backend objects are used with their "host-nodes" attribute (since they are what causes QEMU to apply policy), so I don't think I'm suggesting any significant change there. If you want to add the new setting you mention above, I'd be happy to base my patch of top of that work. ;-) Cheers, Sam.
participants (3)
-
Martin Kletzander
-
Peter Krempa
-
Sam Bobroff