On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
>At the moment, guests that are backed by hugepages in the host are
>only able to use policy to control the placement of those hugepages
>on a per-(guest-)CPU basis. Policy applied globally is ignored.
>
>Such guests would use <memoryBacking><hugepages/></memoryBacking>
and
>a <numatune> block with <memory mode=... nodeset=.../> but no
<memnode
>.../> elements.
>
>This patch corrects this by, in this specific case, changing the QEMU
>command line from "-mem-prealloc -mem-path=..." (which cannot
>specify NUMA policy) to "-object memory-backend-file ..." (which can).
>
>Note: This is not visible to the guest and does not appear to create
>a migration incompatibility.
>
It could make sense, I haven't tried yet, though. However, I still
don't see the point in using memory-backend-file. Is it that when you
don't have cpuset cgroup the allocation doesn't work well? Because it
certainly does work for me.
Thanks for taking a look at this :-)
The point of using a memory-backend-file is that with it, the NUMA policy can
be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell
QEMU to apply NUMA policy in the right place. It does seem odd to me to use
memory-backend-file without attaching the backend to a guest NUMA node, but it
seems to do the right thing in this case. (If there are guest NUMA nodes, or if
hugepages aren't being used, policy is correctly applied.)
I'll describe my test case in detail, perhaps there's something I don't
understand
happening.
* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages
on node 1, and none on node 0.
* I create a 2G guest using virt-install:
virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom
~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking
hugepages=on --graphics vnc --arch ppc64le
* I "virsh destroy" and then "virsh edit" to add this block to the
guest XML:
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
* "virsh start", and the machine starts (I believe it should fail due to
insufficient memory satasfying the policy).
* "numastat -p $(pidof qemu-system-ppc64)" shows something like this:
Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 0.00 2048.00 2048.00
Heap 8.12 0.00 8.12
Stack 0.03 0.00 0.03
Private 35.80 6.10 41.90
---------------- --------------- --------------- ---------------
Total 43.95 2054.10 2098.05
So it looks like it's allocated hugepages from node 1, isn't this violating the
policy I set via numatune?
>Signed-off-by: Sam Bobroff <sam.bobroff(a)au1.ibm.com>
>---
>There was some discussion leading up to this patch, here:
>
>https://www.redhat.com/archives/libvir-list/2016-October/msg00033.html
>
>During that discussion it seemed that fixing this issue would cause
>migration incompatibility but after some careful testing, it appears
>that it only does when a memory-backend object is attached to a guest
>NUMA node and that is not the case here. If only one is created, and
>used globally (not attached via mem=<id>), the migration data does not
>seem to be changed and so it seems reasonable to try something like
>this patch.
>
>This patch does work for my test cases but I don't claim a deep
>understanding of the libvirt code so this is at least partly a RFC.
>Comments welcome :-)
>
>Cheers,
>Sam.
>
>src/qemu/qemu_command.c | 55 ++++++++++++++++++++++++++++++++++++++++---------
>src/qemu/qemu_command.h | 2 +-
>2 files changed, 46 insertions(+), 11 deletions(-)
>
>diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
>index 0804133..c28c8f2 100644
>--- a/src/qemu/qemu_command.c
>+++ b/src/qemu/qemu_command.c
>@@ -3143,7 +3143,7 @@ qemuBuildMemoryBackendStr(unsigned long long size,
> int guestNode,
> virBitmapPtr userNodeset,
> virBitmapPtr autoNodeset,
>- virDomainDefPtr def,
>+ const virDomainDefPtr def,
> virQEMUCapsPtr qemuCaps,
> virQEMUDriverConfigPtr cfg,
> const char **backendType,
>@@ -7129,12 +7129,18 @@ qemuBuildSmpCommandLine(virCommandPtr cmd,
>
>static int
>qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg,
>- const virDomainDef *def,
>+ const virDomainDefPtr def,
These two hunks have nothing to do with the change. If you want to do
them for some reason, explain the reasoning and put it into separate
patch, please.
OK.
> virQEMUCapsPtr qemuCaps,
>- virCommandPtr cmd)
>+ virCommandPtr cmd,
>+ virBitmapPtr auto_nodeset)
>{
> const long system_page_size = virGetSystemPageSizeKB();
> char *mem_path = NULL;
>+ virBitmapPtr nodemask = NULL;
>+ const char *backendType = NULL;
>+ char *backendStr = NULL;
>+ virJSONValuePtr props;
>+ int rv = -1;
>
> /*
> * No-op if hugepages were not requested.
>@@ -7159,18 +7165,47 @@ qemuBuildMemPathStr(virQEMUDriverConfigPtr cfg,
> if (qemuGetHupageMemPath(cfg, def->mem.hugepages[0].size, &mem_path) <
0)
> return -1;
>
>- virCommandAddArgList(cmd, "-mem-prealloc", "-mem-path",
mem_path, NULL);
>+ if (virDomainNumatuneMaybeGetNodeset(def->numa, auto_nodeset,
>+ &nodemask, -1) < 0)
>+ return -1;
>+ if (nodemask && virQEMUCapsGet(qemuCaps, QEMU_CAPS_OBJECT_MEMORY_FILE))
{
>+ props = virJSONValueNewObject();
>+ if (qemuBuildMemoryBackendStr(virDomainDefGetMemoryInitial(def),
>+ 0, -1, NULL, auto_nodeset,
>+ def, qemuCaps, cfg, &backendType,
>+ &props, false) < 0)
>+ goto cleanup;
>+ if (!(backendStr = virQEMUBuildObjectCommandlineFromJSON(backendType,
>+ "mem",
>+ props)))
>+ goto cleanup;
>+ virCommandAddArgList(cmd, "-object", backendStr, NULL);
So now a function that's called BuildMemPathStr does more than that. It
is also called when the memory is added in a different way in which case
this function would just add even more memory which is clearly wrong.
I'm not sure if I understand this; this function used to add "-mem-path"
and it would now add either "-mem-path" or a memory backend (with the same
path), both doing the same kind of configuration (is that not the case?). I did
see that it must not be called twice (which was already the case, I think), but
I convinced myself that it would only be called once: There are only two
callers of qemuBuildMemPathStr(). One is in qemuBuildMemCommandLine() and it is
protected by "if (!virDomainNumaGetNodeCount(def->numa)", and the other is
in
qemuBuildNumaArgStr() and the (only) call to that is protected by "if
(virDomainNumaGetNodeCount(def->numa)". Is that safe enough?
>+ rv = 0;
>+cleanup:
>+ VIR_FREE(backendStr);
>+ VIR_FREE(props);
>+ }
>+ else {
>+ if (nodemask || 1)
" || 1" ? Leftover from debugging?
Oops, yes.
>+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
"%s",
>+ _("Memory file backend objects are "
>+ "unsupported by QEMU binary. Global NUMA "
>+ "hugepage policy will be ignored."));
Reporting error without erroring out is wrong. VIR_WARN is probably
what you wanted to do here.
OK.
As I said, I don't feel the need for the change. If there is a
need for
it, it should clearly be done differently, so NACK to this particular
patch.
Martin
I understand. I'm interested to see what you think of the test case :-)
Cheers,
Sam.