Re: [libvirt] [PATCH 1/1] qemu: host NUMA hugepage policy without guest NUMA

13 Oct 2016

      On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote:
...
On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
...
On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
...
At the moment, guests that are backed by hugepages in the host are
only able to use policy to control the placement of those hugepages
on a per-(guest-)CPU basis. Policy applied globally is ignored.
Such guests would use <memoryBacking><hugepages/></memoryBacking> and
a <numatune> block with <memory mode=... nodeset=.../> but no <memnode
.../> elements.
This patch corrects this by, in this specific case, changing the QEMU
command line from "-mem-prealloc -mem-path=..." (which cannot
specify NUMA policy) to "-object memory-backend-file ..." (which can).
Note: This is not visible to the guest and does not appear to create
a migration incompatibility.
It could make sense, I haven't tried yet, though.  However, I still
don't see the point in using memory-backend-file.  Is it that when you
don't have cpuset cgroup the allocation doesn't work well?  Because it
certainly does work for me.
Thanks for taking a look at this :-)
The point of using a memory-backend-file is that with it, the NUMA policy can
be specified to QEMU, but with -mem-path it can't. It seems to be a way to tell
QEMU to apply NUMA policy in the right place. It does seem odd to me to use
memory-backend-file without attaching the backend to a guest NUMA node, but it
seems to do the right thing in this case. (If there are guest NUMA nodes, or if
hugepages aren't being used, policy is correctly applied.)
I'll describe my test case in detail, perhaps there's something I don't understand
happening.
* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of hugepages
 on node 1, and none on node 0.
* I create a 2G guest using virt-install:
virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom ~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking hugepages=on --graphics vnc --arch ppc64le
* I "virsh destroy" and then "virsh edit" to add this block to the guest XML:
<numatune>
    <memory mode='strict' nodeset='0'/>
 </numatune>
* "virsh start", and the machine starts (I believe it should fail due to insufficient memory satasfying the policy).
* "numastat -p $(pidof qemu-system-ppc64)" shows something like this:
Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc)
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
Huge                         0.00         2048.00         2048.00
Heap                         8.12            0.00            8.12
Stack                        0.03            0.00            0.03
Private                     35.80            6.10           41.90
----------------  --------------- --------------- ---------------
Total                       43.95         2054.10         2098.05
So it looks like it's allocated hugepages from node 1, isn't this violating the
policy I set via numatune?
Oh, now I get it.  We are doing our best to apply that policy to qemu
even when we don't have this option.  However, using this works even
better (which is probably* what we want).  And that's the reasoning
behind this.

 * I'm saying probably because when I was adding numactl binding to be
   used together with cgroups, I was told that we couldn't change the
   binding afterwards and it's bad.  I feel like we could do something
   with that and it would help us in the future, but there needs to be a
   discussion, I guess.  Because I might be one of the few =)

So to recapitulate that, there are three options how to affect the
allocation of qemu's memory:

 1) numactl (libnuma): it works as expected, but cannot be changed later

 2) cgroups: so strict it has to be applied after qemu started, due to
    that it doesn't work right, especially for stuff that gets all
    pre-allocated (like hugepages).  it can be changed later, but it
    won't always mean the memory will migrate, so upon change there is
    no guarantee.  If it's unavailable, we fallback to (1) anyway

 3) memory-backing-file's host-nodes=: this works as expected, but
    cannot be used with older QEMUs, cannot be changed later and in some
    cases (not your particular one) it might screw up migration if it
    wasn't used before.

Selecting the best option from these, plus making the code work with
every possibility (erroring out when you want to change the memory node
and we had to use (1) for example) is a pain.  We should really think
about that and reorganize these things for the better of the future.
Otherwise we're going to get overwhelm ourselves.  Cc'ing Peter to get
his thoughts as well as he worked on some parts of this as well.

Martin