On Fri, Oct 14, 2016 at 10:19:42AM +0200, Martin Kletzander wrote:
On Fri, Oct 14, 2016 at 11:52:22AM +1100, Sam Bobroff wrote:
>On Thu, Oct 13, 2016 at 11:34:43AM +0200, Martin Kletzander wrote:
>>On Thu, Oct 13, 2016 at 11:34:16AM +1100, Sam Bobroff wrote:
>>>On Wed, Oct 12, 2016 at 10:27:50AM +0200, Martin Kletzander wrote:
>>>>On Wed, Oct 12, 2016 at 03:04:53PM +1100, Sam Bobroff wrote:
>>>>>At the moment, guests that are backed by hugepages in the host are
>>>>>only able to use policy to control the placement of those hugepages
>>>>>on a per-(guest-)CPU basis. Policy applied globally is ignored.
>>>>>
>>>>>Such guests would use
<memoryBacking><hugepages/></memoryBacking> and
>>>>>a <numatune> block with <memory mode=... nodeset=.../>
but no <memnode
>>>>>.../> elements.
>>>>>
>>>>>This patch corrects this by, in this specific case, changing the
QEMU
>>>>>command line from "-mem-prealloc -mem-path=..." (which
cannot
>>>>>specify NUMA policy) to "-object memory-backend-file ..."
(which can).
>>>>>
>>>>>Note: This is not visible to the guest and does not appear to create
>>>>>a migration incompatibility.
>>>>>
>>>>
>>>>It could make sense, I haven't tried yet, though. However, I still
>>>>don't see the point in using memory-backend-file. Is it that when
you
>>>>don't have cpuset cgroup the allocation doesn't work well?
Because it
>>>>certainly does work for me.
>>>
>>>Thanks for taking a look at this :-)
>>>
>>>The point of using a memory-backend-file is that with it, the NUMA policy
can
>>>be specified to QEMU, but with -mem-path it can't. It seems to be a way
to tell
>>>QEMU to apply NUMA policy in the right place. It does seem odd to me to use
>>>memory-backend-file without attaching the backend to a guest NUMA node, but
it
>>>seems to do the right thing in this case. (If there are guest NUMA nodes, or
if
>>>hugepages aren't being used, policy is correctly applied.)
>>>
>>>I'll describe my test case in detail, perhaps there's something I
don't understand
>>>happening.
>>>
>>>* I set up a machine with two (fake) NUMA nodes (0 and 1), with 2G of
hugepages
>>> on node 1, and none on node 0.
>>>
>>>* I create a 2G guest using virt-install:
>>>
>>>virt-install --name ppc --memory=2048 --disk ~/tmp/tmp.qcow2 --cdrom
~/tmp/ubuntu-16.04-server-ppc64el.iso --wait 0 --virt-type qemu --memorybacking
hugepages=on --graphics vnc --arch ppc64le
>>>
>>>* I "virsh destroy" and then "virsh edit" to add this
block to the guest XML:
>>>
>>> <numatune>
>>> <memory mode='strict' nodeset='0'/>
>>> </numatune>
>>>
>>>* "virsh start", and the machine starts (I believe it should fail
due to insufficient memory satasfying the policy).
>>>* "numastat -p $(pidof qemu-system-ppc64)" shows something like
this:
>>>
>>>Per-node process memory usage (in MBs) for PID 8048 (qemu-system-ppc)
>>> Node 0 Node 1 Total
>>> --------------- --------------- ---------------
>>>Huge 0.00 2048.00 2048.00
>>>Heap 8.12 0.00 8.12
>>>Stack 0.03 0.00 0.03
>>>Private 35.80 6.10 41.90
>>>---------------- --------------- --------------- ---------------
>>>Total 43.95 2054.10 2098.05
>>>
>>>So it looks like it's allocated hugepages from node 1, isn't this
violating the
>>>policy I set via numatune?
>>>
>>
>>Oh, now I get it. We are doing our best to apply that policy to qemu
>>even when we don't have this option. However, using this works even
>>better (which is probably* what we want). And that's the reasoning
>>behind this.
>>
>>* I'm saying probably because when I was adding numactl binding to be
>> used together with cgroups, I was told that we couldn't change the
>> binding afterwards and it's bad. I feel like we could do something
>> with that and it would help us in the future, but there needs to be a
>> discussion, I guess. Because I might be one of the few =)
>>
>>So to recapitulate that, there are three options how to affect the
>>allocation of qemu's memory:
>>
>>1) numactl (libnuma): it works as expected, but cannot be changed later
>>
>>2) cgroups: so strict it has to be applied after qemu started, due to
>> that it doesn't work right, especially for stuff that gets all
>> pre-allocated (like hugepages). it can be changed later, but it
>> won't always mean the memory will migrate, so upon change there is
>> no guarantee. If it's unavailable, we fallback to (1) anyway
>>
>>3) memory-backing-file's host-nodes=: this works as expected, but
>> cannot be used with older QEMUs, cannot be changed later and in some
>> cases (not your particular one) it might screw up migration if it
>> wasn't used before.
>>
>>Selecting the best option from these, plus making the code work with
>>every possibility (erroring out when you want to change the memory node
>>and we had to use (1) for example) is a pain. We should really think
>>about that and reorganize these things for the better of the future.
>>Otherwise we're going to get overwhelm ourselves. Cc'ing Peter to get
>>his thoughts as well as he worked on some parts of this as well.
>>
>>Martin
>
>Thanks for the explanation, and I agree (I'm already a bit overwhelmed!) :-)
>
>What do you mean by "changed later"? Do you mean, if the domain XML is
changed
>while the machine is running?
>
E.g. by 'virsh numatune domain 1-2'
Ah, thanks!
>I did look at the libnuma and cgroups approaches, but I was
concerned they
>wouldn't work in this case, because of the way QEMU allocates memory when
>mem-prealloc is used: the memory is allocated in the main process, before the
>CPU threads are created. (This is based only on a bit of hacking and debugging
>in QEMU, but it does seem explain the behaviour I've seen so far.)
>
But we use numactl before QEMU is exec()'d.
Sorry, I jumped ahead a bit. I'll try to explain what I mean:
I think the problem with using this method would be that the NUMA policy is
applied to all allocations by QEMU, not just ones related to the memory
backing. I'm not sure if that would cause a serious problem but it seems untidy,
and it doesn't happen in other situations (i.e. with separate memory backend
objects, QEMU sets up the policy specifically for each one and other
allocations aren't affected, AFAIK). Presumably, if memory were very
restricted it could prevent the guest from starting.
>If this is the case, it would seem to be a significant problem:
if policy is
>set on the main thread, it will affect all allocations not just the VCPU
>memory and if it's set on the VCPU threads it won't catch the pre-allocation
at
>all. (Is this what you were referring to by "it doesn't work right"?)
>
Kind of, yes.
>That was my reasoning for trying to use the backend object in this case; it was
>the only method that worked and did not require changes to QEMU. I'd prefer
>the other approaches if they could be made to work.
>
There is a workaround, you can disable the cpuset cgroup in
libvirtd.conf, but that's not what you want, I guess.
Thanks, but no it doesn't seem to be what I want due to the above issue.
>I think QEMU could be altered to move the preallocations into the
VCPU
>threads but it didn't seem trivial and I suspected the QEMU community would
>point out that there was already a way to do it using backend objects. Another
>option would be to add a -host-nodes parameter to QEMU so that the policy can
>be given without adding a memory backend object. (That seems like a more
>reasonable change to QEMU.)
>
I think upstream won't like that, mostly because there is already a
way. And that is using memory-backend object. I think we could just
use that and disable changing it live. But upstream will probably want
that to be configurable or something.
Right, but isn't this already an issue in the cases where libvirt is already
using memory backend objects and NUMA policy? (Or does libvirt already disable
changing it live in those situations?)
>Cheers,
>Sam.
>