On Mon, Aug 01, 2016 at 02:01:22PM +1000, Sam Bobroff wrote:
Hi libvirt people,
I've been looking at a (probable) bug and I'm not sure how to progress. The
situation is a bit complicated and involves both QEMU and libvirt (and I think
it may have been looked at already) so I would really appreciate some advice on
how to approach it. I'm using a pretty recent master version of libvirt from
git and I'm testing on a ppc64le host with a similar guest but this doesn't
seem to be arch-specific.
Sorry I haven't replied earlier and I'm respawning this thread now, I
just noticed this thread being marked for replying only after I fixed
similar thing to what you describe.
If I create a QEMU guest (e.g. via virt-install) that requests both
hugepage
backing on the host and NUMA memory placement on the host, the NUMA placement
seems to be ignored. If I do:
# echo 0 > /proc/sys/vm/nr_hugepages
# echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
# virt-install --name tmp --memory=4096 --graphics none --memorybacking hugepages=yes
--disk none --import --wait 0 --numatune=8
So to be clear, we want to use 16M hugepages, but only allocated from
node 8 while the only allocated ones are on node 0.
... then hugepages are allocated on node 0 and the machine starts
successfully,
which seems like a bug.
This definitely is a bug. But I'm afraid it's not in libvirt. I'll do
some explaining first.
I believe it should fail to start due to insufficient memory, and in
fact that
is what happens if cgroup support isn't detected in the host: there seems to be
a fall-back path in libvirt (probably using mbind()) that works as I would
expect.
Yes, we are using multiple things to enforce NUMA binding:
1) cgroups - This restricts all the allocations made on the account of
the process, even when KVM does that. We cannot use that
until the QEMU process is running because it needs to
allocate some data from the DMA region which is usually
only on one node.
2) numactl's mbind() - it doesn't apply to kernel allocations, so
whatever KVM allocates it's not restricted by
this. We use this always on the process before
exec()-ing qemu binary (if compiled with
support to numactl, of course)
3) memory-backend-file's host-nodes parameter - This is the best
option that is used when QEMU
supports it, but due to migration
compatibility we use it only when
requested with <memnode/>
elements.
Note: the relevant part of the guest XML seems to be this:
»·······<memoryBacking>
»·······»·······<hugepages/>
»·······</memoryBacking>
»·······<numatune>
»·······»·······<memory mode='strict' nodeset='8'/>
»·······</numatune>
It seems fairly clear what is happening: although QEMU is capable of allocating
hugepages on specific NUMA nodes (using "memory-backend-file") libvirt is not
passing those options to QEMU in this situation.
I'm guessing you're talking about the host-nodes= parameter.
I investigated this line of reasoning and if I hack libvirt to pass
those
options to QEMU it does indeed fix the problem... but it renders the machine
state migration-incompatible with unfixed versions. This seems to have been why
this hasn't been fixed already :-(
So what can we do?
From virt-install POV I would suggest adding some functionality that
would probe whether it works with <memnode/> and use it if possible. Or
dig deep down in kvm or qemu to see why the allocations do not conform
to the mbind().
I assume it's not acceptible to just break migration with a
bugfix, and I can
only think of two ways to fix migration:
Unfortunately it's not, especially when it's not a bugfix. and that's
why there's the extra logic for it in.
(a) Add a new flag to the XML, and for guests without the flag,
maintain the
old buggy behaviour (and therefore migration compatability).
It kinda is there, that's the <memnode/> setting.
(b) Hack QEMU so that migration can succeed between un-fixed and
fixed
versions. (And possibly also in the reverse direction?)
That's probably not possible to do.
I don't like (a) because it's visible in the XML, and would
have to be carried
forever (or at least a long time?).
I don't really like (b) either because it's tricky, and even if it could be
made to work reliably, it would add mess and risk to the migration code. I'm
not sure how the QEMU community would feel about it either. However, I did hack
up some code and it worked at least in some simple cases.
Can anyone see a better approach? Is anyone already working on this?
Again, sorry for not replying earlier.
Martin