On Wed, 18 Nov 2015 17:26:54 +0100
Peter Krempa <pkrempa(a)redhat.com> wrote:
On Wed, Nov 18, 2015 at 15:13:20 +0100, Andrea Bolognani wrote:
> The amount of memory a ppc64 domain might need to lock is different
> than that of a equally-sized x86 domain, so we need to check the
> domain's architecture and act accordingly.
>
> Resolves:
https://bugzilla.redhat.com/show_bug.cgi?id=1273480
> ---
> src/qemu/qemu_domain.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 79 insertions(+), 1 deletion(-)
>
ACK, although I'd like to hear David's opinion (cc'd).
So, as Andrea said, the text in the comments is mostly mine, so this
pretty much matches what I suggested.
I still haven't had a chance to investigate the original failing case
more deeply to see exactly what was going on, so I am concerned I might
have missed something.
But, the code presented here is certainly closer to correct than the
previous code. Even if I/we have missed some things the version
Andrea suggests should have the right overall structure, so it will be
simpler to tweak than the old code.
I'll make a couple of extra points to help explain why Power has these
extra sources of locked memory, even without VFIO [1].
First, on x86 the guest's page tables exist within the guest's regular
memory space. On Power the PAPR paravirtualized environment has the
page table ("hash page table"[2]) outside the guest's memory space,
accessed an entry at a time via hypercalls. The hash page table cannot
be swapped or paged itself, so should be accounted as locked memory
(although it actually isn't right now).
Second, under PAPR, the guest always sees an IOMMU, and it's always
turned on (PAPR just doesn't have the concept of "no IOMMU"). On x86
although the host uses an IOMMU to implement VFIO, it's not usually
visible to the guest. Even when there is a guest visible IOMMU on x86,
its page tables again exist within the guest memory space. With PAPR
the IOMMU page tables ("TCE tables") again exist outside the guest
memory space. Those TCE tables can either end up in normal qemu memory,
or in kernel memory depending on what combination of VFIO and our KVM
IOMMU acceleration for emulated devices. But under at least some
combinations it's again unswappable memory, and so we should account it
as locked.
[1] Or at least, it might in future, Andrea is accounting for several
things that don't actually impact locked_vm now, but probably should.
[2] Complete aside. The Power MMU works very differently from the x86
MMU (or indeed the MMU on any other arch I know of), using a hash
table to locate PTEs, rather than a radix tree (PGDs -> PUDs -> PMDs ->
PTEs). IBM Research were/are terribly proud of the design which
apparently had significant advantages for big database loads with a
widely scattered working set - advantages which have been completely
swamped by horrible cache behaviour for most of the last 15 years. It
also requires a big slab of physically contiguous memory for the hash
table, which is a bit of a pain for us. Linux actually treats the hash
page table as though it were an enormous TLB, reloading it as necessary
from radix style page tables.
--
David Gibson <dgibson(a)redhat.com>
Senior Software Engineer, Virtualization, Red Hat