[libvirt] [PATCH 0/6] Fixes for memory locking limit

The code dealing with RLIMIT_MEMLOCK contained a few assumptions that hold for x86 but don't necessarily work as well for other platforms, eg. that qemu will need to lock memory only when VFIO passthrough is involved. This series removes such assumptions by removing ad-hoc code and making sure that the function containing the correct logic is called instead; it also implements the platform-specific calculations needed to calculate the correct RLIMIT_MEMLOCK value for ppc64 guests. Patches 1-4 are architecture-agnostic, patches 5-6 are specific to ppc64. Cheers. Andrea Bolognani (6): process: Log when limiting the amount of locked memory qemu: Use qemuDomainRequiresMlock() in qemuBuildCommandLine() qemu: Use qemuDomainRequiresMlock() when attaching PCI hostdev qemu: Reduce memlock limit after detaching hostdev qemu: Always set locked memory limit for ppc64 domains qemu: Add ppc64-specific math to qemuDomainGetMlockLimitBytes() src/qemu/qemu_command.c | 9 ++--- src/qemu/qemu_domain.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_hotplug.c | 29 ++++++++++++---- src/util/virprocess.c | 4 +++ 4 files changed, 116 insertions(+), 15 deletions(-) -- 2.5.0

This can be useful for debugging. --- src/util/virprocess.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 103c114..4b18903 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -766,6 +766,10 @@ virProcessSetMaxMemLock(pid_t pid, unsigned long long bytes) return -1; } } + + VIR_DEBUG("Locked memory for process %lld limited to %llu bytes", + (long long int) pid, bytes); + return 0; } #else /* ! (HAVE_SETRLIMIT && defined(RLIMIT_MEMLOCK)) */ -- 2.5.0

On Wed, Nov 18, 2015 at 15:13:15 +0100, Andrea Bolognani wrote:
This can be useful for debugging. --- src/util/virprocess.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 103c114..4b18903 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -766,6 +766,10 @@ virProcessSetMaxMemLock(pid_t pid, unsigned long long bytes) return -1; } } + + VIR_DEBUG("Locked memory for process %lld limited to %llu bytes", + (long long int) pid, bytes); +
ACK Peter

This removes a duplication of the logic used to decide whether the memory locking limit should be set. --- src/qemu/qemu_command.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8fdf90c..3e9eb8c 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -9135,7 +9135,6 @@ qemuBuildCommandLine(virConnectPtr conn, int usbcontroller = 0; int actualSerials = 0; bool usblegacy = false; - bool mlock = false; int contOrder[] = { /* * List of controller types that we add commandline args for, @@ -9304,7 +9303,6 @@ qemuBuildCommandLine(virConnectPtr conn, virCommandAddArgFormat(cmd, "mlock=%s", def->mem.locked ? "on" : "off"); } - mlock = def->mem.locked; virCommandAddArg(cmd, "-smp"); if (!(smp = qemuBuildSmpArgStr(def, qemuCaps))) @@ -10869,9 +10867,6 @@ qemuBuildCommandLine(virConnectPtr conn, "supported by this version of qemu")); goto error; } - /* VFIO requires all of the guest's memory to be locked - * resident */ - mlock = true; } if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE)) { @@ -11124,7 +11119,9 @@ qemuBuildCommandLine(virConnectPtr conn, goto error; } - if (mlock) + /* In some situations, eg. VFIO passthrough, QEMU might need to lock a + * significant amount of memory, so we need to set the limit accordingly */ + if (qemuDomainRequiresMlock(def)) virCommandSetMaxMemLock(cmd, qemuDomainGetMlockLimitBytes(def)); if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_MSG_TIMESTAMP) && -- 2.5.0

On Wed, Nov 18, 2015 at 15:13:16 +0100, Andrea Bolognani wrote:
This removes a duplication of the logic used to decide whether the memory locking limit should be set. --- src/qemu/qemu_command.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
ACK, Peter

The function is used everywhere else to check whether the locked memory limit should be set / updated, and it should be used here as well. Moreover, qemuDomainGetMlockLimitBytes() expects the hostdev to have already been added to the domain definition, but we only do that at the end of qemuDomainAttachHostPCIDevice(). Work around the issue by adding the hostdev before adjusting the locked memory limit and removing it immediately afterwards. --- src/qemu/qemu_hotplug.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 89e5c0d..0bd88ce 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1269,18 +1269,27 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, "supported by this version of qemu")); goto error; } - - /* setup memory locking limits, that are necessary for VFIO */ - if (virProcessSetMaxMemLock(vm->pid, - qemuDomainGetMlockLimitBytes(vm->def)) < 0) - goto error; - break; default: break; } + /* Temporarily add the hostdev to the domain definition. This is needed + * because qemuDomainRequiresMlock() and qemuDomainGetMlockLimitBytes() + * require the hostdev to be already part of the domain definition, but + * other functions like qemuAssignDeviceHostdevAlias() used below expect + * it *not* to be there. A better way to handle this would be nice */ + vm->def->hostdevs[vm->def->nhostdevs++] = hostdev; + if (qemuDomainRequiresMlock(vm->def)) { + if (virProcessSetMaxMemLock(vm->pid, + qemuDomainGetMlockLimitBytes(vm->def)) < 0) { + vm->def->hostdevs[--(vm->def->nhostdevs)] = NULL; + goto error; + } + } + vm->def->hostdevs[--(vm->def->nhostdevs)] = NULL; + if (qemuSetupHostdevCGroup(vm, hostdev) < 0) goto error; teardowncgroup = true; -- 2.5.0

On Wed, Nov 18, 2015 at 15:13:17 +0100, Andrea Bolognani wrote:
The function is used everywhere else to check whether the locked memory limit should be set / updated, and it should be used here as well.
Moreover, qemuDomainGetMlockLimitBytes() expects the hostdev to have already been added to the domain definition, but we only do that at the end of qemuDomainAttachHostPCIDevice(). Work around the issue by adding the hostdev before adjusting the locked memory limit and removing it immediately afterwards. --- src/qemu/qemu_hotplug.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 89e5c0d..0bd88ce 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1269,18 +1269,27 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, "supported by this version of qemu")); goto error; } - - /* setup memory locking limits, that are necessary for VFIO */ - if (virProcessSetMaxMemLock(vm->pid, - qemuDomainGetMlockLimitBytes(vm->def)) < 0) - goto error; - break;
default: break; }
+ /* Temporarily add the hostdev to the domain definition. This is needed + * because qemuDomainRequiresMlock() and qemuDomainGetMlockLimitBytes() + * require the hostdev to be already part of the domain definition, but + * other functions like qemuAssignDeviceHostdevAlias() used below expect + * it *not* to be there. A better way to handle this would be nice */ + vm->def->hostdevs[vm->def->nhostdevs++] = hostdev; + if (qemuDomainRequiresMlock(vm->def)) { + if (virProcessSetMaxMemLock(vm->pid, + qemuDomainGetMlockLimitBytes(vm->def)) < 0) { + vm->def->hostdevs[--(vm->def->nhostdevs)] = NULL; + goto error; + } + } + vm->def->hostdevs[--(vm->def->nhostdevs)] = NULL; + if (qemuSetupHostdevCGroup(vm, hostdev) < 0) goto error; teardowncgroup = true;
Hmm, ugly, byt required for the PPC64 calculation in the following patches. I guess reworking the hotplug code would be rather ugly so relucant ACK. Peter

We increase the limit before plugging in a PCI hostdev or a memory module because some memory might need to be locked due to eg. VFIO. Of course we should do the opposite after unplugging a device: this was already the case for memory modules, but not for hostdevs. --- src/qemu/qemu_hotplug.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 0bd88ce..2a16f89 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -3084,6 +3084,14 @@ qemuDomainRemoveHostDevice(virQEMUDriverPtr driver, networkReleaseActualDevice(vm->def, net); virDomainNetDefFree(net); } + + /* QEMU might no longer need to lock as much memory, eg. we just detached + * a VFIO device, so adjust the limit here */ + if (qemuDomainRequiresMlock(vm->def)) + if (virProcessSetMaxMemLock(vm->pid, + qemuDomainGetMlockLimitBytes(vm->def)) < 0) + VIR_WARN("Failed to adjust locked memory limit"); + ret = 0; cleanup: -- 2.5.0

On Wed, Nov 18, 2015 at 15:13:18 +0100, Andrea Bolognani wrote:
We increase the limit before plugging in a PCI hostdev or a memory module because some memory might need to be locked due to eg. VFIO.
Of course we should do the opposite after unplugging a device: this was already the case for memory modules, but not for hostdevs. --- src/qemu/qemu_hotplug.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 0bd88ce..2a16f89 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -3084,6 +3084,14 @@ qemuDomainRemoveHostDevice(virQEMUDriverPtr driver, networkReleaseActualDevice(vm->def, net); virDomainNetDefFree(net); } + + /* QEMU might no longer need to lock as much memory, eg. we just detached + * a VFIO device, so adjust the limit here */ + if (qemuDomainRequiresMlock(vm->def)) + if (virProcessSetMaxMemLock(vm->pid, + qemuDomainGetMlockLimitBytes(vm->def)) < 0) + VIR_WARN("Failed to adjust locked memory limit"); +
Hmmm, looks like we should reset it to default (64KiB afaik) if it was required before and is not required any more. Otherwise we would not decrease the limit after unplugging the last VFIO device (on x86). Peter

On Wed, 2015-11-18 at 16:17 +0100, Peter Krempa wrote:
+ /* QEMU might no longer need to lock as much memory, eg. we just detached + * a VFIO device, so adjust the limit here */ + if (qemuDomainRequiresMlock(vm->def)) + if (virProcessSetMaxMemLock(vm->pid, + qemuDomainGetMlockLimitBytes(vm->def)) < 0) + VIR_WARN("Failed to adjust locked memory limit"); +
Hmmm, looks like we should reset it to default (64KiB afaik) if it was required before and is not required any more. Otherwise we would not decrease the limit after unplugging the last VFIO device (on x86).
I agree, and I planned to do something about that in a follow-up patch as this change alone is already a small improvement over the status quo. Would you prefer it if I pulled this patch from the series for now and posted it again once it supports restoring the limit back to the default once the last VFIO device has been removed from the guest? I'd be okay with that. Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Wed, Nov 18, 2015 at 17:04:55 +0100, Andrea Bolognani wrote:
On Wed, 2015-11-18 at 16:17 +0100, Peter Krempa wrote:
+ /* QEMU might no longer need to lock as much memory, eg. we just detached + * a VFIO device, so adjust the limit here */ + if (qemuDomainRequiresMlock(vm->def)) + if (virProcessSetMaxMemLock(vm->pid, + qemuDomainGetMlockLimitBytes(vm->def)) < 0) + VIR_WARN("Failed to adjust locked memory limit"); +
Hmmm, looks like we should reset it to default (64KiB afaik) if it was required before and is not required any more. Otherwise we would not decrease the limit after unplugging the last VFIO device (on x86).
I agree, and I planned to do something about that in a follow-up patch as this change alone is already a small improvement over the status quo.
Would you prefer it if I pulled this patch from the series for now and posted it again once it supports restoring the limit back to the default once the last VFIO device has been removed from the guest? I'd be okay with that.
Hmm, yeah, I'd prefer to do it the first time the right way. Peter

Unlike other architectures, ppc64 domains need to lock memory even when VFIO is not used. Change qemuDomainRequiresMlock() to reflect this fact. --- src/qemu/qemu_domain.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1f73709..2c0f5af 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -3836,8 +3836,9 @@ qemuDomainGetMlockLimitBytes(virDomainDefPtr def) /** * @def: domain definition * - * Returns ture if the locked memory limit needs to be set or updated due to - * configuration or passthrough devices. + * Returns true if the locked memory limit needs to be set or updated because + * of domain configuration, VFIO passthrough devices or architecture-specific + * requirements. * */ bool qemuDomainRequiresMlock(virDomainDefPtr def) @@ -3847,6 +3848,10 @@ qemuDomainRequiresMlock(virDomainDefPtr def) if (def->mem.locked) return true; + /* ppc64 domains need to lock some memory even when VFIO is not used */ + if (ARCH_IS_PPC64(def->os.arch)) + return true; + for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDefPtr dev = def->hostdevs[i]; -- 2.5.0

On Wed, Nov 18, 2015 at 15:13:19 +0100, Andrea Bolognani wrote:
Unlike other architectures, ppc64 domains need to lock memory even when VFIO is not used.
Change qemuDomainRequiresMlock() to reflect this fact. --- src/qemu/qemu_domain.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
ACK, but I think this should go in after 6/6 that adds the magic calculation for PPC. Peter

On Wed, 2015-11-18 at 16:11 +0100, Peter Krempa wrote:
On Wed, Nov 18, 2015 at 15:13:19 +0100, Andrea Bolognani wrote:
Unlike other architectures, ppc64 domains need to lock memory even when VFIO is not used.
Change qemuDomainRequiresMlock() to reflect this fact. --- src/qemu/qemu_domain.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
ACK, but I think this should go in after 6/6 that adds the magic calculation for PPC.
Sure, I'll change the order before pushing. Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

The amount of memory a ppc64 domain might need to lock is different than that of a equally-sized x86 domain, so we need to check the domain's architecture and act accordingly. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480 --- src/qemu/qemu_domain.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 79 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2c0f5af..1e92b9d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -3793,7 +3793,7 @@ qemuDomainUpdateCurrentMemorySize(virQEMUDriverPtr driver, * @def: domain definition * * Returns the size of the memory in bytes that needs to be set as - * RLIMIT_MEMLOCK for purpose of VFIO device passthrough. + * RLIMIT_MEMLOCK for the QEMU process. * If a mem.hard_limit is set, then that value is preferred; otherwise, the * value returned may depend upon the architecture or devices present. */ @@ -3808,6 +3808,84 @@ qemuDomainGetMlockLimitBytes(virDomainDefPtr def) goto done; } + if (ARCH_IS_PPC64(def->os.arch)) { + unsigned long long maxMemory; + unsigned long long memory; + unsigned long long baseLimit; + unsigned long long passthroughLimit; + size_t nPCIHostBridges; + size_t i; + bool usesVFIO = false; + + /* TODO: Detect at runtime once we start using more than just + * the default PCI Host Bridge */ + nPCIHostBridges = 1; + + for (i = 0; i < def->nhostdevs; i++) { + virDomainHostdevDefPtr dev = def->hostdevs[i]; + + if (dev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { + usesVFIO = true; + break; + } + } + + memory = virDomainDefGetMemoryActual(def); + + if (def->mem.max_memory) + maxMemory = def->mem.max_memory; + else + maxMemory = memory; + + /* baseLimit := maxMemory / 128 (a) + * + 4 MiB * #PHBs + 8 MiB (b) + * + * (a) is the hash table + * + * (b) is accounting for the 32-bit DMA window - it could be either the + * KVM accelerated TCE tables for emulated devices, or the VFIO + * userspace view. The 4 MiB per-PHB (including the default one) covers + * a 2GiB DMA window: default is 1GiB, but it's possible it'll be + * increased to help performance. The 8 MiB extra should be plenty for + * the TCE table index for any reasonable number of PHBs and several + * spapr-vlan or spapr-vscsi devices (512kB + a tiny bit each) */ + baseLimit = maxMemory / 128 + + 4096 * nPCIHostBridges + + 8192; + + /* passthroughLimit := max( 2 GiB * #PHBs, (c) + * memory (d) + * + memory * 1/512 * #PHBs + 8 MiB ) (e) + * + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2 GiB + * rather than 1 GiB + * + * (d) is the with-DDW (and memory pre-registration and related + * features) DMA window accounting - assuming that we only account RAM + * once, even if mapped to multiple PHBs + * + * (e) is the with-DDW userspace view and overhead for the 64-bit DMA + * window. This is based a bit on expected guest behaviour, but there + * really isn't a way to completely avoid that. We assume the guest + * requests a 64-bit DMA window (per PHB) just big enough to map all + * its RAM. 4 kiB page size gives the 1/512; it will be less with 64 + * kiB pages, less still if the guest is mapped with hugepages (unlike + * the default 32-bit DMA window, DDW windows can use large IOMMU + * pages). 8 MiB is for second and further level overheads, like (b) */ + passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, + memory + + memory / 512 * nPCIHostBridges + 8192); + + if (usesVFIO) + memKB = baseLimit + passthroughLimit; + else + memKB = baseLimit; + + goto done; + } + /* For device passthrough using VFIO the guest memory and MMIO memory * regions need to be locked persistent in order to allow DMA. * -- 2.5.0

On Wed, Nov 18, 2015 at 15:13:20 +0100, Andrea Bolognani wrote:
The amount of memory a ppc64 domain might need to lock is different than that of a equally-sized x86 domain, so we need to check the domain's architecture and act accordingly.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480 --- src/qemu/qemu_domain.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 79 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2c0f5af..1e92b9d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c
[...]
+ /* baseLimit := maxMemory / 128 (a) + * + 4 MiB * #PHBs + 8 MiB (b) + * + * (a) is the hash table + * + * (b) is accounting for the 32-bit DMA window - it could be either the + * KVM accelerated TCE tables for emulated devices, or the VFIO + * userspace view. The 4 MiB per-PHB (including the default one) covers + * a 2GiB DMA window: default is 1GiB, but it's possible it'll be + * increased to help performance. The 8 MiB extra should be plenty for + * the TCE table index for any reasonable number of PHBs and several + * spapr-vlan or spapr-vscsi devices (512kB + a tiny bit each) */ + baseLimit = maxMemory / 128 + + 4096 * nPCIHostBridges + + 8192; + + /* passthroughLimit := max( 2 GiB * #PHBs, (c) + * memory (d) + * + memory * 1/512 * #PHBs + 8 MiB ) (e) + * + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2 GiB + * rather than 1 GiB + * + * (d) is the with-DDW (and memory pre-registration and related + * features) DMA window accounting - assuming that we only account RAM + * once, even if mapped to multiple PHBs + * + * (e) is the with-DDW userspace view and overhead for the 64-bit DMA + * window. This is based a bit on expected guest behaviour, but there + * really isn't a way to completely avoid that. We assume the guest + * requests a 64-bit DMA window (per PHB) just big enough to map all + * its RAM. 4 kiB page size gives the 1/512; it will be less with 64 + * kiB pages, less still if the guest is mapped with hugepages (unlike + * the default 32-bit DMA window, DDW windows can use large IOMMU + * pages). 8 MiB is for second and further level overheads, like (b) */ + passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, + memory + + memory / 512 * nPCIHostBridges + 8192); + + if (usesVFIO) + memKB = baseLimit + passthroughLimit; + else + memKB = baseLimit;
Is there any public documentation where you were taking the info from? Should we link it to the code? Peter

On Wed, 2015-11-18 at 16:39 +0100, Peter Krempa wrote:
Is there any public documentation where you were taking the info from? Should we link it to the code?
The information was provided by David Gibson, and AFAIU what he did is go through the kernel / qemu code and take note of all the bits of memory that need to be locked. So there's no comprehensive documentation that I know of, except for the code itself :) Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Wed, Nov 18, 2015 at 15:13:20 +0100, Andrea Bolognani wrote:
The amount of memory a ppc64 domain might need to lock is different than that of a equally-sized x86 domain, so we need to check the domain's architecture and act accordingly.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480 --- src/qemu/qemu_domain.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 79 insertions(+), 1 deletion(-)
ACK, although I'd like to hear David's opinion (cc'd). Peter

On Wed, 18 Nov 2015 17:26:54 +0100 Peter Krempa <pkrempa@redhat.com> wrote:
On Wed, Nov 18, 2015 at 15:13:20 +0100, Andrea Bolognani wrote:
The amount of memory a ppc64 domain might need to lock is different than that of a equally-sized x86 domain, so we need to check the domain's architecture and act accordingly.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1273480 --- src/qemu/qemu_domain.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 79 insertions(+), 1 deletion(-)
ACK, although I'd like to hear David's opinion (cc'd).
So, as Andrea said, the text in the comments is mostly mine, so this pretty much matches what I suggested. I still haven't had a chance to investigate the original failing case more deeply to see exactly what was going on, so I am concerned I might have missed something. But, the code presented here is certainly closer to correct than the previous code. Even if I/we have missed some things the version Andrea suggests should have the right overall structure, so it will be simpler to tweak than the old code. I'll make a couple of extra points to help explain why Power has these extra sources of locked memory, even without VFIO [1]. First, on x86 the guest's page tables exist within the guest's regular memory space. On Power the PAPR paravirtualized environment has the page table ("hash page table"[2]) outside the guest's memory space, accessed an entry at a time via hypercalls. The hash page table cannot be swapped or paged itself, so should be accounted as locked memory (although it actually isn't right now). Second, under PAPR, the guest always sees an IOMMU, and it's always turned on (PAPR just doesn't have the concept of "no IOMMU"). On x86 although the host uses an IOMMU to implement VFIO, it's not usually visible to the guest. Even when there is a guest visible IOMMU on x86, its page tables again exist within the guest memory space. With PAPR the IOMMU page tables ("TCE tables") again exist outside the guest memory space. Those TCE tables can either end up in normal qemu memory, or in kernel memory depending on what combination of VFIO and our KVM IOMMU acceleration for emulated devices. But under at least some combinations it's again unswappable memory, and so we should account it as locked. [1] Or at least, it might in future, Andrea is accounting for several things that don't actually impact locked_vm now, but probably should. [2] Complete aside. The Power MMU works very differently from the x86 MMU (or indeed the MMU on any other arch I know of), using a hash table to locate PTEs, rather than a radix tree (PGDs -> PUDs -> PMDs -> PTEs). IBM Research were/are terribly proud of the design which apparently had significant advantages for big database loads with a widely scattered working set - advantages which have been completely swamped by horrible cache behaviour for most of the last 15 years. It also requires a big slab of physically contiguous memory for the hash table, which is a bit of a pain for us. Linux actually treats the hash page table as though it were an enormous TLB, reloading it as necessary from radix style page tables. -- David Gibson <dgibson@redhat.com> Senior Software Engineer, Virtualization, Red Hat
participants (3)
-
Andrea Bolognani
-
David Gibson
-
Peter Krempa