On 4/12/19 11:56 AM, Erik Skultety wrote:
On Thu, Apr 04, 2019 at 10:40:39AM -0300, Daniel Henrique Barboza
wrote:
> The NVIDIA V100 GPU has an onboard RAM that is mapped into the
> host memory and accessible as normal RAM via an NVLink2 bridge. When
> passed through in a guest, QEMU puts the NVIDIA RAM window in a
> non-contiguous area, above the PCI MMIO area that starts at 32TiB.
> This means that the NVIDIA RAM window starts at 64TiB and go all the
> way to 128TiB.
>
> This means that the guest might request a 64-bit window, for each PCI
> Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM
> window isn't counted as regular RAM, thus this window is considered
> only for the allocation of the Translation and Control Entry (TCE).
> For more information about how NVLink2 support works in QEMU,
> refer to the accepted implementation [1].
>
> This memory layout differs from the existing VFIO case, requiring its
> own formula. This patch changes the PPC64 code of
> @qemuDomainGetMemLockLimitBytes to:
>
> - detect if we have a NVLink2 bridge being passed through to the
> guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function
> added in the previous patch. The existence of the NVLink2 bridge in
> the guest means that we are dealing with the NVLink2 memory layout;
>
> - if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a
> different way to account for the extra memory the TCE table can alloc.
> The 64TiB..128TiB window is more than enough to fit all possible
> GPUs, thus the memLimit is the same regardless of passing through 1 or
> multiple V100 GPUs.
>
> [1]
https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html
For further explanation, I'll also add Alexey's responses on libvirt list:
https://www.redhat.com/archives/libvir-list/2019-March/msg00660.html
https://www.redhat.com/archives/libvir-list/2019-April/msg00527.html
...
> + * passthroughLimit = maxMemory +
> + * 128TiB/512KiB * #PHBs + 8 MiB */
> + if (nvlink2Capable) {
> + passthroughLimit = maxMemory +
> + 128 * (1ULL<<30) / 512 * nPCIHostBridges +
> + 8192;
> + } else if (usesVFIO) {
> + /* For regular (non-NVLink1 present) VFIO passthrough, the value
Shouldn't ^this be "non-NVLink2 present", since the
limits
are unchanged except you need to assign the bridges too for NVLink1?
Yes, "non-NVLink2 present" is correct there. Not sure why or how, but I
managed to decrement some integers of the existing comment I moved
by 1. Here's more corrections:
> + * of passthroughLimit is:
> + *
> + * passthroughLimit := max( 2 GiB * #PHBs, (c)
> + * memory (d)
> + * + memory * 1/512 * #PHBs + 8 MiB ) (e)
> + *
> + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 1
> + * GiB rather than 0 GiB
" We're allowing 2 GiB rather than 1 GiB"
> + *
> + * (d) is the with-DDW (and memory pre-registration and related
> + * features) DMA window accounting - assuming that we only account
> + * RAM once, even if mapped to multiple PHBs
> + *
> + * (e) is the with-DDW userspace view and overhead for the 63-bit
> + * DMA window. This is based a bit on expected guest behaviour, but
64-bit DMA window
> + * there really isn't a way to completely avoid
that. We assume the
> + * guest requests a 63-bit DMA window (per PHB) just big enough to
64-bit DMA window
> + * map all its RAM. 3 kiB page size gives the 1/512; it
will be
4 kiB page size
> + * less with 64 kiB pages, less still if the guest is
mapped with
> + * hugepages (unlike the default 31-bit DMA window, DDW windows
default 32-bit DMA window
> + * can use large IOMMU pages). 7 MiB is for second and
further level
8 MiB is for second
> + * overheads, like (b) */
> passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges,
> memory +
> memory / 512 * nPCIHostBridges + 8192);
> + }
>
> memKB = baseLimit + passthroughLimit;
>
Let me know about the commentary above whether I need to adjust before pushing:
Yes, I appreciate if you can amend the comments up there before pushing.
Thanks,
DHB
Reviewed-by: Erik Skultety <eskultet(a)redhat.com>