Re: [libvirt] [PATCH v5 2/2] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough

15 Apr 2019

      On Fri, Apr 12, 2019 at 12:29:25PM -0300, Daniel Henrique Barboza wrote:
...
On 4/12/19 11:56 AM, Erik Skultety wrote:
...
On Thu, Apr 04, 2019 at 10:40:39AM -0300, Daniel Henrique Barboza wrote:
...
The NVIDIA V100 GPU has an onboard RAM that is mapped into the
host memory and accessible as normal RAM via an NVLink2 bridge. When
passed through in a guest, QEMU puts the NVIDIA RAM window in a
non-contiguous area, above the PCI MMIO area that starts at 32TiB.
This means that the NVIDIA RAM window starts at 64TiB and go all the
way to 128TiB.
This means that the guest might request a 64-bit window, for each PCI
Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM
window isn't counted as regular RAM, thus this window is considered
only for the allocation of the Translation and Control Entry (TCE).
For more information about how NVLink2 support works in QEMU,
refer to the accepted implementation [1].
This memory layout differs from the existing VFIO case, requiring its
own formula. This patch changes the PPC64 code of
@qemuDomainGetMemLockLimitBytes to:
- detect if we have a NVLink2 bridge being passed through to the
guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function
added in the previous patch. The existence of the NVLink2 bridge in
the guest means that we are dealing with the NVLink2 memory layout;
- if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a
different way to account for the extra memory the TCE table can alloc.
The 64TiB..128TiB window is more than enough to fit all possible
GPUs, thus the memLimit is the same regardless of passing through 1 or
multiple V100 GPUs.
[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html
For further explanation, I'll also add Alexey's responses on libvirt list:
https://www.redhat.com/archives/libvir-list/2019-March/msg00660.html
https://www.redhat.com/archives/libvir-list/2019-April/msg00527.html
...
...
+     * passthroughLimit = maxMemory +
+     *                    128TiB/512KiB * #PHBs + 8 MiB */
+    if (nvlink2Capable) {
+        passthroughLimit = maxMemory +
+                           128 * (1ULL<<30) / 512 * nPCIHostBridges +
+                           8192;
+    } else if (usesVFIO) {
+        /* For regular (non-NVLink1 present) VFIO passthrough, the value
                     Shouldn't ^this be "non-NVLink2 present", since the limits
are unchanged except you need to assign the bridges too for NVLink1?
Yes, "non-NVLink2  present" is correct there. Not sure why or how, but  I
managed to decrement some integers of the existing comment I moved
by 1. Here's more corrections:
...
...
+         * of passthroughLimit is:
+         *
+         * passthroughLimit := max( 2 GiB * #PHBs,                       (c)
+         *                          memory                               (d)
+         *                          + memory * 1/512 * #PHBs + 8 MiB )   (e)
+         *
+         * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 1
+         * GiB rather than 0 GiB
" We're allowing 2 GiB rather than 1 GiB"
...
...
+         *
+         * (d) is the with-DDW (and memory pre-registration and related
+         * features) DMA window accounting - assuming that we only account
+         * RAM once, even if mapped to multiple PHBs
+         *
+         * (e) is the with-DDW userspace view and overhead for the 63-bit
+         * DMA window. This is based a bit on expected guest behaviour, but
64-bit DMA window
...
...
+         * there really isn't a way to completely avoid that. We assume the
+         * guest requests a 63-bit DMA window (per PHB) just big enough to
64-bit DMA window
...
...
+         * map all its RAM. 3 kiB page size gives the 1/512; it will be
4 kiB page size
...
...
+         * less with 64 kiB pages, less still if the guest is mapped with
+         * hugepages (unlike the default 31-bit DMA window, DDW windows
default 32-bit DMA window
...
...
+         * can use large IOMMU pages). 7 MiB is for second and further level
8 MiB is for second
Doh! I have to admit, that ^this has slipped my sight, thanks for spotting it.

The series is now pushed.

Regards,
Erik