[libvirt] [PATCH v5 0/2] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough

This series adds support for a new QEMU feature for the spapr (PPC64) machine, NVIDIA V100 + P9 passthrough. Refer to [1] for the version 7 of this feature (version accepted upstream). changes in v5: - patch1: * use ARRAY_CARDINALITY instead of hard coding the array size * fixed leak of 'file' string inside loop * added ATTRIBUTE_UNUSED in the static function to allow build to succeed without applying the second patch - patch 2: * added QEMU reference of the memory layout of NVLink2 support * fixed leak of 'pciAddrStr' string inside loop * added curly braces in multi-line statements Previous version can be found at [2]. [1] https://patchwork.kernel.org/patch/10848727/ [2] https://www.redhat.com/archives/libvir-list/2019-March/msg00801.html Daniel Henrique Barboza (2): qemu_domain: NVLink2 bridge detection function for PPC64 PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough src/qemu/qemu_domain.c | 108 ++++++++++++++++++++++++++++++++++------- 1 file changed, 90 insertions(+), 18 deletions(-) -- 2.20.1

The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much. An alternative is presented in this patch. Making the following assumptions: - if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through; - an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'. This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index c6188b38ce..b0f301e634 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10405,6 +10405,36 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) } +/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static ATTRIBUTE_UNUSED bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"}; + size_t i; + + for (i = 0; i < ARRAY_CARDINALITY(nvlink2Files); i++) { + VIR_AUTOFREE(char *) file = NULL; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s", + device, nvlink2Files[i])) < 0) + return false; + + if (!virFileExists(file)) + return false; + } + + return true; +} + + /** * getPPC64MemLockLimitBytes: * @def: domain definition -- 2.20.1

On Thu, Apr 04, 2019 at 10:40:38AM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- Reviewed-by: Erik Skultety <eskultet@redhat.com>

The NVIDIA V100 GPU has an onboard RAM that is mapped into the host memory and accessible as normal RAM via an NVLink2 bridge. When passed through in a guest, QEMU puts the NVIDIA RAM window in a non-contiguous area, above the PCI MMIO area that starts at 32TiB. This means that the NVIDIA RAM window starts at 64TiB and go all the way to 128TiB. This means that the guest might request a 64-bit window, for each PCI Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM window isn't counted as regular RAM, thus this window is considered only for the allocation of the Translation and Control Entry (TCE). For more information about how NVLink2 support works in QEMU, refer to the accepted implementation [1]. This memory layout differs from the existing VFIO case, requiring its own formula. This patch changes the PPC64 code of @qemuDomainGetMemLockLimitBytes to: - detect if we have a NVLink2 bridge being passed through to the guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function added in the previous patch. The existence of the NVLink2 bridge in the guest means that we are dealing with the NVLink2 memory layout; - if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a different way to account for the extra memory the TCE table can alloc. The 64TiB..128TiB window is more than enough to fit all possible GPUs, thus the memLimit is the same regardless of passing through 1 or multiple V100 GPUs. [1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 80 ++++++++++++++++++++++++++++++++---------- 1 file changed, 61 insertions(+), 19 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index b0f301e634..13e54eafea 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10413,7 +10413,7 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) * such as '0004:04:00.0', and tells if the device is a NVLink2 * bridge. */ -static ATTRIBUTE_UNUSED bool +static bool ppc64VFIODeviceIsNV2Bridge(const char *device) { const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", @@ -10451,7 +10451,9 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) unsigned long long maxMemory = 0; unsigned long long passthroughLimit = 0; size_t i, nPCIHostBridges = 0; + virPCIDeviceAddressPtr pciAddr; bool usesVFIO = false; + bool nvlink2Capable = false; for (i = 0; i < def->ncontrollers; i++) { virDomainControllerDefPtr cont = def->controllers[i]; @@ -10469,7 +10471,17 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { usesVFIO = true; - break; + + pciAddr = &dev->source.subsys.u.pci.addr; + if (virPCIDeviceAddressIsValid(pciAddr, false)) { + VIR_AUTOFREE(char *) pciAddrStr = NULL; + + pciAddrStr = virPCIDeviceAddressAsString(pciAddr); + if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) { + nvlink2Capable = true; + break; + } + } } } @@ -10496,29 +10508,59 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) 4096 * nPCIHostBridges + 8192; - /* passthroughLimit := max( 2 GiB * #PHBs, (c) - * memory (d) - * + memory * 1/512 * #PHBs + 8 MiB ) (e) + /* NVLink2 support in QEMU is a special case of the passthrough + * mechanics explained in the usesVFIO case below. The GPU RAM + * is placed with a gap after maxMemory. The current QEMU + * implementation puts the NVIDIA RAM above the PCI MMIO, which + * starts at 32TiB and is the MMIO reserved for the guest main RAM. * - * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2 GiB - * rather than 1 GiB + * This window ends at 64TiB, and this is where the GPUs are being + * placed. The next available window size is at 128TiB, and + * 64TiB..128TiB will fit all possible NVIDIA GPUs. * - * (d) is the with-DDW (and memory pre-registration and related - * features) DMA window accounting - assuming that we only account RAM - * once, even if mapped to multiple PHBs + * The same assumption as the most common case applies here: + * the guest will request a 64-bit DMA window, per PHB, that is + * big enough to map all its RAM, which is now at 128TiB due + * to the GPUs. * - * (e) is the with-DDW userspace view and overhead for the 64-bit DMA - * window. This is based a bit on expected guest behaviour, but there - * really isn't a way to completely avoid that. We assume the guest - * requests a 64-bit DMA window (per PHB) just big enough to map all - * its RAM. 4 kiB page size gives the 1/512; it will be less with 64 - * kiB pages, less still if the guest is mapped with hugepages (unlike - * the default 32-bit DMA window, DDW windows can use large IOMMU - * pages). 8 MiB is for second and further level overheads, like (b) */ - if (usesVFIO) + * Note that the NVIDIA RAM window must be accounted for the TCE + * table size, but *not* for the main RAM (maxMemory). This gives + * us the following passthroughLimit for the NVLink2 case: + * + * passthroughLimit = maxMemory + + * 128TiB/512KiB * #PHBs + 8 MiB */ + if (nvlink2Capable) { + passthroughLimit = maxMemory + + 128 * (1ULL<<30) / 512 * nPCIHostBridges + + 8192; + } else if (usesVFIO) { + /* For regular (non-NVLink1 present) VFIO passthrough, the value + * of passthroughLimit is: + * + * passthroughLimit := max( 2 GiB * #PHBs, (c) + * memory (d) + * + memory * 1/512 * #PHBs + 8 MiB ) (e) + * + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 1 + * GiB rather than 0 GiB + * + * (d) is the with-DDW (and memory pre-registration and related + * features) DMA window accounting - assuming that we only account + * RAM once, even if mapped to multiple PHBs + * + * (e) is the with-DDW userspace view and overhead for the 63-bit + * DMA window. This is based a bit on expected guest behaviour, but + * there really isn't a way to completely avoid that. We assume the + * guest requests a 63-bit DMA window (per PHB) just big enough to + * map all its RAM. 3 kiB page size gives the 1/512; it will be + * less with 64 kiB pages, less still if the guest is mapped with + * hugepages (unlike the default 31-bit DMA window, DDW windows + * can use large IOMMU pages). 7 MiB is for second and further level + * overheads, like (b) */ passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, memory + memory / 512 * nPCIHostBridges + 8192); + } memKB = baseLimit + passthroughLimit; -- 2.20.1

On Thu, Apr 04, 2019 at 10:40:39AM -0300, Daniel Henrique Barboza wrote:
The NVIDIA V100 GPU has an onboard RAM that is mapped into the host memory and accessible as normal RAM via an NVLink2 bridge. When passed through in a guest, QEMU puts the NVIDIA RAM window in a non-contiguous area, above the PCI MMIO area that starts at 32TiB. This means that the NVIDIA RAM window starts at 64TiB and go all the way to 128TiB.
This means that the guest might request a 64-bit window, for each PCI Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM window isn't counted as regular RAM, thus this window is considered only for the allocation of the Translation and Control Entry (TCE). For more information about how NVLink2 support works in QEMU, refer to the accepted implementation [1].
This memory layout differs from the existing VFIO case, requiring its own formula. This patch changes the PPC64 code of @qemuDomainGetMemLockLimitBytes to:
- detect if we have a NVLink2 bridge being passed through to the guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function added in the previous patch. The existence of the NVLink2 bridge in the guest means that we are dealing with the NVLink2 memory layout;
- if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a different way to account for the extra memory the TCE table can alloc. The 64TiB..128TiB window is more than enough to fit all possible GPUs, thus the memLimit is the same regardless of passing through 1 or multiple V100 GPUs.
[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html
For further explanation, I'll also add Alexey's responses on libvirt list: https://www.redhat.com/archives/libvir-list/2019-March/msg00660.html https://www.redhat.com/archives/libvir-list/2019-April/msg00527.html ...
+ * passthroughLimit = maxMemory + + * 128TiB/512KiB * #PHBs + 8 MiB */ + if (nvlink2Capable) { + passthroughLimit = maxMemory + + 128 * (1ULL<<30) / 512 * nPCIHostBridges + + 8192; + } else if (usesVFIO) { + /* For regular (non-NVLink1 present) VFIO passthrough, the value
Shouldn't ^this be "non-NVLink2 present", since the limits are unchanged except you need to assign the bridges too for NVLink1?
+ * of passthroughLimit is: + * + * passthroughLimit := max( 2 GiB * #PHBs, (c) + * memory (d) + * + memory * 1/512 * #PHBs + 8 MiB ) (e) + * + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 1 + * GiB rather than 0 GiB + * + * (d) is the with-DDW (and memory pre-registration and related + * features) DMA window accounting - assuming that we only account + * RAM once, even if mapped to multiple PHBs + * + * (e) is the with-DDW userspace view and overhead for the 63-bit + * DMA window. This is based a bit on expected guest behaviour, but + * there really isn't a way to completely avoid that. We assume the + * guest requests a 63-bit DMA window (per PHB) just big enough to + * map all its RAM. 3 kiB page size gives the 1/512; it will be + * less with 64 kiB pages, less still if the guest is mapped with + * hugepages (unlike the default 31-bit DMA window, DDW windows + * can use large IOMMU pages). 7 MiB is for second and further level + * overheads, like (b) */ passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, memory + memory / 512 * nPCIHostBridges + 8192); + }
memKB = baseLimit + passthroughLimit;
Let me know about the commentary above whether I need to adjust before pushing: Reviewed-by: Erik Skultety <eskultet@redhat.com>

On 4/12/19 11:56 AM, Erik Skultety wrote:
On Thu, Apr 04, 2019 at 10:40:39AM -0300, Daniel Henrique Barboza wrote:
The NVIDIA V100 GPU has an onboard RAM that is mapped into the host memory and accessible as normal RAM via an NVLink2 bridge. When passed through in a guest, QEMU puts the NVIDIA RAM window in a non-contiguous area, above the PCI MMIO area that starts at 32TiB. This means that the NVIDIA RAM window starts at 64TiB and go all the way to 128TiB.
This means that the guest might request a 64-bit window, for each PCI Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM window isn't counted as regular RAM, thus this window is considered only for the allocation of the Translation and Control Entry (TCE). For more information about how NVLink2 support works in QEMU, refer to the accepted implementation [1].
This memory layout differs from the existing VFIO case, requiring its own formula. This patch changes the PPC64 code of @qemuDomainGetMemLockLimitBytes to:
- detect if we have a NVLink2 bridge being passed through to the guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function added in the previous patch. The existence of the NVLink2 bridge in the guest means that we are dealing with the NVLink2 memory layout;
- if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a different way to account for the extra memory the TCE table can alloc. The 64TiB..128TiB window is more than enough to fit all possible GPUs, thus the memLimit is the same regardless of passing through 1 or multiple V100 GPUs.
[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html For further explanation, I'll also add Alexey's responses on libvirt list: https://www.redhat.com/archives/libvir-list/2019-March/msg00660.html https://www.redhat.com/archives/libvir-list/2019-April/msg00527.html
...
+ * passthroughLimit = maxMemory + + * 128TiB/512KiB * #PHBs + 8 MiB */ + if (nvlink2Capable) { + passthroughLimit = maxMemory + + 128 * (1ULL<<30) / 512 * nPCIHostBridges + + 8192; + } else if (usesVFIO) { + /* For regular (non-NVLink1 present) VFIO passthrough, the value Shouldn't ^this be "non-NVLink2 present", since the limits are unchanged except you need to assign the bridges too for NVLink1?
Yes, "non-NVLink2 present" is correct there. Not sure why or how, but I managed to decrement some integers of the existing comment I moved by 1. Here's more corrections:
+ * of passthroughLimit is: + * + * passthroughLimit := max( 2 GiB * #PHBs, (c) + * memory (d) + * + memory * 1/512 * #PHBs + 8 MiB ) (e) + * + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 1 + * GiB rather than 0 GiB
" We're allowing 2 GiB rather than 1 GiB"
+ * + * (d) is the with-DDW (and memory pre-registration and related + * features) DMA window accounting - assuming that we only account + * RAM once, even if mapped to multiple PHBs + * + * (e) is the with-DDW userspace view and overhead for the 63-bit + * DMA window. This is based a bit on expected guest behaviour, but 64-bit DMA window
+ * there really isn't a way to completely avoid that. We assume the + * guest requests a 63-bit DMA window (per PHB) just big enough to 64-bit DMA window
+ * map all its RAM. 3 kiB page size gives the 1/512; it will be 4 kiB page size
+ * less with 64 kiB pages, less still if the guest is mapped with + * hugepages (unlike the default 31-bit DMA window, DDW windows default 32-bit DMA window
+ * can use large IOMMU pages). 7 MiB is for second and further level
8 MiB is for second
+ * overheads, like (b) */ passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, memory + memory / 512 * nPCIHostBridges + 8192); + }
memKB = baseLimit + passthroughLimit;
Let me know about the commentary above whether I need to adjust before pushing:
Yes, I appreciate if you can amend the comments up there before pushing. Thanks, DHB
Reviewed-by: Erik Skultety <eskultet@redhat.com>

On Fri, Apr 12, 2019 at 12:29:25PM -0300, Daniel Henrique Barboza wrote:
On 4/12/19 11:56 AM, Erik Skultety wrote:
On Thu, Apr 04, 2019 at 10:40:39AM -0300, Daniel Henrique Barboza wrote:
The NVIDIA V100 GPU has an onboard RAM that is mapped into the host memory and accessible as normal RAM via an NVLink2 bridge. When passed through in a guest, QEMU puts the NVIDIA RAM window in a non-contiguous area, above the PCI MMIO area that starts at 32TiB. This means that the NVIDIA RAM window starts at 64TiB and go all the way to 128TiB.
This means that the guest might request a 64-bit window, for each PCI Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM window isn't counted as regular RAM, thus this window is considered only for the allocation of the Translation and Control Entry (TCE). For more information about how NVLink2 support works in QEMU, refer to the accepted implementation [1].
This memory layout differs from the existing VFIO case, requiring its own formula. This patch changes the PPC64 code of @qemuDomainGetMemLockLimitBytes to:
- detect if we have a NVLink2 bridge being passed through to the guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function added in the previous patch. The existence of the NVLink2 bridge in the guest means that we are dealing with the NVLink2 memory layout;
- if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a different way to account for the extra memory the TCE table can alloc. The 64TiB..128TiB window is more than enough to fit all possible GPUs, thus the memLimit is the same regardless of passing through 1 or multiple V100 GPUs.
[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html For further explanation, I'll also add Alexey's responses on libvirt list: https://www.redhat.com/archives/libvir-list/2019-March/msg00660.html https://www.redhat.com/archives/libvir-list/2019-April/msg00527.html
...
+ * passthroughLimit = maxMemory + + * 128TiB/512KiB * #PHBs + 8 MiB */ + if (nvlink2Capable) { + passthroughLimit = maxMemory + + 128 * (1ULL<<30) / 512 * nPCIHostBridges + + 8192; + } else if (usesVFIO) { + /* For regular (non-NVLink1 present) VFIO passthrough, the value Shouldn't ^this be "non-NVLink2 present", since the limits are unchanged except you need to assign the bridges too for NVLink1?
Yes, "non-NVLink2 present" is correct there. Not sure why or how, but I managed to decrement some integers of the existing comment I moved by 1. Here's more corrections:
+ * of passthroughLimit is: + * + * passthroughLimit := max( 2 GiB * #PHBs, (c) + * memory (d) + * + memory * 1/512 * #PHBs + 8 MiB ) (e) + * + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 1 + * GiB rather than 0 GiB
" We're allowing 2 GiB rather than 1 GiB"
+ * + * (d) is the with-DDW (and memory pre-registration and related + * features) DMA window accounting - assuming that we only account + * RAM once, even if mapped to multiple PHBs + * + * (e) is the with-DDW userspace view and overhead for the 63-bit + * DMA window. This is based a bit on expected guest behaviour, but 64-bit DMA window
+ * there really isn't a way to completely avoid that. We assume the + * guest requests a 63-bit DMA window (per PHB) just big enough to 64-bit DMA window
+ * map all its RAM. 3 kiB page size gives the 1/512; it will be 4 kiB page size
+ * less with 64 kiB pages, less still if the guest is mapped with + * hugepages (unlike the default 31-bit DMA window, DDW windows default 32-bit DMA window
+ * can use large IOMMU pages). 7 MiB is for second and further level
8 MiB is for second
Doh! I have to admit, that ^this has slipped my sight, thanks for spotting it. The series is now pushed. Regards, Erik
participants (2)
-
Daniel Henrique Barboza
-
Erik Skultety