[libvirt] [PATCH v3 0/4] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough

This series includes Libvirt support for a new QEMU feature for the spapr (PPC64) machine, NVIDIA V100 + P9 passthrough. Refer to [1] for the version 3 of this feature (same version used as a reference for this series). Changes in v3: - added a new patch (patch 2) that isolates the PPC64 exclusive code to calculate the memLockLimit (suggested by Erik Skultety) - fixed 'make syntax-check' errors across all patches - v2 can be found at [2] [1] https://patchwork.kernel.org/cover/10831413/ [2] https://www.redhat.com/archives/libvir-list/2019-March/msg00059.html Daniel Henrique Barboza (4): qemu_domain: simplify non-VFIO memLockLimit calc for PPC64 qemu_domain: add a PPC64 memLockLimit helper qemu_domain: NVLink2 device tree functions for PPC64 PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough src/qemu/qemu_domain.c | 402 ++++++++++++++++++++++++++++++++--------- 1 file changed, 321 insertions(+), 81 deletions(-) -- 2.20.1

passthroughLimit is being calculated even if usesVFIO is false. After that, a if/else conditional is used to check if we're going to sum it up with baseLimit. This patch initializes passthroughLimit to zero and always return memKB = baseLimit + passthroughLimit. The conditional is then used to calculate passthroughLimit if usesVFIO is true. This results in some cycles spared for the usesVFIO=false scenario, but the real motivation is to make the code simpler to add an alternative passthroughLimit formula for NVLink2 passthrough. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1487268a89..099097fe62 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10378,7 +10378,7 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) unsigned long long maxMemory; unsigned long long memory; unsigned long long baseLimit; - unsigned long long passthroughLimit; + unsigned long long passthroughLimit = 0; size_t nPCIHostBridges = 0; bool usesVFIO = false; @@ -10444,15 +10444,12 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) * kiB pages, less still if the guest is mapped with hugepages (unlike * the default 32-bit DMA window, DDW windows can use large IOMMU * pages). 8 MiB is for second and further level overheads, like (b) */ - passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, - memory + - memory / 512 * nPCIHostBridges + 8192); - if (usesVFIO) - memKB = baseLimit + passthroughLimit; - else - memKB = baseLimit; + passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, + memory + + memory / 512 * nPCIHostBridges + 8192); + memKB = baseLimit + passthroughLimit; goto done; } -- 2.20.1

On Tue, Mar 05, 2019 at 09:46:06AM -0300, Daniel Henrique Barboza wrote:
passthroughLimit is being calculated even if usesVFIO is false. After that, a if/else conditional is used to check if we're going to sum it up with baseLimit.
This patch initializes passthroughLimit to zero and always return memKB = baseLimit + passthroughLimit. The conditional is then used to calculate passthroughLimit if usesVFIO is true. This results in some cycles spared for the usesVFIO=false scenario, but the real motivation is to make the code simpler to add an alternative passthroughLimit formula for NVLink2 passthrough.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- Reviewed-by: Erik Skultety <eskultet@redhat.com>

There are a lot of documentation in the comments about how PPC64 handles passthrough VFIO devices to calculate the memLockLimit. And more will be added with the PPC64 NVLink2 support code. Let's remove the PPC64 code from qemuDomainGetMemLockLimitBytes body and put it into a helper function. This will simply the flow of qemuDomainGetMemLockLimitBytes that handles all other platforms and improves the readability of PPC64 specifics. Suggested-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 169 ++++++++++++++++++++++------------------- 1 file changed, 91 insertions(+), 78 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 099097fe62..77548c224c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,95 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) } +/** + * getPPC64MemLockLimitBytes: + * @def: domain definition + * + * A PPC64 helper that calculates the memory locking limit in order for + * the guest to operate properly. + */ +static unsigned long long +getPPC64MemLockLimitBytes(virDomainDefPtr def) +{ + unsigned long long memKB = 0; + unsigned long long baseLimit, memory, maxMemory; + unsigned long long passthroughLimit = 0; + size_t i, nPCIHostBridges = 0; + bool usesVFIO = false; + + for (i = 0; i < def->ncontrollers; i++) { + virDomainControllerDefPtr cont = def->controllers[i]; + + if (!virDomainControllerIsPSeriesPHB(cont)) + continue; + + nPCIHostBridges++; + } + + for (i = 0; i < def->nhostdevs; i++) { + virDomainHostdevDefPtr dev = def->hostdevs[i]; + + if (dev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { + usesVFIO = true; + break; + } + } + + memory = virDomainDefGetMemoryTotal(def); + + if (def->mem.max_memory) + maxMemory = def->mem.max_memory; + else + maxMemory = memory; + + /* baseLimit := maxMemory / 128 (a) + * + 4 MiB * #PHBs + 8 MiB (b) + * + * (a) is the hash table + * + * (b) is accounting for the 32-bit DMA window - it could be either the + * KVM accelerated TCE tables for emulated devices, or the VFIO + * userspace view. The 4 MiB per-PHB (including the default one) covers + * a 2GiB DMA window: default is 1GiB, but it's possible it'll be + * increased to help performance. The 8 MiB extra should be plenty for + * the TCE table index for any reasonable number of PHBs and several + * spapr-vlan or spapr-vscsi devices (512kB + a tiny bit each) */ + baseLimit = maxMemory / 128 + + 4096 * nPCIHostBridges + + 8192; + + /* passthroughLimit := max( 2 GiB * #PHBs, (c) + * memory (d) + * + memory * 1/512 * #PHBs + 8 MiB ) (e) + * + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2 GiB + * rather than 1 GiB + * + * (d) is the with-DDW (and memory pre-registration and related + * features) DMA window accounting - assuming that we only account RAM + * once, even if mapped to multiple PHBs + * + * (e) is the with-DDW userspace view and overhead for the 64-bit DMA + * window. This is based a bit on expected guest behaviour, but there + * really isn't a way to completely avoid that. We assume the guest + * requests a 64-bit DMA window (per PHB) just big enough to map all + * its RAM. 4 kiB page size gives the 1/512; it will be less with 64 + * kiB pages, less still if the guest is mapped with hugepages (unlike + * the default 32-bit DMA window, DDW windows can use large IOMMU + * pages). 8 MiB is for second and further level overheads, like (b) */ + if (usesVFIO) + passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, + memory + + memory / 512 * nPCIHostBridges + 8192); + + memKB = baseLimit + passthroughLimit; + + return memKB << 10; +} + + /** * qemuDomainGetMemLockLimitBytes: * @def: domain definition @@ -10374,84 +10463,8 @@ qemuDomainGetMemLockLimitBytes(virDomainDefPtr def) if (def->mem.locked) return VIR_DOMAIN_MEMORY_PARAM_UNLIMITED; - if (ARCH_IS_PPC64(def->os.arch) && def->virtType == VIR_DOMAIN_VIRT_KVM) { - unsigned long long maxMemory; - unsigned long long memory; - unsigned long long baseLimit; - unsigned long long passthroughLimit = 0; - size_t nPCIHostBridges = 0; - bool usesVFIO = false; - - for (i = 0; i < def->ncontrollers; i++) { - virDomainControllerDefPtr cont = def->controllers[i]; - - if (!virDomainControllerIsPSeriesPHB(cont)) - continue; - - nPCIHostBridges++; - } - - for (i = 0; i < def->nhostdevs; i++) { - virDomainHostdevDefPtr dev = def->hostdevs[i]; - - if (dev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && - dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - usesVFIO = true; - break; - } - } - - memory = virDomainDefGetMemoryTotal(def); - - if (def->mem.max_memory) - maxMemory = def->mem.max_memory; - else - maxMemory = memory; - - /* baseLimit := maxMemory / 128 (a) - * + 4 MiB * #PHBs + 8 MiB (b) - * - * (a) is the hash table - * - * (b) is accounting for the 32-bit DMA window - it could be either the - * KVM accelerated TCE tables for emulated devices, or the VFIO - * userspace view. The 4 MiB per-PHB (including the default one) covers - * a 2GiB DMA window: default is 1GiB, but it's possible it'll be - * increased to help performance. The 8 MiB extra should be plenty for - * the TCE table index for any reasonable number of PHBs and several - * spapr-vlan or spapr-vscsi devices (512kB + a tiny bit each) */ - baseLimit = maxMemory / 128 + - 4096 * nPCIHostBridges + - 8192; - - /* passthroughLimit := max( 2 GiB * #PHBs, (c) - * memory (d) - * + memory * 1/512 * #PHBs + 8 MiB ) (e) - * - * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2 GiB - * rather than 1 GiB - * - * (d) is the with-DDW (and memory pre-registration and related - * features) DMA window accounting - assuming that we only account RAM - * once, even if mapped to multiple PHBs - * - * (e) is the with-DDW userspace view and overhead for the 64-bit DMA - * window. This is based a bit on expected guest behaviour, but there - * really isn't a way to completely avoid that. We assume the guest - * requests a 64-bit DMA window (per PHB) just big enough to map all - * its RAM. 4 kiB page size gives the 1/512; it will be less with 64 - * kiB pages, less still if the guest is mapped with hugepages (unlike - * the default 32-bit DMA window, DDW windows can use large IOMMU - * pages). 8 MiB is for second and further level overheads, like (b) */ - if (usesVFIO) - passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, - memory + - memory / 512 * nPCIHostBridges + 8192); - - memKB = baseLimit + passthroughLimit; - goto done; - } + if (ARCH_IS_PPC64(def->os.arch) && def->virtType == VIR_DOMAIN_VIRT_KVM) + return getPPC64MemLockLimitBytes(def); /* For device passthrough using VFIO the guest memory and MMIO memory * regions need to be locked persistent in order to allow DMA. -- 2.20.1

On Tue, Mar 05, 2019 at 09:46:07AM -0300, Daniel Henrique Barboza wrote:
There are a lot of documentation in the comments about how PPC64 handles passthrough VFIO devices to calculate the memLockLimit. And more will be added with the PPC64 NVLink2 support code.
Let's remove the PPC64 code from qemuDomainGetMemLockLimitBytes body and put it into a helper function. This will simply the
s/simply/simplify
flow of qemuDomainGetMemLockLimitBytes that handles all other platforms and improves the readability of PPC64 specifics.
Suggested-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 169 ++++++++++++++++++++++------------------- 1 file changed, 91 insertions(+), 78 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 099097fe62..77548c224c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,95 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * getPPC64MemLockLimitBytes: + * @def: domain definition + * + * A PPC64 helper that calculates the memory locking limit in order for + * the guest to operate properly. + */ +static unsigned long long +getPPC64MemLockLimitBytes(virDomainDefPtr def) +{ + unsigned long long memKB = 0; + unsigned long long baseLimit, memory, maxMemory;
One variable per line. [...] I'll squash this in before pushing: diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..535ee78c6f 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10354,7 +10354,9 @@ static unsigned long long getPPC64MemLockLimitBytes(virDomainDefPtr def) { unsigned long long memKB = 0; - unsigned long long baseLimit, memory, maxMemory; + unsigned long long baseLimit = 0; + unsigned long long memory = 0; + unsigned long long maxMemory = 0; unsigned long long passthroughLimit = 0; size_t i, nPCIHostBridges = 0; bool usesVFIO = false; Reviewed-by: Erik Skultety <eskultet@redhat.com>

The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verfying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created. An alternative is presented in this patch. Given a PCI device, we'll traverse the device tree at /proc/device-tree to check if the device has a NPU bridge, retrieve the node of the NVLink2 bus, find the memory-node that is related to the bus and see if it's a NVLink2 bus by inspecting its 'reg' value. This logic is contained inside the 'device_is_nvlink2_capable' function, which uses other new helper functions to navigate and fetch values from the device tree nodes. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..97de5793e2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) } +/** + * Reads a phandle file and returns the phandle value. + */ +static int +read_dt_phandle(const char* file) +{ + unsigned int buf[1]; + size_t read; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return -1; + + read = fread(buf, sizeof(unsigned int), 1, f); + + if (!read) { + VIR_CLOSE(f); + return 0; + } + + VIR_CLOSE(f); + return be32toh(buf[0]); +} + + +/** + * Reads a memory reg file and returns the first 4 int values. + * + * The caller is responsible for freeing the returned array. + */ +static unsigned int * +read_dt_memory_reg(const char *file) +{ + unsigned int *buf; + size_t read, i; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return NULL; + + if (VIR_ALLOC_N(buf, 4) < 0) + return NULL; + + read = fread(buf, sizeof(unsigned int), 4, f); + + if (!read && read < 4) + /* shouldn't happen */ + VIR_FREE(buf); + else for (i = 0; i < 4; i++) + buf[i] = be32toh(buf[i]); + + VIR_CLOSE(f); + return buf; +} + + +/** + * This wrapper function receives arguments to be used in a + * 'find' call to retrieve the file names that matches + * the criteria inside the /proc/device-tree dir. + * + * A 'find' call with '-iname phandle' inside /proc/device-tree + * provides more than a thousand matches. Adding '-path' to + * narrow it down further is necessary to keep the file + * listing sane. + * + * The caller is responsible to free the buffer returned by + * this function. + */ +static char * +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern) +{ + virCommandPtr cmd = NULL; + char *output = NULL; + + cmd = virCommandNew("find"); + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern, + "-iname", file_pattern, NULL); + virCommandSetOutputBuffer(cmd, &output); + + if (virCommandRun(cmd, NULL) < 0) + VIR_FREE(output); + + virCommandFree(cmd); + return output; +} + + +/** + * Helper function that receives a listing of file names and + * calls read_dt_phandle() on each one finding for a match + * with the given phandle argument. Returns the file name if a + * match is found, NULL otherwise. + */ +static char * +find_dt_file_with_phandle(char *files, int phandle) +{ + char *line, *tmp; + int ret; + + line = strtok_r(files, "\n", &tmp); + do { + ret = read_dt_phandle(line); + if (ret == phandle) + break; + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); + + return line; +} + + +/** + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is NVLink2 capable. + * + * The logic goes as follows: + * + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu' + * attribute; + * 2 - find the device tree node of the nvlink bus using the phandle + * found in (1) + * 3 - get the phandle of the memory region of the nvlink bus + * 4 - find the device tree node of the memory region using the + * phandle found in (3) + * 5 - read the 'reg' value of the memory region. If the value of + * the second 64 bit value is 0x02 0x00, the device is attached + * to a NVLink2 bus. + * + * If any of these steps fails, the function returns false. + */ +static bool +device_is_nvlink2_capable(const char *device) +{ + char *file, *files, *tmp; + unsigned int *reg; + int phandle; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu", + device)) < 0) + return false; + + /* Find phandles of nvlinks: */ + if ((phandle = read_dt_phandle(file)) == -1) + return false; + + /* Find a DT node for the phandle found */ + files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* Find a phandle of the GPU memory region of the device. The + * file found above ends with '/phandle' - the memory region + * of the GPU ends with '/memory-region */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/memory-region"); + + if ((phandle = read_dt_phandle(file)) == -1) + goto fail; + + file = NULL; + VIR_FREE(files); + + /* Find the memory node for the phandle found above */ + files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* And see its size in the second 64bit value of 'reg'. First, + * the end of the file needs to be changed from '/phandle' to + * '/reg' */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/reg"); + + reg = read_dt_memory_reg(file); + if (reg && reg[2] == 0x20 && reg[3] == 0x00) + return true; + + fail: + VIR_FREE(files); + VIR_FREE(reg); + return false; +} + + /** * getPPC64MemLockLimitBytes: * @def: domain definition -- 2.20.1

On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verfying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created.
An alternative is presented in this patch. Given a PCI device, we'll traverse the device tree at /proc/device-tree to check if the device has a NPU bridge, retrieve the node of the NVLink2 bus, find the memory-node that is related to the bus and see if it's a NVLink2 bus by inspecting its 'reg' value. This logic is contained inside the 'device_is_nvlink2_capable' function, which uses other new helper functions to navigate and fetch values from the device tree nodes.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> ---
This fails with a bunch of compilation errors, please make sure you run make check and make syntax-check on each patch of the series.
src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..97de5793e2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * Reads a phandle file and returns the phandle value. + */ +static int +read_dt_phandle(const char* file)
..we prefer camelCase naming of functions...all functions should have a prefix, e.g. vir,<driver>, etc.
+{ + unsigned int buf[1]; + size_t read; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return -1; + + read = fread(buf, sizeof(unsigned int), 1, f);
We already have safe helpers that take care of such operations.
+ + if (!read) { + VIR_CLOSE(f);
You could use VIR_AUTOCLOSE for this
+ return 0; + } + + VIR_CLOSE(f); + return be32toh(buf[0]);
Is this part of gnulib? We need to make sure it's portable otherwise we'd need to make the conversion ourselves, just like for le64toh
+} + + +/** + * Reads a memory reg file and returns the first 4 int values. + * + * The caller is responsible for freeing the returned array. + */ +static unsigned int * +read_dt_memory_reg(const char *file) +{ + unsigned int *buf; + size_t read, i; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return NULL; + + if (VIR_ALLOC_N(buf, 4) < 0) + return NULL; + + read = fread(buf, sizeof(unsigned int), 4, f); + + if (!read && read < 4) + /* shouldn't happen */ + VIR_FREE(buf); + else for (i = 0; i < 4; i++) + buf[i] = be32toh(buf[i]); + + VIR_CLOSE(f); + return buf;
Same comments as above...
+} + + +/** + * This wrapper function receives arguments to be used in a + * 'find' call to retrieve the file names that matches + * the criteria inside the /proc/device-tree dir. + * + * A 'find' call with '-iname phandle' inside /proc/device-tree + * provides more than a thousand matches. Adding '-path' to + * narrow it down further is necessary to keep the file + * listing sane. + * + * The caller is responsible to free the buffer returned by + * this function. + */ +static char * +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern) +{ + virCommandPtr cmd = NULL; + char *output = NULL; + + cmd = virCommandNew("find"); + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern, + "-iname", file_pattern, NULL); + virCommandSetOutputBuffer(cmd, &output); + + if (virCommandRun(cmd, NULL) < 0) + VIR_FREE(output); + + virCommandFree(cmd); + return output; +} + + +/** + * Helper function that receives a listing of file names and + * calls read_dt_phandle() on each one finding for a match + * with the given phandle argument. Returns the file name if a + * match is found, NULL otherwise. + */ +static char * +find_dt_file_with_phandle(char *files, int phandle) +{ + char *line, *tmp; + int ret; + + line = strtok_r(files, "\n", &tmp); + do { + ret = read_dt_phandle(line); + if (ret == phandle) + break; + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); + + return line; +} + + +/** + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is NVLink2 capable. + * + * The logic goes as follows: + * + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu' + * attribute; + * 2 - find the device tree node of the nvlink bus using the phandle + * found in (1) + * 3 - get the phandle of the memory region of the nvlink bus + * 4 - find the device tree node of the memory region using the + * phandle found in (3) + * 5 - read the 'reg' value of the memory region. If the value of + * the second 64 bit value is 0x02 0x00, the device is attached + * to a NVLink2 bus. + * + * If any of these steps fails, the function returns false. + */ +static bool +device_is_nvlink2_capable(const char *device) +{ + char *file, *files, *tmp; + unsigned int *reg; + int phandle; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu", + device)) < 0) + return false; + + /* Find phandles of nvlinks: */ + if ((phandle = read_dt_phandle(file)) == -1) + return false; + + /* Find a DT node for the phandle found */ + files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* Find a phandle of the GPU memory region of the device. The + * file found above ends with '/phandle' - the memory region + * of the GPU ends with '/memory-region */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/memory-region"); + + if ((phandle = read_dt_phandle(file)) == -1) + goto fail; + + file = NULL; + VIR_FREE(files); + + /* Find the memory node for the phandle found above */ + files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* And see its size in the second 64bit value of 'reg'. First, + * the end of the file needs to be changed from '/phandle' to + * '/reg' */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/reg"); + + reg = read_dt_memory_reg(file); + if (reg && reg[2] == 0x20 && reg[3] == 0x00) + return true;
This function is very complex just to find out whether a PCI device is capable of NVLink or not. Feels wrong to do it this way, I believe it would be much easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports NVLink so that our node-device driver would take care of this during libvirtd startup and then you'd only call a single API from the PPC64 helper you're introducing in the next patch to find out whether you need the alternative formula or not. Honestly, apart from the coding style issues, I don't like this approach and unless there's a really compelling reason for libvirt to do it in a way which involves spawning a 'find' process because of a complex pattern and a bunch of data necessary to filter out, I'd really suggest contacting NVIDIA about this. It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which we're able to read. Thinking about it a bit more, since this is NVIDIA-specific, having an NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned above. In general, we try to avoid introducing vendor-specific code, so nodedev might not be the right place to check for the attribute (because it's not universal and would require vendor-specific elements), but I think we could have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt codebase. Erik

On 3/7/19 12:15 PM, Erik Skultety wrote:
On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verfying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created.
An alternative is presented in this patch. Given a PCI device, we'll traverse the device tree at /proc/device-tree to check if the device has a NPU bridge, retrieve the node of the NVLink2 bus, find the memory-node that is related to the bus and see if it's a NVLink2 bus by inspecting its 'reg' value. This logic is contained inside the 'device_is_nvlink2_capable' function, which uses other new helper functions to navigate and fetch values from the device tree nodes.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- This fails with a bunch of compilation errors, please make sure you run make check and make syntax-check on each patch of the series.
Ooops, forgot to follow up make syntax-check with 'make' before submitting ... my bad.
src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..97de5793e2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * Reads a phandle file and returns the phandle value. + */ +static int +read_dt_phandle(const char* file) ..we prefer camelCase naming of functions...all functions should have a prefix, e.g. vir,<driver>, etc.
+{ + unsigned int buf[1]; + size_t read; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return -1; + + read = fread(buf, sizeof(unsigned int), 1, f); We already have safe helpers that take care of such operations.
+ + if (!read) { + VIR_CLOSE(f); You could use VIR_AUTOCLOSE for this
+ return 0; + } + + VIR_CLOSE(f); + return be32toh(buf[0]); Is this part of gnulib? We need to make sure it's portable otherwise we'd need to make the conversion ourselves, just like for le64toh
+} + + +/** + * Reads a memory reg file and returns the first 4 int values. + * + * The caller is responsible for freeing the returned array. + */ +static unsigned int * +read_dt_memory_reg(const char *file) +{ + unsigned int *buf; + size_t read, i; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return NULL; + + if (VIR_ALLOC_N(buf, 4) < 0) + return NULL; + + read = fread(buf, sizeof(unsigned int), 4, f); + + if (!read && read < 4) + /* shouldn't happen */ + VIR_FREE(buf); + else for (i = 0; i < 4; i++) + buf[i] = be32toh(buf[i]); + + VIR_CLOSE(f); + return buf; Same comments as above...
+} + + +/** + * This wrapper function receives arguments to be used in a + * 'find' call to retrieve the file names that matches + * the criteria inside the /proc/device-tree dir. + * + * A 'find' call with '-iname phandle' inside /proc/device-tree + * provides more than a thousand matches. Adding '-path' to + * narrow it down further is necessary to keep the file + * listing sane. + * + * The caller is responsible to free the buffer returned by + * this function. + */ +static char * +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern) +{ + virCommandPtr cmd = NULL; + char *output = NULL; + + cmd = virCommandNew("find"); + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern, + "-iname", file_pattern, NULL); + virCommandSetOutputBuffer(cmd, &output); + + if (virCommandRun(cmd, NULL) < 0) + VIR_FREE(output); + + virCommandFree(cmd); + return output; +} + + +/** + * Helper function that receives a listing of file names and + * calls read_dt_phandle() on each one finding for a match + * with the given phandle argument. Returns the file name if a + * match is found, NULL otherwise. + */ +static char * +find_dt_file_with_phandle(char *files, int phandle) +{ + char *line, *tmp; + int ret; + + line = strtok_r(files, "\n", &tmp); + do { + ret = read_dt_phandle(line); + if (ret == phandle) + break; + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); + + return line; +} + + +/** + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is NVLink2 capable. + * + * The logic goes as follows: + * + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu' + * attribute; + * 2 - find the device tree node of the nvlink bus using the phandle + * found in (1) + * 3 - get the phandle of the memory region of the nvlink bus + * 4 - find the device tree node of the memory region using the + * phandle found in (3) + * 5 - read the 'reg' value of the memory region. If the value of + * the second 64 bit value is 0x02 0x00, the device is attached + * to a NVLink2 bus. + * + * If any of these steps fails, the function returns false. + */ +static bool +device_is_nvlink2_capable(const char *device) +{ + char *file, *files, *tmp; + unsigned int *reg; + int phandle; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu", + device)) < 0) + return false; + + /* Find phandles of nvlinks: */ + if ((phandle = read_dt_phandle(file)) == -1) + return false; + + /* Find a DT node for the phandle found */ + files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* Find a phandle of the GPU memory region of the device. The + * file found above ends with '/phandle' - the memory region + * of the GPU ends with '/memory-region */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/memory-region"); + + if ((phandle = read_dt_phandle(file)) == -1) + goto fail; + + file = NULL; + VIR_FREE(files); + + /* Find the memory node for the phandle found above */ + files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* And see its size in the second 64bit value of 'reg'. First, + * the end of the file needs to be changed from '/phandle' to + * '/reg' */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/reg"); + + reg = read_dt_memory_reg(file); + if (reg && reg[2] == 0x20 && reg[3] == 0x00) + return true; This function is very complex just to find out whether a PCI device is capable of NVLink or not. Feels wrong to do it this way, I believe it would be much easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports NVLink so that our node-device driver would take care of this during libvirtd startup and then you'd only call a single API from the PPC64 helper you're introducing in the next patch to find out whether you need the alternative formula or not.
Honestly, apart from the coding style issues, I don't like this approach and unless there's a really compelling reason for libvirt to do it in a way which involves spawning a 'find' process because of a complex pattern and a bunch of data necessary to filter out, I'd really suggest contacting NVIDIA about this. It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which we're able to read.
I'll contact NVIDIA and see if there is an easier way (a sysfs attribute, for example) and, if doesn't, try to provide a plausibe reason to justify this detection code.
Thinking about it a bit more, since this is NVIDIA-specific, having an NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned above. In general, we try to avoid introducing vendor-specific code, so nodedev might not be the right place to check for the attribute (because it's not universal and would require vendor-specific elements), but I think we could have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt codebase.
Erik

On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote:
On 3/7/19 12:15 PM, Erik Skultety wrote:
On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verfying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created.
An alternative is presented in this patch. Given a PCI device, we'll traverse the device tree at /proc/device-tree to check if the device has a NPU bridge, retrieve the node of the NVLink2 bus, find the memory-node that is related to the bus and see if it's a NVLink2 bus by inspecting its 'reg' value. This logic is contained inside the 'device_is_nvlink2_capable' function, which uses other new helper functions to navigate and fetch values from the device tree nodes.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- This fails with a bunch of compilation errors, please make sure you run make check and make syntax-check on each patch of the series.
Ooops, forgot to follow up make syntax-check with 'make' before submitting ... my bad.
src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..97de5793e2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * Reads a phandle file and returns the phandle value. + */ +static int +read_dt_phandle(const char* file) ..we prefer camelCase naming of functions...all functions should have a prefix, e.g. vir,<driver>, etc.
+{ + unsigned int buf[1]; + size_t read; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return -1; + + read = fread(buf, sizeof(unsigned int), 1, f); We already have safe helpers that take care of such operations.
+ + if (!read) { + VIR_CLOSE(f); You could use VIR_AUTOCLOSE for this
+ return 0; + } + + VIR_CLOSE(f); + return be32toh(buf[0]); Is this part of gnulib? We need to make sure it's portable otherwise we'd need to make the conversion ourselves, just like for le64toh
+} + + +/** + * Reads a memory reg file and returns the first 4 int values. + * + * The caller is responsible for freeing the returned array. + */ +static unsigned int * +read_dt_memory_reg(const char *file) +{ + unsigned int *buf; + size_t read, i; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return NULL; + + if (VIR_ALLOC_N(buf, 4) < 0) + return NULL; + + read = fread(buf, sizeof(unsigned int), 4, f); + + if (!read && read < 4) + /* shouldn't happen */ + VIR_FREE(buf); + else for (i = 0; i < 4; i++) + buf[i] = be32toh(buf[i]); + + VIR_CLOSE(f); + return buf; Same comments as above...
+} + + +/** + * This wrapper function receives arguments to be used in a + * 'find' call to retrieve the file names that matches + * the criteria inside the /proc/device-tree dir. + * + * A 'find' call with '-iname phandle' inside /proc/device-tree + * provides more than a thousand matches. Adding '-path' to + * narrow it down further is necessary to keep the file + * listing sane. + * + * The caller is responsible to free the buffer returned by + * this function. + */ +static char * +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern) +{ + virCommandPtr cmd = NULL; + char *output = NULL; + + cmd = virCommandNew("find"); + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern, + "-iname", file_pattern, NULL); + virCommandSetOutputBuffer(cmd, &output); + + if (virCommandRun(cmd, NULL) < 0) + VIR_FREE(output); + + virCommandFree(cmd); + return output; +} + + +/** + * Helper function that receives a listing of file names and + * calls read_dt_phandle() on each one finding for a match + * with the given phandle argument. Returns the file name if a + * match is found, NULL otherwise. + */ +static char * +find_dt_file_with_phandle(char *files, int phandle) +{ + char *line, *tmp; + int ret; + + line = strtok_r(files, "\n", &tmp); + do { + ret = read_dt_phandle(line); + if (ret == phandle) + break; + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); + + return line; +} + + +/** + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is NVLink2 capable. + * + * The logic goes as follows: + * + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu' + * attribute; + * 2 - find the device tree node of the nvlink bus using the phandle + * found in (1) + * 3 - get the phandle of the memory region of the nvlink bus + * 4 - find the device tree node of the memory region using the + * phandle found in (3) + * 5 - read the 'reg' value of the memory region. If the value of + * the second 64 bit value is 0x02 0x00, the device is attached + * to a NVLink2 bus. + * + * If any of these steps fails, the function returns false. + */ +static bool +device_is_nvlink2_capable(const char *device) +{ + char *file, *files, *tmp; + unsigned int *reg; + int phandle; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu", + device)) < 0) + return false; + + /* Find phandles of nvlinks: */ + if ((phandle = read_dt_phandle(file)) == -1) + return false; + + /* Find a DT node for the phandle found */ + files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* Find a phandle of the GPU memory region of the device. The + * file found above ends with '/phandle' - the memory region + * of the GPU ends with '/memory-region */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/memory-region"); + + if ((phandle = read_dt_phandle(file)) == -1) + goto fail; + + file = NULL; + VIR_FREE(files); + + /* Find the memory node for the phandle found above */ + files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* And see its size in the second 64bit value of 'reg'. First, + * the end of the file needs to be changed from '/phandle' to + * '/reg' */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/reg"); + + reg = read_dt_memory_reg(file); + if (reg && reg[2] == 0x20 && reg[3] == 0x00) + return true; This function is very complex just to find out whether a PCI device is capable of NVLink or not. Feels wrong to do it this way, I believe it would be much easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports NVLink so that our node-device driver would take care of this during libvirtd startup and then you'd only call a single API from the PPC64 helper you're introducing in the next patch to find out whether you need the alternative formula or not.
Honestly, apart from the coding style issues, I don't like this approach and unless there's a really compelling reason for libvirt to do it in a way which involves spawning a 'find' process because of a complex pattern and a bunch of data necessary to filter out, I'd really suggest contacting NVIDIA about this. It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which we're able to read.
I'll contact NVIDIA and see if there is an easier way (a sysfs attribute, for example) and, if doesn't, try to provide a plausibe reason to justify this detection code.
Sorry for the delay in responding. The problem is that all the V100 GPUs support NVLink, but it may or may not be connected up. This is detected at runtime during GPU initialization, which seems like much too heavy of an operation to perform as part of passthrough initialization. And that's why vfio-pci pieces rely on device tree information to figure it out. Alexey, would it be possible for vfio-pci to export the information in a way more friendly to libvirt?
Thinking about it a bit more, since this is NVIDIA-specific, having an NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned above. In general, we try to avoid introducing vendor-specific code, so nodedev might not be the right place to check for the attribute (because it's not universal and would require vendor-specific elements), but I think we could have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt codebase.
Erik

On 08/03/2019 04:51, Piotr Jaroszynski wrote:
On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote:
On 3/7/19 12:15 PM, Erik Skultety wrote:
On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verfying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created.
An alternative is presented in this patch. Given a PCI device, we'll traverse the device tree at /proc/device-tree to check if the device has a NPU bridge, retrieve the node of the NVLink2 bus, find the memory-node that is related to the bus and see if it's a NVLink2 bus by inspecting its 'reg' value. This logic is contained inside the 'device_is_nvlink2_capable' function, which uses other new helper functions to navigate and fetch values from the device tree nodes.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- This fails with a bunch of compilation errors, please make sure you run make check and make syntax-check on each patch of the series.
Ooops, forgot to follow up make syntax-check with 'make' before submitting ... my bad.
src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..97de5793e2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * Reads a phandle file and returns the phandle value. + */ +static int +read_dt_phandle(const char* file) ..we prefer camelCase naming of functions...all functions should have a prefix, e.g. vir,<driver>, etc.
+{ + unsigned int buf[1]; + size_t read; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return -1; + + read = fread(buf, sizeof(unsigned int), 1, f); We already have safe helpers that take care of such operations.
+ + if (!read) { + VIR_CLOSE(f); You could use VIR_AUTOCLOSE for this
+ return 0; + } + + VIR_CLOSE(f); + return be32toh(buf[0]); Is this part of gnulib? We need to make sure it's portable otherwise we'd need to make the conversion ourselves, just like for le64toh
+} + + +/** + * Reads a memory reg file and returns the first 4 int values. + * + * The caller is responsible for freeing the returned array. + */ +static unsigned int * +read_dt_memory_reg(const char *file) +{ + unsigned int *buf; + size_t read, i; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return NULL; + + if (VIR_ALLOC_N(buf, 4) < 0) + return NULL; + + read = fread(buf, sizeof(unsigned int), 4, f); + + if (!read && read < 4) + /* shouldn't happen */ + VIR_FREE(buf); + else for (i = 0; i < 4; i++) + buf[i] = be32toh(buf[i]); + + VIR_CLOSE(f); + return buf; Same comments as above...
+} + + +/** + * This wrapper function receives arguments to be used in a + * 'find' call to retrieve the file names that matches + * the criteria inside the /proc/device-tree dir. + * + * A 'find' call with '-iname phandle' inside /proc/device-tree + * provides more than a thousand matches. Adding '-path' to + * narrow it down further is necessary to keep the file + * listing sane. + * + * The caller is responsible to free the buffer returned by + * this function. + */ +static char * +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern) +{ + virCommandPtr cmd = NULL; + char *output = NULL; + + cmd = virCommandNew("find"); + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern, + "-iname", file_pattern, NULL); + virCommandSetOutputBuffer(cmd, &output); + + if (virCommandRun(cmd, NULL) < 0) + VIR_FREE(output); + + virCommandFree(cmd); + return output; +} + + +/** + * Helper function that receives a listing of file names and + * calls read_dt_phandle() on each one finding for a match + * with the given phandle argument. Returns the file name if a + * match is found, NULL otherwise. + */ +static char * +find_dt_file_with_phandle(char *files, int phandle) +{ + char *line, *tmp; + int ret; + + line = strtok_r(files, "\n", &tmp); + do { + ret = read_dt_phandle(line); + if (ret == phandle) + break; + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); + + return line; +} + + +/** + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is NVLink2 capable. + * + * The logic goes as follows: + * + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu' + * attribute; + * 2 - find the device tree node of the nvlink bus using the phandle + * found in (1) + * 3 - get the phandle of the memory region of the nvlink bus + * 4 - find the device tree node of the memory region using the + * phandle found in (3) + * 5 - read the 'reg' value of the memory region. If the value of + * the second 64 bit value is 0x02 0x00, the device is attached + * to a NVLink2 bus. + * + * If any of these steps fails, the function returns false. + */ +static bool +device_is_nvlink2_capable(const char *device) +{ + char *file, *files, *tmp; + unsigned int *reg; + int phandle; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu", + device)) < 0) + return false; + + /* Find phandles of nvlinks: */ + if ((phandle = read_dt_phandle(file)) == -1) + return false; + + /* Find a DT node for the phandle found */ + files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* Find a phandle of the GPU memory region of the device. The + * file found above ends with '/phandle' - the memory region + * of the GPU ends with '/memory-region */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/memory-region"); + + if ((phandle = read_dt_phandle(file)) == -1) + goto fail; + + file = NULL; + VIR_FREE(files); + + /* Find the memory node for the phandle found above */ + files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* And see its size in the second 64bit value of 'reg'. First, + * the end of the file needs to be changed from '/phandle' to + * '/reg' */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/reg"); + + reg = read_dt_memory_reg(file); + if (reg && reg[2] == 0x20 && reg[3] == 0x00) + return true; This function is very complex just to find out whether a PCI device is capable of NVLink or not. Feels wrong to do it this way, I believe it would be much easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports NVLink so that our node-device driver would take care of this during libvirtd startup and then you'd only call a single API from the PPC64 helper you're introducing in the next patch to find out whether you need the alternative formula or not.
Honestly, apart from the coding style issues, I don't like this approach and unless there's a really compelling reason for libvirt to do it in a way which involves spawning a 'find' process because of a complex pattern and a bunch of data necessary to filter out, I'd really suggest contacting NVIDIA about this. It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which we're able to read.
I'll contact NVIDIA and see if there is an easier way (a sysfs attribute, for example) and, if doesn't, try to provide a plausibe reason to justify this detection code.
Sorry for the delay in responding. The problem is that all the V100 GPUs support NVLink, but it may or may not be connected up. This is detected at runtime during GPU initialization, which seems like much too heavy of an operation to perform as part of passthrough initialization. And that's why vfio-pci pieces rely on device tree information to figure it out.
Alexey, would it be possible for vfio-pci to export the information in a way more friendly to libvirt?
The only information needed here is whether a specific GPU has RAM or not. This can easily be found from the device-tree, imho quite friendly already. VFIO gets to know about these new capabilities when the VFIO PCI device is opened, and we rather want to avoid going that far in libvirt (open a VFIO container, attach a group, get a vfio-pci fd from it, enumerate regions - 2 PCI resets on the way, delays, meh). btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be passed too or the entire RAM thing won't work. "find" for the memory node can also be dropped really - if NVLink bridge OF node has "memory-region", then VFIO will most likely expose RAM and QEMU will try using it anyway.
Thinking about it a bit more, since this is NVIDIA-specific, having an NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned above. In general, we try to avoid introducing vendor-specific code, so nodedev might not be the right place to check for the attribute (because it's not universal and would require vendor-specific elements), but I think we could have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt codebase.
Erik
-- Alexey

...
Sorry for the delay in responding. The problem is that all the V100 GPUs support NVLink, but it may or may not be connected up. This is detected at runtime during GPU initialization, which seems like much too heavy of an operation to perform as part of passthrough initialization. And that's why vfio-pci pieces rely on device tree information to figure it out.
Alexey, would it be possible for vfio-pci to export the information in a way more friendly to libvirt?
The only information needed here is whether a specific GPU has RAM or not. This can easily be found from the device-tree, imho quite friendly already. VFIO gets to know about these new capabilities when the VFIO PCI device is opened, and we rather want to avoid going that far in libvirt (open a VFIO container, attach a group, get a vfio-pci fd from it, enumerate regions - 2 PCI resets on the way, delays, meh).
Agreed, we talked about this when dealing with other stuff already and we really don't want libvirt to need to open a VFIO container just to query a few attributes/settings which it then would process or pass directly to QEMU, we'd therefore need a different way consumable to libvirt...
btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be passed too or the entire RAM thing won't work. "find" for the memory node can also be dropped really - if NVLink bridge OF node has "memory-region", then VFIO will most likely expose RAM and QEMU will try using it anyway.
I'm not sure I follow ^this, can you be more specific about how you imagine libvirt detecting it? Thanks, Erik

On 3/7/19 10:29 PM, Alexey Kardashevskiy wrote:
On 08/03/2019 04:51, Piotr Jaroszynski wrote:
On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote:
On 3/7/19 12:15 PM, Erik Skultety wrote:
On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verfying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created.
An alternative is presented in this patch. Given a PCI device, we'll traverse the device tree at /proc/device-tree to check if the device has a NPU bridge, retrieve the node of the NVLink2 bus, find the memory-node that is related to the bus and see if it's a NVLink2 bus by inspecting its 'reg' value. This logic is contained inside the 'device_is_nvlink2_capable' function, which uses other new helper functions to navigate and fetch values from the device tree nodes.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- This fails with a bunch of compilation errors, please make sure you run make check and make syntax-check on each patch of the series. Ooops, forgot to follow up make syntax-check with 'make' before submitting ... my bad.
src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..97de5793e2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * Reads a phandle file and returns the phandle value. + */ +static int +read_dt_phandle(const char* file) ..we prefer camelCase naming of functions...all functions should have a prefix, e.g. vir,<driver>, etc.
+{ + unsigned int buf[1]; + size_t read; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return -1; + + read = fread(buf, sizeof(unsigned int), 1, f); We already have safe helpers that take care of such operations.
+ + if (!read) { + VIR_CLOSE(f); You could use VIR_AUTOCLOSE for this
+ return 0; + } + + VIR_CLOSE(f); + return be32toh(buf[0]); Is this part of gnulib? We need to make sure it's portable otherwise we'd need to make the conversion ourselves, just like for le64toh
+} + + +/** + * Reads a memory reg file and returns the first 4 int values. + * + * The caller is responsible for freeing the returned array. + */ +static unsigned int * +read_dt_memory_reg(const char *file) +{ + unsigned int *buf; + size_t read, i; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return NULL; + + if (VIR_ALLOC_N(buf, 4) < 0) + return NULL; + + read = fread(buf, sizeof(unsigned int), 4, f); + + if (!read && read < 4) + /* shouldn't happen */ + VIR_FREE(buf); + else for (i = 0; i < 4; i++) + buf[i] = be32toh(buf[i]); + + VIR_CLOSE(f); + return buf; Same comments as above...
+} + + +/** + * This wrapper function receives arguments to be used in a + * 'find' call to retrieve the file names that matches + * the criteria inside the /proc/device-tree dir. + * + * A 'find' call with '-iname phandle' inside /proc/device-tree + * provides more than a thousand matches. Adding '-path' to + * narrow it down further is necessary to keep the file + * listing sane. + * + * The caller is responsible to free the buffer returned by + * this function. + */ +static char * +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern) +{ + virCommandPtr cmd = NULL; + char *output = NULL; + + cmd = virCommandNew("find"); + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern, + "-iname", file_pattern, NULL); + virCommandSetOutputBuffer(cmd, &output); + + if (virCommandRun(cmd, NULL) < 0) + VIR_FREE(output); + + virCommandFree(cmd); + return output; +} + + +/** + * Helper function that receives a listing of file names and + * calls read_dt_phandle() on each one finding for a match + * with the given phandle argument. Returns the file name if a + * match is found, NULL otherwise. + */ +static char * +find_dt_file_with_phandle(char *files, int phandle) +{ + char *line, *tmp; + int ret; + + line = strtok_r(files, "\n", &tmp); + do { + ret = read_dt_phandle(line); + if (ret == phandle) + break; + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); + + return line; +} + + +/** + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is NVLink2 capable. + * + * The logic goes as follows: + * + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu' + * attribute; + * 2 - find the device tree node of the nvlink bus using the phandle + * found in (1) + * 3 - get the phandle of the memory region of the nvlink bus + * 4 - find the device tree node of the memory region using the + * phandle found in (3) + * 5 - read the 'reg' value of the memory region. If the value of + * the second 64 bit value is 0x02 0x00, the device is attached + * to a NVLink2 bus. + * + * If any of these steps fails, the function returns false. + */ +static bool +device_is_nvlink2_capable(const char *device) +{ + char *file, *files, *tmp; + unsigned int *reg; + int phandle; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu", + device)) < 0) + return false; + + /* Find phandles of nvlinks: */ + if ((phandle = read_dt_phandle(file)) == -1) + return false; + + /* Find a DT node for the phandle found */ + files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* Find a phandle of the GPU memory region of the device. The + * file found above ends with '/phandle' - the memory region + * of the GPU ends with '/memory-region */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/memory-region"); + + if ((phandle = read_dt_phandle(file)) == -1) + goto fail; + + file = NULL; + VIR_FREE(files); + + /* Find the memory node for the phandle found above */ + files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* And see its size in the second 64bit value of 'reg'. First, + * the end of the file needs to be changed from '/phandle' to + * '/reg' */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/reg"); + + reg = read_dt_memory_reg(file); + if (reg && reg[2] == 0x20 && reg[3] == 0x00) + return true; This function is very complex just to find out whether a PCI device is capable of NVLink or not. Feels wrong to do it this way, I believe it would be much easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports NVLink so that our node-device driver would take care of this during libvirtd startup and then you'd only call a single API from the PPC64 helper you're introducing in the next patch to find out whether you need the alternative formula or not.
Honestly, apart from the coding style issues, I don't like this approach and unless there's a really compelling reason for libvirt to do it in a way which involves spawning a 'find' process because of a complex pattern and a bunch of data necessary to filter out, I'd really suggest contacting NVIDIA about this. It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which we're able to read. I'll contact NVIDIA and see if there is an easier way (a sysfs attribute, for example) and, if doesn't, try to provide a plausibe reason to justify this detection code.
Sorry for the delay in responding. The problem is that all the V100 GPUs support NVLink, but it may or may not be connected up. This is detected at runtime during GPU initialization, which seems like much too heavy of an operation to perform as part of passthrough initialization. And that's why vfio-pci pieces rely on device tree information to figure it out.
Alexey, would it be possible for vfio-pci to export the information in a way more friendly to libvirt?
The only information needed here is whether a specific GPU has RAM or not. This can easily be found from the device-tree, imho quite friendly already. VFIO gets to know about these new capabilities when the VFIO PCI device is opened, and we rather want to avoid going that far in libvirt (open a VFIO container, attach a group, get a vfio-pci fd from it, enumerate regions - 2 PCI resets on the way, delays, meh).
btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be passed too or the entire RAM thing won't work. "find" for the memory node can also be dropped really - if NVLink bridge OF node has "memory-region", then VFIO will most likely expose RAM and QEMU will try using it anyway.
Hmmm I am not not sure I understood it completely ..... seeing the of_nodes exposed in sysfs of one of the NVLink buses we have: /sys/bus/pci/devices/0007:00:00.0/of_node$ ls class-code ibm,gpu ibm,nvlink-speed memory-region reg device-id ibm,loc-code ibm,pci-config-space-type name revision-id ibm,device-tgt-addr ibm,nvlink interrupts phandle vendor-id We can make a safe assumption that the V100 GPU will always be passed through with at least one NVLink2 bus. How many hops do we need to assert that a given device is a NVLink2 bus from its of_node info? For example, can we say something like: "The device node of this device has ibm,gpu and ibm,nvlink and ibm,nvlink-speed and ibm,nvlink-speed is 0x8, so this is a NVLink2 bus. Since it is not possible to pass through the bus alone, there is a V100 GPU in this same IOMMU group. So this is a NVLink2 passthrough scenario" Or perhaps: "It has ibm,gpu and ibm,nvlink and the of_node of its memory-region has a reg value of 0x20 , thus this is a nvlink2 bus and ....." Both alternatives are way simpler than what I'm doing in this patch. I am not sure if they are valid though.
Thinking about it a bit more, since this is NVIDIA-specific, having an NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned above. In general, we try to avoid introducing vendor-specific code, so nodedev might not be the right place to check for the attribute (because it's not universal and would require vendor-specific elements), but I think we could have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt codebase.
Erik

On 12/03/2019 09:15, Daniel Henrique Barboza wrote:
On 3/7/19 10:29 PM, Alexey Kardashevskiy wrote:
On 08/03/2019 04:51, Piotr Jaroszynski wrote:
On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote:
On 3/7/19 12:15 PM, Erik Skultety wrote:
On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verfying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created.
An alternative is presented in this patch. Given a PCI device, we'll traverse the device tree at /proc/device-tree to check if the device has a NPU bridge, retrieve the node of the NVLink2 bus, find the memory-node that is related to the bus and see if it's a NVLink2 bus by inspecting its 'reg' value. This logic is contained inside the 'device_is_nvlink2_capable' function, which uses other new helper functions to navigate and fetch values from the device tree nodes.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- This fails with a bunch of compilation errors, please make sure you run make check and make syntax-check on each patch of the series. Ooops, forgot to follow up make syntax-check with 'make' before submitting ... my bad.
src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..97de5793e2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * Reads a phandle file and returns the phandle value. + */ +static int +read_dt_phandle(const char* file) ..we prefer camelCase naming of functions...all functions should have a prefix, e.g. vir,<driver>, etc.
+{ + unsigned int buf[1]; + size_t read; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return -1; + + read = fread(buf, sizeof(unsigned int), 1, f); We already have safe helpers that take care of such operations.
+ + if (!read) { + VIR_CLOSE(f); You could use VIR_AUTOCLOSE for this
+ return 0; + } + + VIR_CLOSE(f); + return be32toh(buf[0]); Is this part of gnulib? We need to make sure it's portable otherwise we'd need to make the conversion ourselves, just like for le64toh
+} + + +/** + * Reads a memory reg file and returns the first 4 int values. + * + * The caller is responsible for freeing the returned array. + */ +static unsigned int * +read_dt_memory_reg(const char *file) +{ + unsigned int *buf; + size_t read, i; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return NULL; + + if (VIR_ALLOC_N(buf, 4) < 0) + return NULL; + + read = fread(buf, sizeof(unsigned int), 4, f); + + if (!read && read < 4) + /* shouldn't happen */ + VIR_FREE(buf); + else for (i = 0; i < 4; i++) + buf[i] = be32toh(buf[i]); + + VIR_CLOSE(f); + return buf; Same comments as above...
+} + + +/** + * This wrapper function receives arguments to be used in a + * 'find' call to retrieve the file names that matches + * the criteria inside the /proc/device-tree dir. + * + * A 'find' call with '-iname phandle' inside /proc/device-tree + * provides more than a thousand matches. Adding '-path' to + * narrow it down further is necessary to keep the file + * listing sane. + * + * The caller is responsible to free the buffer returned by + * this function. + */ +static char * +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern) +{ + virCommandPtr cmd = NULL; + char *output = NULL; + + cmd = virCommandNew("find"); + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern, + "-iname", file_pattern, NULL); + virCommandSetOutputBuffer(cmd, &output); + + if (virCommandRun(cmd, NULL) < 0) + VIR_FREE(output); + + virCommandFree(cmd); + return output; +} + + +/** + * Helper function that receives a listing of file names and + * calls read_dt_phandle() on each one finding for a match + * with the given phandle argument. Returns the file name if a + * match is found, NULL otherwise. + */ +static char * +find_dt_file_with_phandle(char *files, int phandle) +{ + char *line, *tmp; + int ret; + + line = strtok_r(files, "\n", &tmp); + do { + ret = read_dt_phandle(line); + if (ret == phandle) + break; + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); + + return line; +} + + +/** + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is NVLink2 capable. + * + * The logic goes as follows: + * + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu' + * attribute; + * 2 - find the device tree node of the nvlink bus using the phandle + * found in (1) + * 3 - get the phandle of the memory region of the nvlink bus + * 4 - find the device tree node of the memory region using the + * phandle found in (3) + * 5 - read the 'reg' value of the memory region. If the value of + * the second 64 bit value is 0x02 0x00, the device is attached + * to a NVLink2 bus. + * + * If any of these steps fails, the function returns false. + */ +static bool +device_is_nvlink2_capable(const char *device) +{ + char *file, *files, *tmp; + unsigned int *reg; + int phandle; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu", + device)) < 0) + return false; + + /* Find phandles of nvlinks: */ + if ((phandle = read_dt_phandle(file)) == -1) + return false; + + /* Find a DT node for the phandle found */ + files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* Find a phandle of the GPU memory region of the device. The + * file found above ends with '/phandle' - the memory region + * of the GPU ends with '/memory-region */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/memory-region"); + + if ((phandle = read_dt_phandle(file)) == -1) + goto fail; + + file = NULL; + VIR_FREE(files); + + /* Find the memory node for the phandle found above */ + files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* And see its size in the second 64bit value of 'reg'. First, + * the end of the file needs to be changed from '/phandle' to + * '/reg' */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/reg"); + + reg = read_dt_memory_reg(file); + if (reg && reg[2] == 0x20 && reg[3] == 0x00) + return true; This function is very complex just to find out whether a PCI device is capable of NVLink or not. Feels wrong to do it this way, I believe it would be much easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports NVLink so that our node-device driver would take care of this during libvirtd startup and then you'd only call a single API from the PPC64 helper you're introducing in the next patch to find out whether you need the alternative formula or not.
Honestly, apart from the coding style issues, I don't like this approach and unless there's a really compelling reason for libvirt to do it in a way which involves spawning a 'find' process because of a complex pattern and a bunch of data necessary to filter out, I'd really suggest contacting NVIDIA about this. It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which we're able to read. I'll contact NVIDIA and see if there is an easier way (a sysfs attribute, for example) and, if doesn't, try to provide a plausibe reason to justify this detection code.
Sorry for the delay in responding. The problem is that all the V100 GPUs support NVLink, but it may or may not be connected up. This is detected at runtime during GPU initialization, which seems like much too heavy of an operation to perform as part of passthrough initialization. And that's why vfio-pci pieces rely on device tree information to figure it out.
Alexey, would it be possible for vfio-pci to export the information in a way more friendly to libvirt?
The only information needed here is whether a specific GPU has RAM or not. This can easily be found from the device-tree, imho quite friendly already. VFIO gets to know about these new capabilities when the VFIO PCI device is opened, and we rather want to avoid going that far in libvirt (open a VFIO container, attach a group, get a vfio-pci fd from it, enumerate regions - 2 PCI resets on the way, delays, meh).
btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be passed too or the entire RAM thing won't work. "find" for the memory node can also be dropped really - if NVLink bridge OF node has "memory-region", then VFIO will most likely expose RAM and QEMU will try using it anyway.
Hmmm I am not not sure I understood it completely ..... seeing the of_nodes exposed in sysfs of one of the NVLink buses we have:
/sys/bus/pci/devices/0007:00:00.0/of_node$ ls class-code ibm,gpu ibm,nvlink-speed memory-region reg device-id ibm,loc-code ibm,pci-config-space-type name revision-id ibm,device-tgt-addr ibm,nvlink interrupts phandle vendor-id
We can make a safe assumption that the V100 GPU will always be passed through with at least one NVLink2 bus.
Assumption #1: if we want GPU RAM to be available in the guest, an NVLink bridge must be passed through. No bridge - no RAM - no rlimit adjustment.
How many hops do we need to assert that a given device is a NVLink2 bus from its of_node info?
For example, can we say something like:
"The device node of this device has ibm,gpu and ibm,nvlink and ibm,nvlink-speed and ibm,nvlink-speed is 0x8, so this is a NVLink2 bus. Since it is not
ibm,nvlink-speed can be different (I saw 9), you cannot rely on this.
possible to pass through the bus alone, there is a V100 GPU in this same IOMMU group. So this is a NVLink2 passthrough scenario"
Or perhaps:
"It has ibm,gpu and ibm,nvlink and the of_node of its memory-region has a reg value of 0x20 , thus this is a nvlink2 bus and ....."
0x20 is a size of the GPU RAM window which might change in different revisions of the hw as well.
Both alternatives are way simpler than what I'm doing in this patch. I am not sure if they are valid though.
Assumption #2: if the nvlink2 bridge's DT node has memory-region, then the GPU will try to use its RAM. The adjustment you'll have to make won't depend on anything as the existing GPU RAM placement is so that whatever you can possibly throw at QEMU will fit 128TiB window and we cannot make the window smaller anyway (next size would lower power of two - 64TiB - and GPU RAM lives just above that). I could do better of course and allow adjusting the limit to cover let's say 66TiB instead of 128TiB, it just seems to be unnecessary complication (we already allocate only required amount of TCE entries on demand).
Thinking about it a bit more, since this is NVIDIA-specific, having an NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned above. In general, we try to avoid introducing vendor-specific code, so nodedev might not be the right place to check for the attribute (because it's not universal and would require vendor-specific elements), but I think we could have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt codebase.
Erik
-- Alexey

On 3/12/19 1:05 AM, Alexey Kardashevskiy wrote:
On 12/03/2019 09:15, Daniel Henrique Barboza wrote:
On 3/7/19 10:29 PM, Alexey Kardashevskiy wrote:
On 3/7/19 12:15 PM, Erik Skultety wrote:
On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote: > The NVLink2 support in QEMU implements the detection of NVLink2 > capable devices by verfying the attributes of the VFIO mem region > QEMU allocates for the NVIDIA GPUs. To properly allocate an > adequate amount of memLock, Libvirt needs this information before > a QEMU instance is even created. > > An alternative is presented in this patch. Given a PCI device, > we'll traverse the device tree at /proc/device-tree to check if > the device has a NPU bridge, retrieve the node of the NVLink2 bus, > find the memory-node that is related to the bus and see if it's a > NVLink2 bus by inspecting its 'reg' value. This logic is contained > inside the 'device_is_nvlink2_capable' function, which uses other > new helper functions to navigate and fetch values from the device > tree nodes. > > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> > --- This fails with a bunch of compilation errors, please make sure you run make check and make syntax-check on each patch of the series. Ooops, forgot to follow up make syntax-check with 'make' before submitting ... my bad.
> src/qemu/qemu_domain.c | 194 > +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 194 insertions(+) > > diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c > index 77548c224c..97de5793e2 100644 > --- a/src/qemu/qemu_domain.c > +++ b/src/qemu/qemu_domain.c > @@ -10343,6 +10343,200 @@ > qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) > } > > > +/** > + * Reads a phandle file and returns the phandle value. > + */ > +static int > +read_dt_phandle(const char* file) ..we prefer camelCase naming of functions...all functions should have a prefix, e.g. vir,<driver>, etc.
> +{ > + unsigned int buf[1]; > + size_t read; > + FILE *f; > + > + f = fopen(file, "r"); > + if (!f) > + return -1; > + > + read = fread(buf, sizeof(unsigned int), 1, f); We already have safe helpers that take care of such operations.
> + > + if (!read) { > + VIR_CLOSE(f); You could use VIR_AUTOCLOSE for this
> + return 0; > + } > + > + VIR_CLOSE(f); > + return be32toh(buf[0]); Is this part of gnulib? We need to make sure it's portable otherwise we'd need to make the conversion ourselves, just like for le64toh
> +} > + > + > +/** > + * Reads a memory reg file and returns the first 4 int values. > + * > + * The caller is responsible for freeing the returned array. > + */ > +static unsigned int * > +read_dt_memory_reg(const char *file) > +{ > + unsigned int *buf; > + size_t read, i; > + FILE *f; > + > + f = fopen(file, "r"); > + if (!f) > + return NULL; > + > + if (VIR_ALLOC_N(buf, 4) < 0) > + return NULL; > + > + read = fread(buf, sizeof(unsigned int), 4, f); > + > + if (!read && read < 4) > + /* shouldn't happen */ > + VIR_FREE(buf); > + else for (i = 0; i < 4; i++) > + buf[i] = be32toh(buf[i]); > + > + VIR_CLOSE(f); > + return buf; Same comments as above...
> +} > + > + > +/** > + * This wrapper function receives arguments to be used in a > + * 'find' call to retrieve the file names that matches > + * the criteria inside the /proc/device-tree dir. > + * > + * A 'find' call with '-iname phandle' inside /proc/device-tree > + * provides more than a thousand matches. Adding '-path' to > + * narrow it down further is necessary to keep the file > + * listing sane. > + * > + * The caller is responsible to free the buffer returned by > + * this function. > + */ > +static char * > +retrieve_dt_files_pattern(const char *path_pattern, const char > *file_pattern) > +{ > + virCommandPtr cmd = NULL; > + char *output = NULL; > + > + cmd = virCommandNew("find"); > + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", > path_pattern, > + "-iname", file_pattern, NULL); > + virCommandSetOutputBuffer(cmd, &output); > + > + if (virCommandRun(cmd, NULL) < 0) > + VIR_FREE(output); > + > + virCommandFree(cmd); > + return output; > +} > + > + > +/** > + * Helper function that receives a listing of file names and > + * calls read_dt_phandle() on each one finding for a match > + * with the given phandle argument. Returns the file name if a > + * match is found, NULL otherwise. > + */ > +static char * > +find_dt_file_with_phandle(char *files, int phandle) > +{ > + char *line, *tmp; > + int ret; > + > + line = strtok_r(files, "\n", &tmp); > + do { > + ret = read_dt_phandle(line); > + if (ret == phandle) > + break; > + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); > + > + return line; > +} > + > + > +/** > + * This function receives a string that represents a PCI device, > + * such as '0004:04:00.0', and tells if the device is NVLink2 > capable. > + * > + * The logic goes as follows: > + * > + * 1 - get the phandle of a nvlink of the device, reading the > 'ibm,npu' > + * attribute; > + * 2 - find the device tree node of the nvlink bus using the phandle > + * found in (1) > + * 3 - get the phandle of the memory region of the nvlink bus > + * 4 - find the device tree node of the memory region using the > + * phandle found in (3) > + * 5 - read the 'reg' value of the memory region. If the value of > + * the second 64 bit value is 0x02 0x00, the device is attached > + * to a NVLink2 bus. > + * > + * If any of these steps fails, the function returns false. > + */ > +static bool > +device_is_nvlink2_capable(const char *device) > +{ > + char *file, *files, *tmp; > + unsigned int *reg; > + int phandle; > + > + if ((virAsprintf(&file, > "/sys/bus/pci/devices/%s/of_node/ibm,npu", > + device)) < 0) > + return false; > + > + /* Find phandles of nvlinks: */ > + if ((phandle = read_dt_phandle(file)) == -1) > + return false; > + > + /* Find a DT node for the phandle found */ > + files = retrieve_dt_files_pattern("*device-tree/pci*", > "phandle"); > + if (!files) > + return false; > + > + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) > + goto fail; > + > + /* Find a phandle of the GPU memory region of the device. The > + * file found above ends with '/phandle' - the memory region > + * of the GPU ends with '/memory-region */ > + tmp = strrchr(file, '/'); > + *tmp = '\0'; > + file = strcat(file, "/memory-region"); > + > + if ((phandle = read_dt_phandle(file)) == -1) > + goto fail; > + > + file = NULL; > + VIR_FREE(files); > + > + /* Find the memory node for the phandle found above */ > + files = retrieve_dt_files_pattern("*device-tree/memory*", > "phandle"); > + if (!files) > + return false; > + > + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) > + goto fail; > + > + /* And see its size in the second 64bit value of 'reg'. First, > + * the end of the file needs to be changed from '/phandle' to > + * '/reg' */ > + tmp = strrchr(file, '/'); > + *tmp = '\0'; > + file = strcat(file, "/reg"); > + > + reg = read_dt_memory_reg(file); > + if (reg && reg[2] == 0x20 && reg[3] == 0x00) > + return true; This function is very complex just to find out whether a PCI device is capable of NVLink or not. Feels wrong to do it this way, I believe it would be much easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports NVLink so that our node-device driver would take care of this during libvirtd startup and then you'd only call a single API from the PPC64 helper you're introducing in the next patch to find out whether you need the alternative formula or not.
Honestly, apart from the coding style issues, I don't like this approach and unless there's a really compelling reason for libvirt to do it in a way which involves spawning a 'find' process because of a complex pattern and a bunch of data necessary to filter out, I'd really suggest contacting NVIDIA about this. It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which we're able to read. I'll contact NVIDIA and see if there is an easier way (a sysfs attribute, for example) and, if doesn't, try to provide a plausibe reason to justify this detection code. Sorry for the delay in responding. The problem is that all the V100 GPUs support NVLink, but it may or may not be connected up. This is detected at runtime during GPU initialization, which seems like much too heavy of an operation to perform as part of passthrough initialization. And
On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote: that's why vfio-pci pieces rely on device tree information to figure it out.
Alexey, would it be possible for vfio-pci to export the information in a way more friendly to libvirt? The only information needed here is whether a specific GPU has RAM or not. This can easily be found from the device-tree, imho quite friendly already. VFIO gets to know about these new capabilities when the VFIO PCI device is opened, and we rather want to avoid going that far in
On 08/03/2019 04:51, Piotr Jaroszynski wrote: libvirt (open a VFIO container, attach a group, get a vfio-pci fd from it, enumerate regions - 2 PCI resets on the way, delays, meh).
btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be passed too or the entire RAM thing won't work. "find" for the memory node can also be dropped really - if NVLink bridge OF node has "memory-region", then VFIO will most likely expose RAM and QEMU will try using it anyway.
Hmmm I am not not sure I understood it completely ..... seeing the of_nodes exposed in sysfs of one of the NVLink buses we have:
/sys/bus/pci/devices/0007:00:00.0/of_node$ ls class-code ibm,gpu ibm,nvlink-speed memory-region reg device-id ibm,loc-code ibm,pci-config-space-type name revision-id ibm,device-tgt-addr ibm,nvlink interrupts phandle vendor-id
We can make a safe assumption that the V100 GPU will always be passed through with at least one NVLink2 bus.
Assumption #1: if we want GPU RAM to be available in the guest, an NVLink bridge must be passed through. No bridge - no RAM - no rlimit adjustment.
How many hops do we need to assert that a given device is a NVLink2 bus from its of_node info?
For example, can we say something like:
"The device node of this device has ibm,gpu and ibm,nvlink and ibm,nvlink-speed and ibm,nvlink-speed is 0x8, so this is a NVLink2 bus. Since it is not
ibm,nvlink-speed can be different (I saw 9), you cannot rely on this.
possible to pass through the bus alone, there is a V100 GPU in this same IOMMU group. So this is a NVLink2 passthrough scenario"
Or perhaps:
"It has ibm,gpu and ibm,nvlink and the of_node of its memory-region has a reg value of 0x20 , thus this is a nvlink2 bus and ....."
0x20 is a size of the GPU RAM window which might change in different revisions of the hw as well.
Both alternatives are way simpler than what I'm doing in this patch. I am not sure if they are valid though.
Assumption #2: if the nvlink2 bridge's DT node has memory-region, then the GPU will try to use its RAM. The adjustment you'll have to make won't depend on anything as the existing GPU RAM placement is so that whatever you can possibly throw at QEMU will fit 128TiB window and we cannot make the window smaller anyway (next size would lower power of two - 64TiB - and GPU RAM lives just above that).
Ok, so, from the of_node of an unknown device that has been passed through to a VM, if the DT node has: ibm,gpu ibm,nvlink memory-region Is that enough to justify the rlimit adjustment for NVLink2?
I could do better of course and allow adjusting the limit to cover let's say 66TiB instead of 128TiB, it just seems to be unnecessary complication (we already allocate only required amount of TCE entries on demand).
Thinking about it a bit more, since this is NVIDIA-specific, having an NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned above. In general, we try to avoid introducing vendor-specific code, so nodedev might not be the right place to check for the attribute (because it's not universal and would require vendor-specific elements), but I think we could have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt codebase.
Erik

On 12/03/2019 20:39, Daniel Henrique Barboza wrote:
On 3/12/19 1:05 AM, Alexey Kardashevskiy wrote:
On 12/03/2019 09:15, Daniel Henrique Barboza wrote:
On 3/7/19 10:29 PM, Alexey Kardashevskiy wrote:
On 3/7/19 12:15 PM, Erik Skultety wrote: > On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza > wrote: >> The NVLink2 support in QEMU implements the detection of NVLink2 >> capable devices by verfying the attributes of the VFIO mem region >> QEMU allocates for the NVIDIA GPUs. To properly allocate an >> adequate amount of memLock, Libvirt needs this information before >> a QEMU instance is even created. >> >> An alternative is presented in this patch. Given a PCI device, >> we'll traverse the device tree at /proc/device-tree to check if >> the device has a NPU bridge, retrieve the node of the NVLink2 bus, >> find the memory-node that is related to the bus and see if it's a >> NVLink2 bus by inspecting its 'reg' value. This logic is contained >> inside the 'device_is_nvlink2_capable' function, which uses other >> new helper functions to navigate and fetch values from the device >> tree nodes. >> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> >> --- > This fails with a bunch of compilation errors, please make sure you > run make > check and make syntax-check on each patch of the series. Ooops, forgot to follow up make syntax-check with 'make' before submitting ... my bad.
>> src/qemu/qemu_domain.c | 194 >> +++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 194 insertions(+) >> >> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c >> index 77548c224c..97de5793e2 100644 >> --- a/src/qemu/qemu_domain.c >> +++ b/src/qemu/qemu_domain.c >> @@ -10343,6 +10343,200 @@ >> qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) >> } >> >> >> +/** >> + * Reads a phandle file and returns the phandle value. >> + */ >> +static int >> +read_dt_phandle(const char* file) > ..we prefer camelCase naming of functions...all functions should > have a prefix, > e.g. vir,<driver>, etc. > >> +{ >> + unsigned int buf[1]; >> + size_t read; >> + FILE *f; >> + >> + f = fopen(file, "r"); >> + if (!f) >> + return -1; >> + >> + read = fread(buf, sizeof(unsigned int), 1, f); > We already have safe helpers that take care of such operations. > >> + >> + if (!read) { >> + VIR_CLOSE(f); > You could use VIR_AUTOCLOSE for this > >> + return 0; >> + } >> + >> + VIR_CLOSE(f); >> + return be32toh(buf[0]); > Is this part of gnulib? We need to make sure it's portable > otherwise we'd need > to make the conversion ourselves, just like for le64toh > >> +} >> + >> + >> +/** >> + * Reads a memory reg file and returns the first 4 int values. >> + * >> + * The caller is responsible for freeing the returned array. >> + */ >> +static unsigned int * >> +read_dt_memory_reg(const char *file) >> +{ >> + unsigned int *buf; >> + size_t read, i; >> + FILE *f; >> + >> + f = fopen(file, "r"); >> + if (!f) >> + return NULL; >> + >> + if (VIR_ALLOC_N(buf, 4) < 0) >> + return NULL; >> + >> + read = fread(buf, sizeof(unsigned int), 4, f); >> + >> + if (!read && read < 4) >> + /* shouldn't happen */ >> + VIR_FREE(buf); >> + else for (i = 0; i < 4; i++) >> + buf[i] = be32toh(buf[i]); >> + >> + VIR_CLOSE(f); >> + return buf; > Same comments as above... > >> +} >> + >> + >> +/** >> + * This wrapper function receives arguments to be used in a >> + * 'find' call to retrieve the file names that matches >> + * the criteria inside the /proc/device-tree dir. >> + * >> + * A 'find' call with '-iname phandle' inside /proc/device-tree >> + * provides more than a thousand matches. Adding '-path' to >> + * narrow it down further is necessary to keep the file >> + * listing sane. >> + * >> + * The caller is responsible to free the buffer returned by >> + * this function. >> + */ >> +static char * >> +retrieve_dt_files_pattern(const char *path_pattern, const char >> *file_pattern) >> +{ >> + virCommandPtr cmd = NULL; >> + char *output = NULL; >> + >> + cmd = virCommandNew("find"); >> + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", >> path_pattern, >> + "-iname", file_pattern, NULL); >> + virCommandSetOutputBuffer(cmd, &output); >> + >> + if (virCommandRun(cmd, NULL) < 0) >> + VIR_FREE(output); >> + >> + virCommandFree(cmd); >> + return output; >> +} >> + >> + >> +/** >> + * Helper function that receives a listing of file names and >> + * calls read_dt_phandle() on each one finding for a match >> + * with the given phandle argument. Returns the file name if a >> + * match is found, NULL otherwise. >> + */ >> +static char * >> +find_dt_file_with_phandle(char *files, int phandle) >> +{ >> + char *line, *tmp; >> + int ret; >> + >> + line = strtok_r(files, "\n", &tmp); >> + do { >> + ret = read_dt_phandle(line); >> + if (ret == phandle) >> + break; >> + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); >> + >> + return line; >> +} >> + >> + >> +/** >> + * This function receives a string that represents a PCI device, >> + * such as '0004:04:00.0', and tells if the device is NVLink2 >> capable. >> + * >> + * The logic goes as follows: >> + * >> + * 1 - get the phandle of a nvlink of the device, reading the >> 'ibm,npu' >> + * attribute; >> + * 2 - find the device tree node of the nvlink bus using the >> phandle >> + * found in (1) >> + * 3 - get the phandle of the memory region of the nvlink bus >> + * 4 - find the device tree node of the memory region using the >> + * phandle found in (3) >> + * 5 - read the 'reg' value of the memory region. If the value of >> + * the second 64 bit value is 0x02 0x00, the device is attached >> + * to a NVLink2 bus. >> + * >> + * If any of these steps fails, the function returns false. >> + */ >> +static bool >> +device_is_nvlink2_capable(const char *device) >> +{ >> + char *file, *files, *tmp; >> + unsigned int *reg; >> + int phandle; >> + >> + if ((virAsprintf(&file, >> "/sys/bus/pci/devices/%s/of_node/ibm,npu", >> + device)) < 0) >> + return false; >> + >> + /* Find phandles of nvlinks: */ >> + if ((phandle = read_dt_phandle(file)) == -1) >> + return false; >> + >> + /* Find a DT node for the phandle found */ >> + files = retrieve_dt_files_pattern("*device-tree/pci*", >> "phandle"); >> + if (!files) >> + return false; >> + >> + if ((file = find_dt_file_with_phandle(files, phandle)) == >> NULL) >> + goto fail; >> + >> + /* Find a phandle of the GPU memory region of the device. The >> + * file found above ends with '/phandle' - the memory region >> + * of the GPU ends with '/memory-region */ >> + tmp = strrchr(file, '/'); >> + *tmp = '\0'; >> + file = strcat(file, "/memory-region"); >> + >> + if ((phandle = read_dt_phandle(file)) == -1) >> + goto fail; >> + >> + file = NULL; >> + VIR_FREE(files); >> + >> + /* Find the memory node for the phandle found above */ >> + files = retrieve_dt_files_pattern("*device-tree/memory*", >> "phandle"); >> + if (!files) >> + return false; >> + >> + if ((file = find_dt_file_with_phandle(files, phandle)) == >> NULL) >> + goto fail; >> + >> + /* And see its size in the second 64bit value of 'reg'. First, >> + * the end of the file needs to be changed from '/phandle' to >> + * '/reg' */ >> + tmp = strrchr(file, '/'); >> + *tmp = '\0'; >> + file = strcat(file, "/reg"); >> + >> + reg = read_dt_memory_reg(file); >> + if (reg && reg[2] == 0x20 && reg[3] == 0x00) >> + return true; > This function is very complex just to find out whether a PCI device > is capable > of NVLink or not. Feels wrong to do it this way, I believe it would > be much > easier if NVIDIA exposed a sysfs attribute saying whether a PCI > device supports > NVLink so that our node-device driver would take care of this > during libvirtd > startup and then you'd only call a single API from the PPC64 helper > you're > introducing in the next patch to find out whether you need the > alternative > formula or not. > > Honestly, apart from the coding style issues, I don't like this > approach and > unless there's a really compelling reason for libvirt to do it in a > way which > involves spawning a 'find' process because of a complex pattern and > a bunch of > data necessary to filter out, I'd really suggest contacting NVIDIA > about this. > It's the same for mdevs, NVIDIA exposes a bunch of attributes in > sysfs which > we're able to read. I'll contact NVIDIA and see if there is an easier way (a sysfs attribute, for example) and, if doesn't, try to provide a plausibe reason to justify this detection code. Sorry for the delay in responding. The problem is that all the V100 GPUs support NVLink, but it may or may not be connected up. This is detected at runtime during GPU initialization, which seems like much too heavy of an operation to perform as part of passthrough initialization. And
On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote: that's why vfio-pci pieces rely on device tree information to figure it out.
Alexey, would it be possible for vfio-pci to export the information in a way more friendly to libvirt? The only information needed here is whether a specific GPU has RAM or not. This can easily be found from the device-tree, imho quite friendly already. VFIO gets to know about these new capabilities when the VFIO PCI device is opened, and we rather want to avoid going that far in
On 08/03/2019 04:51, Piotr Jaroszynski wrote: libvirt (open a VFIO container, attach a group, get a vfio-pci fd from it, enumerate regions - 2 PCI resets on the way, delays, meh).
btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be passed too or the entire RAM thing won't work. "find" for the memory node can also be dropped really - if NVLink bridge OF node has "memory-region", then VFIO will most likely expose RAM and QEMU will try using it anyway.
Hmmm I am not not sure I understood it completely ..... seeing the of_nodes exposed in sysfs of one of the NVLink buses we have:
/sys/bus/pci/devices/0007:00:00.0/of_node$ ls class-code ibm,gpu ibm,nvlink-speed memory-region reg device-id ibm,loc-code ibm,pci-config-space-type name revision-id ibm,device-tgt-addr ibm,nvlink interrupts phandle vendor-id
We can make a safe assumption that the V100 GPU will always be passed through with at least one NVLink2 bus.
Assumption #1: if we want GPU RAM to be available in the guest, an NVLink bridge must be passed through. No bridge - no RAM - no rlimit adjustment.
How many hops do we need to assert that a given device is a NVLink2 bus from its of_node info?
For example, can we say something like:
"The device node of this device has ibm,gpu and ibm,nvlink and ibm,nvlink-speed and ibm,nvlink-speed is 0x8, so this is a NVLink2 bus. Since it is not
ibm,nvlink-speed can be different (I saw 9), you cannot rely on this.
possible to pass through the bus alone, there is a V100 GPU in this same IOMMU group. So this is a NVLink2 passthrough scenario"
Or perhaps:
"It has ibm,gpu and ibm,nvlink and the of_node of its memory-region has a reg value of 0x20 , thus this is a nvlink2 bus and ....."
0x20 is a size of the GPU RAM window which might change in different revisions of the hw as well.
Both alternatives are way simpler than what I'm doing in this patch. I am not sure if they are valid though.
Assumption #2: if the nvlink2 bridge's DT node has memory-region, then the GPU will try to use its RAM. The adjustment you'll have to make won't depend on anything as the existing GPU RAM placement is so that whatever you can possibly throw at QEMU will fit 128TiB window and we cannot make the window smaller anyway (next size would lower power of two - 64TiB - and GPU RAM lives just above that).
Ok, so, from the of_node of an unknown device that has been passed through to a VM, if the DT node has:
ibm,gpu ibm,nvlink memory-region
Is that enough to justify the rlimit adjustment for NVLink2?
This is the idea, yes. To be on a safe side, you might want to limit the check to the IBM vendor ID (not even sure about the device id) and PCI bridges only.
I could do better of course and allow adjusting the limit to cover let's say 66TiB instead of 128TiB, it just seems to be unnecessary complication (we already allocate only required amount of TCE entries on demand).
> Thinking about it a bit more, since this is NVIDIA-specific, > having an > NVIDIA-only sysfs attribute doesn't help the node-device driver I > mentioned > above. In general, we try to avoid introducing vendor-specific > code, so nodedev > might not be the right place to check for the attribute (because > it's not > universal and would require vendor-specific elements), but I think > we could > have an NVLink helper reading this sysfs attribute(s) elsewhere in > libvirt > codebase. > > Erik
-- Alexey

On 3/7/19 12:39 PM, Daniel Henrique Barboza wrote:
On 3/7/19 12:15 PM, Erik Skultety wrote:
On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verfying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created.
An alternative is presented in this patch. Given a PCI device, we'll traverse the device tree at /proc/device-tree to check if the device has a NPU bridge, retrieve the node of the NVLink2 bus, find the memory-node that is related to the bus and see if it's a NVLink2 bus by inspecting its 'reg' value. This logic is contained inside the 'device_is_nvlink2_capable' function, which uses other new helper functions to navigate and fetch values from the device tree nodes.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- This fails with a bunch of compilation errors, please make sure you run make check and make syntax-check on each patch of the series.
Ooops, forgot to follow up make syntax-check with 'make' before submitting ... my bad.
Erik, I've fixed the errors and applied all the review suggestions, but it appears that the detection code can be simplified. I'll wait to see if we can get that logic shorter before resending this patch. Also, this patch won't compile because the NVLink 2 detection function isn't being used anywhere - it will be used on patch 4. One alternative to make it compilable is to put a #pragma GCC on the function to tell the compiler to ignore the Wunused-function warning and then remove it on patch 4. I can also declare it public, then turn it back to static on patch 4. Or we can ignore this warning and assume that this error is expected. Let me know how do you want to proceed, I am fine with any of the alternatives. Daniel
src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..97de5793e2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * Reads a phandle file and returns the phandle value. + */ +static int +read_dt_phandle(const char* file) ..we prefer camelCase naming of functions...all functions should have a prefix, e.g. vir,<driver>, etc.
+{ + unsigned int buf[1]; + size_t read; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return -1; + + read = fread(buf, sizeof(unsigned int), 1, f); We already have safe helpers that take care of such operations.
+ + if (!read) { + VIR_CLOSE(f); You could use VIR_AUTOCLOSE for this
+ return 0; + } + + VIR_CLOSE(f); + return be32toh(buf[0]); Is this part of gnulib? We need to make sure it's portable otherwise we'd need to make the conversion ourselves, just like for le64toh
+} + + +/** + * Reads a memory reg file and returns the first 4 int values. + * + * The caller is responsible for freeing the returned array. + */ +static unsigned int * +read_dt_memory_reg(const char *file) +{ + unsigned int *buf; + size_t read, i; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return NULL; + + if (VIR_ALLOC_N(buf, 4) < 0) + return NULL; + + read = fread(buf, sizeof(unsigned int), 4, f); + + if (!read && read < 4) + /* shouldn't happen */ + VIR_FREE(buf); + else for (i = 0; i < 4; i++) + buf[i] = be32toh(buf[i]); + + VIR_CLOSE(f); + return buf; Same comments as above...
+} + + +/** + * This wrapper function receives arguments to be used in a + * 'find' call to retrieve the file names that matches + * the criteria inside the /proc/device-tree dir. + * + * A 'find' call with '-iname phandle' inside /proc/device-tree + * provides more than a thousand matches. Adding '-path' to + * narrow it down further is necessary to keep the file + * listing sane. + * + * The caller is responsible to free the buffer returned by + * this function. + */ +static char * +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern) +{ + virCommandPtr cmd = NULL; + char *output = NULL; + + cmd = virCommandNew("find"); + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern, + "-iname", file_pattern, NULL); + virCommandSetOutputBuffer(cmd, &output); + + if (virCommandRun(cmd, NULL) < 0) + VIR_FREE(output); + + virCommandFree(cmd); + return output; +} + + +/** + * Helper function that receives a listing of file names and + * calls read_dt_phandle() on each one finding for a match + * with the given phandle argument. Returns the file name if a + * match is found, NULL otherwise. + */ +static char * +find_dt_file_with_phandle(char *files, int phandle) +{ + char *line, *tmp; + int ret; + + line = strtok_r(files, "\n", &tmp); + do { + ret = read_dt_phandle(line); + if (ret == phandle) + break; + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); + + return line; +} + + +/** + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is NVLink2 capable. + * + * The logic goes as follows: + * + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu' + * attribute; + * 2 - find the device tree node of the nvlink bus using the phandle + * found in (1) + * 3 - get the phandle of the memory region of the nvlink bus + * 4 - find the device tree node of the memory region using the + * phandle found in (3) + * 5 - read the 'reg' value of the memory region. If the value of + * the second 64 bit value is 0x02 0x00, the device is attached + * to a NVLink2 bus. + * + * If any of these steps fails, the function returns false. + */ +static bool +device_is_nvlink2_capable(const char *device) +{ + char *file, *files, *tmp; + unsigned int *reg; + int phandle; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu", + device)) < 0) + return false; + + /* Find phandles of nvlinks: */ + if ((phandle = read_dt_phandle(file)) == -1) + return false; + + /* Find a DT node for the phandle found */ + files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* Find a phandle of the GPU memory region of the device. The + * file found above ends with '/phandle' - the memory region + * of the GPU ends with '/memory-region */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/memory-region"); + + if ((phandle = read_dt_phandle(file)) == -1) + goto fail; + + file = NULL; + VIR_FREE(files); + + /* Find the memory node for the phandle found above */ + files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* And see its size in the second 64bit value of 'reg'. First, + * the end of the file needs to be changed from '/phandle' to + * '/reg' */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/reg"); + + reg = read_dt_memory_reg(file); + if (reg && reg[2] == 0x20 && reg[3] == 0x00) + return true; This function is very complex just to find out whether a PCI device is capable of NVLink or not. Feels wrong to do it this way, I believe it would be much easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports NVLink so that our node-device driver would take care of this during libvirtd startup and then you'd only call a single API from the PPC64 helper you're introducing in the next patch to find out whether you need the alternative formula or not.
Honestly, apart from the coding style issues, I don't like this approach and unless there's a really compelling reason for libvirt to do it in a way which involves spawning a 'find' process because of a complex pattern and a bunch of data necessary to filter out, I'd really suggest contacting NVIDIA about this. It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which we're able to read.
I'll contact NVIDIA and see if there is an easier way (a sysfs attribute, for example) and, if doesn't, try to provide a plausibe reason to justify this detection code.
Thinking about it a bit more, since this is NVIDIA-specific, having an NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned above. In general, we try to avoid introducing vendor-specific code, so nodedev might not be the right place to check for the attribute (because it's not universal and would require vendor-specific elements), but I think we could have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt codebase.
Erik

On 3/7/19 12:15 PM, Erik Skultety wrote:
On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verfying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created.
An alternative is presented in this patch. Given a PCI device, we'll traverse the device tree at /proc/device-tree to check if the device has a NPU bridge, retrieve the node of the NVLink2 bus, find the memory-node that is related to the bus and see if it's a NVLink2 bus by inspecting its 'reg' value. This logic is contained inside the 'device_is_nvlink2_capable' function, which uses other new helper functions to navigate and fetch values from the device tree nodes.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- This fails with a bunch of compilation errors, please make sure you run make check and make syntax-check on each patch of the series.
src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 77548c224c..97de5793e2 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * Reads a phandle file and returns the phandle value. + */ +static int +read_dt_phandle(const char* file) ..we prefer camelCase naming of functions...all functions should have a prefix, e.g. vir,<driver>, etc.
+{ + unsigned int buf[1]; + size_t read; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return -1; + + read = fread(buf, sizeof(unsigned int), 1, f); We already have safe helpers that take care of such operations.
+ + if (!read) { + VIR_CLOSE(f); You could use VIR_AUTOCLOSE for this
+ return 0; + } + + VIR_CLOSE(f); + return be32toh(buf[0]); Is this part of gnulib? We need to make sure it's portable otherwise we'd need to make the conversion ourselves, just like for le64toh
be32toh is from glibc, from the same family of le64toh according to the man page. Guess it faces the same restrictions then. If the next patch requires be32toh I'll add a helper function to manually convert it.
+} + + +/** + * Reads a memory reg file and returns the first 4 int values. + * + * The caller is responsible for freeing the returned array. + */ +static unsigned int * +read_dt_memory_reg(const char *file) +{ + unsigned int *buf; + size_t read, i; + FILE *f; + + f = fopen(file, "r"); + if (!f) + return NULL; + + if (VIR_ALLOC_N(buf, 4) < 0) + return NULL; + + read = fread(buf, sizeof(unsigned int), 4, f); + + if (!read && read < 4) + /* shouldn't happen */ + VIR_FREE(buf); + else for (i = 0; i < 4; i++) + buf[i] = be32toh(buf[i]); + + VIR_CLOSE(f); + return buf; Same comments as above...
+} + + +/** + * This wrapper function receives arguments to be used in a + * 'find' call to retrieve the file names that matches + * the criteria inside the /proc/device-tree dir. + * + * A 'find' call with '-iname phandle' inside /proc/device-tree + * provides more than a thousand matches. Adding '-path' to + * narrow it down further is necessary to keep the file + * listing sane. + * + * The caller is responsible to free the buffer returned by + * this function. + */ +static char * +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern) +{ + virCommandPtr cmd = NULL; + char *output = NULL; + + cmd = virCommandNew("find"); + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern, + "-iname", file_pattern, NULL); + virCommandSetOutputBuffer(cmd, &output); + + if (virCommandRun(cmd, NULL) < 0) + VIR_FREE(output); + + virCommandFree(cmd); + return output; +} + + +/** + * Helper function that receives a listing of file names and + * calls read_dt_phandle() on each one finding for a match + * with the given phandle argument. Returns the file name if a + * match is found, NULL otherwise. + */ +static char * +find_dt_file_with_phandle(char *files, int phandle) +{ + char *line, *tmp; + int ret; + + line = strtok_r(files, "\n", &tmp); + do { + ret = read_dt_phandle(line); + if (ret == phandle) + break; + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); + + return line; +} + + +/** + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is NVLink2 capable. + * + * The logic goes as follows: + * + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu' + * attribute; + * 2 - find the device tree node of the nvlink bus using the phandle + * found in (1) + * 3 - get the phandle of the memory region of the nvlink bus + * 4 - find the device tree node of the memory region using the + * phandle found in (3) + * 5 - read the 'reg' value of the memory region. If the value of + * the second 64 bit value is 0x02 0x00, the device is attached + * to a NVLink2 bus. + * + * If any of these steps fails, the function returns false. + */ +static bool +device_is_nvlink2_capable(const char *device) +{ + char *file, *files, *tmp; + unsigned int *reg; + int phandle; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu", + device)) < 0) + return false; + + /* Find phandles of nvlinks: */ + if ((phandle = read_dt_phandle(file)) == -1) + return false; + + /* Find a DT node for the phandle found */ + files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* Find a phandle of the GPU memory region of the device. The + * file found above ends with '/phandle' - the memory region + * of the GPU ends with '/memory-region */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/memory-region"); + + if ((phandle = read_dt_phandle(file)) == -1) + goto fail; + + file = NULL; + VIR_FREE(files); + + /* Find the memory node for the phandle found above */ + files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle"); + if (!files) + return false; + + if ((file = find_dt_file_with_phandle(files, phandle)) == NULL) + goto fail; + + /* And see its size in the second 64bit value of 'reg'. First, + * the end of the file needs to be changed from '/phandle' to + * '/reg' */ + tmp = strrchr(file, '/'); + *tmp = '\0'; + file = strcat(file, "/reg"); + + reg = read_dt_memory_reg(file); + if (reg && reg[2] == 0x20 && reg[3] == 0x00) + return true; This function is very complex just to find out whether a PCI device is capable of NVLink or not. Feels wrong to do it this way, I believe it would be much easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports NVLink so that our node-device driver would take care of this during libvirtd startup and then you'd only call a single API from the PPC64 helper you're introducing in the next patch to find out whether you need the alternative formula or not.
Honestly, apart from the coding style issues, I don't like this approach and unless there's a really compelling reason for libvirt to do it in a way which involves spawning a 'find' process because of a complex pattern and a bunch of data necessary to filter out, I'd really suggest contacting NVIDIA about this. It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which we're able to read. Thinking about it a bit more, since this is NVIDIA-specific, having an NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned above. In general, we try to avoid introducing vendor-specific code, so nodedev might not be the right place to check for the attribute (because it's not universal and would require vendor-specific elements), but I think we could have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt codebase.
Erik

The NVIDIA V100 GPU has an onboard RAM that is mapped into the host memory and accessible as normal RAM via an NVLink2 bus. When passed through in a guest, QEMU puts the NVIDIA RAM window in a non-contiguous area, above the PCI MMIO area that starts at 32TiB. This means that the NVIDIA RAM window starts at 64TiB and go all the way to 128TiB. This means that the guest might request a 64-bit window, for each PCI Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM window isn't counted as regular RAM, thus this window is considered only for the allocation of the Translation and Control Entry (TCE). This memory layout differs from the existing VFIO case, requiring its own formula. This patch changes the PPC64 code of qemuDomainGetMemLockLimitBytes to: - detect if a VFIO PCI device is using NVLink2 capabilities. This is done by using the device tree inspection mechanisms that were implemented in the previous patch; - if any device is a NVIDIA GPU using a NVLink2 bus, passthroughLimit is calculated in a different way to account for the extra memory the TCE table can alloc. The 64TiB..128TiB window is more than enough to fit all possible GPUs, thus the memLimit is the same regardless of passing through 1 or multiple V100 GPUs. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 42 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 39 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 97de5793e2..c0abd6da9a 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10551,7 +10551,9 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) unsigned long long baseLimit, memory, maxMemory; unsigned long long passthroughLimit = 0; size_t i, nPCIHostBridges = 0; - bool usesVFIO = false; + virPCIDeviceAddressPtr pciAddr; + char *pciAddrStr = NULL; + bool usesVFIO = false, nvlink2Capable = false; for (i = 0; i < def->ncontrollers; i++) { virDomainControllerDefPtr cont = def->controllers[i]; @@ -10569,7 +10571,15 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { usesVFIO = true; - break; + + pciAddr = &dev->source.subsys.u.pci.addr; + if (virPCIDeviceAddressIsValid(pciAddr, false)) { + pciAddrStr = virPCIDeviceAddressAsString(pciAddr); + if (device_is_nvlink2_capable(pciAddrStr)) { + nvlink2Capable = true; + break; + } + } } } @@ -10596,6 +10606,32 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) 4096 * nPCIHostBridges + 8192; + /* NVLink2 support in QEMU is a special case of the passthrough + * mechanics explained in the usesVFIO case below. The GPU RAM + * is placed with a gap after maxMemory. The current QEMU + * implementation puts the NVIDIA RAM above the PCI MMIO, which + * starts at 32TiB and is the MMIO reserved for the guest main RAM. + * + * This window ends at 64TiB, and this is where the GPUs are being + * placed. The next available window size is at 128TiB, and + * 64TiB..128TiB will fit all possible NVIDIA GPUs. + * + * The same assumption as the most common case applies here: + * the guest will request a 64-bit DMA window, per PHB, that is + * big enough to map all its RAM, which is now at 128TiB due + * to the GPUs. + * + * Note that the NVIDIA RAM window must be accounted for the TCE + * table size, but *not* for the main RAM (maxMemory). This gives + * us the following passthroughLimit for the NVLink2 case: + * + * passthroughLimit = maxMemory + + * 128TiB/512KiB * #PHBs + 8 MiB */ + if (nvlink2Capable) + passthroughLimit = maxMemory + + 128 * (1ULL<<30) / 512 * nPCIHostBridges + + 8192; + /* passthroughLimit := max( 2 GiB * #PHBs, (c) * memory (d) * + memory * 1/512 * #PHBs + 8 MiB ) (e) @@ -10615,7 +10651,7 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) * kiB pages, less still if the guest is mapped with hugepages (unlike * the default 32-bit DMA window, DDW windows can use large IOMMU * pages). 8 MiB is for second and further level overheads, like (b) */ - if (usesVFIO) + else if (usesVFIO) passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, memory + memory / 512 * nPCIHostBridges + 8192); -- 2.20.1

Erik, It appears that patches 1 and 2 are good to go (and probably can be pushed separately from the other 2). Should I resend only patches 3 and 4 when I get more info from NVIDIA about the detection code in patch 3? Thanks, Daniel On 3/5/19 9:46 AM, Daniel Henrique Barboza wrote:
This series includes Libvirt support for a new QEMU feature for the spapr (PPC64) machine, NVIDIA V100 + P9 passthrough. Refer to [1] for the version 3 of this feature (same version used as a reference for this series).
Changes in v3: - added a new patch (patch 2) that isolates the PPC64 exclusive code to calculate the memLockLimit (suggested by Erik Skultety) - fixed 'make syntax-check' errors across all patches - v2 can be found at [2]
[1] https://patchwork.kernel.org/cover/10831413/ [2] https://www.redhat.com/archives/libvir-list/2019-March/msg00059.html
Daniel Henrique Barboza (4): qemu_domain: simplify non-VFIO memLockLimit calc for PPC64 qemu_domain: add a PPC64 memLockLimit helper qemu_domain: NVLink2 device tree functions for PPC64 PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough
src/qemu/qemu_domain.c | 402 ++++++++++++++++++++++++++++++++--------- 1 file changed, 321 insertions(+), 81 deletions(-)

On Thu, Mar 07, 2019 at 01:07:23PM -0300, Daniel Henrique Barboza wrote:
Erik,
It appears that patches 1 and 2 are good to go (and probably can be pushed separately from the other 2). Should I resend only patches 3 and 4 when I get more info from NVIDIA about the detection code in patch 3?
Hi, sorry for not stating it immediately, I pushed the first 2 patches immediately after giving you the RB, so sure, you only need to focus on the NVLink stuff. Erik
participants (4)
-
Alexey Kardashevskiy
-
Daniel Henrique Barboza
-
Erik Skultety
-
Piotr Jaroszynski