[libvirt] [PATCH v4 0/2] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough

This series adds support for a new QEMU feature for the spapr (PPC64) machine, NVIDIA V100 + P9 passthrough. Refer to [1] for the version 7 of this feature (same version used as a reference for this series). changes in v4: - only 2 patches - previous patches 1 and 2 were applied - the detection mechanism was redesigned in patch 3. It is now simpler and shorter after the discussions from the previous version. - previous version can be found at [2] [1] https://patchwork.kernel.org/patch/10848727/ [2] https://www.redhat.com/archives/libvir-list/2019-March/msg00212.html Daniel Henrique Barboza (2): qemu_domain: NVLink2 bridge detection function for PPC64 PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough src/qemu/qemu_domain.c | 71 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 69 insertions(+), 2 deletions(-) -- 2.20.1

The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much. An alternative is presented in this patch. Making the following assumptions: - if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through; - an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'. This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1659e88478..dcc92d253c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10398,6 +10398,35 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) } +/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"}; + char *file; + size_t i; + + for (i = 0; i < 4; i++) { + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s", + device, nvlink2Files[i])) < 0) + return false; + + if (!virFileExists(file)) + return false; + } + + return true; +} + + /** * getPPC64MemLockLimitBytes: * @def: domain definition -- 2.20.1

On Tue, Mar 12, 2019 at 18:55:49 -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+)
disclaimer: This is a code review. I don't have much knowledge about what you are trying to implement.
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1659e88478..dcc92d253c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10398,6 +10398,35 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"}; + char *file; + size_t i; + + for (i = 0; i < 4; i++) {
Use ARRAY_CARDINALITY(nvlink2Files) instead of '4'.
+ if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s", + device, nvlink2Files[i])) < 0) + return false; + + if (!virFileExists(file)) + return false;
memory allocated to 'file' is leaked in every loop and also at exit. You can use VIR_AUTOFREE(char *) file above, but you need to free the pointer also in every single loop.
+ } + + return true; +}
This function is unused and static, this means the build will fail after this patch.
+ + /** * getPPC64MemLockLimitBytes: * @def: domain definition -- 2.20.1
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Tue, Apr 02, 2019 at 09:21:53AM +0200, Peter Krempa wrote:
On Tue, Mar 12, 2019 at 18:55:49 -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+)
disclaimer: This is a code review. I don't have much knowledge about what you are trying to implement.
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1659e88478..dcc92d253c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10398,6 +10398,35 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"}; + char *file; + size_t i; + + for (i = 0; i < 4; i++) {
Use ARRAY_CARDINALITY(nvlink2Files) instead of '4'.
+ if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s", + device, nvlink2Files[i])) < 0) + return false; + + if (!virFileExists(file)) + return false;
memory allocated to 'file' is leaked in every loop and also at exit. You can use VIR_AUTOFREE(char *) file above, but you need to free the pointer also in every single loop.
Or just declare the file variable inside the loop, then it will be freed after going out of scope every loop. Jano
+ } + + return true; +}
This function is unused and static, this means the build will fail after this patch.
+ + /** * getPPC64MemLockLimitBytes: * @def: domain definition -- 2.20.1
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 4/2/19 4:46 AM, Ján Tomko wrote:
On Tue, Apr 02, 2019 at 09:21:53AM +0200, Peter Krempa wrote:
On Tue, Mar 12, 2019 at 18:55:49 -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+)
disclaimer: This is a code review. I don't have much knowledge about what you are trying to implement.
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1659e88478..dcc92d253c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10398,6 +10398,35 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"}; + char *file; + size_t i; + + for (i = 0; i < 4; i++) {
Use ARRAY_CARDINALITY(nvlink2Files) instead of '4'.
+ if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s", + device, nvlink2Files[i])) < 0) + return false; + + if (!virFileExists(file)) + return false;
memory allocated to 'file' is leaked in every loop and also at exit. You can use VIR_AUTOFREE(char *) file above, but you need to free the pointer also in every single loop.
Or just declare the file variable inside the loop, then it will be freed after going out of scope every loop.
Thanks for this tip. I've used it in both patches to fix this leak inside loop.
Jano
+ } + + return true; +}
This function is unused and static, this means the build will fail after this patch.
+ + /** * getPPC64MemLockLimitBytes: * @def: domain definition -- 2.20.1
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 4/2/19 4:21 AM, Peter Krempa wrote:
On Tue, Mar 12, 2019 at 18:55:49 -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) disclaimer: This is a code review. I don't have much knowledge about what you are trying to implement.
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1659e88478..dcc92d253c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10398,6 +10398,35 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"}; + char *file; + size_t i; + + for (i = 0; i < 4; i++) { Use ARRAY_CARDINALITY(nvlink2Files) instead of '4'.
ok!
+ if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s", + device, nvlink2Files[i])) < 0) + return false; + + if (!virFileExists(file)) + return false; memory allocated to 'file' is leaked in every loop and also at exit. You can use VIR_AUTOFREE(char *) file above, but you need to free the pointer also in every single loop.
+ } + + return true; +} This function is unused and static, this means the build will fail after this patch.
I added an ATTRIBUTE_UNUSED in the function to allow the build to succeed. In the next patch, when I ended up using the function, I removed it. Is that ok?
+ + /** * getPPC64MemLockLimitBytes: * @def: domain definition -- 2.20.1
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Tue, Mar 12, 2019 at 06:55:49PM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'.
Alexey mentioned that it should be enough to check for the properties ^above. I'm just wondering, knowing this is IBM's PPC8/9 if the assumptions we have made are going to stay with further revisions of PPC, NVLink and GPUs, IOW we need to be sure "ibm,nvlink" won't be renamed with further revisions, e.g. other cards than V100 in the future, because then compatibility and revision selection comes into the picture.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This
Just out of curiosity, what about NVLink 1.0? Apart from performance, I wasn't able to find something useful in terms of compatibility, is there something to consider, since we're only relying on NVLink 2.0?
new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1659e88478..dcc92d253c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10398,6 +10398,35 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"};
Like I said above, if ^this is not to change, then I'm okay in principle, still I feel like this needs David's ACK (putting him on CC) Codewise, Peter already provided you with comments. Erik
+ char *file; + size_t i; + + for (i = 0; i < 4; i++) { + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s", + device, nvlink2Files[i])) < 0) + return false; + + if (!virFileExists(file)) + return false; + } + + return true; +} + + /** * getPPC64MemLockLimitBytes: * @def: domain definition -- 2.20.1
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 4/2/19 5:34 AM, Erik Skultety wrote:
On Tue, Mar 12, 2019 at 06:55:49PM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'. Alexey mentioned that it should be enough to check for the properties ^above. I'm just wondering, knowing this is IBM's PPC8/9 if the assumptions we have made are going to stay with further revisions of PPC, NVLink and GPUs, IOW we need to be sure "ibm,nvlink" won't be renamed with further revisions, e.g. other cards than V100 in the future, because then compatibility and revision selection comes into the picture.
Perhaps Alexey or Piotr can comment on this. I can't confirm that the device node naming will remain as is in the long run.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This Just out of curiosity, what about NVLink 1.0? Apart from performance, I wasn't able to find something useful in terms of compatibility, is there something to consider, since we're only relying on NVLink 2.0?
NVLink1, as far as Libvirt goes, works similar to a regular GPU passthrough. There is no changes in the memory topology in QEMU that needs extra code to adjust rlimit. I am not entirely sure if the code above will not generate a false-positive, mistakenly detecting NV2 scenario for a NV1 GPU. I think the 'memory-region' attribute won't be present in a NV1 bridge. Piotr, can you comment here?
new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1659e88478..dcc92d253c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10398,6 +10398,35 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"}; Like I said above, if ^this is not to change, then I'm okay in principle, still I feel like this needs David's ACK (putting him on CC)
Codewise, Peter already provided you with comments.
Erik
+ char *file; + size_t i; + + for (i = 0; i < 4; i++) { + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s", + device, nvlink2Files[i])) < 0) + return false; + + if (!virFileExists(file)) + return false; + } + + return true; +} + + /** * getPPC64MemLockLimitBytes: * @def: domain definition -- 2.20.1
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Tue, Apr 02, 2019 at 05:27:55PM -0300, Daniel Henrique Barboza wrote:
On 4/2/19 5:34 AM, Erik Skultety wrote:
On Tue, Mar 12, 2019 at 06:55:49PM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'. Alexey mentioned that it should be enough to check for the properties ^above. I'm just wondering, knowing this is IBM's PPC8/9 if the assumptions we have made are going to stay with further revisions of PPC, NVLink and GPUs, IOW we need to be sure "ibm,nvlink" won't be renamed with further revisions, e.g. other cards than V100 in the future, because then compatibility and revision selection comes into the picture.
Perhaps Alexey or Piotr can comment on this. I can't confirm that the device node naming will remain as is in the long run.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This Just out of curiosity, what about NVLink 1.0? Apart from performance, I wasn't able to find something useful in terms of compatibility, is there something to consider, since we're only relying on NVLink 2.0?
NVLink1, as far as Libvirt goes, works similar to a regular GPU passthrough. There is no changes in the memory topology in QEMU that needs extra code to adjust rlimit.
I am not entirely sure if the code above will not generate a false-positive, mistakenly detecting NV2 scenario for a NV1 GPU. I think the 'memory-region' attribute won't be present in a NV1 bridge. Piotr, can you comment here?
I'm glad you mentioned it. Well, the worst that can happen if it generates a false positive is that we raise the rlimit even though we don't need to, which in general is a concern, since malicious guests with relaxed limits can lock enough memory so that the host doesn't have enough left for itself, this cannot be prevented completely, however, we should still follow our "best effort" approach, IOW we should make sure that we only adjust the limit if necessary. Thanks, Erik

On 4/3/19 4:05 AM, Erik Skultety wrote:
On Tue, Apr 02, 2019 at 05:27:55PM -0300, Daniel Henrique Barboza wrote:
On 4/2/19 5:34 AM, Erik Skultety wrote:
On Tue, Mar 12, 2019 at 06:55:49PM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'. Alexey mentioned that it should be enough to check for the properties ^above. I'm just wondering, knowing this is IBM's PPC8/9 if the assumptions we have made are going to stay with further revisions of PPC, NVLink and GPUs, IOW we need to be sure "ibm,nvlink" won't be renamed with further revisions, e.g. other cards than V100 in the future, because then compatibility and revision selection comes into the picture. Perhaps Alexey or Piotr can comment on this. I can't confirm that the device node naming will remain as is in the long run.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This Just out of curiosity, what about NVLink 1.0? Apart from performance, I wasn't able to find something useful in terms of compatibility, is there something to consider, since we're only relying on NVLink 2.0? NVLink1, as far as Libvirt goes, works similar to a regular GPU passthrough. There is no changes in the memory topology in QEMU that needs extra code to adjust rlimit.
I am not entirely sure if the code above will not generate a false-positive, mistakenly detecting NV2 scenario for a NV1 GPU. I think the 'memory-region' attribute won't be present in a NV1 bridge. Piotr, can you comment here?
I'm glad you mentioned it. Well, the worst that can happen if it generates a false positive is that we raise the rlimit even though we don't need to, which in general is a concern, since malicious guests with relaxed limits can lock enough memory so that the host doesn't have enough left for itself, this cannot be prevented completely, however, we should still follow our "best effort" approach, IOW we should make sure that we only adjust the limit if necessary.
Just verified that this code will *not* result in NV2 false-positives for NV1 passthrough. As I've said before, NV1 works as a regular VFIO passthrough. The main difference is that there is *no* NVLink Bridges being passed through as well, which is exactly what the code here is detecting. For reference, the link below contains instructions of how NV1 passthrough of a Tesla k40m works in Libvirt: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1541902/comments/26 TL;DR, it consists simply in the <hostdev> element with the GPU and, in case of multiple NV1 GPUs are being passed through, an extra global parameter. There is no way to mistake NV1 with NV2 if we're looking for NV2 bridges being passed on, like we're doing here. Thanks, DHB
Thanks, Erik

On 04/04/2019 00:52, Daniel Henrique Barboza wrote:
On 4/3/19 4:05 AM, Erik Skultety wrote:
On Tue, Apr 02, 2019 at 05:27:55PM -0300, Daniel Henrique Barboza wrote:
On 4/2/19 5:34 AM, Erik Skultety wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'. Alexey mentioned that it should be enough to check for the
On Tue, Mar 12, 2019 at 06:55:49PM -0300, Daniel Henrique Barboza wrote: properties ^above. I'm just wondering, knowing this is IBM's PPC8/9 if the assumptions we have made are going to stay with further revisions of PPC, NVLink and GPUs, IOW we need to be sure "ibm,nvlink" won't be renamed with further revisions, e.g. other cards than V100 in the future, because then compatibility and revision selection comes into the picture. Perhaps Alexey or Piotr can comment on this. I can't confirm that the device node naming will remain as is in the long run.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This Just out of curiosity, what about NVLink 1.0? Apart from performance, I wasn't able to find something useful in terms of compatibility, is there something to consider, since we're only relying on NVLink 2.0? NVLink1, as far as Libvirt goes, works similar to a regular GPU passthrough. There is no changes in the memory topology in QEMU that needs extra code to adjust rlimit.
I am not entirely sure if the code above will not generate a false-positive, mistakenly detecting NV2 scenario for a NV1 GPU. I think the 'memory-region' attribute won't be present in a NV1 bridge. Piotr, can you comment here?
I'm glad you mentioned it. Well, the worst that can happen if it generates a false positive is that we raise the rlimit even though we don't need to, which in general is a concern, since malicious guests with relaxed limits can lock enough memory so that the host doesn't have enough left for itself, this cannot be prevented completely, however, we should still follow our "best effort" approach, IOW we should make sure that we only adjust the limit if necessary.
Just verified that this code will *not* result in NV2 false-positives for NV1 passthrough.
As I've said before, NV1 works as a regular VFIO passthrough. The main difference is that there is *no* NVLink Bridges being passed through as well, which is exactly what the code here is detecting.
For reference, the link below contains instructions of how NV1 passthrough of a Tesla k40m works in Libvirt:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1541902/comments/26
TL;DR, it consists simply in the <hostdev> element with the GPU and, in case of multiple NV1 GPUs are being passed through, an extra global parameter. There is no way to mistake NV1 with NV2 if we're looking for NV2 bridges being passed on, like we're doing here.
No. K40 is not NV1's GPU (although the same core), it is "NVIDIA Corporation Device 15fe" and the difference to V100 (from P9 boxes) is that previously NVLink1 was used for (very very fast) DMA _only_ (hence the same limits as usual) but NVLink2 provides direct access for a CPU to the GPU's RAM so that memory appears on the host, can be DMA'ed to/from (by a network adapter, for example) and has to be mapped via IOMMU resulting in bigger IOMMU tables. So, NVLink1 also required the bridges to be passed but there was no change to VFIO or QEMU in any way but there are still "ibm,gpu" and "ibm,npu" properties in the host device tree - the Linux powernv platform relies on this. There are no "ibm,nvlink" and "memory-region" in those nodes on NVLink1 systems though and since the device tree comes from skiboot which is the firmware our team controls - I can say this API won't change incompatibly. -- Alexey

On Tue, 2 Apr 2019 10:34:27 +0200 Erik Skultety <eskultet@redhat.com> wrote:
On Tue, Mar 12, 2019 at 06:55:49PM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much.
An alternative is presented in this patch. Making the following assumptions:
- if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through;
- an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'.
Alexey mentioned that it should be enough to check for the properties ^above. I'm just wondering, knowing this is IBM's PPC8/9 if the assumptions we have made are going to stay with further revisions of PPC, NVLink and GPUs, IOW we need to be sure "ibm,nvlink" won't be renamed with further revisions, e.g. other cards than V100 in the future, because then compatibility and revision selection comes into the picture.
This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This
Just out of curiosity, what about NVLink 1.0? Apart from performance, I wasn't able to find something useful in terms of compatibility, is there something to consider, since we're only relying on NVLink 2.0?
Probably not. NVLink 1 is entirely incompatible, and AFAIK support for it has been postponed until probably never.
new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1659e88478..dcc92d253c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10398,6 +10398,35 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) }
+/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"};
Like I said above, if ^this is not to change, then I'm okay in principle, still I feel like this needs David's ACK (putting him on CC)
Codewise, Peter already provided you with comments.
Hm, I don't feel I can answer this definitively. My guess would be to ask Alexey, and AFAIK you already have information from him. -- David Gibson <dgibson@redhat.com> Principal Software Engineer, Virtualization, Red Hat

The NVIDIA V100 GPU has an onboard RAM that is mapped into the host memory and accessible as normal RAM via an NVLink2 bridge. When passed through in a guest, QEMU puts the NVIDIA RAM window in a non-contiguous area, above the PCI MMIO area that starts at 32TiB. This means that the NVIDIA RAM window starts at 64TiB and go all the way to 128TiB. This means that the guest might request a 64-bit window, for each PCI Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM window isn't counted as regular RAM, thus this window is considered only for the allocation of the Translation and Control Entry (TCE). This memory layout differs from the existing VFIO case, requiring its own formula. This patch changes the PPC64 code of @qemuDomainGetMemLockLimitBytes to: - detect if we have a NVLink2 bridge being passed through to the guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function added in the previous patch. The existence of the NVLink2 bridge in the guest means that we are dealing with the NVLink2 memory layout; - if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a different way to account for the extra memory the TCE table can alloc. The 64TiB..128TiB window is more than enough to fit all possible GPUs, thus the memLimit is the same regardless of passing through 1 or multiple V100 GPUs. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 42 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 40 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index dcc92d253c..6d1a69491d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10443,7 +10443,10 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) unsigned long long maxMemory = 0; unsigned long long passthroughLimit = 0; size_t i, nPCIHostBridges = 0; + virPCIDeviceAddressPtr pciAddr; + char *pciAddrStr = NULL; bool usesVFIO = false; + bool nvlink2Capable = false; for (i = 0; i < def->ncontrollers; i++) { virDomainControllerDefPtr cont = def->controllers[i]; @@ -10461,7 +10464,16 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { usesVFIO = true; - break; + + pciAddr = &dev->source.subsys.u.pci.addr; + if (virPCIDeviceAddressIsValid(pciAddr, false)) { + pciAddrStr = virPCIDeviceAddressAsString(pciAddr); + if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) { + nvlink2Capable = true; + break; + } + } + } } @@ -10488,6 +10500,32 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) 4096 * nPCIHostBridges + 8192; + /* NVLink2 support in QEMU is a special case of the passthrough + * mechanics explained in the usesVFIO case below. The GPU RAM + * is placed with a gap after maxMemory. The current QEMU + * implementation puts the NVIDIA RAM above the PCI MMIO, which + * starts at 32TiB and is the MMIO reserved for the guest main RAM. + * + * This window ends at 64TiB, and this is where the GPUs are being + * placed. The next available window size is at 128TiB, and + * 64TiB..128TiB will fit all possible NVIDIA GPUs. + * + * The same assumption as the most common case applies here: + * the guest will request a 64-bit DMA window, per PHB, that is + * big enough to map all its RAM, which is now at 128TiB due + * to the GPUs. + * + * Note that the NVIDIA RAM window must be accounted for the TCE + * table size, but *not* for the main RAM (maxMemory). This gives + * us the following passthroughLimit for the NVLink2 case: + * + * passthroughLimit = maxMemory + + * 128TiB/512KiB * #PHBs + 8 MiB */ + if (nvlink2Capable) + passthroughLimit = maxMemory + + 128 * (1ULL<<30) / 512 * nPCIHostBridges + + 8192; + /* passthroughLimit := max( 2 GiB * #PHBs, (c) * memory (d) * + memory * 1/512 * #PHBs + 8 MiB ) (e) @@ -10507,7 +10545,7 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) * kiB pages, less still if the guest is mapped with hugepages (unlike * the default 32-bit DMA window, DDW windows can use large IOMMU * pages). 8 MiB is for second and further level overheads, like (b) */ - if (usesVFIO) + else if (usesVFIO) passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, memory + memory / 512 * nPCIHostBridges + 8192); -- 2.20.1

On Tue, Mar 12, 2019 at 18:55:50 -0300, Daniel Henrique Barboza wrote:
The NVIDIA V100 GPU has an onboard RAM that is mapped into the host memory and accessible as normal RAM via an NVLink2 bridge. When passed through in a guest, QEMU puts the NVIDIA RAM window in a non-contiguous area, above the PCI MMIO area that starts at 32TiB. This means that the NVIDIA RAM window starts at 64TiB and go all the way to 128TiB.
This means that the guest might request a 64-bit window, for each PCI Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM window isn't counted as regular RAM, thus this window is considered only for the allocation of the Translation and Control Entry (TCE).
This memory layout differs from the existing VFIO case, requiring its own formula. This patch changes the PPC64 code of @qemuDomainGetMemLockLimitBytes to:
- detect if we have a NVLink2 bridge being passed through to the guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function added in the previous patch. The existence of the NVLink2 bridge in the guest means that we are dealing with the NVLink2 memory layout;
- if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a different way to account for the extra memory the TCE table can alloc. The 64TiB..128TiB window is more than enough to fit all possible GPUs, thus the memLimit is the same regardless of passing through 1 or multiple V100 GPUs.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 42 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 40 insertions(+), 2 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index dcc92d253c..6d1a69491d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10443,7 +10443,10 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) unsigned long long maxMemory = 0; unsigned long long passthroughLimit = 0; size_t i, nPCIHostBridges = 0; + virPCIDeviceAddressPtr pciAddr; + char *pciAddrStr = NULL; bool usesVFIO = false; + bool nvlink2Capable = false;
for (i = 0; i < def->ncontrollers; i++) { virDomainControllerDefPtr cont = def->controllers[i]; @@ -10461,7 +10464,16 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { usesVFIO = true; - break; + + pciAddr = &dev->source.subsys.u.pci.addr; + if (virPCIDeviceAddressIsValid(pciAddr, false)) { + pciAddrStr = virPCIDeviceAddressAsString(pciAddr);
Again this leaks the PCI address string on every iteration and on exit from this function.
+ if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) { + nvlink2Capable = true; + break; + } + } + } }
@@ -10488,6 +10500,32 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) 4096 * nPCIHostBridges + 8192;
+ /* NVLink2 support in QEMU is a special case of the passthrough + * mechanics explained in the usesVFIO case below. The GPU RAM + * is placed with a gap after maxMemory. The current QEMU + * implementation puts the NVIDIA RAM above the PCI MMIO, which + * starts at 32TiB and is the MMIO reserved for the guest main RAM. + * + * This window ends at 64TiB, and this is where the GPUs are being + * placed. The next available window size is at 128TiB, and + * 64TiB..128TiB will fit all possible NVIDIA GPUs. + * + * The same assumption as the most common case applies here: + * the guest will request a 64-bit DMA window, per PHB, that is + * big enough to map all its RAM, which is now at 128TiB due + * to the GPUs. + * + * Note that the NVIDIA RAM window must be accounted for the TCE + * table size, but *not* for the main RAM (maxMemory). This gives + * us the following passthroughLimit for the NVLink2 case:
Citation needed. Please link a source for these claims. We have some sources for claims on x86_64 even if they are not exactly scientific.
+ * + * passthroughLimit = maxMemory + + * 128TiB/512KiB * #PHBs + 8 MiB */ + if (nvlink2Capable)
Please add curly braces to this condition as it's multi-line and also has big comment inside of it.
+ passthroughLimit = maxMemory + + 128 * (1ULL<<30) / 512 * nPCIHostBridges + + 8192;
I don't quite understand why this formula uses maxMemory while the vfio case uses just 'memory'.
+ /* passthroughLimit := max( 2 GiB * #PHBs, (c) * memory (d) * + memory * 1/512 * #PHBs + 8 MiB ) (e) @@ -10507,7 +10545,7 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) * kiB pages, less still if the guest is mapped with hugepages (unlike * the default 32-bit DMA window, DDW windows can use large IOMMU * pages). 8 MiB is for second and further level overheads, like (b) */ - if (usesVFIO) + else if (usesVFIO)
So can't there be a case when a nvlink2 device is present but also e.g. vfio network cards?
passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, memory + memory / 512 * nPCIHostBridges + 8192);
Also add curly braces here when you are at it.
-- 2.20.1
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 4/2/19 4:37 AM, Peter Krempa wrote:
On Tue, Mar 12, 2019 at 18:55:50 -0300, Daniel Henrique Barboza wrote:
The NVIDIA V100 GPU has an onboard RAM that is mapped into the host memory and accessible as normal RAM via an NVLink2 bridge. When passed through in a guest, QEMU puts the NVIDIA RAM window in a non-contiguous area, above the PCI MMIO area that starts at 32TiB. This means that the NVIDIA RAM window starts at 64TiB and go all the way to 128TiB.
This means that the guest might request a 64-bit window, for each PCI Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM window isn't counted as regular RAM, thus this window is considered only for the allocation of the Translation and Control Entry (TCE).
This memory layout differs from the existing VFIO case, requiring its own formula. This patch changes the PPC64 code of @qemuDomainGetMemLockLimitBytes to:
- detect if we have a NVLink2 bridge being passed through to the guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function added in the previous patch. The existence of the NVLink2 bridge in the guest means that we are dealing with the NVLink2 memory layout;
- if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a different way to account for the extra memory the TCE table can alloc. The 64TiB..128TiB window is more than enough to fit all possible GPUs, thus the memLimit is the same regardless of passing through 1 or multiple V100 GPUs.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 42 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 40 insertions(+), 2 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index dcc92d253c..6d1a69491d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10443,7 +10443,10 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) unsigned long long maxMemory = 0; unsigned long long passthroughLimit = 0; size_t i, nPCIHostBridges = 0; + virPCIDeviceAddressPtr pciAddr; + char *pciAddrStr = NULL; bool usesVFIO = false; + bool nvlink2Capable = false;
for (i = 0; i < def->ncontrollers; i++) { virDomainControllerDefPtr cont = def->controllers[i]; @@ -10461,7 +10464,16 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { usesVFIO = true; - break; + + pciAddr = &dev->source.subsys.u.pci.addr; + if (virPCIDeviceAddressIsValid(pciAddr, false)) { + pciAddrStr = virPCIDeviceAddressAsString(pciAddr); Again this leaks the PCI address string on every iteration and on exit from this function.
I followed Jano tip here as well (VIR_AUTOFREE and variable declared inside the loop).
+ if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) { + nvlink2Capable = true; + break; + } + } + } }
@@ -10488,6 +10500,32 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) 4096 * nPCIHostBridges + 8192;
+ /* NVLink2 support in QEMU is a special case of the passthrough + * mechanics explained in the usesVFIO case below. The GPU RAM + * is placed with a gap after maxMemory. The current QEMU + * implementation puts the NVIDIA RAM above the PCI MMIO, which + * starts at 32TiB and is the MMIO reserved for the guest main RAM. + * + * This window ends at 64TiB, and this is where the GPUs are being + * placed. The next available window size is at 128TiB, and + * 64TiB..128TiB will fit all possible NVIDIA GPUs. + * + * The same assumption as the most common case applies here: + * the guest will request a 64-bit DMA window, per PHB, that is + * big enough to map all its RAM, which is now at 128TiB due + * to the GPUs. + * + * Note that the NVIDIA RAM window must be accounted for the TCE + * table size, but *not* for the main RAM (maxMemory). This gives + * us the following passthroughLimit for the NVLink2 case: Citation needed. Please link a source for these claims. We have some sources for claims on x86_64 even if they are not exactly scientific.
The source is the QEMU implementation of the NVLink 2 support. I can link the QEMU patches in the commit msg like I did in the cover-letter. Will that suffice?
+ * + * passthroughLimit = maxMemory + + * 128TiB/512KiB * #PHBs + 8 MiB */ + if (nvlink2Capable) Please add curly braces to this condition as it's multi-line and also has big comment inside of it.
+ passthroughLimit = maxMemory + + 128 * (1ULL<<30) / 512 * nPCIHostBridges + + 8192; I don't quite understand why this formula uses maxMemory while the vfio case uses just 'memory'.
The main difference is the memory hotplug case. If there is not memory hotplug, maxMemory=memory and everything is the same. With memory hotplug, "memory" in the calculations represents "megs" from QEMU's "-m [size=]megs[,slots=n,maxmem=size]". For VFIO, we use 'memory' because until the new memory is plugged into the guest, only "megs" are actually mapped for DMA. The actual IOMMU table window backing the 64bit DMA window still has to cover "maxmem" though, so this is why 'maxMemory' is used in the 'baseLimit' formula. For NV2, we use maxMemory because the GPU RAM starts already in 'maxMemory' with a gap (which is now at the 64TiB window). Thanks, dhb
+ /* passthroughLimit := max( 2 GiB * #PHBs, (c) * memory (d) * + memory * 1/512 * #PHBs + 8 MiB ) (e) @@ -10507,7 +10545,7 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) * kiB pages, less still if the guest is mapped with hugepages (unlike * the default 32-bit DMA window, DDW windows can use large IOMMU * pages). 8 MiB is for second and further level overheads, like (b) */ - if (usesVFIO) + else if (usesVFIO) So can't there be a case when a nvlink2 device is present but also e.g. vfio network cards?
passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges, memory + memory / 512 * nPCIHostBridges + 8192);
Also add curly braces here when you are at it.
-- 2.20.1
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Ping The QEMU implementation was accepted and will be presented in QEMU 4.1. The implementation accepted is compatible with the work already done in this series, so this version is still current. Thanks, dhb On 3/12/19 6:55 PM, Daniel Henrique Barboza wrote:
This series adds support for a new QEMU feature for the spapr (PPC64) machine, NVIDIA V100 + P9 passthrough. Refer to [1] for the version 7 of this feature (same version used as a reference for this series).
changes in v4: - only 2 patches - previous patches 1 and 2 were applied - the detection mechanism was redesigned in patch 3. It is now simpler and shorter after the discussions from the previous version. - previous version can be found at [2]
[1] https://patchwork.kernel.org/patch/10848727/ [2] https://www.redhat.com/archives/libvir-list/2019-March/msg00212.html
Daniel Henrique Barboza (2): qemu_domain: NVLink2 bridge detection function for PPC64 PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough
src/qemu/qemu_domain.c | 71 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 69 insertions(+), 2 deletions(-)

On Mon, Apr 01, 2019 at 06:28:04PM -0300, Daniel Henrique Barboza wrote:
Ping
The QEMU implementation was accepted and will be presented in QEMU 4.1. The implementation accepted is compatible with the work already done in this series, so this version is still current.
We contacted NVIDIA 2 weeks ago to get more information on NVLINK itself and also whether they even have x86_64 NVLINK on their roadmap and if so, what similarities/differences the 2 arches would have which would further help us asses this patch set. Unfortunately, there hasn't been any response yet. Erik
participants (6)
-
Alexey Kardashevskiy
-
Daniel Henrique Barboza
-
David Gibson
-
Erik Skultety
-
Ján Tomko
-
Peter Krempa