
...
Sorry for the delay in responding. The problem is that all the V100 GPUs support NVLink, but it may or may not be connected up. This is detected at runtime during GPU initialization, which seems like much too heavy of an operation to perform as part of passthrough initialization. And that's why vfio-pci pieces rely on device tree information to figure it out.
Alexey, would it be possible for vfio-pci to export the information in a way more friendly to libvirt?
The only information needed here is whether a specific GPU has RAM or not. This can easily be found from the device-tree, imho quite friendly already. VFIO gets to know about these new capabilities when the VFIO PCI device is opened, and we rather want to avoid going that far in libvirt (open a VFIO container, attach a group, get a vfio-pci fd from it, enumerate regions - 2 PCI resets on the way, delays, meh).
Agreed, we talked about this when dealing with other stuff already and we really don't want libvirt to need to open a VFIO container just to query a few attributes/settings which it then would process or pass directly to QEMU, we'd therefore need a different way consumable to libvirt...
btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be passed too or the entire RAM thing won't work. "find" for the memory node can also be dropped really - if NVLink bridge OF node has "memory-region", then VFIO will most likely expose RAM and QEMU will try using it anyway.
I'm not sure I follow ^this, can you be more specific about how you imagine libvirt detecting it? Thanks, Erik