
On 8/5/22 2:20 PM, Jason Gunthorpe wrote:
On Fri, Aug 05, 2022 at 11:24:08AM -0600, Alex Williamson wrote:
On Thu, 4 Aug 2022 21:11:07 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Thu, Aug 04, 2022 at 01:36:24PM -0600, Alex Williamson wrote:
That is reasonable, but I'd say those three kernels only have two drivers and they both have vfio as a substring in their name - so the simple thing of just substring searching 'vfio' would get us over that gap.
Looking at the aliases for exactly "vfio_pci" isn't that much more complicated, and "feels" a lot more reliable than just doing a substring search for "vfio" in the driver's name. (It would be, uh, .... "not smart" to name a driver "vfio<anything>" if it wasn't actually a vfio variant driver (or the opposite), but I could imagine it happening; :-/)
This is still pretty hacky. I'm worried about what happens to the kernel if this becames some crazy unintended uAPI that we never really thought about carefully... This was not a use case when we designed the modules.alias stuff at least.
BTW - why not do things the normal way?
1. readlink /sys/bus/pci/devices/XX/iommu_group 2. Compute basename of #1 3. Check if /dev/vfio/#2 exists (or /sys/class/vfio/#2)
It has a small edge case where a multi-device group might give a false positive for an undrivered device, but for the purposes of libvirt that seems pretty obscure.. (while the above has false negative issues, obviously)
This is not a small edge case, it's extremely common. We have a *lot* of users assigning desktop GPUs and other consumer grade hardware, which are usually multi-function devices without isolation exposed via ACS or quirks.
The edge case is that the user has created a multi-device group, manually assigned device 1 in the group to VFIO, left device 2 with no driver and then told libvirt to manually use device 2. With the above approach libvirt won't detect this misconfiguration and qemu will fail.
libvirt will see that there is no driver at all, and recognize that, by definition, "no driver" == "not a vfio variant driver" :-). So in this case libvirt catches the misconfiguration.
The vfio group exists if any devices in the group are bound to a vfio driver, but the device is not accessible from the group unless the viability test passes. That means QEMU may not be able to get access to the device because the device we want isn't actually bound to a vfio driver or another device in the group is not in a viable state. Thanks,
This is a different misconfiguration that libvirt also won't detect, right? In this case ownership claiming in the kernel will fail and qemu will fail too, like above.
Right. If we're relying on "iommu group matching" as you suggest, then libvirt will mistakenly conclude that the driver is a vfio variant, but then qemu will fail.
This, and the above, could be handled by having libvirt also open the group FD and get the device. It would prove both correct binding and viability.
I had understood the point of this logic was to give better error reporting to users so that common misconfigurations could be diagnosed earlier.
Correct. In the end QEMU and the kernel have the final say of course, but if we can detect the problem sooner then it's more likely the user will get a meaningful error message.
When I say 'small edge case' I mean it seems like an unlikely misconfiguration that someone would know to setup VFIO but then use the wrong BDFs to do it - arguably less likely than someone would know to setup VFIO but forget to unbind the other drivers in the group?
You obviously haven't spent enough time trying to remotely troubleshoot the setups of noobs on IRC :-). If anything can be done wrong, there is certainly someone around who will do it that way. Of course we can't eliminate 100% of these, but the more misconfigurations we can catch, and the earlier we can catch them, the better. All of this without any false error reports of course. It's better that some edge cases get through (and be shot down by QEMU) rather than that we would mistakenly prevent someone from using a totally viable config (which is the case right now).
But maybe I don't get it at all ...
Nah, you get it. We just have minor differences of opinion of which choice will be the best combination of simple to implement + less user headaches (and in the end any of us could be right; personally my opinion sways from minute to minute).