On 8/5/22 2:56 PM, Alex Williamson wrote:
On Fri, 5 Aug 2022 15:20:24 -0300
Jason Gunthorpe <jgg(a)nvidia.com> wrote:
> On Fri, Aug 05, 2022 at 11:24:08AM -0600, Alex Williamson wrote:
>> On Thu, 4 Aug 2022 21:11:07 -0300
>> Jason Gunthorpe <jgg(a)nvidia.com> wrote:
>>
>>> On Thu, Aug 04, 2022 at 01:36:24PM -0600, Alex Williamson wrote:
>>>
>>>>>> That is reasonable, but I'd say those three kernels only have
two
>>>>>> drivers and they both have vfio as a substring in their name - so
the
>>>>>> simple thing of just substring searching 'vfio' would get
us over that
>>>>>> gap.
>>>>>
>>>>> Looking at the aliases for exactly "vfio_pci" isn't
that much more
>>>>> complicated, and "feels" a lot more reliable than just
doing a substring
>>>>> search for "vfio" in the driver's name. (It would be,
uh, .... "not
>>>>> smart" to name a driver "vfio<anything>" if it
wasn't actually a vfio
>>>>> variant driver (or the opposite), but I could imagine it happening;
:-/)
>>>
>>> This is still pretty hacky. I'm worried about what happens to the
>>> kernel if this becames some crazy unintended uAPI that we never really
>>> thought about carefully... This was not a use case when we designed
>>> the modules.alias stuff at least.
>>>
>>> BTW - why not do things the normal way?
>>>
>>> 1. readlink /sys/bus/pci/devices/XX/iommu_group
>>> 2. Compute basename of #1
>>> 3. Check if /dev/vfio/#2 exists (or /sys/class/vfio/#2)
>>>
>>> It has a small edge case where a multi-device group might give a false
>>> positive for an undrivered device, but for the purposes of libvirt
>>> that seems pretty obscure.. (while the above has false negative
>>> issues, obviously)
>>
>> This is not a small edge case, it's extremely common. We have a *lot*
>> of users assigning desktop GPUs and other consumer grade hardware, which
>> are usually multi-function devices without isolation exposed via ACS or
>> quirks.
>
> The edge case is that the user has created a multi-device group,
> manually assigned device 1 in the group to VFIO, left device 2 with no
> driver and then told libvirt to manually use device 2. With the above
> approach libvirt won't detect this misconfiguration and qemu will
> fail.
>
>> The vfio group exists if any devices in the group are bound to a vfio
>> driver, but the device is not accessible from the group unless the
>> viability test passes. That means QEMU may not be able to get access
>> to the device because the device we want isn't actually bound to a vfio
>> driver or another device in the group is not in a viable state. Thanks,
>
> This is a different misconfiguration that libvirt also won't detect,
> right? In this case ownership claiming in the kernel will fail and
> qemu will fail too, like above.
>
> This, and the above, could be handled by having libvirt also open the
> group FD and get the device. It would prove both correct binding and
> viability.
libvirt cannot do this in the group model because the group must be
isolated in a container before the device can be accessed and libvirt
cannot presume the QEMU container configuration. For direct device
access, this certainly becomes a possibility and I've been trying to
steer things in that direction, libvirt has the option to pass an fd for
the iommufd and can then pass fds for each of the devices in the new
uAPI paradigm.
> I had understood the point of this logic was to give better error
> reporting to users so that common misconfigurations could be diagnosed
> earlier. When I say 'small edge case' I mean it seems like an unlikely
> misconfiguration that someone would know to setup VFIO but then use
> the wrong BDFs to do it - arguably less likely than someone would know
> to setup VFIO but forget to unbind the other drivers in the group?
I'm not sure how much testing libvirt does of other devices in a group,
Laine?
It had been so long since I looked at that, and the code was so obtuse,
that I had to set something up to see what happened.
1) if there is another devices in the same group, and that device is
bound to vfio-pci and in use by a different QEMU process, then libvirt
refuses assign the device in question (it will allow assigning the
device to the same guest as other devices in the same group.
2) if there is another device in the same group, and that device is
bound to some other driver than vfio-pci, then libvirt doesn't notice,
tells QEMU to do the assignment, and QEMU fails.
Without looking/trying, I would have said that libvirt would check for
(2), but I guess nobody ever tried it :-/
AIUI here, libvirt has a managed='yes|no' option per device. In the
'yes' case libvirt will unbind devices from their host driver and bind
them to vfio-pci. In the 'no' case, I believe libvirt is still doing a
sanity test on the driver, but only knows about vfio-pci.
Correct.
The initial step is to then enlighten libvirt that other drivers can be
compatible for the 'no' case and later we can make smarter choices
about which driver to use or allow the user to specify (ie. a user
should be able to use vfio-pci rather than a variant driver if they
choose) in the 'yes' case.
Yes, that's the next step. I just wanted to first add a simple (i.e.
difficult to botch up) patch to allow trying out the new variant drivers
without bypassing libvirt (which is bad PR for libvirt every time it
happens).
If libvirt is currently testing that only the target device is bound to
vfio-pci, then maybe we do have gaps for the ancillary devices in the
group, but that gap changes if instead we only test that a vfio group
exists relative to the iommu group of the target device.
Yep. Currently we miss case (2) above. With this suggested method of
detecting vfio variant drivers, we would also start missing (1).
Of course QEMU would also give an error and fail in this case, so it's
not catastrophic (but still not prefered).