On Wed, 7 Sep 2016 23:36:28 +0530
Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
On 9/7/2016 10:14 PM, Alex Williamson wrote:
> On Wed, 7 Sep 2016 21:45:31 +0530
> Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
>
>> On 9/7/2016 2:58 AM, Alex Williamson wrote:
>>> On Wed, 7 Sep 2016 01:05:11 +0530
>>> Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
>>>
>>>> On 9/6/2016 11:10 PM, Alex Williamson wrote:
>>>>> On Sat, 3 Sep 2016 22:04:56 +0530
>>>>> Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
>>>>>
>>>>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:
>>>>>>>> <Alex> We could even do:
>>>>>>>>>>
>>>>>>>>>> echo $UUID1:$GROUPA > create
>>>>>>>>>>
>>>>>>>>>> where $GROUPA is the group ID of a previously
created mdev device into
>>>>>>>>>> which $UUID1 is to be created and added to the
same group.
>>>>>>>> </Alex>
>>>>>>>
>>>>>>> From the point of view of libvirt, I think I prefer
Alex's idea.
>>>>>>> <group> could be an additional element in the
nodedev-create XML:
>>>>>>>
>>>>>>> <device>
>>>>>>> <name>my-vgpu</name>
>>>>>>> <parent>pci_0000_86_00_0</parent>
>>>>>>> <capability type='mdev'>
>>>>>>> <type id='11'/>
>>>>>>>
<uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>>> <group>group1</group>
>>>>>>> </capability>
>>>>>>> </device>
>>>>>>>
>>>>>>> (should group also be a UUID?)
>>>>>>>
>>>>>>
>>>>>> No, this should be a unique number in a system, similar to
iommu_group.
>>>>>
>>>>> Sorry, just trying to catch up on this thread after a long weekend.
>>>>>
>>>>> We're talking about iommu groups here, we're not creating
any sort of
>>>>> parallel grouping specific to mdev devices.
>>>>
>>>> I thought we were talking about group of mdev devices and not iommu
>>>> group. IIRC, there were concerns about it (this would be similar to
>>>> UUID+instance) and that would (ab)use iommu groups.
>>>
>>> What constraints does a group, which is not an iommu group, place on the
>>> usage of the mdev devices? What happens if we put two mdev devices in
>>> the same "mdev group" and then assign them to separate VMs/users?
I
>>> believe that the answer is that this theoretical "mdev group"
doesn't
>>> actually impose any constraints on the devices within the group or how
>>> they're used.
>>>
>>
>> We feel its not a good idea to try to associate device's iommu groups
>> with mdev device groups. That adds more complications.
>>
>> As in above nodedev-create xml, 'group1' could be a unique number that
>> can be generated by libvirt. Then to create mdev device:
>>
>> echo $UUID1:group1 > create
>>
>> If user want to add more mdev devices to same group, he/she should use
>> same group number in next nodedev-create devices. So create commands
>> would be:
>> echo $UUID2:group1 > create
>> echo $UUID3:group1 > create
>
> So groups return to being static, libvirt would need to destroy and
> create mdev devices specifically for use within the predefined group?
Yes.
> This imposes limitations on how mdev devices can be used (ie. the mdev
> pool option is once again removed). We're also back to imposing
> grouping semantics on mdev devices that may not need them. Do all mdev
> devices for a given user need to be put into the same group?
Yes.
> Do groups
> span parent devices? Do they span different vendor drivers?
>
Yes and yes. Group number would be associated with mdev device
irrespective of its parent.
>> Each mdev device would store this group number in its mdev_device
>> structure.
>>
>> With this, we would add open() and close() callbacks from vfio_mdev
>> module for vendor driver to commit resources. Then we don't need
>> 'start'/'stop' or online/offline interface.
>>
>> To commit resources for all devices associated to that domain/user space
>> application, vendor driver can use 'first open()' and 'last
close()' to
>> free those. Or if vendor driver want to commit resources for each device
>> separately, they can do in each device's open() call. It will depend on
>> vendor driver how they want to implement.
>>
>> Libvirt don't have to do anything about assigned group numbers while
>> managing mdev devices.
>>
>> QEMU commandline parameter would be same as earlier (don't have to
>> mention group number here):
>>
>> -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \
>> -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2
>>
>> In case if two mdev devices from same groups are assigned to different
>> domains, we can fail open() call of second device. How would driver know
>> that those are being used by different domain? By checking <group1, pid>
>> of first device of 'group1'. The two devices in same group should have
>> same pid in their open() call.
>
> Are you assuming that the two devices are owned by the same vendor
> driver?
No. See my reply to next questions below.
> What if I put NVIDIA and Intel vGPUs both into the same group
> and give each of them to a separate VM?
It depends on where we put the logic to verify pid in open() call of
each devices in group.
If we place the logic of checking <group, pid> for devices in a group in
vendor driver, then in above case both VMs would boot.
But If we impose this logic in mdev core or vfio_mdev module, then
open() on second device should fail.
So you're proposing that the mdev layer keeps a list of mdev-groups and
wraps the vfio_device_ops.{open,release} entry points to record or
verify the user on each open, keep tallies of the open devices, and
clears that association on the last close? Is pid really the thing we
want to key on, what about multiple threads running in the same address
space? vfio-core does this my only allowing a single open on the vfio
group thus the vfio device file descriptors can be farmed out to other
threads. Using pid seems incompatible with that usage model and we'll
have a vfio group per mdev device, so we can't restrict access there.
The model seems plausible, but also significantly restricts the user's
freedom unless we can come up with a better context to use to identify
the user.
Forcing groups to be static also seems arbitrary since nothing here
demands that the mdev group cannot be changed while not in use. This
grouping is really only required for NVIDIA mdev devices, so it needs
to be as non-intrusive as possible for other vendors or it needs to
only be invoked for vendors that require it.
> How would the NVIDIA host
> driver know which <group, pid> the Intel device got?
How to make use of group number to commit resources for devices owned by
a vendor would be vendor driver's responsibility. NVIDIA driver doesn't
need to know about Intel's vGPU nor Intel driver need to know about
NVIDIA's vGPU.
So the mdev layer would be responsible for making sure that a device
within a mdev group can only be opened by the <somehow> identified user
and the vendor driver would have it's own list of mdev groups and
devices and do yet more first-open/last-closed processing.
> This is what the
> iommu groups do that a different layer of grouping cannot do. Maybe
> you're suggesting a group per vendor driver, but how does libvirt know
> the vendor driver? Do they need to go research the parent device in
> sysfs and compare driver links?
>
No, group is not associated with vendor driver. Group number is
associated iwth mdev device.
Philosophically, mdev devices should be entirely independent of one
another. A user can set the same iommu context for multiple mdevs
by placing them in the same container. A user should be able to
stop using an mdev in one place and start using it somewhere else.
It should be a fungible $TYPE device. It's an NVIDIA-only requirement
that imposes this association of mdev devices into groups and I don't
particularly see it as beneficial to the mdev architecture. So why
make it a standard part of the interface?
We could do keying at the layer you suggest, assuming we can find
something that doesn't restrict the user, but we could make that
optional. For instance, say we did key on pid, there could be an
attribute in the supported types hierarchy to indicate this type
supports(requires) pid-sets. Each mdev device with this attribute
would create a pid-group file in sysfs where libvirt could associate
the device. Only for those mdev devices requiring it.
The alternative is that we need to find some mechanism for this
association that doesn't impose arbitrary requirements, and potentially
usage restrictions on vendors that don't have this need. Thanks,
Alex