On Wed, 7 Sep 2016 08:22:05 +0000
"Tian, Kevin" <kevin.tian(a)intel.com> wrote:
> From: Alex Williamson
> Sent: Wednesday, September 07, 2016 5:29 AM
>
> On Wed, 7 Sep 2016 01:05:11 +0530
> Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
>
> > On 9/6/2016 11:10 PM, Alex Williamson wrote:
> > > On Sat, 3 Sep 2016 22:04:56 +0530
> > > Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
> > >
> > >> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> > >>>
> > >>>
> > >>> On 02/09/2016 20:33, Kirti Wankhede wrote:
> > >>>> <Alex> We could even do:
> > >>>>>>
> > >>>>>> echo $UUID1:$GROUPA > create
> > >>>>>>
> > >>>>>> where $GROUPA is the group ID of a previously created
mdev device into
> > >>>>>> which $UUID1 is to be created and added to the same
group.
> > >>>> </Alex>
> > >>>
> > >>> From the point of view of libvirt, I think I prefer Alex's
idea.
> > >>> <group> could be an additional element in the nodedev-create
XML:
> > >>>
> > >>> <device>
> > >>> <name>my-vgpu</name>
> > >>> <parent>pci_0000_86_00_0</parent>
> > >>> <capability type='mdev'>
> > >>> <type id='11'/>
> > >>>
<uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> > >>> <group>group1</group>
> > >>> </capability>
> > >>> </device>
> > >>>
> > >>> (should group also be a UUID?)
> > >>>
> > >>
> > >> No, this should be a unique number in a system, similar to
iommu_group.
> > >
> > > Sorry, just trying to catch up on this thread after a long weekend.
> > >
> > > We're talking about iommu groups here, we're not creating any sort
of
> > > parallel grouping specific to mdev devices.
> >
> > I thought we were talking about group of mdev devices and not iommu
> > group. IIRC, there were concerns about it (this would be similar to
> > UUID+instance) and that would (ab)use iommu groups.
>
> What constraints does a group, which is not an iommu group, place on the
> usage of the mdev devices? What happens if we put two mdev devices in
> the same "mdev group" and then assign them to separate VMs/users? I
> believe that the answer is that this theoretical "mdev group" doesn't
> actually impose any constraints on the devices within the group or how
> they're used.
>
> vfio knows about iommu groups and we consider an iommu group to be the
> unit of ownership for userspace. Therefore by placing multiple mdev
> devices within the same iommu group we can be assured that there's only
> one user for that group. Furthermore, the specific case for this
> association on NVIDIA is to couple the hardware peer-to-peer resources
> for the individual mdev devices. Therefore this particular grouping
> does imply a lack of isolation between those mdev devices involved in
> the group.
>
> For mdev devices which are actually isolated from one another, where
> they don't poke these p2p holes, placing them in the same iommu group
> is definitely an abuse of the interface and is going to lead to
> problems with a single iommu context. But how does libvirt know that
> one type of mdev device needs to be grouped while another type doesn't?
can we introduce an attribute under specific type to indicate such p2p
requirement so libvirt knows the need of additional group action?
I don't have any objection to that.
>
> There's really not much that I like about using iommu groups in this
> way, it's just that they seem to solve this particular problem of
> enforcing how such a group can be used and imposing a second form of
> grouping onto the vfio infrastructure seems much too complex.
>
> > I'm thinking about your suggestion, but would also like to know your
> > thought how sysfs interface would look like? Its still no clear to me.
> > Or will it be better to have grouping at mdev layer?
>
> In previous replies I had proposed that a group could be an additional
> argument when we write the mdev UUID to the create entry in sysfs.
> This is specifically why I listed only the UUID when creating the first
> mdev device and UUID:group when creating the second. The user would
> need to go determine the group ID allocated for the first entry to
> specify creating the second within that same group.
>
> I have no love for this proposal, it's functional but not elegant and
> again leaves libvirt lost in trying to determine which devices need to
> be grouped together and which have no business being grouped together.
>
> Let's think through this further and let me make a couple assumptions
> to get started:
>
> 1) iommu groups are the way that we want to group NVIDIA vGPUs because:
> a) The peer-to-peer resources represent an isolation gap between
> mdev devices, iommu groups represent sets of isolated devices.
> b) The 1:1 mapping of an iommu group to a user matches the NVIDIA
> device model.
> c) iommu_group_for_each_dev() gives the vendor driver the
> functionality it needs to perform a first-open/last-close
> device walk for configuring these p2p resources.
>
> 2) iommu groups as used by mdev devices should contain the minimum
> number of devices in order to provide the maximum iommu context
> flexibility.
>
> Do we agree on these? The corollary is that NVIDIA is going to suffer
> reduced iommu granularity exactly because of the requirement to setup
> p2p resources between mdev devices within the same VM. This has
> implications when guest iommus are in play (viommu).
>
> So by default we want an iommu group per mdev. This works for all mdev
> devices as far as we know, including NVIDIA with the constraint that we
> only have a single NVIDIA device per VM.
>
> What if we want multiple NVIDIA devices? We either need to create the
> additional devices with a property which will place them into the same
> iommu group or allow the iommu groups to be manipulated dynamically.
>
> The trouble I see with the former (creating a device into a group) is
> that it becomes part of the "create" syntax, which is global for all
> mdev devices. It's the same functional, but non-elegant solution I
> proposed previously.
>
> What if we allow groups to be manipulated dynamically? In this case I
> envision an attribute under the mdev device with read/write access.
> The existence of the attribute indicates to libvirt that this device
> requires such handling and allows reading and setting the association.
> To be clear, the attribute would only exist on mdev devices requiring
> this handling. I'm always a fan of naming things after what they do, so
> rather than making this attribute reference an iommu group, I might
> actually call it "peer_to_peer_resource_uuid". So the process might
> look something like this:
>
> # create 2 mdev devices
> echo $UUID0 > /sys/devices/mdev/<s:b:d.f>/types/1/create
> echo $UUID1 > /sys/devices/mdev/<s:b:d.f>/types/1/create
>
> # move $UUID1 to the same group as $UUID0
> P2P_UUID=$(cat
>
/sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID0/peer_to_peer_resource_uuid)
> echo $P2P_UUID > \
>
> /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID1/peer_to_peer_resource_uuid
>
> Alternatively we could have used uuidgen to create a UUID then moved
> both to the new UUID.
>
> Within the mdev vendor driver this would walk through the mdev devices,
> find the matching peer_to_peer_resource_uuid (generated randomly at
> create time by default) and add the device to the iommu group for
> devices sharing that p2p uuid. When removed from the VM, libvirt could
> simply echo the output of uuidgen to each to split them again.
I think it could work. Then the binding of p2p uuid with devices is
asynchronous from mdev_create, which is more flexible to manage.
>
> So from a libvirt perspective, special handling would need to invoked
> that when this p2p attribute is found, all devices for a given VM would
> need to share the same p2p uuid. libvirt would be free to use an
if those devices come from two parent devices, do we expect
libvirt to use two p2p uuids here?
I expect so. AIUI, NVIDIA wants to start all the devices together,
which implies that a p2p uuid group would span parent devices. If
there are not actually any p2p resources shared between parent devices
it would be more optimal to create a p2p uuid group for each parent,
thus limiting the size of the iommu group, but that might interfere with
internals of the NVIDIA userspace manager. It's a bit more 'abuse'
rather than 'use' of iommu groups if there aren't actually any p2p
resources. Whether or not there's some optimization in having mdev
devices on the same parent is going to be something that libvirt, or at
least an advanced user if we can't do it programatically, is going to
want to know. Thanks,
Alex
> existing p2p uuid or generate a new one. The vendor driver
should
> enforce a write failure if the device cannot be added to the p2p uuid
> (for example devices within the p2p uuid are already opened).
>
> Maybe this is similar to your proposal and even goes back to vm_uuid,
> but under the covers the vendor driver needs to be manipulating iommu
> grouping based on this parameter and there's no concept of an "mdev
> group" in the base API (nor vm_uuid), this is an extension keyed by the
> additional sysfs attribute.
>
> Are we getting closer? Thanks,
>
Looks so. :-)
Thanks,
Kevin