Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

Wednesday, 7 September 2016

On Wed, 7 Sep 2016 21:45:31 +0530
Kirti Wankhede <kwankhede(a)nvidia.com&gt; wrote:

...
 On 9/7/2016 2:58 AM, Alex Williamson wrote:
 > On Wed, 7 Sep 2016 01:05:11 +0530
 > Kirti Wankhede <kwankhede(a)nvidia.com&gt; wrote:
 >   
 >> On 9/6/2016 11:10 PM, Alex Williamson wrote:  
 >>> On Sat, 3 Sep 2016 22:04:56 +0530
 >>> Kirti Wankhede <kwankhede(a)nvidia.com&gt; wrote:
 >>>     
 >>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:    
 >>>>>
 >>>>>
 >>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:      
 >>>>>> <Alex> We could even do:      
 >>>>>>>>
 >>>>>>>> echo $UUID1:$GROUPA > create
 >>>>>>>>
 >>>>>>>> where $GROUPA is the group ID of a previously created
mdev device into
 >>>>>>>> which $UUID1 is to be created and added to the same
group.      
 >>>>>> </Alex>      
 >>>>>
 >>>>> From the point of view of libvirt, I think I prefer Alex's
idea.
 >>>>> <group> could be an additional element in the nodedev-create
XML:
 >>>>>
 >>>>>     <device>
 >>>>>       <name>my-vgpu</name>
 >>>>>       <parent>pci_0000_86_00_0</parent>
 >>>>>       <capability type='mdev'>
 >>>>>         <type id='11'/>
 >>>>>        
<uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
 >>>>>         <group>group1</group>
 >>>>>       </capability>
 >>>>>     </device>
 >>>>>
 >>>>> (should group also be a UUID?)
 >>>>>       
 >>>>
 >>>> No, this should be a unique number in a system, similar to iommu_group. 

 >>>
 >>> Sorry, just trying to catch up on this thread after a long weekend.
 >>>
 >>> We're talking about iommu groups here, we're not creating any sort
of
 >>> parallel grouping specific to mdev devices.    
 >>
 >> I thought we were talking about group of mdev devices and not iommu
 >> group. IIRC, there were concerns about it (this would be similar to
 >> UUID+instance) and that would (ab)use iommu groups.  
 > 
 > What constraints does a group, which is not an iommu group, place on the
 > usage of the mdev devices?  What happens if we put two mdev devices in
 > the same "mdev group" and then assign them to separate VMs/users?  I
 > believe that the answer is that this theoretical "mdev group" doesn't
 > actually impose any constraints on the devices within the group or how
 > they're used.
 >   

 We feel its not a good idea to try to associate device's iommu groups
 with mdev device groups. That adds more complications.

 As in above nodedev-create xml, 'group1' could be a unique number that
 can be generated by libvirt. Then to create mdev device:

   echo $UUID1:group1 > create

 If user want to add more mdev devices to same group, he/she should use
 same group number in next nodedev-create devices. So create commands
 would be:
   echo $UUID2:group1 > create
   echo $UUID3:group1 > create 
So groups return to being static, libvirt would need to destroy and
create mdev devices specifically for use within the predefined group?
This imposes limitations on how mdev devices can be used (ie. the mdev
pool option is once again removed).  We're also back to imposing
grouping semantics on mdev devices that may not need them.  Do all mdev
devices for a given user need to be put into the same group?  Do groups
span parent devices?  Do they span different vendor drivers?

...
 Each mdev device would store this group number in its mdev_device
 structure.

 With this, we would add open() and close() callbacks from vfio_mdev
 module for vendor driver to commit resources. Then we don't need
 'start'/'stop' or online/offline interface.

 To commit resources for all devices associated to that domain/user space
 application, vendor driver can use 'first open()' and 'last close()' to
 free those. Or if vendor driver want to commit resources for each device
 separately, they can do in each device's open() call. It will depend on
 vendor driver how they want to implement.

 Libvirt don't have to do anything about assigned group numbers while
 managing mdev devices.

 QEMU commandline parameter would be same as earlier (don't have to
 mention group number here):

   -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \
   -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2

 In case if two mdev devices from same groups are assigned to different
 domains, we can fail open() call of second device. How would driver know
 that those are being used by different domain? By checking <group1, pid>
 of first device of 'group1'. The two devices in same group should have
 same pid in their open() call. 
Are you assuming that the two devices are owned by the same vendor
driver?  What if I put NVIDIA and Intel vGPUs both into the same group
and give each of them to a separate VM?  How would the NVIDIA host
driver know which <group, pid> the Intel device got?  This is what the
iommu groups do that a different layer of grouping cannot do.  Maybe
you're suggesting a group per vendor driver, but how does libvirt know
the vendor driver?  Do they need to go research the parent device in
sysfs and compare driver links?

...
 To hot-plug mdev device to a domain in which there is already a mdev
 device assigned, mdev device should be created with same group number as
 the existing devices are and then hot-plug it. If there is no mdev
 device in that domain, then group number should be a unique number.

 This simplifies the mdev grouping and also provide flexibility for
 vendor driver implementation. 
The 'start' operation for NVIDIA mdev devices allocate peer-to-peer
resources between mdev devices.  Does this not represent some degree of
an isolation hole between those devices?  Will peer-to-peer DMA between
devices honor the guest IOVA when mdev devices are placed into separate
address spaces, such as possible with vIOMMU?

I don't particularly like the iommu group solution either, which is why
in my latest proposal I've given the vendor driver a way to indicate
this grouping is required so more flexible mdev devices aren't
restricted by this.  But the limited knowledge I have of the hardware
configuration which imposes this restriction on NVIDIA devices seems to
suggest that iommu grouping of these sets is appropriate.  The vfio-core
infrastructure is almost entirely built for managing vfio group, which
are just a direct mapping of iommu groups.  So the complexity of iommu
groups is already handled.  Adding a new layer of grouping into mdev
seems like it's increasing the complexity further, not decreasing it.
Thanks,

Alex

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support