Re: [libvirt] [PATCH v7 0/4] Add Mediated device support

Wednesday, 31 August 2016

...
 From: Alex Williamson [mailto:alex.williamson@redhat.com]
 Sent: Wednesday, August 31, 2016 12:17 AM

 Hi folks,

 At KVM Forum we had a BoF session primarily around the mediated device
 sysfs interface.  I'd like to share what I think we agreed on and the
 "problem areas" that still need some work so we can get the thoughts
 and ideas from those who weren't able to attend.

 DanPB expressed some concern about the mdev_supported_types sysfs
 interface, which exposes a flat csv file with fields like "type",
 "number of instance", "vendor string", and then a bunch of type
 specific fields like "framebuffer size", "resolution", "frame
rate
 limit", etc.  This is not entirely machine parsing friendly and sort of
 abuses the sysfs concept of one value per file.  Example output taken
 from Neo's libvirt RFC:

 cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
 # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
 max_resolution
 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160

 The create/destroy then looks like this:

 echo "$mdev_UUID:vendor_specific_argument_list" >
 	/sys/bus/pci/devices/.../mdev_create

 echo "$mdev_UUID:vendor_specific_argument_list" >
 	/sys/bus/pci/devices/.../mdev_destroy

 "vendor_specific_argument_list" is nebulous.

 So the idea to fix this is to explode this into a directory structure,
 something like:

 ├── mdev_destroy
 └── mdev_supported_types
     ├── 11
     │   ├── create
     │   ├── description
     │   └── max_instances
     ├── 12
     │   ├── create
     │   ├── description
     │   └── max_instances
     └── 13
         ├── create
         ├── description
         └── max_instances

 Note that I'm only exposing the minimal attributes here for simplicity,
 the other attributes would be included in separate files and we would
 require vendors to create standard attributes for common device classes. 
I like this idea. All standard attributes are reflected into this hierarchy.
In the meantime, can we still allow optional vendor string in create 
interface? libvirt doesn't need to know the meaning, but allows upper
layer to do some vendor specific tweak if necessary.

...

 For vGPUs like NVIDIA where we don't support multiple types
 concurrently, this directory structure would update as mdev devices are
 created, removing no longer available types.  I carried forward 
or keep the type with max_instances cleared to ZERO.

...
 max_instances here, but perhaps we really want to copy SR-IOV and
 report a max and current allocation.  Creation and deletion is 
right, cur/max_instances look reasonable.

...
 simplified as we can simply "echo $UUID > create" per
type.  I don't
 understand why destroy had a parameter list, so here I imagine we can
 simply do the same... in fact, I'd actually rather see a "remove" sysfs
 entry under each mdev device, so we remove it at the device rather than
 in some central location (any objections?). 
OK to me. 

...

 We discussed how this might look with Intel devices which do allow
 mixed vGPU types concurrently.  We believe, but need confirmation, that
 the vendor driver could still make a finite set of supported types,
 perhaps with additional module options to the vendor driver to enable
 more "exotic" types.  So for instance if IGD vGPUs are based on
 power-of-2 portions of the framebuffer size, then the vendor driver
 could list types with 32MB, 64MB, 128MB, etc in useful and popular
 sizes.  As vGPUs are allocated, the larger sizes may become unavailable. 
Yes, Intel can do such type of definition. One thing I'm not sure is 
about impact cross listed types, i.e. when creating a new instance
under a given type, max_instances under other types would be 
dynamically decremented based on available resource. Would it be
a problem for libvirt or upper level stack, since a natural interpretation
of max_instances should be a static number?

An alternative is to make max_instances configurable, so libvirt has
chance to define a pool of available instances with different types
before creating any instance. For example, initially IGD driver may 
report max_instances only for a minimal sharing granularity:
	128MB:
		max_instances (8)
	256MB:
		max_instances (0)
	512MB:
		max_instances (0)

Then libvirt can configure more types as:
	128MB:
		max_instances (2)
	256MB:
		max_instances (1)
	512MB:
		max_instances (1)

Starting from this point, max_instances would be static and then
mdev instance can be created under each type. But I'm not
sure whether such additional configuration role is reasonable to libvirt...

...

 We still don't have any way for the admin to learn in advance how the
 available supported types will change once mdev devices start to be
 created.  I'm not sure how we can create a specification for this, so
 probing by creating devices may be the most flexible model.

 The other issue is the start/stop requirement, which was revealed to
 setup peer-to-peer resources between vGPUs which is a limited hardware
 resource.  We'd really like to have these happen automatically on the
 first open of a vfio mdev device file and final release.  So we
 brainstormed how the open/release callbacks could know the other mdev
 devices for a given user.  This is where the instance number came into
 play previously.  This is an area that needs work. 
IGD doesn't have such peer-to-peer resource setup requirement. So
it's sufficient to create/destroy a mdev instance in a single action on
IGD. However I'd expect we still keep the "start/stop" interface (
maybe not exposed as sysfs node, instead being a VFIO API), as 
required to support future live migration usage. We've made prototype
working for KVMGT today.

...

 There was a thought that perhaps on open() the vendor driver could look
 at the user pid and use that to associate with other devices, but the
 problem here is that we open and begin access to each device, so
 devices do this discovery serially rather than in parallel as desired.
 (we might not fault in mmio space yet though, so I wonder if open()
 could set the association of mdev to pid, then the first mmio fault
 would trigger the resource allocation?  Then all the "magic" would live
 in the vendor driver.  open() could fail if the pid already has running
 mdev devices and the vendor driver chooses not to support hotplug)

 One comment was that for a GPU that only supports homogeneous vGPUs,
 libvirt may choose to create all the vGPUs in advance and handle them
 as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
 case.

 We also considered whether iommu groups could be (ab)used for this use
 case, peer-to-peer would in fact be an iommu grouping constraint
 afterall.  This would have the same UUID+instance constraint as above
 though and would require some sort of sysfs interface for the user to
 be able to create multiple mdevs within a group.

 Everyone was given homework to think about this on their flights home,
 so I expect plenty of ideas by now ;)

 Overall I think mediated devices were well received by the community,
 so let's keep up the development and discussion to bring it to
 fruition.  Thanks, 
Thanks a lot Alex for your help on driving this discussion. Mediated device
technique has the potential to be used for other type of I/O virtualizations
in the future, not limited to GPU virtualization. So getting the core framework
ready earlier would be highly welcomed. :-)

Thanks
Kevin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [PATCH v7 0/4] Add Mediated device support