Re: [libvirt] [PATCH v7 0/4] Add Mediated device support

31 Aug 2016

      On Wed, 31 Aug 2016 15:04:13 +0800
Jike Song <jike.song@intel.com> wrote:
...
On 08/31/2016 02:12 PM, Tian, Kevin wrote:
...
...
From: Alex Williamson [mailto:alex.williamson@redhat.com]
Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device
sysfs interface.  I'd like to share what I think we agreed on and the
"problem areas" that still need some work so we can get the thoughts
and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs
interface, which exposes a flat csv file with fields like "type",
"number of instance", "vendor string", and then a bunch of type
specific fields like "framebuffer size", "resolution", "frame rate
limit", etc.  This is not entirely machine parsing friendly and sort of
abuses the sysfs concept of one value per file.  Example output taken
from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
# vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
max_resolution
11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" >
  /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" >
  /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure,
something like:
├── mdev_destroy
└── mdev_supported_types
    ├── 11
    │   ├── create
    │   ├── description
    │   └── max_instances
    ├── 12
    │   ├── create
    │   ├── description
    │   └── max_instances
    └── 13
        ├── create
        ├── description
        └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity,
the other attributes would be included in separate files and we would
require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy.
In the meantime, can we still allow optional vendor string in create 
interface? libvirt doesn't need to know the meaning, but allows upper
layer to do some vendor specific tweak if necessary.
Not sure whether this can done within MDEV framework (attrs provided by
vendor driver of course), or must be within the vendor driver.
The purpose of the sub-directories is that libvirt doesn't need to pass
arbitrary, vendor strings to the create function, the attributes of the
mdev device created are defined by the attributes in the sysfs
directory where the create is done.  The user only provides a uuid for
the device.  Arbitrary vendor parameters are a barrier, libvirt may not
need to know the meaning, but would need to know when to apply them,
which is just as bad.  Ultimately we want libvirt to be able to
interact with sysfs without having an vendor specific knowledge.
...
...
...
For vGPUs like NVIDIA where we don't support multiple types
concurrently, this directory structure would update as mdev devices are
created, removing no longer available types.  I carried forward
or keep the type with max_instances cleared to ZERO.
+1 :)
Possible yes, but why would the vendor driver report types that the
user cannot create?  It just seems like superfluous information (well,
except for the use I discover below).
...
...
...
max_instances here, but perhaps we really want to copy SR-IOV and
report a max and current allocation.  Creation and deletion is
right, cur/max_instances look reasonable.
...
simplified as we can simply "echo $UUID > create" per type.  I don't
understand why destroy had a parameter list, so here I imagine we can
simply do the same... in fact, I'd actually rather see a "remove" sysfs
entry under each mdev device, so we remove it at the device rather than
in some central location (any objections?).
OK to me.
IIUC, "destroy" has a parameter list is only because the previous
$VM_UUID + instnace implementation. It should be safe to move the "destroy"
file under mdev now.
...
...
We discussed how this might look with Intel devices which do allow
mixed vGPU types concurrently.  We believe, but need confirmation, that
the vendor driver could still make a finite set of supported types,
perhaps with additional module options to the vendor driver to enable
more "exotic" types.  So for instance if IGD vGPUs are based on
power-of-2 portions of the framebuffer size, then the vendor driver
could list types with 32MB, 64MB, 128MB, etc in useful and popular
sizes.  As vGPUs are allocated, the larger sizes may become unavailable.
Yes, Intel can do such type of definition. One thing I'm not sure is 
about impact cross listed types, i.e. when creating a new instance
under a given type, max_instances under other types would be 
dynamically decremented based on available resource. Would it be
a problem for libvirt or upper level stack, since a natural interpretation
of max_instances should be a static number?
An alternative is to make max_instances configurable, so libvirt has
chance to define a pool of available instances with different types
before creating any instance. For example, initially IGD driver may 
report max_instances only for a minimal sharing granularity:
  128MB:
      max_instances (8)
  256MB:
      max_instances (0)
  512MB:
      max_instances (0)
Then libvirt can configure more types as:
  128MB:
      max_instances (2)
  256MB:
      max_instances (1)
  512MB:
      max_instances (1)
Starting from this point, max_instances would be static and then
mdev instance can be created under each type. But I'm not
sure whether such additional configuration role is reasonable to libvirt...
My expectation of your example, where I'm assuming you have 1G of total
memory that can be divided between the mdev devices would be:

 128M: 8
 256M: 4
 512M: 2

If a 512M mdev device is created, this becomes:

 128M: 4
 256M: 2
 512M: 1

Creating a 128M mdev device from that becomes:

 128M: 3
 256M: 1
 512M: 0

It's not great, but I don't know how to do it better without the user
having a clear understanding of the algorithm and resources required
for each mdev device.  For instance, the size here, presumably the
framebuffer size, is just one attribute in the device directory, the
user won't know that this attribute is the key to the available
instances.

I don't particularly like the idea of a writeable max_instances, the
user can simply create instances of the type and see the results.

Just thought of another thing; do we need some way to determine the
type of an mdev device from sysfs or is this implicit knowledge for the
user that created the device?  For instance, we create a 512M device
and it becomes a child device to the parent, so we can associate to the
parent, but if we come back later, how do we know it's a 512M device?
Perhaps this is a reason to keep the type directories around and we can
cross link the device to the type and create a devices subdirectory
under each type.  Perhaps then "max_instances" becomes
"available_instances" (ie. how many left we can create) and we don't
need a "current_instances" because we can simply look in the devices
directory.
...
...
...
We still don't have any way for the admin to learn in advance how the
available supported types will change once mdev devices start to be
created.  I'm not sure how we can create a specification for this, so
probing by creating devices may be the most flexible model.
The other issue is the start/stop requirement, which was revealed to
setup peer-to-peer resources between vGPUs which is a limited hardware
resource.  We'd really like to have these happen automatically on the
first open of a vfio mdev device file and final release.  So we
brainstormed how the open/release callbacks could know the other mdev
devices for a given user.  This is where the instance number came into
play previously.  This is an area that needs work.
IGD doesn't have such peer-to-peer resource setup requirement. So
it's sufficient to create/destroy a mdev instance in a single action on
IGD. However I'd expect we still keep the "start/stop" interface (
maybe not exposed as sysfs node, instead being a VFIO API), as 
required to support future live migration usage. We've made prototype
working for KVMGT today.
Great!
...
It's good for the framework to define start/stop interfaces, but as Alex
said below, it should be MDEV oriented, not VM oriented.
I don't know a lot about the peer-to-peer resource, but to me, although
VM_UUID + instance is not applicable, userspace can always achieve the
same purpose by, let us assume a mdev hierarchy, providing the VM UUID
under every mdev:
/sys/bus/pci/devices/<sbdf>/mdev/
  |-- mdev01/
  |   `-- vm_uuid
  `-- mdev02/
      `-- vm_uuid
Did I miss something?
Sure, this is just another way of doing UUID+instance.  Nit, it might
look more like:

 	/sys/bus/pci/devices/<sbdf>/mdev/
 	|-- uuid1/
 	|   `-- group_uuid
 	`-- uuid2/
 	    `-- group_uuid

Where each mdev device is actually referenced by its UUID name then
we'd have some writable attribute under the device where mdev devices
sharing the same group UUID are handled together.  There's a problem
here though that vfio doesn't know about this level of grouping, so
uuid1 and uuid2 could actually be given to different users despite the
grouping here, which results in one or both devices not working or
creating security issues.  That sort of implies that this would
necessarily need to be exposed as iommu grouping.  This factors into why
it seems like a good idea to make the start/stop implicit within the
interface.  In that way each mdev device is fungible as far as a user
like libvirt is concerned, internal details like peer-to-peer resources
are handled automatically as the devices are accessed.
...
...
...
There was a thought that perhaps on open() the vendor driver could look
at the user pid and use that to associate with other devices, but the
problem here is that we open and begin access to each device, so
devices do this discovery serially rather than in parallel as desired.
(we might not fault in mmio space yet though, so I wonder if open()
could set the association of mdev to pid, then the first mmio fault
would trigger the resource allocation?  Then all the "magic" would live
in the vendor driver.  open() could fail if the pid already has running
mdev devices and the vendor driver chooses not to support hotplug)
One comment was that for a GPU that only supports homogeneous vGPUs,
libvirt may choose to create all the vGPUs in advance and handle them
as we do SR-IOV VFs.  The UUID+instance model would preclude such a use
case.
We also considered whether iommu groups could be (ab)used for this use
case, peer-to-peer would in fact be an iommu grouping constraint
afterall.  This would have the same UUID+instance constraint as above
though and would require some sort of sysfs interface for the user to
be able to create multiple mdevs within a group.
Everyone was given homework to think about this on their flights home,
so I expect plenty of ideas by now ;)
Overall I think mediated devices were well received by the community,
so let's keep up the development and discussion to bring it to
fruition.  Thanks,
Thanks a lot Alex for your help on driving this discussion. Mediated device
technique has the potential to be used for other type of I/O virtualizations
in the future, not limited to GPU virtualization. So getting the core framework
ready earlier would be highly welcomed. :-)
I agree, there's lots of potential and it's extra incentive to create
an interface that's going to make sense long term.  Ideally we only
need to create the kernel and libvirt infrastructure once and we
can handle any type of mediated driver.  Thanks,

Alex