On 9/2/2016 1:31 AM, Alex Williamson wrote:
On Thu, 1 Sep 2016 23:52:02 +0530
Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
> Alex,
> Thanks for summarizing the discussion.
>
> On 8/31/2016 9:18 PM, Alex Williamson wrote:
>> On Wed, 31 Aug 2016 15:04:13 +0800
>> Jike Song <jike.song(a)intel.com> wrote:
>>
>>> On 08/31/2016 02:12 PM, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> At KVM Forum we had a BoF session primarily around the mediated
device
>>>>> sysfs interface. I'd like to share what I think we agreed on and
the
>>>>> "problem areas" that still need some work so we can get the
thoughts
>>>>> and ideas from those who weren't able to attend.
>>>>>
>>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>>> interface, which exposes a flat csv file with fields like
"type",
>>>>> "number of instance", "vendor string", and then a
bunch of type
>>>>> specific fields like "framebuffer size",
"resolution", "frame rate
>>>>> limit", etc. This is not entirely machine parsing friendly and
sort of
>>>>> abuses the sysfs concept of one value per file. Example output
taken
>>>>> from Neo's libvirt RFC:
>>>>>
>>>>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config,
framebuffer,
>>>>> max_resolution
>>>>> 11 ,"GRID M60-0B", 16, 2, 45,
512M, 2560x1600
>>>>> 12 ,"GRID M60-0Q", 16, 2, 60,
512M, 2560x1600
>>>>> 13 ,"GRID M60-1B", 8, 2, 45,
1024M, 2560x1600
>>>>> 14 ,"GRID M60-1Q", 8, 2, 60,
1024M, 2560x1600
>>>>> 15 ,"GRID M60-2B", 4, 2, 45,
2048M, 2560x1600
>>>>> 16 ,"GRID M60-2Q", 4, 4, 60,
2048M, 2560x1600
>>>>> 17 ,"GRID M60-4Q", 2, 4, 60,
4096M, 3840x2160
>>>>> 18 ,"GRID M60-8Q", 1, 4, 60,
8192M, 3840x2160
>>>>>
>>>>> The create/destroy then looks like this:
>>>>>
>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>> /sys/bus/pci/devices/.../mdev_create
>>>>>
>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>> /sys/bus/pci/devices/.../mdev_destroy
>>>>>
>>>>> "vendor_specific_argument_list" is nebulous.
>>>>>
>>>>> So the idea to fix this is to explode this into a directory
structure,
>>>>> something like:
>>>>>
>>>>> ├── mdev_destroy
>>>>> └── mdev_supported_types
>>>>> ├── 11
>>>>> │ ├── create
>>>>> │ ├── description
>>>>> │ └── max_instances
>>>>> ├── 12
>>>>> │ ├── create
>>>>> │ ├── description
>>>>> │ └── max_instances
>>>>> └── 13
>>>>> ├── create
>>>>> ├── description
>>>>> └── max_instances
>>>>>
>>>>> Note that I'm only exposing the minimal attributes here for
simplicity,
>>>>> the other attributes would be included in separate files and we
would
>>>>> require vendors to create standard attributes for common device
classes.
>>>>
>>>> I like this idea. All standard attributes are reflected into this
hierarchy.
>>>> In the meantime, can we still allow optional vendor string in create
>>>> interface? libvirt doesn't need to know the meaning, but allows
upper
>>>> layer to do some vendor specific tweak if necessary.
>>>>
>>>
>>> Not sure whether this can done within MDEV framework (attrs provided by
>>> vendor driver of course), or must be within the vendor driver.
>>
>> The purpose of the sub-directories is that libvirt doesn't need to pass
>> arbitrary, vendor strings to the create function, the attributes of the
>> mdev device created are defined by the attributes in the sysfs
>> directory where the create is done. The user only provides a uuid for
>> the device. Arbitrary vendor parameters are a barrier, libvirt may not
>> need to know the meaning, but would need to know when to apply them,
>> which is just as bad. Ultimately we want libvirt to be able to
>> interact with sysfs without having an vendor specific knowledge.
>>
>
> Above directory hierarchy looks fine to me. Along with the fixed set of
> parameter, a optional field of extra parameter is also required. Such
> parameters are required for some specific testing or running benchmarks,
> for example to disable FRL (framerate limiter) or to disable console vnc
> when not required. Libvirt don't need to know its details, its just a
> string that user can provide and libvirt need to pass the string as it
> is to vendor driver, vendor driver would act accordingly.
Wouldn't it make more sense to enable these through the vendor driver
which would then provide additional types through the sysfs interface
that could be selected by libvirt? Or simply transparently change
these parameters within the existing types? I think we really want to
get away from adding any sort of magic vendor strings.
In the directory structure, a 'params' can take optional parameters.
Libvirt then can set 'params' and then create mdev device. For example,
param say 'disable_console_vnc=1' is set for type 11, then devices
created of type 11 will have that param set unless it is cleared.
└── mdev_supported_types
├── 11
│ ├── create
│ ├── description
│ └── max_instances
│ └── params
├── 12
│ ├── create
│ ├── description
│ └── max_instances
│ └── params
└── 13
├── create
├── description
└── max_instances
└── params
This has to come from libvirt since such params could be different for
each mdev device.
>>>>>
>>>>> For vGPUs like NVIDIA where we don't support multiple types
>>>>> concurrently, this directory structure would update as mdev devices
are
>>>>> created, removing no longer available types. I carried forward
>>>>
>>>> or keep the type with max_instances cleared to ZERO.
>>>>
>>>
>>> +1 :)
>>
>> Possible yes, but why would the vendor driver report types that the
>> user cannot create? It just seems like superfluous information (well,
>> except for the use I discover below).
>>
>
> The directory structure for a physical GPU will be defined when device
> is register to mdev module. It would be simpler to change creatable
> instance count i.e for the types which can't be created creatable
> instance count would be set to 0.
>
>
>>>>> max_instances here, but perhaps we really want to copy SR-IOV and
>>>>> report a max and current allocation. Creation and deletion is
>>>>
>>>> right, cur/max_instances look reasonable.
>>>>
>>>>> simplified as we can simply "echo $UUID > create" per
type. I don't
>>>>> understand why destroy had a parameter list, so here I imagine we
can
>>>>> simply do the same... in fact, I'd actually rather see a
"remove" sysfs
>>>>> entry under each mdev device, so we remove it at the device rather
than
>>>>> in some central location (any objections?).
>>>>
>>>> OK to me.
>>>
>>> IIUC, "destroy" has a parameter list is only because the previous
>>> $VM_UUID + instnace implementation. It should be safe to move the
"destroy"
>>> file under mdev now.
>>>
>
> Sorry if that was there in libvirt discussion, but "destroy" don't
need
> extra parameters. Yes it could be moved to mdev device directory.
>
>>>>> We discussed how this might look with Intel devices which do allow
>>>>> mixed vGPU types concurrently. We believe, but need confirmation,
that
>>>>> the vendor driver could still make a finite set of supported types,
>>>>> perhaps with additional module options to the vendor driver to
enable
>>>>> more "exotic" types. So for instance if IGD vGPUs are
based on
>>>>> power-of-2 portions of the framebuffer size, then the vendor driver
>>>>> could list types with 32MB, 64MB, 128MB, etc in useful and popular
>>>>> sizes. As vGPUs are allocated, the larger sizes may become
unavailable.
>>>>
>>>> Yes, Intel can do such type of definition. One thing I'm not sure is
>>>> about impact cross listed types, i.e. when creating a new instance
>>>> under a given type, max_instances under other types would be
>>>> dynamically decremented based on available resource. Would it be
>>>> a problem for libvirt or upper level stack, since a natural
interpretation
>>>> of max_instances should be a static number?
>>>>
>>>> An alternative is to make max_instances configurable, so libvirt has
>>>> chance to define a pool of available instances with different types
>>>> before creating any instance. For example, initially IGD driver may
>>>> report max_instances only for a minimal sharing granularity:
>>>> 128MB:
>>>> max_instances (8)
>>>> 256MB:
>>>> max_instances (0)
>>>> 512MB:
>>>> max_instances (0)
>>>>
>>>> Then libvirt can configure more types as:
>>>> 128MB:
>>>> max_instances (2)
>>>> 256MB:
>>>> max_instances (1)
>>>> 512MB:
>>>> max_instances (1)
>>>>
>>>> Starting from this point, max_instances would be static and then
>>>> mdev instance can be created under each type. But I'm not
>>>> sure whether such additional configuration role is reasonable to
libvirt...
>>
>> My expectation of your example, where I'm assuming you have 1G of total
>> memory that can be divided between the mdev devices would be:
>>
>> 128M: 8
>> 256M: 4
>> 512M: 2
>>
>> If a 512M mdev device is created, this becomes:
>>
>> 128M: 4
>> 256M: 2
>> 512M: 1
>>
>> Creating a 128M mdev device from that becomes:
>>
>> 128M: 3
>> 256M: 1
>> 512M: 0
>>
>> It's not great, but I don't know how to do it better without the user
>> having a clear understanding of the algorithm and resources required
>> for each mdev device. For instance, the size here, presumably the
>> framebuffer size, is just one attribute in the device directory, the
>> user won't know that this attribute is the key to the available
>> instances.
>>
>> I don't particularly like the idea of a writeable max_instances, the
>> user can simply create instances of the type and see the results.
>>
>> Just thought of another thing; do we need some way to determine the
>> type of an mdev device from sysfs or is this implicit knowledge for the
>> user that created the device? For instance, we create a 512M device
>> and it becomes a child device to the parent, so we can associate to the
>> parent, but if we come back later, how do we know it's a 512M device?
>> Perhaps this is a reason to keep the type directories around and we can
>> cross link the device to the type and create a devices subdirectory
>> under each type. Perhaps then "max_instances" becomes
>> "available_instances" (ie. how many left we can create) and we
don't
>> need a "current_instances" because we can simply look in the devices
>> directory.
>>
>
> When mdev module creates mdev device, mdev_device_create() in patch,
> here 'mdev->dev.parent' is assigned as its parent physical device. So
> device_register() create child's directory inside parent's directory.
> Directory for mdev device is not explicitly created. So I don't think we
> can move this directory to type directory. But we can think of adding
> link to type directory from mdev device's directory.
Yes, the idea was only to add links, not to change anything about the
parent/child hierarchy in sysfs. The result would be similar to how we
have /sys/kernel/iommu_groups/$GROUP/devices/ with links to the devices
contained within that group.
>>>>>
>>>>> We still don't have any way for the admin to learn in advance how
the
>>>>> available supported types will change once mdev devices start to be
>>>>> created. I'm not sure how we can create a specification for
this, so
>>>>> probing by creating devices may be the most flexible model.
>>>>>
>
> Removing type directory dynamically seems difficult. So the other way as
> suggested here, when that type is not supported, vendor driver can
> return max_instance to 0.
I'm ok with this, seems like there are enough uses for it and it's
necessary to keep the directory for the device links.
>>>>> The other issue is the start/stop requirement, which was revealed to
>>>>> setup peer-to-peer resources between vGPUs which is a limited
hardware
>>>>> resource. We'd really like to have these happen automatically on
the
>>>>> first open of a vfio mdev device file and final release. So we
>>>>> brainstormed how the open/release callbacks could know the other
mdev
>>>>> devices for a given user. This is where the instance number came
into
>>>>> play previously. This is an area that needs work.
>>>>
>>>> IGD doesn't have such peer-to-peer resource setup requirement. So
>>>> it's sufficient to create/destroy a mdev instance in a single action
on
>>>> IGD. However I'd expect we still keep the "start/stop"
interface (
>>>> maybe not exposed as sysfs node, instead being a VFIO API), as
>>>> required to support future live migration usage. We've made
prototype
>>>> working for KVMGT today.
>>
>> Great!
>>
>
> In this v7 version of patch, I had made changes that introduce 'online'
> in mdev device directory as discussed in v6 reviews. We need this to
> commit resources for that device(s).
But if we have some number of mdev devices, each with just a UUID
identifier, how are separate online callbacks for each device
associated to a single peer-to-peer context?
>>> It's good for the framework to define start/stop interfaces, but as Alex
>>> said below, it should be MDEV oriented, not VM oriented.
>>>
>>> I don't know a lot about the peer-to-peer resource, but to me, although
>>> VM_UUID + instance is not applicable, userspace can always achieve the
>>> same purpose by, let us assume a mdev hierarchy, providing the VM UUID
>>> under every mdev:
>>>
>>> /sys/bus/pci/devices/<sbdf>/mdev/
>>> |-- mdev01/
>>> | `-- vm_uuid
>>> `-- mdev02/
>>> `-- vm_uuid
>>>
>>> Did I miss something?
>>
>> Sure, this is just another way of doing UUID+instance. Nit, it might
>> look more like:
>>
>> /sys/bus/pci/devices/<sbdf>/mdev/
>> |-- uuid1/
>> | `-- group_uuid
>> `-- uuid2/
>> `-- group_uuid
>>
>> Where each mdev device is actually referenced by its UUID name then
>> we'd have some writable attribute under the device where mdev devices
>> sharing the same group UUID are handled together.
>
> Group UUID would also work, as long as its unique and set for all
> devices in a group, it should work.
Well, except for the problem I mention in the quoted paragraph below.
>> There's a problem
>> here though that vfio doesn't know about this level of grouping, so
>> uuid1 and uuid2 could actually be given to different users despite the
>> grouping here, which results in one or both devices not working or
>> creating security issues. That sort of implies that this would
>> necessarily need to be exposed as iommu grouping. This factors into why
>> it seems like a good idea to make the start/stop implicit within the
>> interface. In that way each mdev device is fungible as far as a user
>> like libvirt is concerned, internal details like peer-to-peer resources
>> are handled automatically as the devices are accessed.
>>
>
> I understand your concerns here. But making implicit doesn't guarantee
> that device will not be accessed unless all mdev devices are started.
This is true, start on mmio fault relies on devices being setup w/o
accessing the mmio space. It should be how QEMU works today though.
>>>>> There was a thought that perhaps on open() the vendor driver could
look
>>>>> at the user pid and use that to associate with other devices, but
the
>>>>> problem here is that we open and begin access to each device, so
>>>>> devices do this discovery serially rather than in parallel as
desired.
>>>>> (we might not fault in mmio space yet though, so I wonder if open()
>>>>> could set the association of mdev to pid, then the first mmio fault
>>>>> would trigger the resource allocation? Then all the
"magic" would live
>>>>> in the vendor driver. open() could fail if the pid already has
running
>>>>> mdev devices and the vendor driver chooses not to support hotplug)
>>>>>
>
> Problem is resources should be committed before any device being
> accessed and not at fault at mmio space.
It seems then that the grouping needs to affect the iommu group so that
you know that there's only a single owner for all the mdev devices
within the group. IIRC, the bus drivers don't have any visibility
to opening and releasing of the group itself to trigger the
online/offline, but they can track opening of the device file
descriptors within the group. Within the VFIO API the user cannot
access the device without the device file descriptor, so a "first
device opened" and "last device closed" trigger would provide the
trigger points you need. Some sort of new sysfs interface would need to
be invented to allow this sort of manipulation.
I like this suggestion and thinking around it.
Also we should probably keep sight of whether we feel this is
sufficiently necessary for the complexity. If we can get by with only
doing this grouping at creation time then we could define the "create"
interface in various ways. For example:
echo $UUID0 > create
would create a single mdev named $UUID0 in it's own group.
echo {$UUID0,$UUID1} > create
could create mdev devices $UUID0 and $UUID1 grouped together.
I think this would create mdev device of same type on same parent
device. We need to consider the case of multiple mdev devices of
different types and with different parents to be grouped together.
We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into
which $UUID1 is to be created and added to the same group.
I was thinking about:
echo $UUID0 > create
would create mdev device
echo $UUID0 > /sys/class/mdev/create_group
would add created device to group.
For multiple devices case:
echo $UUID0 > create
echo $UUID1 > create
would create mdev devices which could be of different types and
different parents.
echo $UUID0, $UUID1 > /sys/class/mdev/create_group
would add devices in a group.
Mdev core module would create a new group with unique number. On mdev
device 'destroy' that mdev device would be removed from the group. When
there are no devices left in the group, group would be deleted. With
this "first device opened" and "last device closed" trigger can be
used
to commit resources.
Then libvirt use mdev device path to pass as argument to QEMU, same as
it does for VFIO. Libvirt don't have to care about group number.
Currently iommu groups are determined at device discovery time and
not
changeable, so it seems like this sort of matches that model, but it
makes life difficult for libvirt if they want to have a pool of mdev
devices that they arbitrarily assigned to VMs. Also the question of
whether libvirt applies this all mdev devices or only NVIDIA. Does it
try to use the same group across different parent devices?
Yes, group could consists of mdev devices with different parent devices.
Does it
only group devices with matching vendor strings? Much to be
specified...
I don't think it should be vendor specific.
Thanks,
Kirti
Thanks,
Alex