On Tue, 20 Sep 2016 02:05:52 +0530
Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
Hi libvirt experts,
Thanks for valuable input on v1 version of RFC.
Quick brief, VFIO based mediated device framework provides a way to
virtualize their devices without SR-IOV, like NVIDIA vGPU, Intel KVMGT
and IBM's channel IO. This framework reuses VFIO APIs for all the
functionalities for mediated devices which are currently being used for
pass through devices. This framework introduces a set of new sysfs files
for device creation and its life cycle management.
Here is the summary of discussion on v1:
1. Discover mediated device:
As part of physical device initialization process, vendor driver will
register their physical devices, which will be used to create virtual
device (mediated device, aka mdev) to the mediated framework.
Vendor driver should specify mdev_supported_types in directory format.
This format is class based, for example, display class directory format
should be as below. We need to define such set for each class of devices
which would be supported by mediated device framework.
--- mdev_destroy
I thought we replaced mdev_destroy with a remove attribute for each
mdev device.
--- mdev_supported_types
|-- 11
| |-- create
| |-- name
| |-- fb_length
| |-- resolution
| |-- heads
| |-- max_instances
| |-- params
| |-- requires_group
|-- 12
| |-- create
| |-- name
| |-- fb_length
| |-- resolution
| |-- heads
| |-- max_instances
| |-- params
| |-- requires_group
|-- 13
|-- create
|-- name
|-- fb_length
|-- resolution
|-- heads
|-- max_instances
|-- params
|-- requires_group
In the above example directory '11' represents a type id of mdev device.
'name', 'fb_length', 'resolution', 'heads',
'max_instance' and
'requires_group' would be Read-Only files that vendor would provide to
describe about that type.
'create':
Write-only file. Mandatory.
Accepts string to create mediated device.
'name':
Read-Only file. Mandatory.
Returns string, the name of that type id.
'fb_length':
Read-only file. Mandatory.
Returns <number>{K,M,G}, size of framebuffer.
This can't be mandatory, it's only relevant to vGPU devices, vGPUs are
just one user of mediated devices.
'resolution':
Read-Only file. Mandatory.
Returns 'hres x vres' format. Maximum supported resolution.
Same.
'heads':
Read-Only file. Mandatory.
Returns integer. Number of maximum heads supported.
Same.
'max_instance':
Read-Only file. Mandatory.
Returns integer. Returns maximum mdev device could be created
at the moment when this file is read. This count would be updated by
vendor driver. Before creating mdev device of this type, check if
max_instance is > 0.
Didn't we discuss this being being something like "available_instances"
with a "devices" sub-directory linking current devices of that type?
'params'
Write-Only file. Optional.
String input. Libvirt would pass the string given in XML file to
this file and then create mdev device. Set empty string to clear params.
For example, set parameter 'frame_rate_limiter=0' to disable frame rate
limiter for performance benchmarking, then create device of type 11. The
device created would have that parameter set by vendor driver.
Might as well just call it "magic", there's absolutely no ability to
introspect what parameters are allowed or even valid here. Can all
parameters be changed dynamically? Do parameter changes apply to
existing devices or only future devices? This is a poorly specified
interface. I'd prefer this be done via module options on the vendor
driver.
'requires_group'
Read-Only file. Optional.
This should be provided by vendor driver if vendor driver need to
group mdev devices in one domain so that vendor driver can use 'first
open' to commit resources of all mdev devices associated to that domain
and 'last close' to free those.
The parent device would look like:
<device>
<name>pci_0000_86_00_0</name>
<capability type='pci'>
<domain>0</domain>
<bus>134</bus>
<slot>0</slot>
<function>0</function>
This is the parent device address? (I think we'd re-use the
specification for assigned devices) Is this optional? Couldn't
libvirt choose to pick any parent device supplying the specified type
if not supplied?
<capability type='mdev'>
<!-- one type element per sysfs directory -->
<type id='11'>
<!-- one element per sysfs file roughly -->
<name>GRID M60-0B</name>
<attribute name='fb_length'>512M</attribute>
<attribute name='resolution'>2560x1600</attribute>
<attribute name='heads'>2</attribute>
<attribute name='max_instances'>16</attribute>
<attribute name='requires_group'>1</attribute>
</type>
</capability>
<product id='...'>GRID M60</product>
<vendor id='0x10de'>NVIDIA</vendor>
What are these specifying?
</capability>
</device>
2. Create/destroy mediated device
With above example, vGPU device XML would look like:
<device>
<name>my-vgpu</name>
<parent>pci_0000_86_00_0</parent>
<capability type='mdev'>
<type id='11'/>
<group>1</group>
<params>'frame_rate_limiter=0'</params>
</capability>
</device>
'type id' is mandatory.
I was under the impression that the vendor supplied "name" was the one
unique identifier. We're certainly not going to create a registrar to
hand out type ids to each vendor, so what makes type id unique? I have
a hard time believing that a given vendor can even allocate unique type
ids for their own devices. Unique type id across vendors is not
practical. So which attribute are we actually using to identify which
type of mdev device we need and why would we ever specify additional
attributes like fb_length? Doesn't the vendor guarantee that "GRID
M60-0B" has a fixed setup of those attributes?
'group' is optional. It should be a unique number in the
system among
all the groups created for mdev devices. Its usage is:
- not needed if single vGPU device is being assigned to a domain.
- only need to be set if multiple vGPUs need to be assigned to a
domain and vendor driver have 'requires_group' file in type id directory.
- if type id directory include 'requires_group' and user tries to
assign multiple vGPUs to a domain without having <group> field in XML,
it will create single vGPU.
We never finished our discussion of how this gets implemented or
whether a group applies only to devices from the same vendor, same
device, how heterogeneous groups are handled, etc.
'params' is optional field. User should set this field if
extra
parameters need to be set for a particular vGPU device. Libvirt don't
need to parse these params. These are meant for vendor driver.
ie. magic black hole. nope.
Libvirt need to follow the sequence to create device:
* Read /sys/../0000\:86\:00.0/11/max_instances. If it is greater than 0,
then only proceed else fail.
* Set extra params if 'params' field exist in device XML and 'params'
file exist in type id directory
echo "frame_rate_limiter=0" > /sys/../0000\:86\:00.0/11/params
* Autogenerate UUID
* Create device:
echo "$UUID:<group>" > /sys/../0000\:86\:00.0/11/create
where <group> is optional. Group should be unique number among all
the groups created for mdev devices.
Is it an integer, a UUID, a string? How is the group for an mdev
device discovered? Can it be changed later?
* Clear params, if set earlier:
echo "" > /sys/../0000\:86\:00.0/11/params
So params only applies to the devices created while those params are
set? This is inherently racy.
* To destroy device:
echo $UUID > /sys/../0000\:86\:00.0/mdev_destroy
echo 1 > /sys/path/to/mdev/device/remove
3. Start/stop mediated device
No change or requirement for libvirt as this will be handled by open()
and close() callbacks to vendor driver. In case of multiple devices and
'requires_group' set, this will be handled in 'first open()' and
'last
close()' on device in that group.
4. Launch QEMU/VM
Pass the mdev sysfs path to QEMU as vfio-pci device.
For above vGPU device example:
-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID
5. QEMU/VM Shutdown sequence
No change or requirement for libvirt.
6. VM Reset
No change or requirement for libvirt as this will be handled via VFIO
reset API and QEMU process will keep running as before.
7. Hot-plug
It is same syntax to create a virtual device for hot-plug.
How do groups work with hotplug? Can a device be creating into an
existing, running group? Can a device be removed from an existing,
running group? Thanks,
Alex