VFIO's new mediated device interface "is used for allowing
software-defined devices to be exposed through VFIO while the host
driver manages access to the interface" (quoted from
http://www.phoronix.com/scan.php?page=news_item&px=VFIO-Linux-4.10-Me...
). Now that the support for mediated devices has been added to the
upstream Linux kernel, there is a stable API that libvirt can use to
support assigning mediated devices (e.g. virtual GPUs) to Qemu/KVM
guests (or presumably any other hypervisor that support device
assignment via VFIO.
We've had a few private discussions about what should be added to
libvirt, and now have enough rough ideas to start discussing it on the list.
The major requirements we've come up with so far (in what I think is a
reasonable order of implementation) are:
1) The ability to assign an already-created mediated device to a guest
(think of "<hostdev ... managed='no'>" mode for assigning regular
PCI
devices).
2) reporting of the capabilities of a mediated device "parent"
(including, for example, the supported types and maximum number of child
devices that are supported, and the names of all existing child devices)
and of existing child devices (via the node device APIs, e.g. virsh
nodedev-list and virsh nodedev-dumpxml)
3) The ability to create and destroy mediated devices via the NodeDevice
API. (similar in function to the "virsh detach-device and virsh
attach-device commands - i.e. they make a device ready to be assigned to
a guest using <hostdev>, but have no persistent config and no
"auto-start" capability).
4) Support for "managed" mediated devices - libvirt will create a new
child device as required, and destroy it when it's no longer needed
(similar to the way that standard PCI hostdevs are (when managed="yes")
detached from their host driver and attached to vfio-pci as needed) (I
think this is less useful than item (5), but is simpler and may be a
good way to test all the preceding additions (as well as being useful in
some simpler configurations).
5) The ability to create and manage "pools" of mediated devices, with
persistent config and an auto-start capability so that the device pools
are automatically created when the host is booted (this will require
either some form of persistent config and lifecycle management to be
added to the nodedevice driver, or a new libvirt driver type with
functionality similar to storage pools, but used to manage pools of mdev
child devices).
=========
Going back to the beginning, with slightly more detail:
1) "Unmanaged" mediated device assignment - assigning an existing device
to a virtual machine
This will assume that the desired child device has already been created,
and can be found in /sys/bus/mdev/devices/$UUID. Here's a first attempt
at what the XML would look like:
<hostdev mode='subsystem' type='pci' /managed='no'>/
<source> <!-- (maybe add "type='mdev'" ???) -->
<mdev uuid='$uuid'/>
</source>
<address type='pci' blah blah blah/> <!-- PCI address in the
guest -->
</hostdev>
In the past, the "type" attribute of hostdev described the type on both
the host and the guest. With mediated devices, the device isn't visible
on the host as a PCI device, but just as a software device. So the type
attribute in <hostdev> now gives the type of the device on the guest,
and the device type on the host is determined from the contents of <source>.
Erik had a different suggestion for this (which I think he's already
working on patches for) - that the type attribute in <hostdev> should be
the type of the device in the *host*, and the type in the guest would be
that given in the <address>. Something like this I think:
<hostdev mode='subsystem' type='mdev' /managed='no'/>
<source>
<mdev uuid='$uuid'/>
</source>
<address type='pci' blah blah blah/>
</hostdev>
(Is this correct, Erik?)
(I arrived at my suggestion by the thinking that, in other places where
there are similar attributes for the host and guest side, e.g. the IP
addresses and routes that can be added on both the host and guest side
of an <interface>, everything related to the host side is in the
<source> subelement, while things related to the guest are directly
under the toplevel of the device element. On the other hand, the
"managed" attribute isn't something related to the guest, but to the
host, and his idea has less redundancy, so maybe he's onto something...)
(NB: a mediated device could be exposed to the guest as a PCI device, a
CCW device, or anything else supported by vfio. The type of device that
the guest will see can be determined from the contents of
mdev_supported_types/<type-id>/device_api under the parent device's
directory in sysfs (it will be, e.g., "vfio-pci" or "vfio-ccw"). But
libvirt assigns guest-side addresses at the time a domain is defined,
and it's possible that the mdev child device won't be created yet at
define time (and therefore we won't know which parent device it's
associated with, and so we won't be able to look at device_api). In such
situations, it will be up to management to know something about the
device it will be creating and assume a type. Fortunately this is a
reasonably safe thing to do - on x86 platforms we can be fairly certain
that the device will be a PCI device. (And, because this also makes a
difference for some machinetypes, that it will be a PCI Express device).
We will want to check device_api at runtime though, to validate that the
guest-side device really is a PCI device.
==
2) Reporting parent and child mediated devices and their capabilities in
the node device API.
There are 3 stages to this:
a) add mediated child devices to the list of devices provided by "virsh
nodedev-list". These will be called "mdev_$UUID", and will show up as
descendents of their respective parent devices in "virsh nodedev-list
--tree". The list of all these devices can easily be retrieved by
enumerating the links in /sys/bus/mdev/devices/$UUID.
b) report the capabilities of parent devices in their dumpxml output.
This will included supported child device types and a list of current
children.
I don't have any experience with nodedev reporting for SCSI devices, but
recently noticed that nodedev-list can report lists of devices with
certain capabilities, e.g. "virsh nodedev-list --cap=scsi_host". Based
on this, I guess it would be useful for the parent devices to show
something like this (using the sample mtty driver as an example):
<device>
<name>pci_0000_02_00_0</name>
<parent>pci_0000_00_04_0</parent>
<driver>
<name>mtty</name>
</driver>
<capability type='mdev_parent'>
[list of supported types, each with number allowed]
[list of current child devices (just giving uuid or device
name ("mdev_$uuid"?)]
[other info about parent/children?]
</capability>
...
Likewise, a nodedev-dumpxml of a child device should contain a pointer
to the parent device.
c) respond to dumpxml requests for mediated child devices. This should
include at least the uuid/type of the child device, and a link back to
the parent device (and I suppose somehow include <capability
type='mdev_child'> so that it can be filtered with virsh modedev-list?)
==
(3), (4), and (5) need more thought that I haven't gotten to yet. TBD
(if anyone else has thoughts on those, please share!)