VFIO's new mediated device
interface "is used for allowing software-defined devices to be
exposed through VFIO while the host driver manages access to the
interface" (quoted from
http://www.phoronix.com/scan.php?page=news_item&px=VFIO-Linux-4.10-Mediated
). Now that the support for mediated devices has been added to the
upstream Linux kernel, there is a stable API that libvirt can use
to support assigning mediated devices (e.g. virtual GPUs) to
Qemu/KVM guests (or presumably any other hypervisor that support
device assignment via VFIO.
We've had a few private discussions about what should be added to
libvirt, and now have enough rough ideas to start discussing it on
the list.
The major requirements we've come up with so far (in what I think
is a reasonable order of implementation) are:
1) The ability to assign an already-created mediated device to a
guest (think of "<hostdev ... managed='no'>" mode for
assigning regular PCI devices).
2) reporting of the capabilities of a mediated device "parent"
(including, for example, the supported types and maximum number of
child devices that are supported, and the names of all existing
child devices) and of existing child devices (via the node device
APIs, e.g. virsh nodedev-list and virsh nodedev-dumpxml)
3) The ability to create and destroy mediated devices via the
NodeDevice API. (similar in function to the "virsh detach-device
and virsh attach-device commands - i.e. they make a device ready
to be assigned to a guest using <hostdev>, but have no
persistent config and no "auto-start" capability).
4) Support for "managed" mediated devices - libvirt will create a
new child device as required, and destroy it when it's no longer
needed (similar to the way that standard PCI hostdevs are (when
managed="yes") detached from their host driver and attached to
vfio-pci as needed) (I think this is less useful than item (5),
but is simpler and may be a good way to test all the preceding
additions (as well as being useful in some simpler
configurations).
5) The ability to create and manage "pools" of mediated devices,
with persistent config and an auto-start capability so that the
device pools are automatically created when the host is booted
(this will require either some form of persistent config and
lifecycle management to be added to the nodedevice driver, or a
new libvirt driver type with functionality similar to storage
pools, but used to manage pools of mdev child devices).
=========
Going back to the beginning, with slightly more detail:
1) "Unmanaged" mediated device assignment - assigning an existing
device to a virtual machine
This will assume that the desired child device has already been
created, and can be found in /sys/
bus/mdev/devices/$UUID.
Here's a first attempt at what the XML would look like:
<hostdev mode='subsystem' type='pci' managed='no'>
<source> <!-- (maybe add "type='mdev'" ???)
-->
<mdev uuid='$uuid'/>
</source>
<address type='pci' blah blah blah/> <!-- PCI
address in the guest -->
</hostdev>
In the past, the "type" attribute of hostdev described the type on
both the host and the guest. With mediated devices, the device
isn't visible on the host as a PCI device, but just as a software
device. So the type attribute in <hostdev> now gives the
type of the device on the guest, and the device type on the host
is determined from the contents of <source>.
Erik had a different suggestion for this (which I think he's
already working on patches for) - that the type attribute in
<hostdev> should be the type of the device in the *host*,
and the type in the guest would be that given in the
<address>. Something like this I think:
<hostdev mode='subsystem' type='mdev' managed='no'>
<source>
<mdev uuid='$uuid'/>
</source>
<address type='pci' blah blah blah/>
</hostdev>
(Is this correct, Erik?)
(I arrived at my suggestion by the thinking that, in other places
where there are similar attributes for the host and guest side,
e.g. the IP addresses and routes that can be added on both the
host and guest side of an <interface>, everything related to
the host side is in the <source> subelement, while things
related to the guest are directly under the toplevel of the device
element. On the other hand, the "managed" attribute isn't
something related to the guest, but to the host, and his idea has
less redundancy, so maybe he's onto something...)
(NB: a mediated device could be exposed to the guest as a PCI
device, a CCW device, or anything else supported by vfio. The type
of device that the guest will see can be determined from the
contents of
mdev_supported_types/<type-id>/device_api
under the parent device's directory in sysfs (it will be, e.g.,
"vfio-pci" or "vfio-ccw"). But libvirt assigns guest-side
addresses at the time a domain is defined, and it's possible that
the mdev child device won't be created yet at define time (and
therefore we won't know which parent device it's associated with,
and so we won't be able to look at device_api). In such
situations, it will be up to management to know something about
the device it will be creating and assume a type. Fortunately this
is a reasonably safe thing to do - on x86 platforms we can be
fairly certain that the device will be a PCI device. (And, because
this also makes a difference for some machinetypes, that it will
be a PCI Express device). We will want to check device_api at
runtime though, to validate that the guest-side device really is a
PCI device.
==
2) Reporting parent and child mediated devices and their
capabilities in the node device API.
There are 3 stages to this:
a) add mediated child devices to the list of devices provided by
"virsh nodedev-list". These will be called "mdev_$UUID", and will
show up as descendents of their respective parent devices in
"virsh nodedev-list --tree". The list of all these devices can
easily be retrieved by enumerating the links in /sys/
bus/mdev/devices/$UUID.
b) report the capabilities of parent devices in their dumpxml
output. This will included supported child device types and a list
of current children.
I don't have any experience with nodedev reporting for SCSI
devices, but recently noticed that nodedev-list can report lists
of devices with certain capabilities, e.g. "virsh nodedev-list
--cap=scsi_host". Based on this, I guess it would be useful for
the parent devices to show something like this (using the sample
mtty driver as an example):
<device>
<name>pci_0000_02_00_0</name>
<parent>pci_0000_00_04_0</parent>
<driver>
<name>mtty</name>
</driver>
<capability type='mdev_parent'>
[list of supported types, each with number allowed]
[list of current child devices (just giving uuid or
device name ("mdev_$uuid"?)]
[other info about parent/children?]
</capability>
...