VFIO's new mediated device interface  "is used for allowing software-defined devices to be exposed through VFIO while the host driver manages access to the interface" (quoted from http://www.phoronix.com/scan.php?page=news_item&px=VFIO-Linux-4.10-Mediated ). Now that the support for mediated devices has been added to the upstream Linux kernel, there is a stable API that libvirt can use to support assigning mediated devices (e.g. virtual GPUs) to Qemu/KVM guests (or presumably any other hypervisor that support device assignment via VFIO.

We've had a few private discussions about what should be added to libvirt, and now have enough rough ideas to start discussing it on the list.

The major requirements we've come up with so far (in what I think is a reasonable order of implementation) are:

1) The ability to assign an already-created mediated device to a guest (think of "<hostdev ... managed='no'>" mode for assigning regular PCI devices).

2) reporting of the capabilities of a mediated device "parent" (including, for example, the supported types and maximum number of child devices that are supported, and the names of all existing child devices) and of existing child devices (via the node device APIs, e.g. virsh nodedev-list and virsh nodedev-dumpxml)

3) The ability to create and destroy mediated devices via the NodeDevice API. (similar in function to the "virsh detach-device and virsh attach-device commands - i.e. they make a device ready to be assigned to a guest using <hostdev>, but have no persistent config and no "auto-start" capability).

4) Support for "managed" mediated devices - libvirt will create a new child device as required, and destroy it when it's no longer needed (similar to the way that standard PCI hostdevs are (when managed="yes") detached from their host driver and attached to vfio-pci as needed) (I think this is less useful than item (5), but is simpler and may be a good way to test all the preceding additions (as well as being useful in some simpler configurations).

5) The ability to create and manage "pools" of mediated devices, with persistent config and an auto-start capability so that the device pools are automatically created when the host is booted (this will require either some form of persistent config and lifecycle management to be added to the nodedevice driver, or a new libvirt driver type with functionality similar to storage pools, but used to manage pools of mdev child devices).

=========

Going back to the beginning, with slightly more detail:

1) "Unmanaged" mediated device assignment - assigning an existing device to a virtual machine

This will assume that the desired child device has already been created, and can be found in /sys/bus/mdev/devices/$UUID. Here's a first attempt at what the XML would look like:

    <hostdev mode='subsystem' type='pci'
managed='no'>
        <source>  <!-- (maybe add "type='mdev'" ???) -->

            <mdev uuid='$uuid'/>
        </source>

        <address type='pci' blah blah blah/> <!-- PCI address in the guest -->

     </hostdev>


In the past, the "type" attribute of hostdev described the type on both the host and the guest. With mediated devices, the device isn't visible on the host as a PCI device, but just as a software device. So the type attribute in <hostdev> now gives the type of the device on the guest, and the device type on the host is determined from the contents of <source>.

Erik had a different suggestion for this (which I think he's already working on patches for) - that the type attribute in <hostdev> should be the type of the device in the *host*, and the type in the guest would be that given in the <address>. Something like this I think:

    <hostdev mode='subsystem' type='mdev' managed='no'>
        <source>

            <mdev uuid='$uuid'/>
        </source>

        <address type='pci' blah blah blah/>

     </hostdev>


(Is this correct, Erik?)

(I arrived at my suggestion by the thinking that, in other places where there are similar attributes for the host and guest side, e.g. the IP addresses and routes that can be added on both the host and guest side of an <interface>, everything related to the host side is in the <source> subelement, while things related to the guest are directly under the toplevel of the device element. On the other hand, the "managed" attribute isn't something related to the guest, but to the host, and his idea has less redundancy, so maybe he's onto something...)

(NB: a mediated device could be exposed to the guest as a PCI device, a CCW device, or anything else supported by vfio. The type of device that the guest will see can be determined from the contents of  mdev_supported_types/<type-id>/device_api under the parent device's directory in sysfs (it will be, e.g., "vfio-pci" or "vfio-ccw"). But libvirt assigns guest-side addresses at the time a domain is defined, and it's possible that the mdev child device won't be created yet at define time (and therefore we won't know which parent device it's associated with, and so we won't be able to look at device_api). In such situations, it will be up to management to know something about the device it will be creating and assume a type. Fortunately this is a reasonably safe thing to do - on x86 platforms we can be fairly certain that the device will be a PCI device. (And, because this also makes a difference for some machinetypes, that it will be a PCI Express device). We will want to check device_api at runtime though, to validate that the guest-side device really is a PCI device.

==

2) Reporting parent and child mediated devices and their capabilities in the node device API.

There are 3 stages to this:

a) add mediated child devices to the list of devices provided by "virsh nodedev-list". These will be called "mdev_$UUID", and will show up as descendents of their respective parent devices in "virsh nodedev-list --tree". The list of all these devices can easily be retrieved by enumerating the links in /sys/bus/mdev/devices/$UUID.

b) report the capabilities of parent devices in their dumpxml output. This will included supported child device types and a list of current children.

I don't have any experience with nodedev reporting for SCSI devices, but recently noticed that nodedev-list can report lists of devices with certain capabilities, e.g. "virsh nodedev-list --cap=scsi_host". Based on this, I guess it would be useful for the parent devices to show something like this (using the sample mtty driver as an example):

     <device>
        <name>pci_0000_02_00_0</name>
        <parent>pci_0000_00_04_0</parent>
        <driver>
          <name>mtty</name>
        </driver>
       <capability type='mdev_parent'>
          [list of supported types, each with number allowed]
          [list of current child devices (just giving uuid or device name ("mdev_$uuid"?)]
          [other info about parent/children?]
       </capability>
       ...

Likewise, a nodedev-dumpxml of a child device should contain a pointer to the parent device.

c) respond to dumpxml requests for mediated child devices. This should include at least the uuid/type of the child device, and a link back to the parent device (and I suppose somehow include <capability type='mdev_child'>  so that it can be filtered with virsh modedev-list?)

==

(3), (4), and (5) need more thought that I haven't gotten to yet. TBD (if anyone else has thoughts on those, please share!)