VFIO is a new method of doing PCI device assignment ("PCI passthrough"
aka "<hostdev>") available in newish kernels (3.6?; it's in Fedora 18
at
any rate) and via the "vfio-pci" device in qemu-1.4+. In contrast to the
traditional KVM PCI device assignment (available via the "pci-assign"
device in qemu), VFIO works properly on systems using UEFI "Secure
Boot"; it also offers other advantages, such as grouping of related
devices that must all be assigned to the same guest (or not at all).
Here's some useful reading on the subject.
http://lwn.net/Articles/474088/
http://lwn.net/Articles/509153/
Short description (from Alex Williamson's KVM Forum Presentation)
1) Assume this is the device you want to assign:
01:10.0 Ethernet controller: Intel Corporation 82576
Virtual Function (rev 01)
2) Find the vfio group of this device
:
# readlink /sys/bus/pci/devices/0000:01:10.0/iommu_group
../../../../kernel/iommu_groups/15
==> IOMMU Group = 15
3) Check the devices in the group:
# ls /sys/bus/pci/devices/0000:01:10.0/iommu_group/devices/
0000:01:10.0
(so this group has only 1 device)
4) Unbind from device driver
# echo 0000:01:10.0 >/sys/bus/pci/devices/0000:01:10.0/driver/unbind
5) Find vendor & device ID
$ lspci -n -s 01:10.0
01:10.0 0200: 8086:10ca (rev 01)
6) Bind to vfio-pci
$ echo 8086 10ca > /sys/bus/pci/drivers/vfio-pci/new_id
(this will result in a new device node "/dev/vfio/15", which is what qemu will
use to setup the device for passthrough)
7) chown the device node so it is accessible by qemu user:
# chown qemu /dev/vfio/15; chgrp qemu /dev/vfio/15
(note that /dev/vfio/vfio, which is installed as 0600 root:root, must also be made mode
0666, still owned by root - this is supposedly not dangerous)
8) set the limit for locked memory equal to all of guest memory size + [some amount large
enough to encompass all of io space]
# ulimit -l 2621440 # ((2048 + 512) * 1024)
9) pass to qemu using -device vfio-pci:
sudo qemu qemu-system-x86_64 -m 2048 -hda rhel6vm \
-vga std -vnc :0 -net none \
-enable-kvm \
-device vfio-pci,host=01:10.0,id=net0
(qemu will then use something like step (2) to figure out which device node it needs to
use)
Why the "ulimit -l"?
--------------------
Any qemu guest that is using the old pci-assign must have *all* guest memory and IO space
locked in memory. Normally the maximum amount of locked memory allowed for a process is
controlled by "ulimit -l", but in the case of pc-assign, the kvm kernel module
has always just ignored the -l limit and locked it all anyway.
With vfio-pci, all guest memory and IO space must still be locked in memory, but the vfio
module *doesn't* ignore the process limits, so libvirt will need to set ulimit -l for
any guest that wants to do vfio-based pci passthrough. Since (due to the possibility of
hotplug) we don't know at the time the qemu process is started whether or not it might
need to do a pci passthrough, we will need to use prlimit(2) to modify the limit of the
already-running qemu.
Proposed XML Changes
--------------------
To support vfio pci device assignment in libvirt, I'm thinking something
like this (note that the <driver> subelement is already used for
<interface> and <disk> to choose which backend to use for a particular
device):
<hostdev managed='yes'>
<driver name='vfio'/>
...
</hostdev>
<interface type='hostdev' managed='yes'>
<driver name='vfio'/>
...
</hostdev>
(this new use of <driver> inside <interface> wouldn't conflict with the
existing <driver name='qemu|vhost'>, since neither of those could ever
possibly be a valid choice for <interface type='hostdev'>. The one possible
problem would be if someone had an <interface type='network'> which might
possibly point to a hostdev or standard bridged network, and wanted to make sure that in
the case of a bridged network, that <driver name='qemu' was used. I suppose in
this case, the driver name in the network definition would override any driver name in the
interface?)
Sepaking of <network>, here's how vfio would be specified in a hostdev
<network> definition:
<network>
<name>vfio-net</name>
<forward mode='hostdev' managed='yes'>
<driver name='vfio'/>
<pf dev='eth3'/> <!-- or a list of VFs -->
</forward>
...
</network>
Another possibility for the <network> xml would be to add a
"driver='vfio'" to each individual <interface> line, in case
someone wanted some devices in a pool to be asigned using vfio and some using the old
style, but that seems highly unlikely (and could create problems in the future if we ever
needed to add a 2nd attribute to the <driver> element).
Actually, at one point I considered that vfio should be turned on globally in
libvirtd.conf (or qemu.conf), but that would make switchover a tedious process, as all
existing guests using PCI passthrough would need to be shutdown prior to the change. As
long as there are no technical problems with allowing both types on the same host,
it's more flexible to choose on a device-by-device basis.
Now some questions:
1) Is there any reason that we shouldn't/can't allow both pci-assign and vfio-pci
at the same time on the same host (and even guest).
2) Does it make any sense to support a "managed='no'" mode for vfio,
which skipped steps 2-6 above? (this would be parallel to the existing pci-assign
managed='no'(where no unbinding/binding of the device to the host's pci-stub
driver is done, but the device name is simply passed to qemu assuming that all that work
was already done)) Or should <driver name='vfio'/> automatically mean that
all unbinding/binding be done for each device.
3) Is it at all bothersome that qemu must be the one opening the device node, and that
there is apparently no way to have libvirt open it and send the fd to qemu?