On Tue, 2013-03-12 at 13:20 -0400, Laine Stump wrote:
On 03/11/2013 02:06 PM, Alex Williamson wrote:
> On Mon, 2013-03-11 at 13:23 -0400, Laine Stump wrote:
>> VFIO is a new method of doing PCI device assignment ("PCI
passthrough"
>> aka "<hostdev>") available in newish kernels (3.6?; it's in
Fedora 18 at
>> any rate) and via the "vfio-pci" device in qemu-1.4+. In contrast to
the
>> traditional KVM PCI device assignment (available via the "pci-assign"
>> device in qemu), VFIO works properly on systems using UEFI "Secure
>> Boot"; it also offers other advantages, such as grouping of related
>> devices that must all be assigned to the same guest (or not at all).
>> Here's some useful reading on the subject.
>>
>>
http://lwn.net/Articles/474088/
>>
http://lwn.net/Articles/509153/
>>
>> Short description (from Alex Williamson's KVM Forum Presentation)
>>
>> 1) Assume this is the device you want to assign:
>> 01:10.0 Ethernet controller: Intel Corporation 82576
>> Virtual Function (rev 01)
>>
>> 2) Find the vfio group of this device
>> :
>> # readlink /sys/bus/pci/devices/0000:01:10.0/iommu_group
>> ../../../../kernel/iommu_groups/15
>>
>> ==> IOMMU Group = 15
>>
>> 3) Check the devices in the group:
>> # ls /sys/bus/pci/devices/0000:01:10.0/iommu_group/devices/
>> 0000:01:10.0
>>
>> (so this group has only 1 device)
>>
>> 4) Unbind from device driver
>> # echo 0000:01:10.0 >/sys/bus/pci/devices/0000:01:10.0/driver/unbind
>>
>> 5) Find vendor & device ID
>> $ lspci -n -s 01:10.0
>> 01:10.0 0200: 8086:10ca (rev 01)
>>
>> 6) Bind to vfio-pci
>> $ echo 8086 10ca > /sys/bus/pci/drivers/vfio-pci/new_id
>>
>> (this will result in a new device node "/dev/vfio/15", which is what
qemu will use to setup the device for passthrough)
>>
>> 7) chown the device node so it is accessible by qemu user:
>> # chown qemu /dev/vfio/15; chgrp qemu /dev/vfio/15
>>
>> (note that /dev/vfio/vfio, which is installed as 0600 root:root, must also be
made mode 0666, still owned by root - this is supposedly not dangerous)
> I'll look into this, the intention has always been that /dev/vfio/vfio
> is a safe interface that's only empowered when connected to
> a /dev/vfio/$GROUP, which implies some privileges.
>
>> 8) set the limit for locked memory equal to all of guest memory size + [some
amount large enough to encompass all of io space]
>> # ulimit -l 2621440 # ((2048 + 512) * 1024)
>>
>> 9) pass to qemu using -device vfio-pci:
>>
>> sudo qemu qemu-system-x86_64 -m 2048 -hda rhel6vm \
>> -vga std -vnc :0 -net none \
>> -enable-kvm \
>> -device vfio-pci,host=01:10.0,id=net0
>>
>> (qemu will then use something like step (2) to figure out which device node it
needs to use)
>>
>> Why the "ulimit -l"?
>> --------------------
>>
>> Any qemu guest that is using the old pci-assign must have *all* guest
>> memory and IO space locked in memory. Normally the maximum amount of
>> locked memory allowed for a process is controlled by "ulimit -l", but
>> in the case of pc-assign, the kvm kernel module has always just
>> ignored the -l limit and locked it all anyway.
>>
>> With vfio-pci, all guest memory and IO space must still be locked in
>> memory, but the vfio module *doesn't* ignore the process limits, so
>> libvirt will need to set ulimit -l for any guest that wants to do
>> vfio-based pci passthrough. Since (due to the possibility of hotplug)
>> we don't know at the time the qemu process is started whether or not
>> it might need to do a pci passthrough, we will need to use prlimit(2)
>> to modify the limit of the already-running qemu.
>>
>>
>> Proposed XML Changes
>> --------------------
>>
>> To support vfio pci device assignment in libvirt, I'm thinking something
>> like this (note that the <driver> subelement is already used for
>> <interface> and <disk> to choose which backend to use for a
particular
>> device):
>>
>> <hostdev managed='yes'>
>> <driver name='vfio'/>
>> ...
>> </hostdev>
>>
>> <interface type='hostdev' managed='yes'>
>> <driver name='vfio'/>
> vfio is the overall userspace driver framework while vfio-pci is the
> specific qemu driver we're using here. Does it make more sense to call
> this 'vfio-pci'? It's possible that we could later have a device tree
> qemu driver which would need to be involved with -device vfio-dt (or
> something) and have different options.
Would this new "vfio-dt" device be used for PCI devices?
No, dt would be non-PCI devices described by device tree.
I actually left
out an important attribute in my example:
<hostdev type='pci' managed='yes'>
<driver name='vfio'/>
...
</hostdev>
I may be leaning in your direction anyway though (or maybe such a future
difference could be handled by an extra attribute to <driver name='vfio'
.../>). (and this brings up the question of whether we should give a
name to / allow specifying in a <driver> element the current pci-assign.
Do we need to recognize "<driver name='pci-assign'/>" so that in
some
future world where vfio is the default, it is still possible to force
use of pci-assign instead?)
My vision is that at some point pci-assign will go away and we can
remove it from the KVM kernel module. Obviously that will be after a
long vetting period of vfio-pci and long deprecation period. Part of
that vetting may be changing the default, at which point we'd need a way
to specify pci-assign if there are any incompatibilities. So yes, we
should probably create the option to specify pci-assign now.
>> ...
>> </hostdev>
>>
>> (this new use of <driver> inside <interface> wouldn't conflict
with
>> the existing <driver name='qemu|vhost'>, since neither of those
could
>> ever possibly be a valid choice for <interface type='hostdev'>.
The
>> one possible problem would be if someone had an <interface
>> type='network'> which might possibly point to a hostdev or standard
>> bridged network, and wanted to make sure that in the case of a bridged
>> network, that <driver name='qemu' was used. I suppose in this case,
>> the driver name in the network definition would override any driver
>> name in the interface?)
I'm still a bit bothered by this one. Does anybody have any comment
about it?
>> Speaking of <network>, here's how vfio would be specified in a hostdev
<network> definition:
>>
>> <network>
>> <name>vfio-net</name>
>> <forward mode='hostdev' managed='yes'>
>> <driver name='vfio'/>
>> <pf dev='eth3'/> <!-- or a list of VFs -->
>> </forward>
>> ...
>> </network>
>>
>> Another possibility for the <network> xml would be to add a
>> "driver='vfio'" to each individual <interface> line, in
case someone
>> wanted some devices in a pool to be asigned using vfio and some using
>> the old style, but that seems highly unlikely (and could create
>> problems in the future if we ever needed to add a 2nd attribute to the
>> <driver> element).
>>
>> Actually, at one point I considered that vfio should be turned on
>> globally in libvirtd.conf (or qemu.conf), but that would make
>> switchover a tedious process, as all existing guests using PCI
>> passthrough would need to be shutdown prior to the change. As long as
>> there are no technical problems with allowing both types on the same
>> host, it's more flexible to choose on a device-by-device basis.
>>
>> Now some questions:
>>
>> 1) Is there any reason that we shouldn't/can't allow both pci-assign
>> and vfio-pci at the same time on the same host (and even guest).
> vfio-pci and pci-assign can be mixed, but don't intermix devices within
> a group. Sometimes this will work (if the grouping is isolation
> reasons), but sometimes it won't (when the grouping is for visibility).
> Best to just avoid that scenario.
Right. My intent is to encode group membership in the activePciHostDevs
array somehow. When doing vfio assignment, if any device in the same
group as the desired device has already been assigned (using either
pci-assign or vfio-pci), the vfio assignment of the new device would be
refused (unless the other device was assigned to the same group).
Likewise, for pci-assign, if the desired device was in a group that
already belonged to some other domain (implying that some device in that
group had been assigned with vfio-pci), the assignment would fail.
Great
>> 2) Does it make any sense to support a
"managed='no'" mode for vfio,
>> which skipped steps 2-6 above? (this would be parallel to the existing
>> pci-assign managed='no'(where no unbinding/binding of the device to
>> the host's pci-stub driver is done, but the device name is simply
>> passed to qemu assuming that all that work was already done)) Or
>> should <driver name='vfio'/> automatically mean that all
>> unbinding/binding be done for each device.
> I don't think it hurts to have it, but I can't think of a use case.
> Even with pci-assign, I can only think of cases where customers have
> used it to try to work around things they shouldn't be doing with it.
That's what I've always thought about "managed" for pci-assign too -
I've *never* had a situation where it seemed proper to *not* use it. I'm
guessing that unmanaged only exists for some sort of backward
compatibility reason.
>> 3) Is it at all bothersome that qemu must be the one opening the
>> device node, and that there is apparently no way to have libvirt open
>> it and send the fd to qemu?
> I have the same question. The architecture of vfio is that the user
> will open /dev/vfio/vfio (vfiofd) and add a group to it (groupfd).
> Multiple groupfds can be added to a single vfiofd, allowing groups to
> share IOMMU domains. However, it's not guaranteed that the IOMMU driver
> will allow this (the domains may be incompatible). Qemu will therefore
> attempt to add any new group to an existing vfiofd before re-opening a
> new one.
So it's unknown until runtime just how many times /dev/vfio/vfio will
need to be opened (i.e. host many fds will be needed).
Right. On x86 we'll likely always use a single vfiofd for all the
groups, but let's not build in that assumption somewhere for it to bite
us later.
> There's also the problem that a group has multiple
devices, so
> if device A from group X gets added with vfiofd and groupXfd and libvirt
> then passes a new vfiofd' and groupXfd' for attaching device B, also
> from group X... what's qemu to do?
>
> So in order to pass file descriptors libvirt has to either know exactly
> how things are working or just always pass a vfiofd and groupfd, which
> qemu will discard if it doesn't need. The latter implies that fds could
> live on and be required past the point where the device that added them
> has been removed (in the example above, add A and qemu uses vfiofd and
> groupXfd, hot add B and qemu discards vfiofd' and groupXfd', remove A
> and qemu continues to use vfiofd and groupXfd for B).
Ugh. That doesn't sound very conducive to the whole "libvirt opens
everything required and just sends fds to qemu" model, so I guess at
least for now we'll have to chown/chmod the device nodes, send the pci
address string and let qemu open the devices itself.
Ok, for now I'll assume you don't need it, but we could later add
vfiofd=, groupfd= with the above condition that they may or may not get
used. libvirt would have to track which files are still in use if it
wanted to do any fd garbage collection. Thanks,
Alex