On 06/24/2013 06:06 AM, Daniel P. Berrange wrote:
On Mon, Jun 24, 2013 at 05:54:49AM -0400, Laine Stump wrote:
> When I first put in support for VFIO device assignment, I didn't
> realize that groups of devices were quite as common as they actually
> are. In particular, I didn't know that often multiple
> seemingly-unrelated devices can end up in the same VFIO iommu group
> due to unlucky circumstances of hardware - they may share a dma
> controller which means that the devices can't truly be isolated from
> each other, and thus should not be simultaneously assigned to
> different guests (or even used by the host) - all of the devices in a
> group should be either assigned to the same guest or, if not assigned
> to the guest, should be isolated off in a driver to prevent them
> from being used by the host.
>
> The following set of patches makes setting that up easier to deal
> with. The end result of all the patches is the following:
>
> 1) The virNodeDevice API will be able to detach or re-attach all the
> devices in a particular group with a single API call.
>
> 2) <hostdev managed='yes'>, <interface type='hostdev'
managed='yes'>,
> and <interface type='network' managed='yes'> devices (where
the
> network is itself a pool of SRIOV Virtual Functions) can specify:
>
> <driver name='vfio' group='auto'/>
>
> and libvirt will automatically detach (and bind to the 'vfio-pci'
> driver for assignment/isolation) all devices in the same group as
> the device being assigned. Likewise, when the device it detached
> from the guest, a check will be made and, if none of the devices in
> the same group as the device being detach is still in use by a guest
I am concerned that group='auto' is a really incredibly dangerous
setting from the POV of operation of the host OS.
I can just imagine forum postings / docs saying to use group=auto
and people blindly following it without much inclination as to
what will happen. They will be trying to assign a spare NIC to
their guest; they'll get an error saying it can't be done since it
is part of a group; they'll search google and find a recommendation
to use group=auto to "fix" the problem. libvirt will see that their
SATA controller & graphics card are part of the same group as the
NIC and automatically detach them both from the host OS. Kaboom,
the user is screwed.
Yes, I understand (and share) your concern. These patches grew out of
claims that there would be a regression in behavior if "managed='yes'"
stopped working properly with devices that were in a group, and this was
the only way I could see to make it work.
With traditional configs, even with managed=yes, you could be sure
that only the single device in the XML would ever be touched. If
there was a conflict due to other devices being on the same PCI
bridge without FLR, then the device would safely fail to be assigned
until the user had explicitly disconnected other devices from the
host. We never attempted to automatically disconnect anything that
was not part of the XML
Following on from that, how does an application determine what
other devices are present in the group associated with the device
being assigned ? Are we exposing group membership info in the node
device XML anywhere ?
Yes. nodedev-dumpxml now has the following information for every device
that's in a group (this is added in Patch 20/22):
<iommuGroup number='12'>
<address domain='0x0000' bus='0x02' slot='0x00'
function='0x0'/>
<address domain='0x0000' bus='0x02' slot='0x00'
function='0x1'/>
</iommuGroup>
I'm not sure what else to suggest, other than to say we should
not
add this attribute, and require that the application/user explicitly
disconnect any other devices in the same group from the host OS. Any
other option I can think of just sounds too dangerous.
I would suggest adding some sort of "assignment white list" to libvirt's
config, and requiring any device manipulated in any manner to be on that
white list, but in a way it's already possible to create such a list -
just manually detach all devices that will be assigned to a guest (and
any devices in the same iommu groups) and stop using managed='yes'.
(Just doing that was my original plan, but it had opposition due to the
perceived regression in behavior.)
Of course none of this device group management is required for what I
see (maybe mistakenly so?) as the most common use of PCI device
assignment - SRIOV Virtual Functions; each VF is always by itself in a
group, so managed assignment will work for them with no problem. It's
the odd devices integrated on the motherboard chipset, and cards plugged
into certain combinations of slots, that end up being in groups.
So what is usable/useful from this patchset? I think the addition of
group information to nodedev-dumpxml is useful, and possibly the
virNodeDeviceReAttachFlags() should be added in case we later come up
with a different solution that requires passing a flag to
virNodeDeviceReAttach() and need to backport it. And of course there are
many patches in here that can be considered cleanups/bugfixes (4 is a
bugfix, everything else up to 10 is a general cleanup, and 11 has
utility functions useful for 20 (which adds the group information to
nodedev-dumpxml). Really, everything except the patches marked with *
are safe/useful:
01 syntax: virPCIDeviceFree is also a NOP for NULL args
02 pci: change stubDriver from const char* to char*
03 pci: new utility functions
04 pci: eliminate memory leak in virPCIDeviceReattach
05 pci: make virPCIDeviceDetach consistent in behavior
06 pci: eliminate repetitive path constructions in
virPCIDeviceBindToStub
07 pci: eliminate unused driver arg from virPCIDeviceDetach
08 pci: update stubDriver name in virPCIDeviceBindToStub
09 pci: rename virPCIDeviceGetVFIOGroupDev to
virPCIDeviceGetIOMMUGroupDev
10 pci: make virPCIParseDeviceAddress public
11 pci: new iommu_group functions
* 12 pci: optionally detach/reattach all devices in a VFIO group
* 13 API & qemu: add ability to detach an entire VFIO group of devices
* 14 virsh: add option to detach entire group of devices
* 15 API: new virNodeDeviceReAttachFlags
* 16 API: implement RPC calls for virNodeDeviceReAttachFlags
* 17 qemu: implement virNodeDeviceReAttachFlags
* 18 xen: implement virNodeDeviceReAttachFlags
* 19 virsh: add option to attach entire group of devices
20 nodedev: add iommuGroup to node device object
* 21 conf: add <driver group='auto'> to hostdev, interface, and networks
* 22 qemu: implement backend of <driver group='auto'/>
Or does someone see a reasonable way to make some modification of the
existing stuff less dangerous? Maybe a simple whitelist could be
workable (but where should it go?) Or is there any way to determine
which devices can safely be detached from their host drivers without
disrupting the system? Alex, any ideas?