Re: [libvirt] [PATCH] Allow a per-PCI passthrough device permissive attribute

28 Jan 2010


      On Thu, Jan 28, 2010 at 08:36:02AM -0500, Chris Lalancette wrote:
...
On 01/27/2010 01:49 PM, Daniel P. Berrange wrote:
...
On Wed, Jan 27, 2010 at 09:23:52AM -0500, Chris Lalancette wrote:
...
Currently there is a global tag to let the administrator
turn off system-wide ACS checking when doing PCI device
passthrough.  However, this is too coarse-grained of an
attribute, since it doesn't allow setups where certain
guests are trusted while other ones are untrusted.  Allow
more complicated setups by making the device checking
a per-device setting.
The more detailed explanation of why this is necessary
delves deep into PCIe internals.  Ideally we'd like
to be able to probe devices and figure out whether it
is safe to assign them.  In practice, this isn't possible
because PCIe allows devices to have "hidden" bridges
that software can't discover.  If you were to have
two devices assigned to two different domains behind
one of these hidden bridges, they could do P2P traffic
and bypass all of the VT-d/IOMMU checks.
The next thing we could try to do is to have a whitelist
of devices that we know to be safe.  For instance, instead
of a "hidden" bridge, PCI devices can multiplex functions
instead, which causes all traffic to head to an upstream
bridge before P2P can take place.  Additionally, some
"hidden" PCI bridges may have ACS on-board.  In both of
these cases it's safe to passthrough the device(s), since
they can't P2P without the IOMMU getting involved.
However, even if we did have a whitelist, I think we still
need a permissive attribute.  For one thing, the whitelist
will always be out of date with respect to new hardware,
so we'd need to allow administrators to temporarily
override the whitelist restriction until a new version of
the whitelist came out.  Also, we want to support the case
where the administrator knows it is safe to assign possibly
unsafe devices to a domain he trusts.
A domain is only trusted until its guest OS gets exploited at which point
this proposed change may let it escape into the host. If you don't have
any IOMMU on your host, you can't use PCI device assignment with KVM at
all, because it would not be safe in the event of guest exploit / mis-
behavior. The same is true of device assignment in this non-ACS + hidden
bridge case.
While this is true, I don't think this is the sort of policy we should be
enforcing in libvirt.  Witness the users who still use Xen PV PCI passthrough,
despite the fact that is unsafe.  We still support them.  If KVM had the
ability to do non-IOMMU passthrough, I would think we would want to allow
that usage through libvirt.  Should we be safe by default?  Absolutely.  But
should we give administrators enough rope to hang themselves?  Yes, it's the
Unix way.
Since KVM does not provide for running without an IOMMU, I don't think we
should be allowing bypass of the IOMMU in one small edge case. Particularly
since the rules on when it is safe to do so are soo obscure that no sysadmin
can realistically make a well informed decision. There are plenty of other
constraints in PCI device assignment that libvirt is already checking and
enforcing that we don't allow bypass of too.
...
...
Thus I don't see why we should introduce a special "permissive" flag 
solely for the non-ACS edge case, while at the smae time not allowing 
the same permissiveness for the far more common non-IOMMU case. NB, I'm 
not suggesting we allow skipping of the checks for the non-IOMMU case 
either.
Keeping a whitelist of devices up2date wrt new hardware launches is no
more troublesome than the existing problem of updating the PCI-IDs databse
or actually providing updated kernel releases with new drivers.
If we use the permissive attribute, then every admin with a device that
is known to be safe has the pain of setting the permissive attribute,
every time, on every machine with this hardware. If we have a whitelist,
then 99% of the time everything will just work because it will already
be known to the whitelist. If the whitelist were an external datafile
the admin could even extend it in the rare occasion when a new device 
were not known. This is a choice of make everyone solve over & over
again themselves, or solve it once for everybody.
I don't really like the idea of a whitelist, but I like it more than just
pushing the problem onto admins via per guest flags. For that matter I
don't like the host level flag we have either and would rather we removed
it. If only there's a 3rd way that were neither flags or whitelists ...
The problem that I tried (and failed) to explain in the commit message is
that even with a whitelist, it is not sufficient.  The problem is that
with multi-function PCI devices, software cannot tell whether it is possible
to do P2P traffic.  In the case that a manufacturer tells us "no, our device
doesn't allow P2P traffic", then great, we can add it to the whitelist.  But
if a manufacturer can't or won't tell us that, then we have to assume it can
do P2P traffic, and block it.  But these multi-function PCI devices are
*exactly* the devices we want to passthrough to guests (all multi-port NICs we
have tested are multi-function).  So even with the whitelist, you still have
to leave a way for the admin to override the check.
This is all coming down to 'its hard, lets just have the user decide'
which is not a satisfactory strategy here. New multi-function devices
will have drivers written & submitted and if they want them used it is
not unreasonable to expect the company providing drivers to indicate
whether the devices allows P2P traffic or now. Or for enterprise OS
vendors to determine this, when deciding whether to support the hardware
in their distro. I accept it can't be discovered from hardware inquiry,
but that doesn't stop this being indicated in other ways, kernel drivers
exposing flags, or maintaining a whitelist of PCI IDs. If an admin
desperately needs to override this, a whitelist still allows them to
add a further PCI ID in a single place, rather than changing *every*
guest XML that uses the device. This is fundamentally something that
needs to be tracked against the device's PCI ID, and not the guest config. 

Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|