Re: [libvirt] Proposal PCI/PCIe device placement on PAPR guests

13 Jan 2017

      On Thu, Jan 12, 2017 at 11:03:05AM -0500, Laine Stump wrote:
...
On 01/05/2017 12:46 AM, David Gibson wrote:
...
There was a discussion back in November on the qemu list which spilled
onto the libvirt list about how to add support for PCIe devices to
POWER VMs, specifically 'pseries' machine type PAPR guests.
Here's a more concrete proposal for how to handle part of this in
future from the libvirt side.  Strictly speaking what I'm suggesting
here isn't intrinsically linked to PCIe: it will make adding PCIe
support sanely easier, as well as having a number of advantages for
both PCIe and plain-PCI devices on PAPR guests.
Background:
* Currently the pseries machine type only supports vanilla PCI
    buses.
     * This is a qemu limitation, not something inherent - PAPR guests
       running under PowerVM (the IBM hypervisor) can use passthrough
       PCIe devices (PowerVM doesn't emulate devices though).
     * In fact the way PCI access is para-virtalized in PAPR makes the
       usual distinctions between PCI and PCIe largely disappear
  * Presentation of PCIe devices to PAPR guests is unusual
     * Unlike x86 - and other "bare metal" platforms, root ports are
       not made visible to the guest. i.e. all devices (typically)
       appear as though they were integrated devices on x86
     * In terms of topology all devices will appear in a way similar to
       a vanilla PCI bus, even PCIe devices
        * However PCIe extended config space is accessible
     * This means libvirt's usual placement of PCIe devices is not
       suitable for PAPR guests
  * PAPR has its own hotplug mechanism
     * This is used instead of standard PCIe hotplug
     * This mechanism works for both PCIe and vanilla-PCI devices
     * This can hotplug/unplug devices even without a root port P2P
       bridge between it and the root "bus
  * Multiple independent host bridges are routine on PAPR
     * Unlike PC (where all host bridges have multiplexed access to
       configuration space) PCI host bridges (PHBs) are truly
       independent for PAPR guests (disjoint MMIO regions in system
       address space)
     * PowerVM typically presents a separate PHB to the guest for each
       host slot passed through
The Proposal:
I suggest that libvirt implement a new default algorithm for placing
(i.e. assigning addresses to) both PCI and PCIe devices for (only)
PAPR guests.
The short summary is that by default it should assign each device to a
separate vPHB, creating vPHBs as necessary.
* For passthrough sometimes a group of host devices can't be safely
     isolated from each other - this is known as a (host) Partitionable
     Endpoint (PE).  In this case, if any device in the PE is passed
     through to a guest, the whole PE must be passed through to the
     same vPHB in the guest.  From the guest POV, each vPHB has exactly
     one (guest) PE.
   * To allow for hotplugged devices, libvirt should also add a number
     of additional, empty vPHBs (the PAPR spec allows for hotplug of
     PHBs, but this is not yet implemented in qemu).  When hotplugging
     a new device (or PE) libvirt should locate a vPHB which doesn't
     currently contain anything.
   * libvirt should only (automatically) add PHBs - never root ports or
     other PCI to PCI bridges
It's a bit unconventional to leave all but one slot of a controller unused,
Unconventional for x86, maybe.  It's been SOP on IBM Power for a
decade or more.  Both for PAPR guests and in some cases on the
physical hardware (AIUI many, though not all, Power systems used a
separate host bridge for each physical slot to ensure better isolation
between devices).
...
but your thinking makes sense. I don't think this will be as
large/disruptive of a change as you might be expecting - we already have
different addressing rules to automatically addressed vs. manually
addressed, as well as a framework in place to behave differently for
different PCI controllers (e.g. some support hotplug and others don't), and
to modify behavior based on machinetype / root bus model, so it should be
straightforward to do make things behave as you outline above.
Actually, I had that impression, so I was hoping it wouldn't be too
bad to implement.  I'd really like to get this underway ASAP, so we
can build the PCIe support (both qemu and Power) around that.
...
(The first item in your list sounds exactly like VFIO iommu groups. Is that
how it's exposed on PPC?
Yes, for Power hosts and guests there's a 1-1 correspondance between
PEs and IOMMU groups.  Technically speaking, I believe the PE provides
more isolation guarantees than the IOMMU group, but they're generally
close enough in practice.
...
If so, libvirt already takes care of guaranteeing
that any devices in the same group aren't used by other guests or the host
during the time a guest is using a device.
Yes, I'm aware of that, that's not an aspect I was concerned about.

Although that said, last I heard there was a bug in libvirt which
on hot *un*plug could assign devices back to the host without waiting
for other assigned devices in the group, which can crash the host.
...
It doesn't automatically assign
the other devices to the guest though, since this could have unexpected
effects on host operation (the example that kept coming up when this was
originally discussed wrt vfio device assignment was the case where a disk
device in use on the host was attached to a controller in the same iommu
group as a USB controller that was going to be assigned to a guest -
silently assigning the disk controller to the guest would cause the host's
disk to suddenly become unusable).)
Um.. what!?  If something in the group is assigned to the guest, other
devices in the group MUST NOT be used by the hast, regardless of
whether they are actually assigned to the guest or not.  In the
situation you describe the guest would control the IOMMU mappings for
the host's disk device making it no more usable, and a whole lot more
dangerous.

In fact I'm pretty sure VFIO won't let you do that: it won't let you
add the group to a VFIO container until all devices in the group are
bound to the VFIO stub driver instead of whatever host driver they
were using before.
...
...
In order to handle migration, the vPHBs will need to be represented in
the domain XML, which will also allow the user to override this
topology if they want.
Advantages:
There are still some details I need to figure out w.r.t. handling PCIe
devices (on both the qemu and libvirt sides).  However the fact that
PAPR guests don't typically see PCIe root ports means that the normal
libvirt PCIe allocation scheme won't work.
Well, the "normal libvirt PCIe allocation scheme" assumes "normal PCIe" :-).
My point exactly.
...
...
This scheme has several
advantages with or without support for PCIe devices:
* Better performance for 32-bit devices
With multiple devices on a single vPHB they all must share a (fairly
small) 32-bit DMA/IOMMU window.  With separate PHBs they each have a
separate window.  PAPR guests have an always-on guest visible IOMMU.
* Better EEH handling for passthrough devices
EEH is an IBM hardware-assisted mechanism for isolating and safely
resetting devices experiencing hardware faults so they don't bring
down other devices or the system at large.  It's roughly similar to
PCIe AER in concept, but has a different IBM specific interface, and
works on both PCI and PCIe devices.
Currently the kernel interfaces for handling EEH events on passthrough
devices will only work if there is a single (host) iommu group in the
vfio container.  While lifting that restriction would be nice, it's
quite difficult to do so (it requires keeping state synchronized
between multiple host groups).  That also means that an EEH error on
one device could stop another device where that isn't required by the
actual hardware.
The unit of EEH isolation is a PE (Partitionable Endpoint) and
currently there is only one guest PE per vPHB.  Changing this might
also be possible, but is again quite complex and may result in
confusing and/or broken distinctions between groups for EEH isolation
and IOMMU isolation purposes.
Placing separate host groups in separate vPHBs sidesteps these
problems.
* Guest NUMA node assignment of devices
PAPR does not (and can't reasonably) use the pxb device.  Instead to
allocate devices to different guest NUMA nodes they should be placed
on different vPHBs.  Placing them on different PHBs by default allows
NUMA node to be assigned to those PHBs in a straightforward manner.
So far libvirt doesn't try to assign PCI addresses to devices according to
NUMA node, but assumes that the management application will manually address
devices that need to be put on a particular pxb (it's only since the recent
advent of the pxb that guests have become aware of multiple NUMA nodes).
Possibly in the future libvirt will attempt to automatically place devices
on a pxb that matches its NUMA node (if it exists). We don't want to force
use of pxb for all guests on a host that has multiple NUMA nodes though.
This might make more sense on PCC though since all devices are on a PHB, and
each PHB can have a NUMA node set.
I hope the connection of pxb to NUMA allocation isn't too tight in
libvirt.  pxb is essentially an x86 specific hack.  For PAPR guests
the correct way to assign a NUMA node to PCI devices is to put them on
separate vPHBs and set the node of the vPHB.  The point above is
noting that once this proposal is implemented, all we need to do to
add NUMA awareness is allow a NUMA node property on the vPHBs.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson