Re: [libvirt] Proposal PCI/PCIe device placement on PAPR guests

Friday, 6 January 2017

On Thu, 5 Jan 2017 16:46:18 +1100
David Gibson <david(a)gibson.dropbear.id.au&gt; wrote:

...
 There was a discussion back in November on the qemu list which
spilled
 onto the libvirt list about how to add support for PCIe devices to
 POWER VMs, specifically 'pseries' machine type PAPR guests.

 Here's a more concrete proposal for how to handle part of this in
 future from the libvirt side.  Strictly speaking what I'm suggesting
 here isn't intrinsically linked to PCIe: it will make adding PCIe
 support sanely easier, as well as having a number of advantages for
 both PCIe and plain-PCI devices on PAPR guests.

 Background:

  * Currently the pseries machine type only supports vanilla PCI
    buses.
     * This is a qemu limitation, not something inherent - PAPR guests
       running under PowerVM (the IBM hypervisor) can use passthrough
       PCIe devices (PowerVM doesn't emulate devices though).
     * In fact the way PCI access is para-virtalized in PAPR makes the
       usual distinctions between PCI and PCIe largely disappear
  * Presentation of PCIe devices to PAPR guests is unusual
     * Unlike x86 - and other "bare metal" platforms, root ports are
       not made visible to the guest. i.e. all devices (typically)
       appear as though they were integrated devices on x86
     * In terms of topology all devices will appear in a way similar to
       a vanilla PCI bus, even PCIe devices
        * However PCIe extended config space is accessible
     * This means libvirt's usual placement of PCIe devices is not
       suitable for PAPR guests
  * PAPR has its own hotplug mechanism
     * This is used instead of standard PCIe hotplug
     * This mechanism works for both PCIe and vanilla-PCI devices
     * This can hotplug/unplug devices even without a root port P2P
       bridge between it and the root "bus
  * Multiple independent host bridges are routine on PAPR
     * Unlike PC (where all host bridges have multiplexed access to
       configuration space) PCI host bridges (PHBs) are truly
       independent for PAPR guests (disjoint MMIO regions in system
       address space)
     * PowerVM typically presents a separate PHB to the guest for each
       host slot passed through

 The Proposal:

 I suggest that libvirt implement a new default algorithm for placing
 (i.e. assigning addresses to) both PCI and PCIe devices for (only)
 PAPR guests.

 The short summary is that by default it should assign each device to a
 separate vPHB, creating vPHBs as necessary.

   * For passthrough sometimes a group of host devices can't be safely
     isolated from each other - this is known as a (host) Partitionable
     Endpoint (PE).  In this case, if any device in the PE is passed
     through to a guest, the whole PE must be passed through to the
     same vPHB in the guest.  From the guest POV, each vPHB has exactly
     one (guest) PE.
   * To allow for hotplugged devices, libvirt should also add a number
     of additional, empty vPHBs (the PAPR spec allows for hotplug of
     PHBs, but this is not yet implemented in qemu).  When hotplugging
     a new device (or PE) libvirt should locate a vPHB which doesn't
     currently contain anything.
   * libvirt should only (automatically) add PHBs - never root ports or
     other PCI to PCI bridges

 In order to handle migration, the vPHBs will need to be represented in
 the domain XML, which will also allow the user to override this
 topology if they want.

 Advantages:

 There are still some details I need to figure out w.r.t. handling PCIe
 devices (on both the qemu and libvirt sides).  However the fact that 
One such detail may be that PCIe devices should have the
"ibm,pci-config-space-type" property set to 1 in the DT,
for the driver to be able to access the extended config
space.

...
 PAPR guests don't typically see PCIe root ports means that the
normal
 libvirt PCIe allocation scheme won't work.  This scheme has several
 advantages with or without support for PCIe devices:

  * Better performance for 32-bit devices

 With multiple devices on a single vPHB they all must share a (fairly
 small) 32-bit DMA/IOMMU window.  With separate PHBs they each have a
 separate window.  PAPR guests have an always-on guest visible IOMMU.

  * Better EEH handling for passthrough devices

 EEH is an IBM hardware-assisted mechanism for isolating and safely
 resetting devices experiencing hardware faults so they don't bring
 down other devices or the system at large.  It's roughly similar to
 PCIe AER in concept, but has a different IBM specific interface, and
 works on both PCI and PCIe devices.

 Currently the kernel interfaces for handling EEH events on passthrough
 devices will only work if there is a single (host) iommu group in the
 vfio container.  While lifting that restriction would be nice, it's
 quite difficult to do so (it requires keeping state synchronized
 between multiple host groups).  That also means that an EEH error on
 one device could stop another device where that isn't required by the
 actual hardware.

 The unit of EEH isolation is a PE (Partitionable Endpoint) and
 currently there is only one guest PE per vPHB.  Changing this might
 also be possible, but is again quite complex and may result in
 confusing and/or broken distinctions between groups for EEH isolation
 and IOMMU isolation purposes.

 Placing separate host groups in separate vPHBs sidesteps these
 problems.

  * Guest NUMA node assignment of devices

 PAPR does not (and can't reasonably) use the pxb device.  Instead to
 allocate devices to different guest NUMA nodes they should be placed
 on different vPHBs.  Placing them on different PHBs by default allows
 NUMA node to be assigned to those PHBs in a straightforward manner.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] Proposal PCI/PCIe device placement on PAPR guests