Re: [libvirt] Proposal PCI/PCIe device placement on PAPR guests

12 Jan 2017

On Thu, 12 Jan 2017 17:19:40 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
...
On 12/01/17 14:52, David Gibson wrote:
...
On Fri, Jan 06, 2017 at 12:57:58PM +0100, Greg Kurz wrote:
...
On Thu, 5 Jan 2017 16:46:18 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
...
There was a discussion back in November on the qemu list which spilled
onto the libvirt list about how to add support for PCIe devices to
POWER VMs, specifically 'pseries' machine type PAPR guests.
Here's a more concrete proposal for how to handle part of this in
future from the libvirt side.  Strictly speaking what I'm suggesting
here isn't intrinsically linked to PCIe: it will make adding PCIe
support sanely easier, as well as having a number of advantages for
both PCIe and plain-PCI devices on PAPR guests.
Background:
* Currently the pseries machine type only supports vanilla PCI
   buses.
    * This is a qemu limitation, not something inherent - PAPR guests
      running under PowerVM (the IBM hypervisor) can use passthrough
      PCIe devices (PowerVM doesn't emulate devices though).
    * In fact the way PCI access is para-virtalized in PAPR makes the
      usual distinctions between PCI and PCIe largely disappear
 * Presentation of PCIe devices to PAPR guests is unusual
    * Unlike x86 - and other "bare metal" platforms, root ports are
      not made visible to the guest. i.e. all devices (typically)
      appear as though they were integrated devices on x86
    * In terms of topology all devices will appear in a way similar to
      a vanilla PCI bus, even PCIe devices
       * However PCIe extended config space is accessible
    * This means libvirt's usual placement of PCIe devices is not
      suitable for PAPR guests
 * PAPR has its own hotplug mechanism
    * This is used instead of standard PCIe hotplug
    * This mechanism works for both PCIe and vanilla-PCI devices
    * This can hotplug/unplug devices even without a root port P2P
      bridge between it and the root "bus
 * Multiple independent host bridges are routine on PAPR
    * Unlike PC (where all host bridges have multiplexed access to
      configuration space) PCI host bridges (PHBs) are truly
      independent for PAPR guests (disjoint MMIO regions in system
      address space)
    * PowerVM typically presents a separate PHB to the guest for each
      host slot passed through
The Proposal:
I suggest that libvirt implement a new default algorithm for placing
(i.e. assigning addresses to) both PCI and PCIe devices for (only)
PAPR guests.
The short summary is that by default it should assign each device to a
separate vPHB, creating vPHBs as necessary.
* For passthrough sometimes a group of host devices can't be safely
    isolated from each other - this is known as a (host) Partitionable
    Endpoint (PE).  In this case, if any device in the PE is passed
    through to a guest, the whole PE must be passed through to the
    same vPHB in the guest.  From the guest POV, each vPHB has exactly
    one (guest) PE.
  * To allow for hotplugged devices, libvirt should also add a number
    of additional, empty vPHBs (the PAPR spec allows for hotplug of
    PHBs, but this is not yet implemented in qemu).  When hotplugging
    a new device (or PE) libvirt should locate a vPHB which doesn't
    currently contain anything.
  * libvirt should only (automatically) add PHBs - never root ports or
    other PCI to PCI bridges
In order to handle migration, the vPHBs will need to be represented in
the domain XML, which will also allow the user to override this
topology if they want.
Advantages:
There are still some details I need to figure out w.r.t. handling PCIe
devices (on both the qemu and libvirt sides).  However the fact that
One such detail may be that PCIe devices should have the
"ibm,pci-config-space-type" property set to 1 in the DT,
for the driver to be able to access the extended config
space.
So, we have a bit of an oddity here.  It looks like we currently set
'ibm,pci-config-space-type' to 1 in the PHB, rather than individual
device nodes.  Which, AFAICT, is simply incorrect in terms of PAPR.
I asked Paul how to read the spec and this is rather correct but not enough
- having type=1 on a PHB means that extended access requests can go behind
it but underlying devices and bridges still need to have type=1 if they
support extended space. Having type set to 0 (or none at all) on a PHB
would mean that extended config space is not available on anything under
this PHB.
I have the very same understanding of the spec (LoPAPR March 2016):

R1–9.1.8–2. All IOAs that implement PCI-X Mode 2 or PCI Express must supply the “ibm,pci-con-
fig-space-type” property (see Section B.6.5.1.1.1‚ “Properties for Children of PCI Host Bridges‚” on
page 703).

Implementation Note: The “ibm,pci-config-space-type” property in Requirement R1–9.1.8–2 is added for
platforms that support I/O fabric and IOAs that implement PCI-X Mode 2, and PCI Express. To access the
extended configuration space provided by PCI-X Mode 2 and PCI Express, all I/O fabric leading up to an IOA
must support a 12-bit register number. In other words, if a platform implementation has a conventional PCI bridge
leading up to an IOA that implements PCI-X Mode 2, the platform will not be able to provide access to the
extended configuration space of that IOA. The “ibm,config-space-type” property in the IOA's OF node
is used by device drivers to determine if an IOA’s extended configuration space can be accessed.

and

B.6.5.1.1.1 Properties for Children of PCI Host Bridges

“ibm,pci-config-space-type”
property name: Indicates if the platform supports access to an extended configuration address space from the PHB
up to and including this node.
0 = Platform supports only an eight bit register number for configuration address space accesses.
1 = Platform supports a twelve bit register number for configuration address space accesses.
This property may be provided in all PHB nodes and their children.
Note: The absence of this property implies the platform supports only an eight bit register number for configura-
tion address space accesses.


And incidentally, this is what the linux kernel currently expects. See these lines
from arch/powerpc/kernel/pci_dn.c:

struct pci_dn *pci_add_device_node_info(struct pci_controller *hose,
                                        struct device_node *dn)
{
        const __be32 *type = of_get_property(dn, "ibm,pci-config-space-type", NULL);
.
.
.
        /* Extended config space */
        pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);


I had to rework Alexey's "spapr_pci: Create PCI-express root bus  by default"
patch to be able to see the extended config space of a vfio-pci device:

--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1052,6 +1052,9 @@ static int spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
     _FDT(fdt_setprop(fdt, offset, "reg", (uint8_t *)rp.reg, rp.reg_len));
     _FDT(fdt_setprop(fdt, offset, "assigned-addresses",
                      (uint8_t *)rp.assigned, rp.assigned_len));
+    if (sphb->pcie_root && pci_is_express(dev)) {
+        _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
+    }
 
     return 0;
 }