Re: [libvirt] [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests

Wednesday, 22 February 2017

David,

I don't see the "spapr_pci: Allow PCI-Express devices" patch in your
ppc-for-2.9 tree. Do you still consider merging it ?

Cheers.

--
Greg

On Fri, 13 Jan 2017 09:58:28 +0100
Greg Kurz <groug(a)kaod.org&gt; wrote:

...
 On Fri, 13 Jan 2017 15:48:31 +1100
 David Gibson <david(a)gibson.dropbear.id.au&gt; wrote:

 > On Thu, Jan 12, 2017 at 10:09:03AM +0100, Greg Kurz wrote:  
 > > On Thu, 12 Jan 2017 17:19:40 +1100
 > > Alexey Kardashevskiy <aik(a)ozlabs.ru&gt; wrote:
 > >     
 > > > On 12/01/17 14:52, David Gibson wrote:    
 > > > > On Fri, Jan 06, 2017 at 12:57:58PM +0100, Greg Kurz wrote:      
 > > > >> On Thu, 5 Jan 2017 16:46:18 +1100
 > > > >> David Gibson <david(a)gibson.dropbear.id.au&gt; wrote:
 > > > >>      
 > > > >>> There was a discussion back in November on the qemu list
which spilled
 > > > >>> onto the libvirt list about how to add support for PCIe
devices to
 > > > >>> POWER VMs, specifically 'pseries' machine type PAPR
guests.
 > > > >>>
 > > > >>> Here's a more concrete proposal for how to handle part of
this in
 > > > >>> future from the libvirt side.  Strictly speaking what I'm
suggesting
 > > > >>> here isn't intrinsically linked to PCIe: it will make
adding PCIe
 > > > >>> support sanely easier, as well as having a number of
advantages for
 > > > >>> both PCIe and plain-PCI devices on PAPR guests.
 > > > >>>
 > > > >>> Background:
 > > > >>>
 > > > >>>  * Currently the pseries machine type only supports vanilla
PCI
 > > > >>>    buses.
 > > > >>>     * This is a qemu limitation, not something inherent -
PAPR guests
 > > > >>>       running under PowerVM (the IBM hypervisor) can use
passthrough
 > > > >>>       PCIe devices (PowerVM doesn't emulate devices
though).
 > > > >>>     * In fact the way PCI access is para-virtalized in PAPR
makes the
 > > > >>>       usual distinctions between PCI and PCIe largely
disappear
 > > > >>>  * Presentation of PCIe devices to PAPR guests is unusual
 > > > >>>     * Unlike x86 - and other "bare metal"
platforms, root ports are
 > > > >>>       not made visible to the guest. i.e. all devices
(typically)
 > > > >>>       appear as though they were integrated devices on x86
 > > > >>>     * In terms of topology all devices will appear in a way
similar to
 > > > >>>       a vanilla PCI bus, even PCIe devices
 > > > >>>        * However PCIe extended config space is accessible
 > > > >>>     * This means libvirt's usual placement of PCIe
devices is not
 > > > >>>       suitable for PAPR guests
 > > > >>>  * PAPR has its own hotplug mechanism
 > > > >>>     * This is used instead of standard PCIe hotplug
 > > > >>>     * This mechanism works for both PCIe and vanilla-PCI
devices
 > > > >>>     * This can hotplug/unplug devices even without a root
port P2P
 > > > >>>       bridge between it and the root "bus
 > > > >>>  * Multiple independent host bridges are routine on PAPR
 > > > >>>     * Unlike PC (where all host bridges have multiplexed
access to
 > > > >>>       configuration space) PCI host bridges (PHBs) are truly
 > > > >>>       independent for PAPR guests (disjoint MMIO regions in
system
 > > > >>>       address space)
 > > > >>>     * PowerVM typically presents a separate PHB to the guest
for each
 > > > >>>       host slot passed through
 > > > >>>
 > > > >>> The Proposal:
 > > > >>>
 > > > >>> I suggest that libvirt implement a new default algorithm for
placing
 > > > >>> (i.e. assigning addresses to) both PCI and PCIe devices for
(only)
 > > > >>> PAPR guests.
 > > > >>>
 > > > >>> The short summary is that by default it should assign each
device to a
 > > > >>> separate vPHB, creating vPHBs as necessary.
 > > > >>>
 > > > >>>   * For passthrough sometimes a group of host devices
can't be safely
 > > > >>>     isolated from each other - this is known as a (host)
Partitionable
 > > > >>>     Endpoint (PE).  In this case, if any device in the PE is
passed
 > > > >>>     through to a guest, the whole PE must be passed through
to the
 > > > >>>     same vPHB in the guest.  From the guest POV, each vPHB
has exactly
 > > > >>>     one (guest) PE.
 > > > >>>   * To allow for hotplugged devices, libvirt should also add
a number
 > > > >>>     of additional, empty vPHBs (the PAPR spec allows for
hotplug of
 > > > >>>     PHBs, but this is not yet implemented in qemu).  When
hotplugging
 > > > >>>     a new device (or PE) libvirt should locate a vPHB which
doesn't
 > > > >>>     currently contain anything.
 > > > >>>   * libvirt should only (automatically) add PHBs - never root
ports or
 > > > >>>     other PCI to PCI bridges
 > > > >>>
 > > > >>> In order to handle migration, the vPHBs will need to be
represented in
 > > > >>> the domain XML, which will also allow the user to override
this
 > > > >>> topology if they want.
 > > > >>>
 > > > >>> Advantages:
 > > > >>>
 > > > >>> There are still some details I need to figure out w.r.t.
handling PCIe
 > > > >>> devices (on both the qemu and libvirt sides).  However the
fact that      
 > > > >>
 > > > >> One such detail may be that PCIe devices should have the
 > > > >> "ibm,pci-config-space-type" property set to 1 in the
DT,
 > > > >> for the driver to be able to access the extended config
 > > > >> space.      
 > > > > 
 > > > > So, we have a bit of an oddity here.  It looks like we currently set
 > > > > 'ibm,pci-config-space-type' to 1 in the PHB, rather than
individual
 > > > > device nodes.  Which, AFAICT, is simply incorrect in terms of PAPR.  

 > > > 
 > > > 
 > > > I asked Paul how to read the spec and this is rather correct but not
enough
 > > > - having type=1 on a PHB means that extended access requests can go
behind
 > > > it but underlying devices and bridges still need to have type=1 if they
 > > > support extended space. Having type set to 0 (or none at all) on a PHB
 > > > would mean that extended config space is not available on anything under
 > > > this PHB.
 > > >     
 > > 
 > > I have the very same understanding of the spec (LoPAPR March 2016):
 > > 
 > > R1–9.1.8–2. All IOAs that implement PCI-X Mode 2 or PCI Express must supply the
“ibm,pci-con-
 > > fig-space-type” property (see Section B.6.5.1.1.1‚ “Properties for Children of
PCI Host Bridges‚” on
 > > page 703).
 > > 
 > > Implementation Note: The “ibm,pci-config-space-type” property in Requirement
R1–9.1.8–2 is added for
 > > platforms that support I/O fabric and IOAs that implement PCI-X Mode 2, and PCI
Express. To access the
 > > extended configuration space provided by PCI-X Mode 2 and PCI Express, all I/O
fabric leading up to an IOA
 > > must support a 12-bit register number. In other words, if a platform
implementation has a conventional PCI bridge
 > > leading up to an IOA that implements PCI-X Mode 2, the platform will not be
able to provide access to the
 > > extended configuration space of that IOA. The “ibm,config-space-type” property
in the IOA's OF node
 > > is used by device drivers to determine if an IOA’s extended configuration space
can be accessed.
 > > 
 > > and
 > > 
 > > B.6.5.1.1.1 Properties for Children of PCI Host Bridges
 > > 
 > > “ibm,pci-config-space-type”
 > > property name: Indicates if the platform supports access to an extended
configuration address space from the PHB
 > > up to and including this node.
 > > 0 = Platform supports only an eight bit register number for configuration
address space accesses.
 > > 1 = Platform supports a twelve bit register number for configuration address
space accesses.
 > > This property may be provided in all PHB nodes and their children.
 > > Note: The absence of this property implies the platform supports only an eight
bit register number for configura-
 > > tion address space accesses.
 > > 
 > > 
 > > And incidentally, this is what the linux kernel currently expects. See these
lines
 > > from arch/powerpc/kernel/pci_dn.c:
 > > 
 > > struct pci_dn *pci_add_device_node_info(struct pci_controller *hose,
 > >                                         struct device_node *dn)
 > > {
 > >         const __be32 *type = of_get_property(dn,
"ibm,pci-config-space-type", NULL);
 > > .
 > > .
 > > .
 > >         /* Extended config space */
 > >         pdn->pci_ext_config_space = (type && of_read_number(type, 1)
== 1);    
 > 
 > Ok, thanks for the information.
 >   
 > > I had to rework Alexey's "spapr_pci: Create PCI-express root bus  by
default"
 > > patch to be able to see the extended config space of a vfio-pci device:    
 > 
 > Ah!  Is there an easy command line way to verify that extended config
 > space is accessible?
 >   

 'lspci -vv' prints the capabilities from the config space with their
 hex offset between [ ]. Extended config space starts at offset 0x100.

 > > --- a/hw/ppc/spapr_pci.c
 > > +++ b/hw/ppc/spapr_pci.c
 > > @@ -1052,6 +1052,9 @@ static int spapr_populate_pci_child_dt(PCIDevice *dev,
void *fdt, int offset,
 > >      _FDT(fdt_setprop(fdt, offset, "reg", (uint8_t *)rp.reg,
rp.reg_len));
 > >      _FDT(fdt_setprop(fdt, offset, "assigned-addresses",
 > >                       (uint8_t *)rp.assigned, rp.assigned_len));
 > > +    if (sphb->pcie_root && pci_is_express(dev)) {
 > > +        _FDT(fdt_setprop_cell(fdt, offset,
"ibm,pci-config-space-type", 0x1));
 > > +    }
 > >  
 > >      return 0;
 > >  }    
 > 
 > I'm looking at merging the patch below into ppc-for-2.9 shortly.  It
 > basically combines a revised version of the pcie option part of
 > Alexey's patch with your DT patch above.  For now it only allows PCIe
 > with an explicit option, changing the default waits on more
 > investigation of how to not break things with libvirt.
 > 
 > From e14702a0bb76900077d82510280e1a023c384b72 Mon Sep 17 00:00:00 2001
 > From: David Gibson <david(a)gibson.dropbear.id.au&gt;
 > Date: Fri, 13 Jan 2017 15:45:30 +1100
 > Subject: [PATCH] spapr_pci: Allow PCI-Express devices
 > 
 > Currently the PCI host bridge (PHB) for the pseries machine type always
 > creates a plain PCI bus (TYPE_PCI_BUS).  This means that qemu won't allow
 > PCIe devices to be attached, and consequently, PCIe devices can't be used
 > on pseries.
 >   

 Well, PCIe devices can still be attached actually (at least I could attach
 a vfio-pci device linked to a PCIe device on the host). But, they appear
 as legacy PCI devices in the guest, without the extended config space.

 > This limitation isn't inherent in the PAPR specification.  In fact, because
 > access to the PCI bus is paravirtualized via hypercalls, from the guest
 > point of view there is very little difference between a plain PCI and PCIe
 > bus under PAPR.
 > 
 > So, to allow PCIe devices, add a "pcie" option to the PHB device, which
 > when enabled creates a PCIe bus.  In addition, on PCIe devices under such
 > a PHB, we publish the device tree flag which indicates that the guest may
 > access PCIe extended config space (via RTAS calls which we've already
 > implemented).
 > 
 > Note that this doesn't change the default PHB to PCIe.  We probably want to
 > do that in future, but we need to sort out some possible breakages with
 > libvirt and existing users first.  This at least lets knowledgeable users
 > experiment with PCIe devices in the meantime.
 > 
 > Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au&gt;  

 FWIW,

 Reviewed-by: Greg Kurz <groug(a)kaod.org&gt;

 > ---
 >  hw/ppc/spapr_pci.c          | 8 +++++++-
 >  include/hw/pci-host/spapr.h | 1 +
 >  2 files changed, 8 insertions(+), 1 deletion(-)
 > 
 > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
 > index fd6fc1d..d17df1f 100644
 > --- a/hw/ppc/spapr_pci.c
 > +++ b/hw/ppc/spapr_pci.c
 > @@ -1053,6 +1053,10 @@ static int spapr_populate_pci_child_dt(PCIDevice *dev, void
*fdt, int offset,
 >      _FDT(fdt_setprop(fdt, offset, "assigned-addresses",
 >                       (uint8_t *)rp.assigned, rp.assigned_len));
 >  
 > +    if (sphb->pcie && pci_is_express(dev)) {
 > +        _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type",
0x1));
 > +    }
 > +
 >      return 0;
 >  }
 >  
 > @@ -1434,7 +1438,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
 >      bus = pci_register_bus(dev, NULL,
 >                             pci_spapr_set_irq, pci_spapr_map_irq, sphb,
 >                             &sphb->memspace, &sphb->iospace,
 > -                           PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
 > +                           PCI_DEVFN(0, 0), PCI_NUM_PINS,
 > +                           sphb->pcie ? TYPE_PCIE_BUS : TYPE_PCI_BUS);
 >      phb->bus = bus;
 >      qbus_set_hotplug_handler(BUS(phb->bus), DEVICE(sphb), NULL);
 >  
 > @@ -1592,6 +1597,7 @@ static Property spapr_phb_properties[] = {
 >      DEFINE_PROP_UINT32("numa_node", sPAPRPHBState, numa_node, -1),
 >      DEFINE_PROP_BOOL("pre-2.8-migration", sPAPRPHBState,
 >                       pre_2_8_migration, false),
 > +    DEFINE_PROP_BOOL("pcie", sPAPRPHBState, pcie, false),
 >      DEFINE_PROP_END_OF_LIST(),
 >  };
 >  
 > diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
 > index 092294e..deed2cc 100644
 > --- a/include/hw/pci-host/spapr.h
 > +++ b/include/hw/pci-host/spapr.h
 > @@ -79,6 +79,7 @@ struct sPAPRPHBState {
 >      uint64_t dma64_win_addr;
 >  
 >      uint32_t numa_node;
 > +    bool pcie;
 >  
 >      /* Fields for migration compatibility hacks */
 >      bool pre_2_8_migration;  

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests