Re: [libvirt] Proposal PCI/PCIe device placement on PAPR guests

Friday, 13 January 2017

On 13/01/17 15:48, David Gibson wrote:
...
 On Thu, Jan 12, 2017 at 10:09:03AM +0100, Greg Kurz wrote:
> On Thu, 12 Jan 2017 17:19:40 +1100
> Alexey Kardashevskiy <aik(a)ozlabs.ru&gt; wrote:
>
>> On 12/01/17 14:52, David Gibson wrote:
>>> On Fri, Jan 06, 2017 at 12:57:58PM +0100, Greg Kurz wrote:  
>>>> On Thu, 5 Jan 2017 16:46:18 +1100
>>>> David Gibson <david(a)gibson.dropbear.id.au&gt; wrote:
>>>>  
>>>>> There was a discussion back in November on the qemu list which
spilled
>>>>> onto the libvirt list about how to add support for PCIe devices to
>>>>> POWER VMs, specifically 'pseries' machine type PAPR guests.
>>>>>
>>>>> Here's a more concrete proposal for how to handle part of this
in
>>>>> future from the libvirt side.  Strictly speaking what I'm
suggesting
>>>>> here isn't intrinsically linked to PCIe: it will make adding
PCIe
>>>>> support sanely easier, as well as having a number of advantages for
>>>>> both PCIe and plain-PCI devices on PAPR guests.
>>>>>
>>>>> Background:
>>>>>
>>>>>  * Currently the pseries machine type only supports vanilla PCI
>>>>>    buses.
>>>>>     * This is a qemu limitation, not something inherent - PAPR
guests
>>>>>       running under PowerVM (the IBM hypervisor) can use passthrough
>>>>>       PCIe devices (PowerVM doesn't emulate devices though).
>>>>>     * In fact the way PCI access is para-virtalized in PAPR makes
the
>>>>>       usual distinctions between PCI and PCIe largely disappear
>>>>>  * Presentation of PCIe devices to PAPR guests is unusual
>>>>>     * Unlike x86 - and other "bare metal" platforms, root
ports are
>>>>>       not made visible to the guest. i.e. all devices (typically)
>>>>>       appear as though they were integrated devices on x86
>>>>>     * In terms of topology all devices will appear in a way similar
to
>>>>>       a vanilla PCI bus, even PCIe devices
>>>>>        * However PCIe extended config space is accessible
>>>>>     * This means libvirt's usual placement of PCIe devices is
not
>>>>>       suitable for PAPR guests
>>>>>  * PAPR has its own hotplug mechanism
>>>>>     * This is used instead of standard PCIe hotplug
>>>>>     * This mechanism works for both PCIe and vanilla-PCI devices
>>>>>     * This can hotplug/unplug devices even without a root port P2P
>>>>>       bridge between it and the root "bus
>>>>>  * Multiple independent host bridges are routine on PAPR
>>>>>     * Unlike PC (where all host bridges have multiplexed access to
>>>>>       configuration space) PCI host bridges (PHBs) are truly
>>>>>       independent for PAPR guests (disjoint MMIO regions in system
>>>>>       address space)
>>>>>     * PowerVM typically presents a separate PHB to the guest for
each
>>>>>       host slot passed through
>>>>>
>>>>> The Proposal:
>>>>>
>>>>> I suggest that libvirt implement a new default algorithm for placing
>>>>> (i.e. assigning addresses to) both PCI and PCIe devices for (only)
>>>>> PAPR guests.
>>>>>
>>>>> The short summary is that by default it should assign each device to
a
>>>>> separate vPHB, creating vPHBs as necessary.
>>>>>
>>>>>   * For passthrough sometimes a group of host devices can't be
safely
>>>>>     isolated from each other - this is known as a (host)
Partitionable
>>>>>     Endpoint (PE).  In this case, if any device in the PE is passed
>>>>>     through to a guest, the whole PE must be passed through to the
>>>>>     same vPHB in the guest.  From the guest POV, each vPHB has
exactly
>>>>>     one (guest) PE.
>>>>>   * To allow for hotplugged devices, libvirt should also add a
number
>>>>>     of additional, empty vPHBs (the PAPR spec allows for hotplug of
>>>>>     PHBs, but this is not yet implemented in qemu).  When
hotplugging
>>>>>     a new device (or PE) libvirt should locate a vPHB which
doesn't
>>>>>     currently contain anything.
>>>>>   * libvirt should only (automatically) add PHBs - never root ports
or
>>>>>     other PCI to PCI bridges
>>>>>
>>>>> In order to handle migration, the vPHBs will need to be represented
in
>>>>> the domain XML, which will also allow the user to override this
>>>>> topology if they want.
>>>>>
>>>>> Advantages:
>>>>>
>>>>> There are still some details I need to figure out w.r.t. handling
PCIe
>>>>> devices (on both the qemu and libvirt sides).  However the fact that 

>>>>
>>>> One such detail may be that PCIe devices should have the
>>>> "ibm,pci-config-space-type" property set to 1 in the DT,
>>>> for the driver to be able to access the extended config
>>>> space.  
>>>
>>> So, we have a bit of an oddity here.  It looks like we currently set
>>> 'ibm,pci-config-space-type' to 1 in the PHB, rather than individual
>>> device nodes.  Which, AFAICT, is simply incorrect in terms of PAPR.  
>>
>>
>> I asked Paul how to read the spec and this is rather correct but not enough
>> - having type=1 on a PHB means that extended access requests can go behind
>> it but underlying devices and bridges still need to have type=1 if they
>> support extended space. Having type set to 0 (or none at all) on a PHB
>> would mean that extended config space is not available on anything under
>> this PHB.
>>
>
> I have the very same understanding of the spec (LoPAPR March 2016):
>
> R1–9.1.8–2. All IOAs that implement PCI-X Mode 2 or PCI Express must supply the
“ibm,pci-con-
> fig-space-type” property (see Section B.6.5.1.1.1‚ “Properties for Children of PCI
Host Bridges‚” on
> page 703).
>
> Implementation Note: The “ibm,pci-config-space-type” property in Requirement
R1–9.1.8–2 is added for
> platforms that support I/O fabric and IOAs that implement PCI-X Mode 2, and PCI
Express. To access the
> extended configuration space provided by PCI-X Mode 2 and PCI Express, all I/O fabric
leading up to an IOA
> must support a 12-bit register number. In other words, if a platform implementation
has a conventional PCI bridge
> leading up to an IOA that implements PCI-X Mode 2, the platform will not be able to
provide access to the
> extended configuration space of that IOA. The “ibm,config-space-type” property in the
IOA's OF node
> is used by device drivers to determine if an IOA’s extended configuration space can
be accessed.
>
> and
>
> B.6.5.1.1.1 Properties for Children of PCI Host Bridges
>
> “ibm,pci-config-space-type”
> property name: Indicates if the platform supports access to an extended configuration
address space from the PHB
> up to and including this node.
> 0 = Platform supports only an eight bit register number for configuration address
space accesses.
> 1 = Platform supports a twelve bit register number for configuration address space
accesses.
> This property may be provided in all PHB nodes and their children.
> Note: The absence of this property implies the platform supports only an eight bit
register number for configura-
> tion address space accesses.
>
>
> And incidentally, this is what the linux kernel currently expects. See these lines
> from arch/powerpc/kernel/pci_dn.c:
>
> struct pci_dn *pci_add_device_node_info(struct pci_controller *hose,
>                                         struct device_node *dn)
> {
>         const __be32 *type = of_get_property(dn,
"ibm,pci-config-space-type", NULL);
> .
> .
> .
>         /* Extended config space */
>         pdn->pci_ext_config_space = (type && of_read_number(type, 1) ==
1);

 Ok, thanks for the information.

> I had to rework Alexey's "spapr_pci: Create PCI-express root bus  by
default"
> patch to be able to see the extended config space of a vfio-pci device:

 Ah!  Is there an easy command line way to verify that extended config
 space is accessible? 

I do "lspci -vvs "0003:01:00.3 and look for "Capabilities: [xxx v1]"
where
xxx >= 0x100.

-- 
Alexey

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] Proposal PCI/PCIe device placement on PAPR guests