On 13/01/17 15:48, David Gibson wrote:
On Thu, Jan 12, 2017 at 10:09:03AM +0100, Greg Kurz wrote:
> On Thu, 12 Jan 2017 17:19:40 +1100
> Alexey Kardashevskiy <aik(a)ozlabs.ru> wrote:
>
>> On 12/01/17 14:52, David Gibson wrote:
>>> On Fri, Jan 06, 2017 at 12:57:58PM +0100, Greg Kurz wrote:
>>>> On Thu, 5 Jan 2017 16:46:18 +1100
>>>> David Gibson <david(a)gibson.dropbear.id.au> wrote:
>>>>
>>>>> There was a discussion back in November on the qemu list which
spilled
>>>>> onto the libvirt list about how to add support for PCIe devices to
>>>>> POWER VMs, specifically 'pseries' machine type PAPR guests.
>>>>>
>>>>> Here's a more concrete proposal for how to handle part of this
in
>>>>> future from the libvirt side. Strictly speaking what I'm
suggesting
>>>>> here isn't intrinsically linked to PCIe: it will make adding
PCIe
>>>>> support sanely easier, as well as having a number of advantages for
>>>>> both PCIe and plain-PCI devices on PAPR guests.
>>>>>
>>>>> Background:
>>>>>
>>>>> * Currently the pseries machine type only supports vanilla PCI
>>>>> buses.
>>>>> * This is a qemu limitation, not something inherent - PAPR
guests
>>>>> running under PowerVM (the IBM hypervisor) can use passthrough
>>>>> PCIe devices (PowerVM doesn't emulate devices though).
>>>>> * In fact the way PCI access is para-virtalized in PAPR makes
the
>>>>> usual distinctions between PCI and PCIe largely disappear
>>>>> * Presentation of PCIe devices to PAPR guests is unusual
>>>>> * Unlike x86 - and other "bare metal" platforms, root
ports are
>>>>> not made visible to the guest. i.e. all devices (typically)
>>>>> appear as though they were integrated devices on x86
>>>>> * In terms of topology all devices will appear in a way similar
to
>>>>> a vanilla PCI bus, even PCIe devices
>>>>> * However PCIe extended config space is accessible
>>>>> * This means libvirt's usual placement of PCIe devices is
not
>>>>> suitable for PAPR guests
>>>>> * PAPR has its own hotplug mechanism
>>>>> * This is used instead of standard PCIe hotplug
>>>>> * This mechanism works for both PCIe and vanilla-PCI devices
>>>>> * This can hotplug/unplug devices even without a root port P2P
>>>>> bridge between it and the root "bus
>>>>> * Multiple independent host bridges are routine on PAPR
>>>>> * Unlike PC (where all host bridges have multiplexed access to
>>>>> configuration space) PCI host bridges (PHBs) are truly
>>>>> independent for PAPR guests (disjoint MMIO regions in system
>>>>> address space)
>>>>> * PowerVM typically presents a separate PHB to the guest for
each
>>>>> host slot passed through
>>>>>
>>>>> The Proposal:
>>>>>
>>>>> I suggest that libvirt implement a new default algorithm for placing
>>>>> (i.e. assigning addresses to) both PCI and PCIe devices for (only)
>>>>> PAPR guests.
>>>>>
>>>>> The short summary is that by default it should assign each device to
a
>>>>> separate vPHB, creating vPHBs as necessary.
>>>>>
>>>>> * For passthrough sometimes a group of host devices can't be
safely
>>>>> isolated from each other - this is known as a (host)
Partitionable
>>>>> Endpoint (PE). In this case, if any device in the PE is passed
>>>>> through to a guest, the whole PE must be passed through to the
>>>>> same vPHB in the guest. From the guest POV, each vPHB has
exactly
>>>>> one (guest) PE.
>>>>> * To allow for hotplugged devices, libvirt should also add a
number
>>>>> of additional, empty vPHBs (the PAPR spec allows for hotplug of
>>>>> PHBs, but this is not yet implemented in qemu). When
hotplugging
>>>>> a new device (or PE) libvirt should locate a vPHB which
doesn't
>>>>> currently contain anything.
>>>>> * libvirt should only (automatically) add PHBs - never root ports
or
>>>>> other PCI to PCI bridges
>>>>>
>>>>> In order to handle migration, the vPHBs will need to be represented
in
>>>>> the domain XML, which will also allow the user to override this
>>>>> topology if they want.
>>>>>
>>>>> Advantages:
>>>>>
>>>>> There are still some details I need to figure out w.r.t. handling
PCIe
>>>>> devices (on both the qemu and libvirt sides). However the fact that
>>>>
>>>> One such detail may be that PCIe devices should have the
>>>> "ibm,pci-config-space-type" property set to 1 in the DT,
>>>> for the driver to be able to access the extended config
>>>> space.
>>>
>>> So, we have a bit of an oddity here. It looks like we currently set
>>> 'ibm,pci-config-space-type' to 1 in the PHB, rather than individual
>>> device nodes. Which, AFAICT, is simply incorrect in terms of PAPR.
>>
>>
>> I asked Paul how to read the spec and this is rather correct but not enough
>> - having type=1 on a PHB means that extended access requests can go behind
>> it but underlying devices and bridges still need to have type=1 if they
>> support extended space. Having type set to 0 (or none at all) on a PHB
>> would mean that extended config space is not available on anything under
>> this PHB.
>>
>
> I have the very same understanding of the spec (LoPAPR March 2016):
>
> R1–9.1.8–2. All IOAs that implement PCI-X Mode 2 or PCI Express must supply the
“ibm,pci-con-
> fig-space-type” property (see Section B.6.5.1.1.1‚ “Properties for Children of PCI
Host Bridges‚” on
> page 703).
>
> Implementation Note: The “ibm,pci-config-space-type” property in Requirement
R1–9.1.8–2 is added for
> platforms that support I/O fabric and IOAs that implement PCI-X Mode 2, and PCI
Express. To access the
> extended configuration space provided by PCI-X Mode 2 and PCI Express, all I/O fabric
leading up to an IOA
> must support a 12-bit register number. In other words, if a platform implementation
has a conventional PCI bridge
> leading up to an IOA that implements PCI-X Mode 2, the platform will not be able to
provide access to the
> extended configuration space of that IOA. The “ibm,config-space-type” property in the
IOA's OF node
> is used by device drivers to determine if an IOA’s extended configuration space can
be accessed.
>
> and
>
> B.6.5.1.1.1 Properties for Children of PCI Host Bridges
>
> “ibm,pci-config-space-type”
> property name: Indicates if the platform supports access to an extended configuration
address space from the PHB
> up to and including this node.
> 0 = Platform supports only an eight bit register number for configuration address
space accesses.
> 1 = Platform supports a twelve bit register number for configuration address space
accesses.
> This property may be provided in all PHB nodes and their children.
> Note: The absence of this property implies the platform supports only an eight bit
register number for configura-
> tion address space accesses.
>
>
> And incidentally, this is what the linux kernel currently expects. See these lines
> from arch/powerpc/kernel/pci_dn.c:
>
> struct pci_dn *pci_add_device_node_info(struct pci_controller *hose,
> struct device_node *dn)
> {
> const __be32 *type = of_get_property(dn,
"ibm,pci-config-space-type", NULL);
> .
> .
> .
> /* Extended config space */
> pdn->pci_ext_config_space = (type && of_read_number(type, 1) ==
1);
Ok, thanks for the information.
> I had to rework Alexey's "spapr_pci: Create PCI-express root bus by
default"
> patch to be able to see the extended config space of a vfio-pci device:
Ah! Is there an easy command line way to verify that extended config
space is accessible?
I do "lspci -vvs "0003:01:00.3 and look for "Capabilities: [xxx v1]"
where
xxx >= 0x100.
--
Alexey