On Fri, 13 Jan 2017 15:48:31 +1100
David Gibson <david(a)gibson.dropbear.id.au> wrote:
On Thu, Jan 12, 2017 at 10:09:03AM +0100, Greg Kurz wrote:
> On Thu, 12 Jan 2017 17:19:40 +1100
> Alexey Kardashevskiy <aik(a)ozlabs.ru> wrote:
>
> > On 12/01/17 14:52, David Gibson wrote:
> > > On Fri, Jan 06, 2017 at 12:57:58PM +0100, Greg Kurz wrote:
> > >> On Thu, 5 Jan 2017 16:46:18 +1100
> > >> David Gibson <david(a)gibson.dropbear.id.au> wrote:
> > >>
> > >>> There was a discussion back in November on the qemu list which
spilled
> > >>> onto the libvirt list about how to add support for PCIe devices
to
> > >>> POWER VMs, specifically 'pseries' machine type PAPR
guests.
> > >>>
> > >>> Here's a more concrete proposal for how to handle part of this
in
> > >>> future from the libvirt side. Strictly speaking what I'm
suggesting
> > >>> here isn't intrinsically linked to PCIe: it will make adding
PCIe
> > >>> support sanely easier, as well as having a number of advantages
for
> > >>> both PCIe and plain-PCI devices on PAPR guests.
> > >>>
> > >>> Background:
> > >>>
> > >>> * Currently the pseries machine type only supports vanilla PCI
> > >>> buses.
> > >>> * This is a qemu limitation, not something inherent - PAPR
guests
> > >>> running under PowerVM (the IBM hypervisor) can use
passthrough
> > >>> PCIe devices (PowerVM doesn't emulate devices though).
> > >>> * In fact the way PCI access is para-virtalized in PAPR makes
the
> > >>> usual distinctions between PCI and PCIe largely disappear
> > >>> * Presentation of PCIe devices to PAPR guests is unusual
> > >>> * Unlike x86 - and other "bare metal" platforms,
root ports are
> > >>> not made visible to the guest. i.e. all devices (typically)
> > >>> appear as though they were integrated devices on x86
> > >>> * In terms of topology all devices will appear in a way
similar to
> > >>> a vanilla PCI bus, even PCIe devices
> > >>> * However PCIe extended config space is accessible
> > >>> * This means libvirt's usual placement of PCIe devices is
not
> > >>> suitable for PAPR guests
> > >>> * PAPR has its own hotplug mechanism
> > >>> * This is used instead of standard PCIe hotplug
> > >>> * This mechanism works for both PCIe and vanilla-PCI devices
> > >>> * This can hotplug/unplug devices even without a root port
P2P
> > >>> bridge between it and the root "bus
> > >>> * Multiple independent host bridges are routine on PAPR
> > >>> * Unlike PC (where all host bridges have multiplexed access
to
> > >>> configuration space) PCI host bridges (PHBs) are truly
> > >>> independent for PAPR guests (disjoint MMIO regions in
system
> > >>> address space)
> > >>> * PowerVM typically presents a separate PHB to the guest for
each
> > >>> host slot passed through
> > >>>
> > >>> The Proposal:
> > >>>
> > >>> I suggest that libvirt implement a new default algorithm for
placing
> > >>> (i.e. assigning addresses to) both PCI and PCIe devices for
(only)
> > >>> PAPR guests.
> > >>>
> > >>> The short summary is that by default it should assign each device
to a
> > >>> separate vPHB, creating vPHBs as necessary.
> > >>>
> > >>> * For passthrough sometimes a group of host devices can't be
safely
> > >>> isolated from each other - this is known as a (host)
Partitionable
> > >>> Endpoint (PE). In this case, if any device in the PE is
passed
> > >>> through to a guest, the whole PE must be passed through to
the
> > >>> same vPHB in the guest. From the guest POV, each vPHB has
exactly
> > >>> one (guest) PE.
> > >>> * To allow for hotplugged devices, libvirt should also add a
number
> > >>> of additional, empty vPHBs (the PAPR spec allows for hotplug
of
> > >>> PHBs, but this is not yet implemented in qemu). When
hotplugging
> > >>> a new device (or PE) libvirt should locate a vPHB which
doesn't
> > >>> currently contain anything.
> > >>> * libvirt should only (automatically) add PHBs - never root
ports or
> > >>> other PCI to PCI bridges
> > >>>
> > >>> In order to handle migration, the vPHBs will need to be
represented in
> > >>> the domain XML, which will also allow the user to override this
> > >>> topology if they want.
> > >>>
> > >>> Advantages:
> > >>>
> > >>> There are still some details I need to figure out w.r.t. handling
PCIe
> > >>> devices (on both the qemu and libvirt sides). However the fact
that
> > >>
> > >> One such detail may be that PCIe devices should have the
> > >> "ibm,pci-config-space-type" property set to 1 in the DT,
> > >> for the driver to be able to access the extended config
> > >> space.
> > >
> > > So, we have a bit of an oddity here. It looks like we currently set
> > > 'ibm,pci-config-space-type' to 1 in the PHB, rather than
individual
> > > device nodes. Which, AFAICT, is simply incorrect in terms of PAPR.
> >
> >
> > I asked Paul how to read the spec and this is rather correct but not enough
> > - having type=1 on a PHB means that extended access requests can go behind
> > it but underlying devices and bridges still need to have type=1 if they
> > support extended space. Having type set to 0 (or none at all) on a PHB
> > would mean that extended config space is not available on anything under
> > this PHB.
> >
>
> I have the very same understanding of the spec (LoPAPR March 2016):
>
> R1–9.1.8–2. All IOAs that implement PCI-X Mode 2 or PCI Express must supply the
“ibm,pci-con-
> fig-space-type” property (see Section B.6.5.1.1.1‚ “Properties for Children of PCI
Host Bridges‚” on
> page 703).
>
> Implementation Note: The “ibm,pci-config-space-type” property in Requirement
R1–9.1.8–2 is added for
> platforms that support I/O fabric and IOAs that implement PCI-X Mode 2, and PCI
Express. To access the
> extended configuration space provided by PCI-X Mode 2 and PCI Express, all I/O
fabric leading up to an IOA
> must support a 12-bit register number. In other words, if a platform implementation
has a conventional PCI bridge
> leading up to an IOA that implements PCI-X Mode 2, the platform will not be able to
provide access to the
> extended configuration space of that IOA. The “ibm,config-space-type” property in
the IOA's OF node
> is used by device drivers to determine if an IOA’s extended configuration space can
be accessed.
>
> and
>
> B.6.5.1.1.1 Properties for Children of PCI Host Bridges
>
> “ibm,pci-config-space-type”
> property name: Indicates if the platform supports access to an extended
configuration address space from the PHB
> up to and including this node.
> 0 = Platform supports only an eight bit register number for configuration address
space accesses.
> 1 = Platform supports a twelve bit register number for configuration address space
accesses.
> This property may be provided in all PHB nodes and their children.
> Note: The absence of this property implies the platform supports only an eight bit
register number for configura-
> tion address space accesses.
>
>
> And incidentally, this is what the linux kernel currently expects. See these lines
> from arch/powerpc/kernel/pci_dn.c:
>
> struct pci_dn *pci_add_device_node_info(struct pci_controller *hose,
> struct device_node *dn)
> {
> const __be32 *type = of_get_property(dn,
"ibm,pci-config-space-type", NULL);
> .
> .
> .
> /* Extended config space */
> pdn->pci_ext_config_space = (type && of_read_number(type, 1) ==
1);
Ok, thanks for the information.
> I had to rework Alexey's "spapr_pci: Create PCI-express root bus by
default"
> patch to be able to see the extended config space of a vfio-pci device:
Ah! Is there an easy command line way to verify that extended config
space is accessible?
'lspci -vv' prints the capabilities from the config space with their
hex offset between [ ]. Extended config space starts at offset 0x100.
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1052,6 +1052,9 @@ static int spapr_populate_pci_child_dt(PCIDevice *dev, void
*fdt, int offset,
> _FDT(fdt_setprop(fdt, offset, "reg", (uint8_t *)rp.reg,
rp.reg_len));
> _FDT(fdt_setprop(fdt, offset, "assigned-addresses",
> (uint8_t *)rp.assigned, rp.assigned_len));
> + if (sphb->pcie_root && pci_is_express(dev)) {
> + _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type",
0x1));
> + }
>
> return 0;
> }
I'm looking at merging the patch below into ppc-for-2.9 shortly. It
basically combines a revised version of the pcie option part of
Alexey's patch with your DT patch above. For now it only allows PCIe
with an explicit option, changing the default waits on more
investigation of how to not break things with libvirt.
From e14702a0bb76900077d82510280e1a023c384b72 Mon Sep 17 00:00:00 2001
From: David Gibson <david(a)gibson.dropbear.id.au>
Date: Fri, 13 Jan 2017 15:45:30 +1100
Subject: [PATCH] spapr_pci: Allow PCI-Express devices
Currently the PCI host bridge (PHB) for the pseries machine type always
creates a plain PCI bus (TYPE_PCI_BUS). This means that qemu won't allow
PCIe devices to be attached, and consequently, PCIe devices can't be used
on pseries.
Well, PCIe devices can still be attached actually (at least I could attach
a vfio-pci device linked to a PCIe device on the host). But, they appear
as legacy PCI devices in the guest, without the extended config space.
This limitation isn't inherent in the PAPR specification. In
fact, because
access to the PCI bus is paravirtualized via hypercalls, from the guest
point of view there is very little difference between a plain PCI and PCIe
bus under PAPR.
So, to allow PCIe devices, add a "pcie" option to the PHB device, which
when enabled creates a PCIe bus. In addition, on PCIe devices under such
a PHB, we publish the device tree flag which indicates that the guest may
access PCIe extended config space (via RTAS calls which we've already
implemented).
Note that this doesn't change the default PHB to PCIe. We probably want to
do that in future, but we need to sort out some possible breakages with
libvirt and existing users first. This at least lets knowledgeable users
experiment with PCIe devices in the meantime.
Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au>
FWIW,
Reviewed-by: Greg Kurz <groug(a)kaod.org>
---
hw/ppc/spapr_pci.c | 8 +++++++-
include/hw/pci-host/spapr.h | 1 +
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index fd6fc1d..d17df1f 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1053,6 +1053,10 @@ static int spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt,
int offset,
_FDT(fdt_setprop(fdt, offset, "assigned-addresses",
(uint8_t *)rp.assigned, rp.assigned_len));
+ if (sphb->pcie && pci_is_express(dev)) {
+ _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type",
0x1));
+ }
+
return 0;
}
@@ -1434,7 +1438,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
bus = pci_register_bus(dev, NULL,
pci_spapr_set_irq, pci_spapr_map_irq, sphb,
&sphb->memspace, &sphb->iospace,
- PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
+ PCI_DEVFN(0, 0), PCI_NUM_PINS,
+ sphb->pcie ? TYPE_PCIE_BUS : TYPE_PCI_BUS);
phb->bus = bus;
qbus_set_hotplug_handler(BUS(phb->bus), DEVICE(sphb), NULL);
@@ -1592,6 +1597,7 @@ static Property spapr_phb_properties[] = {
DEFINE_PROP_UINT32("numa_node", sPAPRPHBState, numa_node, -1),
DEFINE_PROP_BOOL("pre-2.8-migration", sPAPRPHBState,
pre_2_8_migration, false),
+ DEFINE_PROP_BOOL("pcie", sPAPRPHBState, pcie, false),
DEFINE_PROP_END_OF_LIST(),
};
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 092294e..deed2cc 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -79,6 +79,7 @@ struct sPAPRPHBState {
uint64_t dma64_win_addr;
uint32_t numa_node;
+ bool pcie;
/* Fields for migration compatibility hacks */
bool pre_2_8_migration;