On 12/01/17 14:52, David Gibson wrote:
On Fri, Jan 06, 2017 at 12:57:58PM +0100, Greg Kurz wrote:
> On Thu, 5 Jan 2017 16:46:18 +1100
> David Gibson <david(a)gibson.dropbear.id.au> wrote:
>
>> There was a discussion back in November on the qemu list which spilled
>> onto the libvirt list about how to add support for PCIe devices to
>> POWER VMs, specifically 'pseries' machine type PAPR guests.
>>
>> Here's a more concrete proposal for how to handle part of this in
>> future from the libvirt side. Strictly speaking what I'm suggesting
>> here isn't intrinsically linked to PCIe: it will make adding PCIe
>> support sanely easier, as well as having a number of advantages for
>> both PCIe and plain-PCI devices on PAPR guests.
>>
>> Background:
>>
>> * Currently the pseries machine type only supports vanilla PCI
>> buses.
>> * This is a qemu limitation, not something inherent - PAPR guests
>> running under PowerVM (the IBM hypervisor) can use passthrough
>> PCIe devices (PowerVM doesn't emulate devices though).
>> * In fact the way PCI access is para-virtalized in PAPR makes the
>> usual distinctions between PCI and PCIe largely disappear
>> * Presentation of PCIe devices to PAPR guests is unusual
>> * Unlike x86 - and other "bare metal" platforms, root ports are
>> not made visible to the guest. i.e. all devices (typically)
>> appear as though they were integrated devices on x86
>> * In terms of topology all devices will appear in a way similar to
>> a vanilla PCI bus, even PCIe devices
>> * However PCIe extended config space is accessible
>> * This means libvirt's usual placement of PCIe devices is not
>> suitable for PAPR guests
>> * PAPR has its own hotplug mechanism
>> * This is used instead of standard PCIe hotplug
>> * This mechanism works for both PCIe and vanilla-PCI devices
>> * This can hotplug/unplug devices even without a root port P2P
>> bridge between it and the root "bus
>> * Multiple independent host bridges are routine on PAPR
>> * Unlike PC (where all host bridges have multiplexed access to
>> configuration space) PCI host bridges (PHBs) are truly
>> independent for PAPR guests (disjoint MMIO regions in system
>> address space)
>> * PowerVM typically presents a separate PHB to the guest for each
>> host slot passed through
>>
>> The Proposal:
>>
>> I suggest that libvirt implement a new default algorithm for placing
>> (i.e. assigning addresses to) both PCI and PCIe devices for (only)
>> PAPR guests.
>>
>> The short summary is that by default it should assign each device to a
>> separate vPHB, creating vPHBs as necessary.
>>
>> * For passthrough sometimes a group of host devices can't be safely
>> isolated from each other - this is known as a (host) Partitionable
>> Endpoint (PE). In this case, if any device in the PE is passed
>> through to a guest, the whole PE must be passed through to the
>> same vPHB in the guest. From the guest POV, each vPHB has exactly
>> one (guest) PE.
>> * To allow for hotplugged devices, libvirt should also add a number
>> of additional, empty vPHBs (the PAPR spec allows for hotplug of
>> PHBs, but this is not yet implemented in qemu). When hotplugging
>> a new device (or PE) libvirt should locate a vPHB which doesn't
>> currently contain anything.
>> * libvirt should only (automatically) add PHBs - never root ports or
>> other PCI to PCI bridges
>>
>> In order to handle migration, the vPHBs will need to be represented in
>> the domain XML, which will also allow the user to override this
>> topology if they want.
>>
>> Advantages:
>>
>> There are still some details I need to figure out w.r.t. handling PCIe
>> devices (on both the qemu and libvirt sides). However the fact that
>
> One such detail may be that PCIe devices should have the
> "ibm,pci-config-space-type" property set to 1 in the DT,
> for the driver to be able to access the extended config
> space.
So, we have a bit of an oddity here. It looks like we currently set
'ibm,pci-config-space-type' to 1 in the PHB, rather than individual
device nodes. Which, AFAICT, is simply incorrect in terms of PAPR.
I asked Paul how to read the spec and this is rather correct but not enough
- having type=1 on a PHB means that extended access requests can go behind
it but underlying devices and bridges still need to have type=1 if they
support extended space. Having type set to 0 (or none at all) on a PHB
would mean that extended config space is not available on anything under
this PHB.
--
Alexey