On Fri, 2016-12-02 at 15:18 +1100, David Gibson wrote:
> So, would the PCIe Root Bus in a pseries guest behave
> differently than the one in a q35 or mach-virt guest?
Yes. I had a long discussion with BenH and got a somewhat better idea
about this.
Sorry, but I'm afraid you're going to have to break this
down even further for me :(
If only a single host PE (== iommu group) is passed through and
there
are no emulated devices, the difference isn't too bad: basically on
pseries you'll see the subtree that would be below the root complex on
q35.
But if you pass through multiple groups, things get weird.
Is the difference between q35 and pseries guests with
respect to PCIe only relevant when it comes to assigned
devices, or in general? I'm asking this because you seem to
focus entirely on assigned devices.
On q35,
you'd generally expect physically separate (different slot) devices to
appear under separate root complexes.
This part I don't get at all, so please bear with me.
The way I read it you're claiming that eg. a SCSI controller
and a network adapter, being physically separate and assigned
to separate PCI slots, should have a dedicated PCIe Root
Complex each on a q35 guest.
That doesn't match with my experience, where you would simply
assign them to separate slots of the default PCIe Root Bus
(pcie.0), eg. 00:01.0 and 00:02.0.
Maybe you're referring to the fact that you might want to
create multiple PCIe Root Complexes in order to assign the
host devices to separate guest NUMA nodes? How is creating
multiple PCIe Root Complexes on q35 using pxb-pcie different
than creating multiple PHBs using spapr-pci-host-bridge on
pseries?
Whereas on pseries they'll
appear as siblings on a virtual bus (which makes no physical sense for
point-to-point PCI-E).
What is the virtual bus in question? Why would it matter
that they're siblings?
I'm possibly missing the point entirely, but so far it
looks to me like there are different configurations you
might want to use depending on your goal, and both q35
and pseries give you comparable tools to achieve such
configurations.
I suppose we could try treating all devices on pseries as though
they
were chipset builtin devices on q35, which will appear on the root
PCI-E bus without root complex. But I suspect that's likely to cause
trouble with hotplug, and it will certainly need different address
allocation from libvirt.
PCIe Integrated Endpoint Devices are not hotpluggable on
q35, that's why libvirt will follow QEMU's PCIe topology
recommendations and place a PCIe Root Port between them;
I assume the same could be done for pseries guests as
soon as QEMU grows support for generic PCIe Root Ports,
something Marcel has already posted patches for.
Again, sorry for clearly misunderstanding your explanation,
but I'm still not seeing the issue here. I'm sure it's very
clear in your mind, but I'm afraid you're going to have to
walk me through it :(
> Regardless of how we decide to move forward with the
> PCIe-enabled pseries machine type, libvirt will have to
> know about this so it can behave appropriately.
So there are kind of two extremes of how to address this. There are a
variety of options in between, but I suspect they're going to be even
more muddled and hideous than the extremes.
1) Give up. You said there's already a flag that says a PCI-E bus is
able to accept vanilla-PCI devices. We add a hack flag that says a
vanilla-PCI bus is able to accept PCI-E devices. We keep address
allocation as it is now - the pseries topology really does resemble
vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
devices, and PAPR has mechanisms for accessing the extended config
space. PCI-E standard hotplug and error reporting will never work,
but PAPR provides its own mechanisms for those, so that should be ok.
We can definitely special-case pseries guests and take
the "anything goes" approach to PCI vs PCIe, but it would
certainly be nicer if we could avoid presenting our users
the head-scratching situation of PCIe devices being plugged
into legacy PCI slots and still showing up as PCIe in the
guest.
What about virtio devices, which present themselves either
as legacy PCI or PCIe depending on the kind of slot they
are plugged into? Would they show up as PCIe or legacy PCI
on a PCIe-enabled pseries guest?
2) Start exposing the PCI-E heirarchy for pseries guests much more
like q35, root complexes and all. It's not clear that PAPR actually
*forbids* exposing the root complex, it just doesn't require it and
that's not what PowerVM does. But.. there are big questions about
whether existing guests will cope with this or not. When you start
adding in multiple passed through devices and particularly virtual
functions as well, things could get very ugly - we might need to
construct multiple emulated virtual root complexes or other messes.
In the short to medium term, I'm thinking option (1) seems pretty
compelling.
Is the Root Complex not currently exposed? The Root Bus
certainly is, otherwise PCI devices won't work at all, I
assume. And I can clearly see a pci.0 bus in the output
of 'info qtree' for a pseries guest, and a pci.1 too if
I add a spapr-pci-host-bridge.
Maybe I just don't quite get the relationship between Root
Complexes and Root Buses, but I guess my question is: what
is preventing us from simply doing whatever a
spapr-pci-host-bridge is doing in order to expose a legacy
PCI Root Bus (pci.*) to the guest, and create a new
spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
instead?
So, I'm not sure if the idea of a new machine type has legs or
not,
but let's think it through a bit further. Suppose we have a new
machine type, let's call it 'papr'. I'm thinking it would be (at
least with -nodefaults) basically a super-minimal version of pseries:
so each PHB would have to be explicitly created, the VIO bridge would
have to be explicitly created, likewise the NVRAM. Not sure about the
"devices" which really represent firmware features - the RTC, RNG,
hypervisor event source and so forth.
Might have some advantages. Then again, it doesn't really solve the
specific problem here. It means libvirt (or the user) has to
explicitly choose a PCI or PCI-E PHB to put things on,
libvirt would probably add a
<controller type='pci' model='pcie-root'/>
to the guest XML by default, resulting in a
spapr-pcie-host-bridge providing pcie.0 and the same
controller / address allocation logic as q35; the user
would be able to use
<controller type='pci' model='pci-root'/>
instead to stick with legacy PCI. This would only matter
when using '-nodefaults' anyway, when that flag is not
present a PCIe (or legacy PCI) could be created by QEMU
to make it more convenient for people that are not using
libvirt.
Maybe we should have a different model, specific to
pseries guests, instead, so that all PHBs would look the
same in the guest XML, something like
<controller type='pci' model='phb-pcie'/>
It would require shuffling libvirt's PCI address allocation
code around quite a bit, but it should be doable. And if it
makes life easier for our users, then it's worth it.
but libvirt's
PCI-E address allocation will still be wrong in all probability.
Guh.
As an aside, here's a RANT.
[...]
Laine already addressed your points extensively, but I'd
like to add a few thoughts of my own.
* PCI addresses for libvirt guests don't need to be stable
only when performing migration, but also to guarantee
that no change in guest ABI will happen as a consequence
of eg. a simple power cycle.
* Even if libvirt left all PCI address assignment to QEMU,
we would need a way for users to override QEMU's choices,
because one size never fits all and users have all kinds
of crazy, yet valid, requirements. So the first time we
run QEMU, we would have to take the backend-specific
format you suggest, parse it to extract the PCI addresses
that have been assigned, and reflect them in the guest
XML so that the user can change a bunch of them. Then I
guess we could re-encode it in the backend-specific format
and pass it to QEMU the next time we run it but, at that
point, what's the difference with simply putting the PCI
addresses on the command line directly?
* It's not just about the addresses, by the way, but also
about the controllers - what model is used, how they are
plugged together and so on. More stuff that would have to
round-trip because users need to be able to take matters
into their own hands.
* Design mistakes in any software, combined with strict
backwards compatibility requirements, make it difficult
to make changes in both related components and the
software itself, even when the changes would be very
beneficial. It can be very frustrating at times, but
it's the reality of things and unfortunately there's only
so much we can do about it.
* Eduardo's work, which you mentioned, is going to be very
beneficial in the long run; in the short run, Marcel's
PCIe device placement guidelines, a document that has seen
contributions from QEMU, OVMF and libvirt developers, have
been invaluable to improve libvirt's PCI address allocation
logic. So we're already doing better, and more improvements
are on the way :)
--
Andrea Bolognani / Red Hat / Virtualization