On Fri, Nov 25, 2016 at 02:46:21PM +0100, Andrea Bolognani wrote:
On Wed, 2016-11-23 at 16:00 +1100, David Gibson wrote:
> > Existing libvirt versions assume that pseries guests have
> > a legacy PCI root bus, and will base their PCI address
> > allocation / PCI topology decisions on that fact: they
> > will, for example, use legacy PCI bridges.
>
> Um.. yeah.. trouble is libvirt's PCI-E address allocation probably
> won't work for spapr PCI-E either, because of the weird PCI-E without
> root complex presentation we get in PAPR.
So, would the PCIe Root Bus in a pseries guest behave
differently than the one in a q35 or mach-virt guest?
Yes. I had a long discussion with BenH and got a somewhat better idea
about this.
If only a single host PE (== iommu group) is passed through and there
are no emulated devices, the difference isn't too bad: basically on
pseries you'll see the subtree that would be below the root complex on
q35.
But if you pass through multiple groups, things get weird. On q35,
you'd generally expect physically separate (different slot) devices to
appear under separate root complexes. Whereas on pseries they'll
appear as siblings on a virtual bus (which makes no physical sense for
point-to-point PCI-E).
I suppose we could try treating all devices on pseries as though they
were chipset builtin devices on q35, which will appear on the root
PCI-E bus without root complex. But I suspect that's likely to cause
trouble with hotplug, and it will certainly need different address
allocation from libvirt.
Does it have a different number of slots, do we have to
plug different controllers into them, ...?
Regardless of how we decide to move forward with the
PCIe-enabled pseries machine type, libvirt will have to
know about this so it can behave appropriately.
So there are kind of two extremes of how to address this. There are a
variety of options in between, but I suspect they're going to be even
more muddled and hideous than the extremes.
1) Give up. You said there's already a flag that says a PCI-E bus is
able to accept vanilla-PCI devices. We add a hack flag that says a
vanilla-PCI bus is able to accept PCI-E devices. We keep address
allocation as it is now - the pseries topology really does resemble
vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
devices, and PAPR has mechanisms for accessing the extended config
space. PCI-E standard hotplug and error reporting will never work,
but PAPR provides its own mechanisms for those, so that should be ok.
2) Start exposing the PCI-E heirarchy for pseries guests much more
like q35, root complexes and all. It's not clear that PAPR actually
*forbids* exposing the root complex, it just doesn't require it and
that's not what PowerVM does. But.. there are big questions about
whether existing guests will cope with this or not. When you start
adding in multiple passed through devices and particularly virtual
functions as well, things could get very ugly - we might need to
construct multiple emulated virtual root complexes or other messes.
In the short to medium term, I'm thinking option (1) seems pretty
compelling.
> > > I believe after we introduced the very first
> > > pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
> >
> > Isn't i440fx still being updated despite the fact that q35
> > exists? Granted, there are a lot more differences between
> > those two machine types than just the root bus type.
>
> Right, there are heaps of differences between i440fx and q35, and
> reasons to keep both updated. For pseries we have neither the impetus
> nor the resources to maintain two different machine type variant,
> where the only difference is between legacy PCI and weirdly presented
> PCI-E.
Calling the PCIe machine type either pseries-2.8 or
pseries-pcie-2.8 would result in the very same amount of
work, and in both cases it would be understood that the
legacy PCI machine type is no longer going to be updated,
but can still be used to run existing guests.
So, I'm not sure if the idea of a new machine type has legs or not,
but let's think it through a bit further. Suppose we have a new
machine type, let's call it 'papr'. I'm thinking it would be (at
least with -nodefaults) basically a super-minimal version of pseries:
so each PHB would have to be explicitly created, the VIO bridge would
have to be explicitly created, likewise the NVRAM. Not sure about the
"devices" which really represent firmware features - the RTC, RNG,
hypervisor event source and so forth.
Might have some advantages. Then again, it doesn't really solve the
specific problem here. It means libvirt (or the user) has to
explicitly choose a PCI or PCI-E PHB to put things on, but libvirt's
PCI-E address allocation will still be wrong in all probability.
Guh.
As an aside, here's a RANT.
libvirt address allocation. Seriously, W. T. F!
libvirt insists on owning address allocation. That's so it can
recreate the exact same machine at the far end of a migration. So far
so good, except it insists on recording that information in the domain
XML in kinda-sorta-but-not-really back end independent form. But the
thing is libvirt fundamentally CAN NOT get this right. There are all
sorts of possible machine specific address allocation constraints that
can exist - from simply which devices are already created by default
(for varying values of "default") to complicated constraints depending
on details of board wiring. The back end has to know about these - it
implements them. The ONLY way libvirt can get this (temporarily)
right is by duplicating a huge chunk of the back end's allocation
logic, which will inevitably get out of date causing problems just
like this.
Basically the back end will *always* have better information about how
to place devices than libvirt can. So, libvirt should be allowing the
back end to do the allocation, then snapshotting that in a back end
specific format which can be used for creating migration
destinations. But that breaks libvirt's the-domain-XML-is-everything
model.
In this regard libvirt doesn't just have a design flaw, it has design
flaws which breed more design flaws like pestilent cancer. And what's
worse the consequences of those design flaws are now making sane
design decisions increasingly difficult in adjacent projects like
qemu.
I'd feel better about this if there seemed to be some recognition of
it, and some necessarily long term plan to improve it, but if there is
I haven't heard of it. Or at least the closest thing seems to be
coming from the qemu side (witness Eduardo's talk at the last KVM
forum, and mine at the one before).
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson