On 08/09/2016 04:05 AM, Daniel P. Berrange wrote:
On Mon, Aug 08, 2016 at 12:41:48PM -0400, Laine Stump wrote:
> On 08/08/2016 04:56 AM, Laine Stump wrote:
>> When faced with a guest device that requires a PCI address but doesn't
>> have one manually assigned in the config, libvirt has always insisted
>> (well... *tried* to insist) on auto-assigning an address that is on a
>> PCI controller that supports hotplug. One big problem with this is
>> that it prevents automatic use of a Q35 (or aarch64/virt) machine's
>> pcie-root (since the PCIe root complex doesn't support hotplug).
>>
>> In order to promote simpler domain configs (more devices on pcie-root
>> rather than on a pci-bridge), this patch adds a new sub-element to all
>> guest devices that have a PCI address and support hotplug:
>>
>> <hotplug require='no'/>
>>
>> For devices that have hotplug require='no', we turn off the
>> VIR_PCI_CONNECT_HOGPLUGGABLE bit in the devFlags when searching for an
>> available PCI address. Since pcie-root now allows standard PCI
>> devices, this results in those devices being placed on pcie-root
>> rather than pci-bridge.
> I've been playing around with this and, by itself, it works very well. With
> this solved, combined with taking advantage of PCIe for virtio when
> available, it's very easy to create q35 domains that have no legacy-PCI
> without needing to resort to manually assigning addresses.
>
> However, there is still another item that we need to be able to configure -
> stating a preference of legacy PCI vs. PCIe when both are available for a
> device (again, the aim is to do this *without* needing to manually assign an
> address). The following devices have this choice:
>
> 1) vfio assigned devices
> 2) virtio devices
> 3) the nec-xhci USB controller
>
> You might think that it would always be preferable to use PCIe if it's
> available, but especially in these "early days" of using PCIe in guests it
> would be useful to have to ability to *easily* force use of a legacy PCI
> slot in case some PCIe-related bug is encountered (in particular, people
> have pointed out in discussions about vfio device assignment that it could
> be possible for a guest OS to misbehave when presented with a device's PCIe
> configuration block (which hasn't been visible in the past because the
> device was attached to a legacy PCI slot)).
>
> In order maintain functionality while any such bugs are figured out and
> fixed, we need to be able to force the device onto a PCI slot. There are two
> ways of doing this:
>
> 1) manually specify the full PCI address of a legacy PCI slot in the config
> 2) provide an option in the config that simply says "use any PCI slot" or
> "use any PCIe slot".
> Assuming that (1) is too cumbersome, we need to come up with a reasonable
> name/location for a config option (providing the backend for it will be
> trivial). Some possible places:
I prefer a variant on (1), which is to specifcy an address with only the
controller index filled out. eg given a q35 bridge topology
<controller type='pci' index='0' model='pcie-root'/>
<controller type='pci' index='1'
model='dmi-to-pci-bridge'>
<model name='i82801b11-bridge'/>
</controller>
<controller type='pci' index='2' model='pci-bridge'>
<model name='pci-bridge'/>
<target chassisNr='56'/>
</controller>
A device would use
<address type="pci" controller="2"> (for pci-bridge
placement)
<address type="pci" controller="0"> (for pcie-root
placement)
This trivially expaneds to cover the NUMA use case too, without having to
invent further elements duplicating NUMA node info again.
<controller type='pci' index='0' model='pci-root'/>
<controller type='pci' index='1'
model='pci-expander-bus'>
<model name='pxb'/>
<target busNr='254'>
<node>0</node>
</target>
</controller>
<controller type='pci' index='2'
model='pci-expander-bus'>
<model name='pxb'/>
<target busNr='255'>
<node>1</node>
</target>
</controller>
eg device uses
<address type="pci" controller="1"> (for pxb on NUMA node
0)
<address type="pci" controller="2"> (for pxb on NUMA node
1)
Yes, when you first boot a guest, this means the mgmt app has to know
what controllers exist by default, and/or specify controllers,
It also has to know the rules about which controllers support what type
of devices, and which controllers can plug into which other controllers
- the full topology of PCI controllers will need to be spelled out
specifically in advance, so that the management app can give the index
number of the correct controller.
As it stands now, some amount of this is already necessary, just because
libvirt currently will only automatically add pci-bridge devices when
there aren't enough PCI slots available (the code to automatically add
pcie controllers as necessary is yet to be written). Still, this is all
it takes to get 8 properly connected downstream-ports:
<controller type='pci' model='pcie-root-port'/>
<controller type='pci' model='pcie-switch-upstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
(note that you don't need to say where they are connected, nor do you
need to give an index to each one. As long as there is an available slot
of the correct type for each controller as it's encountered, libvirt
will hook them up and index them correctly)
If the controller number was needed to specify in the device address,
then not only would each controller need to be manually given an index
(if you were specifying the entire domain in a single go), but each
device would need to be given a unique setting - instead of just saying
"I want one of these" (in the future, if there isn't "one of
those"
available, everything necessary to provide it will be automatically
created), you have to say "I want *this* one, implying that you know
everything about what "this one" provides (and you will need to manually
setup "this one" and everything else required to provide it).
Also, note that since each pci-root-port and pci-switch-downstream-port
has only a single slot, in those cases this:
<address type="pci" controller="1"/>
is really just:
<address type="pci" bus="1"/>
and for the other cases (pci[e]-root, pci-bridge, dmi-to-pci-bridge) you
will still have to keep track of how many devices you've assigned to a
particular controller, to make sure that your request doesn't overflow
the 31 (or 32) device limit (which brings up another problem of doing it
this way - different controllers have different numbers of useful slots,
so the management app will have to know about that too).
but I
think that's ultimately preferrable than inventing an indefinitely
growing list of extra XML elements to provide tweaks to teh PCI address
assginment logic.
We can simplify life of mgmt apps in a different way, but using the domain
XML capabilities to provide full data on the default controllers used for
each machine type.
And pcie/pci info on every endpoint device (I agree with your assertion
that we should treat every device as potentially capable of hotplug, so
we don't need that info for endpoints).
So apps would not need to have any machine type specific
logic in them - they can write code that's entirely metadata driven based
on the domain capabilities.
The metadata would need to include:
1) upstream connection type (each type of PCI controller needs a different
connection type. I've learned through all this that each controller
has different
rules about what it can be connected to. So far there are 9
different types,
including pci-endpoint and pcie-endpoint)
2) accepted downstream connection types
3) min and max slot (currently known possibilities: 0-0, 0-31, or 0-32)
4) hotplug supported or not (yes, the management app really would need
to know about this, even if they intended for all devices to be
hotpluggable
- they need to be able to avoid pcie-root and dmi-to-pci-bridge)
Then the management app would need to keep track of which addresses are
in use for each controller, so that it would be able to avoid putting
too many devices on any controller.
Also, the management app will now need to know for *every* device
whether that device in this particular version of qemu is a PCIe device,
a legacy PCI device, or one of the silly "dual mode" devices that
changes according to the type of bus it's plugged into. Otherwise it
won't know which controller it should plug the device into.
Finally, the management app would need to add logic to grow the bus
topology correctly as needed. Sometimes this is simple and sometimes
not. For example, if there is an open slot available on pcie-root, you
can get another hotpluggable pcie slot by simply adding a
pcie-root-port. If not, then you need to find an open
pcie-switch-downstream-port or pcie-root-port, then connect a
pcie-switch-upstream-port to that, and connect one or more
pcie-switch-downstream-ports to that. (note this implies that you should
always maintain at least one unoccupied root-port, downstream-port, or
an open slot on pcie-root).
The whole point of my exercise has been to 1) get rid of as much legacy
PCI baggage as possible, and 2) do this in a way that
prevents/eliminates the need for multiple management apps to
re-implement and maintain essentially the same code (especially since my
trip through PCIe-land has shown me that it is fraught with lack of
documentation (about the controller implementations, not the PCIe spec),
misinformation (changing over time), misconceptions, and lots of
annoying little details). I started out kind of liking your alternate
idea, since it seemed to accomplish that in a simpler way that didn't
require adding yet another knob later. But after thinking it through, I
think that it provides very little extra functionality, and would
require a lot of extra work for the management app, even in a simple
case where you didn't care about PCI vs PCIe.
In the end, my opinion is that the job of creating a proper PCIe
topology and assigning devices to the correct address in that topology
is a non-trivial job that should be implemented in one place. I think
libvirt is in the proper place to do that, it just needs to be provided
with a small amount of information from the user/management app so that
libvirt won't be making any policy decisions, but just assuring that
those policies are properly applied.