Re: [libvirt] [PATCH 4/6] conf: new <hotplug require='yes|no'/> element for hotpluggable devices

10 Aug 2016

      On Tue, Aug 09, 2016 at 02:14:15PM -0400, Laine Stump wrote:
...
On 08/09/2016 04:05 AM, Daniel P. Berrange wrote:
...
On Mon, Aug 08, 2016 at 12:41:48PM -0400, Laine Stump wrote:
...
...
When faced with a guest device that requires a PCI address but doesn't
have one manually assigned in the config, libvirt has always insisted
(well... *tried* to insist) on auto-assigning an address that is on a
PCI controller that supports hotplug. One big problem with this is
that it prevents automatic use of a Q35 (or aarch64/virt) machine's
pcie-root (since the PCIe root complex doesn't support hotplug).
In order to promote simpler domain configs (more devices on pcie-root
rather than on a pci-bridge), this patch adds a new sub-element to all
guest devices that have a PCI address and support hotplug:
<hotplug require='no'/>
For devices that have hotplug require='no', we turn off the
VIR_PCI_CONNECT_HOGPLUGGABLE bit in the devFlags when searching for an
available PCI address. Since pcie-root now allows standard PCI
devices, this results in those devices being placed on pcie-root
rather than pci-bridge.
I've been playing around with this and, by itself, it works very well. With
On 08/08/2016 04:56 AM, Laine Stump wrote:
this solved, combined with taking advantage of PCIe for virtio when
available, it's very easy to create q35 domains that have no legacy-PCI
without needing to resort to manually assigning addresses.
However, there is still another item that we need to be able to configure -
stating a preference of legacy PCI vs. PCIe when both are available for a
device (again, the aim is to do this *without* needing to manually assign an
address). The following devices have this choice:
1) vfio assigned devices
2) virtio devices
3) the nec-xhci USB controller
You might think that it would always be preferable to use PCIe if it's
available, but especially in these "early days" of using PCIe in guests it
would be useful to have to ability to *easily* force use of a legacy PCI
slot in case some PCIe-related bug is encountered (in particular, people
have pointed out in discussions about vfio device assignment that it could
be possible for a guest OS to misbehave when presented with a device's PCIe
configuration block (which hasn't been visible in the past because the
device was attached to a legacy PCI slot)).
In order maintain functionality while any such bugs are figured out and
fixed, we need to be able to force the device onto a PCI slot. There are two
ways of doing this:
1) manually specify the full PCI address of a legacy PCI slot in the config
2) provide an option in the config that simply says "use any PCI slot" or
"use any PCIe slot".
Assuming that (1) is too cumbersome, we need to come up with a reasonable
name/location for a config option (providing the backend for it will be
trivial). Some possible places:
I prefer a variant on (1), which is to specifcy an address with only the
controller index filled out. eg given a q35 bridge topology
<controller type='pci' index='0' model='pcie-root'/>
     <controller type='pci' index='1' model='dmi-to-pci-bridge'>
       <model name='i82801b11-bridge'/>
     </controller>
     <controller type='pci' index='2' model='pci-bridge'>
       <model name='pci-bridge'/>
       <target chassisNr='56'/>
     </controller>
A device would use
<address type="pci" controller="2">   (for pci-bridge placement)
    <address type="pci" controller="0">   (for pcie-root placement)
This trivially expaneds to cover the NUMA use case too, without having to
invent further elements duplicating NUMA node info again.
<controller type='pci' index='0' model='pci-root'/>
     <controller type='pci' index='1' model='pci-expander-bus'>
       <model name='pxb'/>
       <target busNr='254'>
         <node>0</node>
       </target>
     </controller>
     <controller type='pci' index='2' model='pci-expander-bus'>
       <model name='pxb'/>
       <target busNr='255'>
         <node>1</node>
       </target>
     </controller>
eg device uses
<address type="pci" controller="1">   (for pxb on NUMA node 0)
    <address type="pci" controller="2">   (for pxb on NUMA node 1)
Yes, when you first boot a guest, this means the mgmt app has to know
what controllers exist by default, and/or specify controllers,
It also has to know the rules about which controllers support what type of
devices, and which controllers can plug into which other controllers - the
full topology of PCI controllers will need to be spelled out specifically in
advance, so that the management app can give the index number of the correct
controller.
As it stands now, some amount of this is already necessary, just because
libvirt currently will only automatically add pci-bridge devices when there
aren't enough PCI slots available (the code to automatically add pcie
controllers as necessary is yet to be written). Still, this is all it takes
to get 8 properly connected downstream-ports:
<controller type='pci' model='pcie-root-port'/>
<controller type='pci' model='pcie-switch-upstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
(note that you don't need to say where they are connected, nor do you need
to give an index to each one. As long as there is an available slot of the
correct type for each controller as it's encountered, libvirt will hook them
up and index them correctly)
If the controller number was needed to specify in the device address, then
not only would each controller need to be manually given an index (if you
were specifying the entire domain in a single go), but each device would
need to be given a unique setting - instead of just saying "I want one of
these" (in the future, if there isn't "one of those" available, everything
necessary to provide it will be automatically created), you have to say "I
want *this* one, implying that you know everything about what "this one"
provides (and you will need to manually setup "this one" and everything else
required to provide it).
Also, note that since each pci-root-port and pci-switch-downstream-port has
only a single slot, in those cases this:
<address type="pci" controller="1"/>
is really just:
<address type="pci" bus="1"/>
and for the other cases (pci[e]-root, pci-bridge, dmi-to-pci-bridge) you
will still have to keep track of how many devices you've assigned to a
particular controller, to make sure that your request doesn't overflow the
31 (or 32) device limit (which brings up another problem of doing it this
way - different controllers have different numbers of useful slots, so the
management app will have to know about that too).
...
but I
think that's ultimately preferrable than inventing an indefinitely
growing list of extra XML elements to provide tweaks to teh PCI address
assginment logic.
We can simplify life of mgmt apps in a different way, but using the domain
XML capabilities to provide full data on the default controllers used for
each machine type.
And pcie/pci info on every endpoint device (I agree with your assertion that
we should treat every device as potentially capable of hotplug, so we don't
need that info for endpoints).
...
So apps would not need to have any machine type specific
logic in them - they can write code that's entirely metadata driven based
on the domain capabilities.
The metadata would need to include:
1) upstream connection type (each type of PCI controller needs a different
    connection type. I've learned through all this that each controller has
different
    rules about what it can be connected to. So far there are 9 different
types,
    including pci-endpoint and pcie-endpoint)
2) accepted downstream connection types
3) min and max slot (currently known possibilities: 0-0, 0-31, or 0-32)
4) hotplug supported or not (yes, the management app really would need
     to know about this, even if they intended for all devices to be
hotpluggable
     - they need to be able to avoid pcie-root and dmi-to-pci-bridge)
Then the management app would need to keep track of which addresses are in
use for each controller, so that it would be able to avoid putting too many
devices on any controller.
Also, the management app will now need to know for *every* device whether
that device in this particular version of qemu is a PCIe device, a legacy
PCI device, or one of the silly "dual mode" devices that changes according
to the type of bus it's plugged into. Otherwise it won't know which
controller it should plug the device into.
Finally, the management app would need to add logic to grow the bus topology
correctly as needed. Sometimes this is simple and sometimes not. For
example, if there is an open slot available on pcie-root, you can get
another hotpluggable pcie slot by simply adding a pcie-root-port. If not,
then you need to find an open pcie-switch-downstream-port or pcie-root-port,
then connect a pcie-switch-upstream-port to that, and connect one or more
pcie-switch-downstream-ports to that. (note this implies that you should
always maintain at least one unoccupied root-port, downstream-port, or an
open slot on pcie-root).
The whole point of my exercise has been to 1) get rid of as much legacy PCI
baggage as possible, and 2) do this in a way that prevents/eliminates the
need for multiple management apps to re-implement and maintain essentially
the same code (especially since my trip through PCIe-land has shown me that
it is fraught with lack of documentation (about the controller
implementations, not the PCIe spec), misinformation (changing over time),
misconceptions, and lots of annoying little details). I started out kind of
liking your alternate idea, since it seemed to accomplish that in a simpler
way that didn't require adding yet another knob later. But after thinking it
through, I think that it provides very little extra functionality, and would
require a lot of extra work for the management app, even in a simple case
where you didn't care about PCI vs PCIe.
In the end, my opinion is that the job of creating a proper PCIe topology
and assigning devices to the correct address in that topology is a
non-trivial job that should be implemented in one place. I think libvirt is
in the proper place to do that, it just needs to be provided with a small
amount of information from the user/management app so that libvirt won't be
making any policy decisions, but just assuring that those policies are
properly applied.
Ultimately what you are suggesting here is to stuff a massive amount of
VM configuration policy logic inside libvirt have a bunch of extra XML
elements which control how that policy works. When some application
finds some aspect of the policy that doesn't work for them, they'll have
to solve all of the hard problems themselves anyway, or live with a policy
that doesn't work right for them. Ultimately we'll end up adding countless
new XML knobs to libvirt whose sole purpose is to be inputs to ever more
complex policy inside libvirt.

We've had a long standing rule that libvirt avoids implementing policy
rules whereever possible, because it is inevitable that you'll never
satisfy all downstream consumers of libvirt. IMHO this rule has been
one of our best decisions and so I'm not in favour of abandoning that
now.

While we do have some existing PCI addressing logic, this was unavoidable
when we first added it, due to the need to move address assignment out of
QEMU and into libvirt, while maintaining back-compat for existing apps
using libvirt. This is now talking about adding logic for cases where we
have no corresponding logic in QEMU and no need for backcompat, so there
is not a compelling reason why this has to live in libvirt.

So I'm strong -1 on the entire approach in this patch series.

I think it needs to be re-worked so that we can expose the neccessary
information in domain capabilities to allow applications using libvirt
to implement their own policies.  This would also allow us to create
a shared library outside of libvirt which can implement some "standard"
policies which can be shared if apps so wish to have them eg this is
the roll libvirt-designer was intended to fill - provide an opinionated
policy driver API.  The attractive thing about using libvirt-designer
or an equivalent API layer outside libvirt, so that all the policy
control knobs will exist as actual APIs inside of extra XML elements..

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|