New subject: Kata needs for device addressing (was Re: libvirt-devaddr: a new library for device address assignment)

4 Mar 2020

      We've been doing alot of refactoring of code in recent times, and also
have plans for significant infrastructure changes. We still need to
spend time delivering interesting features to users / applications.
This mail is to introduce an idea for a solution to an specific
area applications have had long term pain with libvirt's current
"mechanism, not policy" approach - device addressing. This is a way
for us to show brand new ideas & approaches for what the libvirt
project can deliver in terms of management APIs.

To set expectations straight: I have written no code for this yet,
merely identified the gap & conceptual solution.

The device addressing problem
=============================

One of the key jobs libvirt does when processing a new domain XML
configuration is to assign addresses to all devices that are present.
This involves adding various device controllers (PCI bridges, PCI root
ports, IDE/SCSI buses, USB controllers, etc) if they are not already
present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each
device so they are associated with controllers. When libvirt spawns a
QEMU guest, it will pass full address information to QEMU.

Libvirt, as a general rule, aims to avoid defining and implementing
policy around expansion of guest configuration / defaults, however, it
is inescapable in the case of device addressing due to the need to
guarantee a stable hardware ABI to make live migration and save/restore
to disk work.  The policy that libvirt has implemented for device
addressing is, as much as possible, the same as the addressing scheme
QEMU would apply itself.

While libvirt succeeds in its goal of providing a stable hardware API,
the addressing scheme used is not well suited to all deployment
scenarios of QEMU. This is an inevitable result of having a specific
assignment policy implemented in libvirt which has to trade off mutually
incompatible use cases/goals.

When the libvirt addressing policy is not been sufficient, management
applications are forced to take on address assignment themselves,
which is a massive non-trivial job with many subtle problems to
consider.

Places where libvirt's addressing is insufficient for PCI include

 * Setting up multiple guest NUMA nodes and associating devices to
   specific nodes
 * Pre-emptive creation of extra PCIe root ports, to allow for later
   device hotplug on PCIe topologies
 * Determining whether to place a device on a PCI or PCIe bridge
 * Controlling whether a device is placed into a hotpluggable slot
 * Controlling whether a PCIe root port supports hotplug or not
 * Determining whether to places all devices on distinct slots or
   buses, vs grouping them all into functions on the same slot
 * Ability to expand the device addressing without being on the
   hypervisor host

Libvirt wishes to avoid implementing many different address assignment
policies. It also wishes to keep the domain XML as a representation
of the virtual hardware, not add a bunch of properties to it which
merely serve as tunable input parameters for device addressing
algorithms.

There is thus a dilemma here. Management applications increasingly
need fine grained control over device addressing, while libvirt
doesn't want to expose fine grained policy controls via the XML.

The new libvirt-devaddr API
===========================

The way out of this is to define a brand new virt management API
which tackles this specific problem in a way that addresses all the
problems mgmt apps have with device addressing and explicitly
provides a variety of policy impls with tunable behaviour.

By "new API", I actually mean an entirely new library, completely
distinct from libvirt.so, or anything else we've delivered so
far. The closest we've come to delivering something at this kind
of conceptual level, would be the abortive attempt we made with
"libvirt-builder" to deliver a policy-driven API instead of mechanism
based. This proposal is still quite different from that attempt.

At a high level

 * The new API is "libvirt-devaddr" - short for "libvirt device addressing"

 * As input it will take

   1. The guest CPU architecture and machine type
   2. A list of global tunables specifying desired behaviour of the
      address assignment policy
   3. A minimal list of devices needed in the virtual machine, with
      optional addresses and optional per-device tunables to override
      the global tunables

 * As output it will emit

   1. fully expanded list of devices needed in the virtual machine,
      with addressing information sufficient to ensure stable hardware ABI

Initially the API would implement something that behaves the same
way as libvirt's current address assignment API.

The intended usage would be

 * Mgmt application makes a minimal list of devices they want in
   their guest
 * List of devices is fed into libvirt-devaddr API
 * Mgmt application gets back a full list of devices & addresses
 * Mgmt application writes a libvirt XML doc using this full list &
   addresses
 * Mgmt application creates the guest in libvirt

IOW, this new "libvirt-devaddr" API is intended to be used prior to
creating the XML that is used by libvirt. The API could also be used
prior to needing to hotplug a new device to an existing guest.
This API is intended to be a deliverable of the libvirt project, but
it would be completely independent of the current libvirt API. Most
especially note that it would NOT use the domain XML in any way.
This gives applications maximum flexibility in how they consume this
functionality, not trying to force a way to build domain XML.

It would have greater freedom in its API design, making different
choices from libvirt.so on topics such as programming language (C vs
Go vs Python etc), API stability timeframe (forever stable vs sometimes
changing API), data formats (structs, vs YAML/JSON vs XML etc), and of
course the conceptual approach (policy vs mechanism)

The expectation is that this new API would be most likely to be
consumed by KubeVirt, OpenStack, Kata, as the list of problems shown
earlier is directly based on issues seen working with KubeVirt &
OpenStack in particular. It is not limited to these applications and
is broadly useful as conceptual thing.

It would be a goal that this API should also be used by libvirt
itself to replace its current internal device addressing impl.
Essentially the new API should be seen as a way to expose/extract
the current libvirt internal algorithm, making it available to
applications in a flexible manner. I don't anticipate actually copying
the current addressing code in libvirt as-is, but it would certainly
serve as reference for the kind of logic we need to implement, so you
might consider it a "port" or "rewrite" in some very rough sense.

I think this new API concept is a good way for the project make a start
in using Go for libvirt. The functionality covered has a clearly defined
scope limit, making it practical to deliver a real impl in a reasonably
short time frame. Extracting this will provide a real world benefit to
our application consumers, solving many long standing problems they have
with libvirt, and thus justify the effort in doing this work in libvirt
in a non-C language. The main question mark would be about how we might
make this functionality available to Python apps if we chose Go. It is
possible to expose a C API from Go, and we would need this to consume it
from libvirt. There is then the need to manually write a Python API binding
which is tedious work.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

libvirt-devaddr: a new library for device address assignment

Daniel P. Berrangé

Dan Kenigsberg

Daniel P. Berrangé

Dan Kenigsberg

Laine Stump

Daniel Henrique Barboza

Laine Stump

Daniel Henrique Barboza

Daniel P. Berrangé

Daniel Henrique Barboza

Daniel P. Berrangé

Daniel P. Berrangé

Daniel P. Berrangé

Christophe de Dinechin

Adrian Moreno

tags

participants (7)