On Fri, Mar 13, 2020 at 12:47 PM Daniel P. Berrangé <berrange(a)redhat.com> wrote:
On Fri, Mar 13, 2020 at 11:23:44AM +0200, Dan Kenigsberg wrote:
> On Wed, 4 Mar 2020, 14:51 Daniel P. Berrangé, <berrange(a)redhat.com> wrote:
> >
> > We've been doing alot of refactoring of code in recent times, and also
> > have plans for significant infrastructure changes. We still need to
> > spend time delivering interesting features to users / applications.
> > This mail is to introduce an idea for a solution to an specific
> > area applications have had long term pain with libvirt's current
> > "mechanism, not policy" approach - device addressing. This is a way
> > for us to show brand new ideas & approaches for what the libvirt
> > project can deliver in terms of management APIs.
> >
> > To set expectations straight: I have written no code for this yet,
> > merely identified the gap & conceptual solution.
> >
> >
> > The device addressing problem
> > =============================
> >
> > One of the key jobs libvirt does when processing a new domain XML
> > configuration is to assign addresses to all devices that are present.
> > This involves adding various device controllers (PCI bridges, PCI root
> > ports, IDE/SCSI buses, USB controllers, etc) if they are not already
> > present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each
> > device so they are associated with controllers. When libvirt spawns a
> > QEMU guest, it will pass full address information to QEMU.
> >
> > Libvirt, as a general rule, aims to avoid defining and implementing
> > policy around expansion of guest configuration / defaults, however, it
> > is inescapable in the case of device addressing due to the need to
> > guarantee a stable hardware ABI to make live migration and save/restore
> > to disk work. The policy that libvirt has implemented for device
> > addressing is, as much as possible, the same as the addressing scheme
> > QEMU would apply itself.
> >
> > While libvirt succeeds in its goal of providing a stable hardware API,
> > the addressing scheme used is not well suited to all deployment
> > scenarios of QEMU. This is an inevitable result of having a specific
> > assignment policy implemented in libvirt which has to trade off mutually
> > incompatible use cases/goals.
> >
> > When the libvirt addressing policy is not been sufficient, management
> > applications are forced to take on address assignment themselves,
> > which is a massive non-trivial job with many subtle problems to
> > consider.
> >
> > Places where libvirt's addressing is insufficient for PCI include
> >
> > * Setting up multiple guest NUMA nodes and associating devices to
> > specific nodes
> > * Pre-emptive creation of extra PCIe root ports, to allow for later
> > device hotplug on PCIe topologies
> > * Determining whether to place a device on a PCI or PCIe bridge
> > * Controlling whether a device is placed into a hotpluggable slot
> > * Controlling whether a PCIe root port supports hotplug or not
> > * Determining whether to places all devices on distinct slots or
> > buses, vs grouping them all into functions on the same slot
> > * Ability to expand the device addressing without being on the
> > hypervisor host
>
> (I don't understand the last bullet point)
I'm not sure if this is still the case, but at some point in time
there was a desire from KubeVirt to be able to expand the users'
configuration when loaded in KubeVirt, filling in various defaults
for devices. This would run when the end user YAML/JSON config
was first posted to the k8s API for storage, some arbitrary amount
of time later the config gets chosen to run on a virtualization
host at which point it is turned into libvirt domain XML.
Ah, I did not hear about this before, but I see why something like
this would be useful even without libvirt-devaddr. Having something
like virDomainDryRunXML() would have eliminated old race conditions we
had in oVirt.
> > Libvirt wishes to avoid implementing many different address assignment
> > policies. It also wishes to keep the domain XML as a representation
> > of the virtual hardware, not add a bunch of properties to it which
> > merely serve as tunable input parameters for device addressing
> > algorithms.
> >
> > There is thus a dilemma here. Management applications increasingly
> > need fine grained control over device addressing, while libvirt
> > doesn't want to expose fine grained policy controls via the XML.
> >
> >
> > The new libvirt-devaddr API
> > ===========================
> >
> > The way out of this is to define a brand new virt management API
> > which tackles this specific problem in a way that addresses all the
> > problems mgmt apps have with device addressing and explicitly
> > provides a variety of policy impls with tunable behaviour.
> >
> > By "new API", I actually mean an entirely new library, completely
> > distinct from libvirt.so, or anything else we've delivered so
> > far. The closest we've come to delivering something at this kind
> > of conceptual level, would be the abortive attempt we made with
> > "libvirt-builder" to deliver a policy-driven API instead of
mechanism
> > based. This proposal is still quite different from that attempt.
> >
> > At a high level
> >
> > * The new API is "libvirt-devaddr" - short for "libvirt device
addressing"
> >
> > * As input it will take
> >
> > 1. The guest CPU architecture and machine type
> > 2. A list of global tunables specifying desired behaviour of the
> > address assignment policy
> > 3. A minimal list of devices needed in the virtual machine, with
> > optional addresses and optional per-device tunables to override
> > the global tunables
> >
> > * As output it will emit
> >
> > 1. fully expanded list of devices needed in the virtual machine,
> > with addressing information sufficient to ensure stable hardware ABI
> >
> > Initially the API would implement something that behaves the same
> > way as libvirt's current address assignment API.
> >
> > The intended usage would be
> >
> > * Mgmt application makes a minimal list of devices they want in
> > their guest
> > * List of devices is fed into libvirt-devaddr API
> > * Mgmt application gets back a full list of devices & addresses
> > * Mgmt application writes a libvirt XML doc using this full list &
> > addresses
> > * Mgmt application creates the guest in libvirt
> >
> > IOW, this new "libvirt-devaddr" API is intended to be used prior to
> > creating the XML that is used by libvirt. The API could also be used
> > prior to needing to hotplug a new device to an existing guest.
> > This API is intended to be a deliverable of the libvirt project, but
> > it would be completely independent of the current libvirt API. Most
> > especially note that it would NOT use the domain XML in any way.
> > This gives applications maximum flexibility in how they consume this
> > functionality, not trying to force a way to build domain XML.
>
> This procedure forces Mgmt to learn a new language to describe device
> placement. Mgmt (or should I just say "we"?) currently expresses the
> "minimal list of devices" in XML form and pass it to libvirt. Here we
> are asked to pass it once to libvirt-devaddr, parse its output, and
> feed it as XML to libvirt.
I'm not neccessarily suggesting we even need a document format the
core API level. I could easily see the API working in terms of a
list of Go structs, with tunables being normal method parameters.
A JSON format could be an optional way to serialize the Go structs,
but if the app were written in Go the JSON may not be needed at all.
> I believe it would be easier to use the domxml as the base language
> for the new library, too. libvirt-devaddr would accept it with various
> hints (expressed as its own extension to the XML?) such as "place all
> of these devices in the same NUMA node", "keep on root bus" or
> "separate these two chattering devices to their own bus". The output
> of libvirt-devaddr would be a domxml with <devices> filled with
> controllers and addresses, readily available for consumption by
> libvirt.
I don't believe that using the libvirt domain XML is a good idea for
this as it uneccesssarily constrains the usage scenarios. Most management
applications do not use the domain XML as their canonical internal storage
format. KubeVirt has its JSON/YAML schema for k8s API, OpenStack/RHEV just
store metadata in their DB, others vary again. Some of these applications
benefit from being able to expand device topology/addressing, a long time
before they get any where near use of domain XML - the latter only matters
when you come to instantiate a VM on a particular host.
Nevertheless, your suggested Go struct would become a third
representation of virtual devices, on top of domxml and the
Mgmt-canonical one. Maybe I'm just overconservative. Let us ask
kubevirt-dev what would be their preferable form to consume this
suggested API.
We could of coure have a convenience method which optionally generates
a domain XML template from the output list of devices, if someone believes
that's useful to standardize on, but I don't think the domain XML should
be the core format format.
I would also like this library to usable for scenarios in which libvirt
is not involved at all. One of the strange things about the QEMU driver
in libvirt compared to the other hypervisor drivers is that it is missing
an intermediate API layer. In other drivers the hypervisor platform itself
provides a full management API layer, and libvirt merely maps the libvirt
APIs to the underling mgmt API or data formats. IOW, libvirt is just a
mapping layer.
QEMU though only really provides a few low level building blocks, alongside
other building blocks you have to pull in from Linux. It doesn't even provide
a configuration file. Libvirt pulls all these pieces together to form the
complete managment QEMU API, as well as mapping everything onto the libvirt
domain XML & APIs. I think all there is scope & interest/demand to look at
creating an intermediate layer that provides a full managment layer for
QEMU, such that libvirt can eventually become just a mapping layer for
QEMU. In such a scenario the libvirt-devaddr library is still very useful
but you don't want it using the libvirt domain XML, as that's not likely
to be the format in use.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|