On 5/4/20 5:15 PM, Christophe de Dinechin wrote:
> On 4 Mar 2020, at 13:50, Daniel P. Berrangé <berrange(a)redhat.com> wrote:
>
> We've been doing alot of refactoring of code in recent times, and also
> have plans for significant infrastructure changes. We still need to
> spend time delivering interesting features to users / applications.
> This mail is to introduce an idea for a solution to an specific
> area applications have had long term pain with libvirt's current
> "mechanism, not policy" approach - device addressing. This is a way
> for us to show brand new ideas & approaches for what the libvirt
> project can deliver in terms of management APIs.
>
> To set expectations straight: I have written no code for this yet,
> merely identified the gap & conceptual solution.
>
>
> The device addressing problem
> =============================
>
> One of the key jobs libvirt does when processing a new domain XML
> configuration is to assign addresses to all devices that are present.
> This involves adding various device controllers (PCI bridges, PCI root
> ports, IDE/SCSI buses, USB controllers, etc) if they are not already
> present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each
> device so they are associated with controllers. When libvirt spawns a
> QEMU guest, it will pass full address information to QEMU.
>
> Libvirt, as a general rule, aims to avoid defining and implementing
> policy around expansion of guest configuration / defaults, however, it
> is inescapable in the case of device addressing due to the need to
> guarantee a stable hardware ABI to make live migration and save/restore
> to disk work. The policy that libvirt has implemented for device
> addressing is, as much as possible, the same as the addressing scheme
> QEMU would apply itself.
>
> While libvirt succeeds in its goal of providing a stable hardware API,
> the addressing scheme used is not well suited to all deployment
> scenarios of QEMU. This is an inevitable result of having a specific
> assignment policy implemented in libvirt which has to trade off mutually
> incompatible use cases/goals.
>
> When the libvirt addressing policy is not been sufficient, management
> applications are forced to take on address assignment themselves,
> which is a massive non-trivial job with many subtle problems to
> consider.
>
> Places where libvirt's addressing is insufficient for PCI include
>
> * Setting up multiple guest NUMA nodes and associating devices to
> specific nodes
> * Pre-emptive creation of extra PCIe root ports, to allow for later
> device hotplug on PCIe topologies
> * Determining whether to place a device on a PCI or PCIe bridge
> * Controlling whether a device is placed into a hotpluggable slot
> * Controlling whether a PCIe root port supports hotplug or not
> * Determining whether to places all devices on distinct slots or
> buses, vs grouping them all into functions on the same slot
> * Ability to expand the device addressing without being on the
> hypervisor host
>
> Libvirt wishes to avoid implementing many different address assignment
> policies. It also wishes to keep the domain XML as a representation
> of the virtual hardware, not add a bunch of properties to it which
> merely serve as tunable input parameters for device addressing
> algorithms.
>
> There is thus a dilemma here. Management applications increasingly
> need fine grained control over device addressing, while libvirt
> doesn't want to expose fine grained policy controls via the XML.
>
>
> The new libvirt-devaddr API
> ===========================
>
> The way out of this is to define a brand new virt management API
> which tackles this specific problem in a way that addresses all the
> problems mgmt apps have with device addressing and explicitly
> provides a variety of policy impls with tunable behaviour.
>
> By "new API", I actually mean an entirely new library, completely
> distinct from libvirt.so, or anything else we've delivered so
> far. The closest we've come to delivering something at this kind
> of conceptual level, would be the abortive attempt we made with
> "libvirt-builder" to deliver a policy-driven API instead of mechanism
> based. This proposal is still quite different from that attempt.
>
> At a high level
>
> * The new API is "libvirt-devaddr" - short for "libvirt device
addressing"
>
> * As input it will take
>
> 1. The guest CPU architecture and machine type
> 2. A list of global tunables specifying desired behaviour of the
> address assignment policy
> 3. A minimal list of devices needed in the virtual machine, with
> optional addresses and optional per-device tunables to override
> the global tunables
>
> * As output it will emit
>
> 1. fully expanded list of devices needed in the virtual machine,
> with addressing information sufficient to ensure stable hardware ABI
>
> Initially the API would implement something that behaves the same
> way as libvirt's current address assignment API.
>
> The intended usage would be
>
> * Mgmt application makes a minimal list of devices they want in
> their guest
> * List of devices is fed into libvirt-devaddr API
> * Mgmt application gets back a full list of devices & addresses
> * Mgmt application writes a libvirt XML doc using this full list &
> addresses
> * Mgmt application creates the guest in libvirt
+Adrian, +Andrea, +Michal
It dawned on me that kata may provide an additional “borderline”
usage model for this new API. Specifically, it might be a case where
the tunables may be “relayed” through kata-runtime, but really
originate from OpenShift.
OCI Device specification is mknod-based [1] having no bus-specific information
so I think all of logic would be implemented by kata-runtime.
However, the Device Plugin specifies an ENV variable with the host PCI address.
Also, what about in-guest device naming / assignment?
This is a
problem because the ENV var will not match the guest's device address.
I don't see a way around this without having a deterministic way of addressing
devices and modifying/complementing that higher level information.
Adrian, do you think that the iommu group issues you ran into
could help Dan validate that the new library has all the input it
needs to make a sane choice in that case?
I don't think the iommu group problem would require interaction with the
library. Kata agent was just mknod-ing the devices. Fixed in [2]
Do you think that it would be possible to call the library twice
with different tunables in order to get the host and guest device
names?
I don't think I fully understand your proposal. Once qemu is called with a
specific set of device addresses, what could possibly be done in the guest?
In order to be able to consume the devices, the application would need to know
the host->guest address mappings. Whether that mapping is exposed via
kata-agent, ENV var or other means, is yet to be discussed.
WRT to the library itself, I think it would alleviate some of the logic
currently being implemented in kata-runtime that includes things like:
- Determining whether the device's BAR size is small enough for it to be
hot-plugged in a pci bridge
- Determining whether the machine type supports hotplugging on the root bus, or
root-ports need to be pre-allocated.
Related work: [4 [5] and associated PRs
>
> IOW, this new "libvirt-devaddr" API is intended to be used prior to
> creating the XML that is used by libvirt. The API could also be used
> prior to needing to hotplug a new device to an existing guest.
> This API is intended to be a deliverable of the libvirt project, but
> it would be completely independent of the current libvirt API. Most
> especially note that it would NOT use the domain XML in any way.
> This gives applications maximum flexibility in how they consume this
> functionality, not trying to force a way to build domain XML.
>
>
> It would have greater freedom in its API design, making different
> choices from libvirt.so on topics such as programming language (C vs
> Go vs Python etc), API stability timeframe (forever stable vs sometimes
> changing API), data formats (structs, vs YAML/JSON vs XML etc), and of
> course the conceptual approach (policy vs mechanism)
>
> The expectation is that this new API would be most likely to be
> consumed by KubeVirt, OpenStack, Kata, as the list of problems shown
> earlier is directly based on issues seen working with KubeVirt &
> OpenStack in particular. It is not limited to these applications and
> is broadly useful as conceptual thing.
>
> It would be a goal that this API should also be used by libvirt
> itself to replace its current internal device addressing impl.
> Essentially the new API should be seen as a way to expose/extract
> the current libvirt internal algorithm, making it available to
> applications in a flexible manner. I don't anticipate actually copying
> the current addressing code in libvirt as-is, but it would certainly
> serve as reference for the kind of logic we need to implement, so you
> might consider it a "port" or "rewrite" in some very rough
sense.
>
> I think this new API concept is a good way for the project make a start
> in using Go for libvirt. The functionality covered has a clearly defined
> scope limit, making it practical to deliver a real impl in a reasonably
> short time frame. Extracting this will provide a real world benefit to
> our application consumers, solving many long standing problems they have
> with libvirt, and thus justify the effort in doing this work in libvirt
> in a non-C language. The main question mark would be about how we might
> make this functionality available to Python apps if we chose Go. It is
> possible to expose a C API from Go, and we would need this to consume it
> from libvirt. There is then the need to manually write a Python API binding
> which is tedious work.
>
> Regards,
> Daniel
> --
> |:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
> |:
https://libvirt.org -o-
https://fstop138.berrange.com :|
> |:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|
>
[1]
https://github.com/opencontainers/runtime-spec/blob/2a060269036678148a707...
[2]
https://github.com/kata-containers/runtime/pull/2550/commits/4d2574a7230e...
[3]
https://github.com/kata-containers/runtime/issues/115
[4]
https://github.com/kata-containers/runtime/issues/2432
[5]
https://github.com/kata-containers/runtime/issues/2460