libvirt-devaddr: a new library for device address assignment

We've been doing alot of refactoring of code in recent times, and also have plans for significant infrastructure changes. We still need to spend time delivering interesting features to users / applications. This mail is to introduce an idea for a solution to an specific area applications have had long term pain with libvirt's current "mechanism, not policy" approach - device addressing. This is a way for us to show brand new ideas & approaches for what the libvirt project can deliver in terms of management APIs. To set expectations straight: I have written no code for this yet, merely identified the gap & conceptual solution. The device addressing problem ============================= One of the key jobs libvirt does when processing a new domain XML configuration is to assign addresses to all devices that are present. This involves adding various device controllers (PCI bridges, PCI root ports, IDE/SCSI buses, USB controllers, etc) if they are not already present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each device so they are associated with controllers. When libvirt spawns a QEMU guest, it will pass full address information to QEMU. Libvirt, as a general rule, aims to avoid defining and implementing policy around expansion of guest configuration / defaults, however, it is inescapable in the case of device addressing due to the need to guarantee a stable hardware ABI to make live migration and save/restore to disk work. The policy that libvirt has implemented for device addressing is, as much as possible, the same as the addressing scheme QEMU would apply itself. While libvirt succeeds in its goal of providing a stable hardware API, the addressing scheme used is not well suited to all deployment scenarios of QEMU. This is an inevitable result of having a specific assignment policy implemented in libvirt which has to trade off mutually incompatible use cases/goals. When the libvirt addressing policy is not been sufficient, management applications are forced to take on address assignment themselves, which is a massive non-trivial job with many subtle problems to consider. Places where libvirt's addressing is insufficient for PCI include * Setting up multiple guest NUMA nodes and associating devices to specific nodes * Pre-emptive creation of extra PCIe root ports, to allow for later device hotplug on PCIe topologies * Determining whether to place a device on a PCI or PCIe bridge * Controlling whether a device is placed into a hotpluggable slot * Controlling whether a PCIe root port supports hotplug or not * Determining whether to places all devices on distinct slots or buses, vs grouping them all into functions on the same slot * Ability to expand the device addressing without being on the hypervisor host Libvirt wishes to avoid implementing many different address assignment policies. It also wishes to keep the domain XML as a representation of the virtual hardware, not add a bunch of properties to it which merely serve as tunable input parameters for device addressing algorithms. There is thus a dilemma here. Management applications increasingly need fine grained control over device addressing, while libvirt doesn't want to expose fine grained policy controls via the XML. The new libvirt-devaddr API =========================== The way out of this is to define a brand new virt management API which tackles this specific problem in a way that addresses all the problems mgmt apps have with device addressing and explicitly provides a variety of policy impls with tunable behaviour. By "new API", I actually mean an entirely new library, completely distinct from libvirt.so, or anything else we've delivered so far. The closest we've come to delivering something at this kind of conceptual level, would be the abortive attempt we made with "libvirt-builder" to deliver a policy-driven API instead of mechanism based. This proposal is still quite different from that attempt. At a high level * The new API is "libvirt-devaddr" - short for "libvirt device addressing" * As input it will take 1. The guest CPU architecture and machine type 2. A list of global tunables specifying desired behaviour of the address assignment policy 3. A minimal list of devices needed in the virtual machine, with optional addresses and optional per-device tunables to override the global tunables * As output it will emit 1. fully expanded list of devices needed in the virtual machine, with addressing information sufficient to ensure stable hardware ABI Initially the API would implement something that behaves the same way as libvirt's current address assignment API. The intended usage would be * Mgmt application makes a minimal list of devices they want in their guest * List of devices is fed into libvirt-devaddr API * Mgmt application gets back a full list of devices & addresses * Mgmt application writes a libvirt XML doc using this full list & addresses * Mgmt application creates the guest in libvirt IOW, this new "libvirt-devaddr" API is intended to be used prior to creating the XML that is used by libvirt. The API could also be used prior to needing to hotplug a new device to an existing guest. This API is intended to be a deliverable of the libvirt project, but it would be completely independent of the current libvirt API. Most especially note that it would NOT use the domain XML in any way. This gives applications maximum flexibility in how they consume this functionality, not trying to force a way to build domain XML. It would have greater freedom in its API design, making different choices from libvirt.so on topics such as programming language (C vs Go vs Python etc), API stability timeframe (forever stable vs sometimes changing API), data formats (structs, vs YAML/JSON vs XML etc), and of course the conceptual approach (policy vs mechanism) The expectation is that this new API would be most likely to be consumed by KubeVirt, OpenStack, Kata, as the list of problems shown earlier is directly based on issues seen working with KubeVirt & OpenStack in particular. It is not limited to these applications and is broadly useful as conceptual thing. It would be a goal that this API should also be used by libvirt itself to replace its current internal device addressing impl. Essentially the new API should be seen as a way to expose/extract the current libvirt internal algorithm, making it available to applications in a flexible manner. I don't anticipate actually copying the current addressing code in libvirt as-is, but it would certainly serve as reference for the kind of logic we need to implement, so you might consider it a "port" or "rewrite" in some very rough sense. I think this new API concept is a good way for the project make a start in using Go for libvirt. The functionality covered has a clearly defined scope limit, making it practical to deliver a real impl in a reasonably short time frame. Extracting this will provide a real world benefit to our application consumers, solving many long standing problems they have with libvirt, and thus justify the effort in doing this work in libvirt in a non-C language. The main question mark would be about how we might make this functionality available to Python apps if we chose Go. It is possible to expose a C API from Go, and we would need this to consume it from libvirt. There is then the need to manually write a Python API binding which is tedious work. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, 4 Mar 2020, 14:51 Daniel P. Berrangé, <berrange@redhat.com> wrote:
We've been doing alot of refactoring of code in recent times, and also have plans for significant infrastructure changes. We still need to spend time delivering interesting features to users / applications. This mail is to introduce an idea for a solution to an specific area applications have had long term pain with libvirt's current "mechanism, not policy" approach - device addressing. This is a way for us to show brand new ideas & approaches for what the libvirt project can deliver in terms of management APIs.
To set expectations straight: I have written no code for this yet, merely identified the gap & conceptual solution.
The device addressing problem =============================
One of the key jobs libvirt does when processing a new domain XML configuration is to assign addresses to all devices that are present. This involves adding various device controllers (PCI bridges, PCI root ports, IDE/SCSI buses, USB controllers, etc) if they are not already present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each device so they are associated with controllers. When libvirt spawns a QEMU guest, it will pass full address information to QEMU.
Libvirt, as a general rule, aims to avoid defining and implementing policy around expansion of guest configuration / defaults, however, it is inescapable in the case of device addressing due to the need to guarantee a stable hardware ABI to make live migration and save/restore to disk work. The policy that libvirt has implemented for device addressing is, as much as possible, the same as the addressing scheme QEMU would apply itself.
While libvirt succeeds in its goal of providing a stable hardware API, the addressing scheme used is not well suited to all deployment scenarios of QEMU. This is an inevitable result of having a specific assignment policy implemented in libvirt which has to trade off mutually incompatible use cases/goals.
When the libvirt addressing policy is not been sufficient, management applications are forced to take on address assignment themselves, which is a massive non-trivial job with many subtle problems to consider.
Places where libvirt's addressing is insufficient for PCI include
* Setting up multiple guest NUMA nodes and associating devices to specific nodes * Pre-emptive creation of extra PCIe root ports, to allow for later device hotplug on PCIe topologies * Determining whether to place a device on a PCI or PCIe bridge * Controlling whether a device is placed into a hotpluggable slot * Controlling whether a PCIe root port supports hotplug or not * Determining whether to places all devices on distinct slots or buses, vs grouping them all into functions on the same slot * Ability to expand the device addressing without being on the hypervisor host
(I don't understand the last bullet point)
Libvirt wishes to avoid implementing many different address assignment policies. It also wishes to keep the domain XML as a representation of the virtual hardware, not add a bunch of properties to it which merely serve as tunable input parameters for device addressing algorithms.
There is thus a dilemma here. Management applications increasingly need fine grained control over device addressing, while libvirt doesn't want to expose fine grained policy controls via the XML.
The new libvirt-devaddr API ===========================
The way out of this is to define a brand new virt management API which tackles this specific problem in a way that addresses all the problems mgmt apps have with device addressing and explicitly provides a variety of policy impls with tunable behaviour.
By "new API", I actually mean an entirely new library, completely distinct from libvirt.so, or anything else we've delivered so far. The closest we've come to delivering something at this kind of conceptual level, would be the abortive attempt we made with "libvirt-builder" to deliver a policy-driven API instead of mechanism based. This proposal is still quite different from that attempt.
At a high level
* The new API is "libvirt-devaddr" - short for "libvirt device addressing"
* As input it will take
1. The guest CPU architecture and machine type 2. A list of global tunables specifying desired behaviour of the address assignment policy 3. A minimal list of devices needed in the virtual machine, with optional addresses and optional per-device tunables to override the global tunables
* As output it will emit
1. fully expanded list of devices needed in the virtual machine, with addressing information sufficient to ensure stable hardware ABI
Initially the API would implement something that behaves the same way as libvirt's current address assignment API.
The intended usage would be
* Mgmt application makes a minimal list of devices they want in their guest * List of devices is fed into libvirt-devaddr API * Mgmt application gets back a full list of devices & addresses * Mgmt application writes a libvirt XML doc using this full list & addresses * Mgmt application creates the guest in libvirt
IOW, this new "libvirt-devaddr" API is intended to be used prior to creating the XML that is used by libvirt. The API could also be used prior to needing to hotplug a new device to an existing guest. This API is intended to be a deliverable of the libvirt project, but it would be completely independent of the current libvirt API. Most especially note that it would NOT use the domain XML in any way. This gives applications maximum flexibility in how they consume this functionality, not trying to force a way to build domain XML.
This procedure forces Mgmt to learn a new language to describe device placement. Mgmt (or should I just say "we"?) currently expresses the "minimal list of devices" in XML form and pass it to libvirt. Here we are asked to pass it once to libvirt-devaddr, parse its output, and feed it as XML to libvirt. I believe it would be easier to use the domxml as the base language for the new library, too. libvirt-devaddr would accept it with various hints (expressed as its own extension to the XML?) such as "place all of these devices in the same NUMA node", "keep on root bus" or "separate these two chattering devices to their own bus". The output of libvirt-devaddr would be a domxml with <devices> filled with controllers and addresses, readily available for consumption by libvirt.
It would have greater freedom in its API design, making different choices from libvirt.so on topics such as programming language (C vs Go vs Python etc), API stability timeframe (forever stable vs sometimes changing API), data formats (structs, vs YAML/JSON vs XML etc), and of course the conceptual approach (policy vs mechanism)
The expectation is that this new API would be most likely to be consumed by KubeVirt, OpenStack, Kata, as the list of problems shown earlier is directly based on issues seen working with KubeVirt & OpenStack in particular.
And thank you for that.
It is not limited to these applications and is broadly useful as conceptual thing.
It would be a goal that this API should also be used by libvirt itself to replace its current internal device addressing impl. Essentially the new API should be seen as a way to expose/extract the current libvirt internal algorithm, making it available to applications in a flexible manner. I don't anticipate actually copying the current addressing code in libvirt as-is, but it would certainly serve as reference for the kind of logic we need to implement, so you might consider it a "port" or "rewrite" in some very rough sense.
I think this new API concept is a good way for the project make a start in using Go for libvirt. The functionality covered has a clearly defined scope limit, making it practical to deliver a real impl in a reasonably short time frame. Extracting this will provide a real world benefit to our application consumers, solving many long standing problems they have with libvirt, and thus justify the effort in doing this work in libvirt in a non-C language. The main question mark would be about how we might make this functionality available to Python apps if we chose Go. It is possible to expose a C API from Go, and we would need this to consume it from libvirt. There is then the need to manually write a Python API binding which is tedious work.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, Mar 13, 2020 at 11:23:44AM +0200, Dan Kenigsberg wrote:
On Wed, 4 Mar 2020, 14:51 Daniel P. Berrangé, <berrange@redhat.com> wrote:
We've been doing alot of refactoring of code in recent times, and also have plans for significant infrastructure changes. We still need to spend time delivering interesting features to users / applications. This mail is to introduce an idea for a solution to an specific area applications have had long term pain with libvirt's current "mechanism, not policy" approach - device addressing. This is a way for us to show brand new ideas & approaches for what the libvirt project can deliver in terms of management APIs.
To set expectations straight: I have written no code for this yet, merely identified the gap & conceptual solution.
The device addressing problem =============================
One of the key jobs libvirt does when processing a new domain XML configuration is to assign addresses to all devices that are present. This involves adding various device controllers (PCI bridges, PCI root ports, IDE/SCSI buses, USB controllers, etc) if they are not already present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each device so they are associated with controllers. When libvirt spawns a QEMU guest, it will pass full address information to QEMU.
Libvirt, as a general rule, aims to avoid defining and implementing policy around expansion of guest configuration / defaults, however, it is inescapable in the case of device addressing due to the need to guarantee a stable hardware ABI to make live migration and save/restore to disk work. The policy that libvirt has implemented for device addressing is, as much as possible, the same as the addressing scheme QEMU would apply itself.
While libvirt succeeds in its goal of providing a stable hardware API, the addressing scheme used is not well suited to all deployment scenarios of QEMU. This is an inevitable result of having a specific assignment policy implemented in libvirt which has to trade off mutually incompatible use cases/goals.
When the libvirt addressing policy is not been sufficient, management applications are forced to take on address assignment themselves, which is a massive non-trivial job with many subtle problems to consider.
Places where libvirt's addressing is insufficient for PCI include
* Setting up multiple guest NUMA nodes and associating devices to specific nodes * Pre-emptive creation of extra PCIe root ports, to allow for later device hotplug on PCIe topologies * Determining whether to place a device on a PCI or PCIe bridge * Controlling whether a device is placed into a hotpluggable slot * Controlling whether a PCIe root port supports hotplug or not * Determining whether to places all devices on distinct slots or buses, vs grouping them all into functions on the same slot * Ability to expand the device addressing without being on the hypervisor host
(I don't understand the last bullet point)
I'm not sure if this is still the case, but at some point in time there was a desire from KubeVirt to be able to expand the users' configuration when loaded in KubeVirt, filling in various defaults for devices. This would run when the end user YAML/JSON config was first posted to the k8s API for storage, some arbitrary amount of time later the config gets chosen to run on a virtualization host at which point it is turned into libvirt domain XML.
Libvirt wishes to avoid implementing many different address assignment policies. It also wishes to keep the domain XML as a representation of the virtual hardware, not add a bunch of properties to it which merely serve as tunable input parameters for device addressing algorithms.
There is thus a dilemma here. Management applications increasingly need fine grained control over device addressing, while libvirt doesn't want to expose fine grained policy controls via the XML.
The new libvirt-devaddr API ===========================
The way out of this is to define a brand new virt management API which tackles this specific problem in a way that addresses all the problems mgmt apps have with device addressing and explicitly provides a variety of policy impls with tunable behaviour.
By "new API", I actually mean an entirely new library, completely distinct from libvirt.so, or anything else we've delivered so far. The closest we've come to delivering something at this kind of conceptual level, would be the abortive attempt we made with "libvirt-builder" to deliver a policy-driven API instead of mechanism based. This proposal is still quite different from that attempt.
At a high level
* The new API is "libvirt-devaddr" - short for "libvirt device addressing"
* As input it will take
1. The guest CPU architecture and machine type 2. A list of global tunables specifying desired behaviour of the address assignment policy 3. A minimal list of devices needed in the virtual machine, with optional addresses and optional per-device tunables to override the global tunables
* As output it will emit
1. fully expanded list of devices needed in the virtual machine, with addressing information sufficient to ensure stable hardware ABI
Initially the API would implement something that behaves the same way as libvirt's current address assignment API.
The intended usage would be
* Mgmt application makes a minimal list of devices they want in their guest * List of devices is fed into libvirt-devaddr API * Mgmt application gets back a full list of devices & addresses * Mgmt application writes a libvirt XML doc using this full list & addresses * Mgmt application creates the guest in libvirt
IOW, this new "libvirt-devaddr" API is intended to be used prior to creating the XML that is used by libvirt. The API could also be used prior to needing to hotplug a new device to an existing guest. This API is intended to be a deliverable of the libvirt project, but it would be completely independent of the current libvirt API. Most especially note that it would NOT use the domain XML in any way. This gives applications maximum flexibility in how they consume this functionality, not trying to force a way to build domain XML.
This procedure forces Mgmt to learn a new language to describe device placement. Mgmt (or should I just say "we"?) currently expresses the "minimal list of devices" in XML form and pass it to libvirt. Here we are asked to pass it once to libvirt-devaddr, parse its output, and feed it as XML to libvirt.
I'm not neccessarily suggesting we even need a document format the core API level. I could easily see the API working in terms of a list of Go structs, with tunables being normal method parameters. A JSON format could be an optional way to serialize the Go structs, but if the app were written in Go the JSON may not be needed at all.
I believe it would be easier to use the domxml as the base language for the new library, too. libvirt-devaddr would accept it with various hints (expressed as its own extension to the XML?) such as "place all of these devices in the same NUMA node", "keep on root bus" or "separate these two chattering devices to their own bus". The output of libvirt-devaddr would be a domxml with <devices> filled with controllers and addresses, readily available for consumption by libvirt.
I don't believe that using the libvirt domain XML is a good idea for this as it uneccesssarily constrains the usage scenarios. Most management applications do not use the domain XML as their canonical internal storage format. KubeVirt has its JSON/YAML schema for k8s API, OpenStack/RHEV just store metadata in their DB, others vary again. Some of these applications benefit from being able to expand device topology/addressing, a long time before they get any where near use of domain XML - the latter only matters when you come to instantiate a VM on a particular host. We could of coure have a convenience method which optionally generates a domain XML template from the output list of devices, if someone believes that's useful to standardize on, but I don't think the domain XML should be the core format format. I would also like this library to usable for scenarios in which libvirt is not involved at all. One of the strange things about the QEMU driver in libvirt compared to the other hypervisor drivers is that it is missing an intermediate API layer. In other drivers the hypervisor platform itself provides a full management API layer, and libvirt merely maps the libvirt APIs to the underling mgmt API or data formats. IOW, libvirt is just a mapping layer. QEMU though only really provides a few low level building blocks, alongside other building blocks you have to pull in from Linux. It doesn't even provide a configuration file. Libvirt pulls all these pieces together to form the complete managment QEMU API, as well as mapping everything onto the libvirt domain XML & APIs. I think all there is scope & interest/demand to look at creating an intermediate layer that provides a full managment layer for QEMU, such that libvirt can eventually become just a mapping layer for QEMU. In such a scenario the libvirt-devaddr library is still very useful but you don't want it using the libvirt domain XML, as that's not likely to be the format in use. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, Mar 13, 2020 at 12:47 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Fri, Mar 13, 2020 at 11:23:44AM +0200, Dan Kenigsberg wrote:
On Wed, 4 Mar 2020, 14:51 Daniel P. Berrangé, <berrange@redhat.com> wrote:
We've been doing alot of refactoring of code in recent times, and also have plans for significant infrastructure changes. We still need to spend time delivering interesting features to users / applications. This mail is to introduce an idea for a solution to an specific area applications have had long term pain with libvirt's current "mechanism, not policy" approach - device addressing. This is a way for us to show brand new ideas & approaches for what the libvirt project can deliver in terms of management APIs.
To set expectations straight: I have written no code for this yet, merely identified the gap & conceptual solution.
The device addressing problem =============================
One of the key jobs libvirt does when processing a new domain XML configuration is to assign addresses to all devices that are present. This involves adding various device controllers (PCI bridges, PCI root ports, IDE/SCSI buses, USB controllers, etc) if they are not already present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each device so they are associated with controllers. When libvirt spawns a QEMU guest, it will pass full address information to QEMU.
Libvirt, as a general rule, aims to avoid defining and implementing policy around expansion of guest configuration / defaults, however, it is inescapable in the case of device addressing due to the need to guarantee a stable hardware ABI to make live migration and save/restore to disk work. The policy that libvirt has implemented for device addressing is, as much as possible, the same as the addressing scheme QEMU would apply itself.
While libvirt succeeds in its goal of providing a stable hardware API, the addressing scheme used is not well suited to all deployment scenarios of QEMU. This is an inevitable result of having a specific assignment policy implemented in libvirt which has to trade off mutually incompatible use cases/goals.
When the libvirt addressing policy is not been sufficient, management applications are forced to take on address assignment themselves, which is a massive non-trivial job with many subtle problems to consider.
Places where libvirt's addressing is insufficient for PCI include
* Setting up multiple guest NUMA nodes and associating devices to specific nodes * Pre-emptive creation of extra PCIe root ports, to allow for later device hotplug on PCIe topologies * Determining whether to place a device on a PCI or PCIe bridge * Controlling whether a device is placed into a hotpluggable slot * Controlling whether a PCIe root port supports hotplug or not * Determining whether to places all devices on distinct slots or buses, vs grouping them all into functions on the same slot * Ability to expand the device addressing without being on the hypervisor host
(I don't understand the last bullet point)
I'm not sure if this is still the case, but at some point in time there was a desire from KubeVirt to be able to expand the users' configuration when loaded in KubeVirt, filling in various defaults for devices. This would run when the end user YAML/JSON config was first posted to the k8s API for storage, some arbitrary amount of time later the config gets chosen to run on a virtualization host at which point it is turned into libvirt domain XML.
Ah, I did not hear about this before, but I see why something like this would be useful even without libvirt-devaddr. Having something like virDomainDryRunXML() would have eliminated old race conditions we had in oVirt.
Libvirt wishes to avoid implementing many different address assignment policies. It also wishes to keep the domain XML as a representation of the virtual hardware, not add a bunch of properties to it which merely serve as tunable input parameters for device addressing algorithms.
There is thus a dilemma here. Management applications increasingly need fine grained control over device addressing, while libvirt doesn't want to expose fine grained policy controls via the XML.
The new libvirt-devaddr API ===========================
The way out of this is to define a brand new virt management API which tackles this specific problem in a way that addresses all the problems mgmt apps have with device addressing and explicitly provides a variety of policy impls with tunable behaviour.
By "new API", I actually mean an entirely new library, completely distinct from libvirt.so, or anything else we've delivered so far. The closest we've come to delivering something at this kind of conceptual level, would be the abortive attempt we made with "libvirt-builder" to deliver a policy-driven API instead of mechanism based. This proposal is still quite different from that attempt.
At a high level
* The new API is "libvirt-devaddr" - short for "libvirt device addressing"
* As input it will take
1. The guest CPU architecture and machine type 2. A list of global tunables specifying desired behaviour of the address assignment policy 3. A minimal list of devices needed in the virtual machine, with optional addresses and optional per-device tunables to override the global tunables
* As output it will emit
1. fully expanded list of devices needed in the virtual machine, with addressing information sufficient to ensure stable hardware ABI
Initially the API would implement something that behaves the same way as libvirt's current address assignment API.
The intended usage would be
* Mgmt application makes a minimal list of devices they want in their guest * List of devices is fed into libvirt-devaddr API * Mgmt application gets back a full list of devices & addresses * Mgmt application writes a libvirt XML doc using this full list & addresses * Mgmt application creates the guest in libvirt
IOW, this new "libvirt-devaddr" API is intended to be used prior to creating the XML that is used by libvirt. The API could also be used prior to needing to hotplug a new device to an existing guest. This API is intended to be a deliverable of the libvirt project, but it would be completely independent of the current libvirt API. Most especially note that it would NOT use the domain XML in any way. This gives applications maximum flexibility in how they consume this functionality, not trying to force a way to build domain XML.
This procedure forces Mgmt to learn a new language to describe device placement. Mgmt (or should I just say "we"?) currently expresses the "minimal list of devices" in XML form and pass it to libvirt. Here we are asked to pass it once to libvirt-devaddr, parse its output, and feed it as XML to libvirt.
I'm not neccessarily suggesting we even need a document format the core API level. I could easily see the API working in terms of a list of Go structs, with tunables being normal method parameters. A JSON format could be an optional way to serialize the Go structs, but if the app were written in Go the JSON may not be needed at all.
I believe it would be easier to use the domxml as the base language for the new library, too. libvirt-devaddr would accept it with various hints (expressed as its own extension to the XML?) such as "place all of these devices in the same NUMA node", "keep on root bus" or "separate these two chattering devices to their own bus". The output of libvirt-devaddr would be a domxml with <devices> filled with controllers and addresses, readily available for consumption by libvirt.
I don't believe that using the libvirt domain XML is a good idea for this as it uneccesssarily constrains the usage scenarios. Most management applications do not use the domain XML as their canonical internal storage format. KubeVirt has its JSON/YAML schema for k8s API, OpenStack/RHEV just store metadata in their DB, others vary again. Some of these applications benefit from being able to expand device topology/addressing, a long time before they get any where near use of domain XML - the latter only matters when you come to instantiate a VM on a particular host.
Nevertheless, your suggested Go struct would become a third representation of virtual devices, on top of domxml and the Mgmt-canonical one. Maybe I'm just overconservative. Let us ask kubevirt-dev what would be their preferable form to consume this suggested API.
We could of coure have a convenience method which optionally generates a domain XML template from the output list of devices, if someone believes that's useful to standardize on, but I don't think the domain XML should be the core format format.
I would also like this library to usable for scenarios in which libvirt is not involved at all. One of the strange things about the QEMU driver in libvirt compared to the other hypervisor drivers is that it is missing an intermediate API layer. In other drivers the hypervisor platform itself provides a full management API layer, and libvirt merely maps the libvirt APIs to the underling mgmt API or data formats. IOW, libvirt is just a mapping layer.
QEMU though only really provides a few low level building blocks, alongside other building blocks you have to pull in from Linux. It doesn't even provide a configuration file. Libvirt pulls all these pieces together to form the complete managment QEMU API, as well as mapping everything onto the libvirt domain XML & APIs. I think all there is scope & interest/demand to look at creating an intermediate layer that provides a full managment layer for QEMU, such that libvirt can eventually become just a mapping layer for QEMU. In such a scenario the libvirt-devaddr library is still very useful but you don't want it using the libvirt domain XML, as that's not likely to be the format in use.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

TL;DR - I'm not as anti-XML as the proposal seems to be, but also not pro-XML. I also (after thinking about it) understand the advantage of putting this in a separate library. So yeah, let's go it! On 3/13/20 6:47 AM, Daniel P. Berrangé wrote:
On Wed, 4 Mar 2020, 14:51 Daniel P. Berrangé, <berrange@redhat.com> wrote:
We've been doing alot of refactoring of code in recent times, and also have plans for significant infrastructure changes. We still need to spend time delivering interesting features to users / applications. This mail is to introduce an idea for a solution to an specific area applications have had long term pain with libvirt's current "mechanism, not policy" approach - device addressing. This is a way for us to show brand new ideas & approaches for what the libvirt project can deliver in terms of management APIs.
To set expectations straight: I have written no code for this yet, merely identified the gap & conceptual solution.
The device addressing problem =============================
One of the key jobs libvirt does when processing a new domain XML configuration is to assign addresses to all devices that are present. This involves adding various device controllers (PCI bridges, PCI root ports, IDE/SCSI buses, USB controllers, etc) if they are not already present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each device so they are associated with controllers. When libvirt spawns a QEMU guest, it will pass full address information to QEMU.
Libvirt, as a general rule, aims to avoid defining and implementing policy around expansion of guest configuration / defaults, however, it is inescapable in the case of device addressing due to the need to guarantee a stable hardware ABI to make live migration and save/restore to disk work. The policy that libvirt has implemented for device addressing is, as much as possible, the same as the addressing scheme QEMU would apply itself.
While libvirt succeeds in its goal of providing a stable hardware API, the addressing scheme used is not well suited to all deployment scenarios of QEMU. This is an inevitable result of having a specific assignment policy implemented in libvirt which has to trade off mutually incompatible use cases/goals.
When the libvirt addressing policy is not been sufficient, management applications are forced to take on address assignment themselves, which is a massive non-trivial job with many subtle problems to consider.
Places where libvirt's addressing is insufficient for PCI include
* Setting up multiple guest NUMA nodes and associating devices to specific nodes * Pre-emptive creation of extra PCIe root ports, to allow for later device hotplug on PCIe topologies * Determining whether to place a device on a PCI or PCIe bridge * Controlling whether a device is placed into a hotpluggable slot * Controlling whether a PCIe root port supports hotplug or not * Determining whether to places all devices on distinct slots or buses, vs grouping them all into functions on the same slot * Ability to expand the device addressing without being on the hypervisor host (I don't understand the last bullet point) I'm not sure if this is still the case, but at some point in time
On Fri, Mar 13, 2020 at 11:23:44AM +0200, Dan Kenigsberg wrote: there was a desire from KubeVirt to be able to expand the users' configuration when loaded in KubeVirt, filling in various defaults for devices. This would run when the end user YAML/JSON config was first posted to the k8s API for storage, some arbitrary amount of time later the config gets chosen to run on a virtualization host at which point it is turned into libvirt domain XML.
If I recall the discussion properly, the context was that we wanted kubevirt to remember all the stuff like PCI addresses, MAC addresses, exact machinetype to be "backfilled" from libvirt into the Kubevirt config, but for them that's a one-way street. So having all these things set by a separate API (even in a separate library) would definitely be an advantage for them, as long as all the same info was available at that time (e.g. you really need to know the machinetypes supported by the specific qemu that is going to be used in order to set the exact machinetype)
Libvirt wishes to avoid implementing many different address assignment policies. It also wishes to keep the domain XML as a representation of the virtual hardware, not add a bunch of properties to it which merely serve as tunable input parameters for device addressing algorithms.
There is thus a dilemma here. Management applications increasingly need fine grained control over device addressing, while libvirt doesn't want to expose fine grained policy controls via the XML.
The new libvirt-devaddr API ===========================
The way out of this is to define a brand new virt management API which tackles this specific problem in a way that addresses all the problems mgmt apps have with device addressing and explicitly provides a variety of policy impls with tunable behaviour.
By "new API", I actually mean an entirely new library, completely distinct from libvirt.so, or anything else we've delivered so far.
I was at first against the idea of a completely separate library, since each new library means a new package to be maintained and installed. However, I do see the advantage of being completely disconnected from libvirt, since there may be scenarios where libvirt isn't needed (maybe libvirt is on a different host, or maybe something else (libvirt-ng? :-P) is being used. Keeping this separate means it can be used in other scenarios. So now I agree with this.
The closest we've come to delivering something at this kind of conceptual level, would be the abortive attempt we made with "libvirt-builder" to deliver a policy-driven API instead of mechanism based. This proposal is still quite different from that attempt.
At a high level
* The new API is "libvirt-devaddr" - short for "libvirt device addressing"
It's more than just device addresses though. (On the other hand, a name is just a name, so...)
* As input it will take
1. The guest CPU architecture and machine type
To repeat the point above - do we expect libvirt-devaddr to provide the exact machinetype? If so, what will be the mechanism for telling it exactly which machinetypes are supported? Will it need to replicate all of libvirt's qemu capabilities code? (and would that really work if, say, libvirt-devaddr is being used on a machine different from the machine where the virtual machine will eventually be run?)
2. A list of global tunables specifying desired behaviour of the address assignment policy 3. A minimal list of devices needed in the virtual machine, with optional addresses and optional per-device tunables to override the global tunables
* As output it will emit
1. fully expanded list of devices needed in the virtual machine, with addressing information sufficient to ensure stable hardware ABI
I know you already know it and it's implied in what you say, but just to make sure it's clear to anybody else, the "expanded list of devices" will also include all PCI (and SCSI and SATA and whatever) controllers needed for the entire hierarchy. (Or maybe you said that and I missed it. Wouldn't surprise me) This means that the library will need to know which types of which controllers are supported for the machinetype being requested (and of course what is supported by each controller). Is it going to query qemu? Which qemu - the one on the host where libvirt-devaddr is being called I suppose, but that won't necessarily be the same as the host where the guest will eventually run. Will libvirt-devaddr care about things all the way to the level of which type of pcie-root-port to use (for example)? And what about all the odd attributes of various controllers that libvirt sets to a default value and then stores in the XML (chassis id, etc)? I guess we need to take care of all those as well.
Initially the API would implement something that behaves the same way as libvirt's current address assignment API.
The intended usage would be
* Mgmt application makes a minimal list of devices they want in their guest * List of devices is fed into libvirt-devaddr API * Mgmt application gets back a full list of devices & addresses * Mgmt application writes a libvirt XML doc using this full list & addresses * Mgmt application creates the guest in libvirt
IOW, this new "libvirt-devaddr" API is intended to be used prior to creating the XML that is used by libvirt. The API could also be used prior to needing to hotplug a new device to an existing guest.
So everything returned from the original call would need to be kept around in that form (or the application would need to be able to reproduce it on demand), and that's then fed into the API. I guess this could just be the same API - similar to how libvirt acts now, it would accept any address info provided, and then assign it wherever it was omitted.
This API is intended to be a deliverable of the libvirt project, but it would be completely independent of the current libvirt API. Most especially note that it would NOT use the domain XML in any way. This gives applications maximum flexibility in how they consume this functionality, not trying to force a way to build domain XML.
I was originally going to argue in favor of using the same XML, since we otherwise have to convert back and forth. But during the extra long time I've taken to think about it, I think I agree that this isn't important, especially if the chosen format is as simple as possible.
This procedure forces Mgmt to learn a new language to describe device placement. Mgmt (or should I just say "we"?) currently expresses the "minimal list of devices" in XML form and pass it to libvirt. Here we are asked to pass it once to libvirt-devaddr, parse its output, and feed it as XML to libvirt. I'm not neccessarily suggesting we even need a document format the core API level. I could easily see the API working in terms of a list of Go structs, with tunables being normal method parameters. A JSON format could be an optional way to serialize the Go structs, but if the app were written in Go the JSON may not be needed at all.
"Using JSON when we eventually need XML is just using XML with extra steps". Or something like that. Is JSON really that much simpler than XML? Anyway, since we aren't saddled with the precondition that "everything must be stable and backward compatible", there's freedom to experiment, so I guess it's not really necessary to spend too much time debating and trying to make the "definite 100% sure best decision". We can just pick something and try it. If it works out, great; if it doesn't then we pick something else :-)
I believe it would be easier to use the domxml as the base language for the new library, too. libvirt-devaddr would accept it with various hints (expressed as its own extension to the XML?) such as "place all of these devices in the same NUMA node", "keep on root bus" or "separate these two chattering devices to their own bus". The output of libvirt-devaddr would be a domxml with <devices> filled with controllers and addresses, readily available for consumption by libvirt. I don't believe that using the libvirt domain XML is a good idea for this as it uneccesssarily constrains the usage scenarios. Most management applications do not use the domain XML as their canonical internal storage format. KubeVirt has its JSON/YAML schema for k8s API, OpenStack/RHEV just store metadata in their DB, others vary again. Some of these applications benefit from being able to expand device topology/addressing, a long time before they get any where near use of domain XML - the latter only matters when you come to instantiate a VM on a particular host.
This explains why it's not necessary to use XML. But I don't see use of XML as "unnecessarily constraining" the usage scenarios. Does it make the code (on either side) unnecessarily inefficient? Does it require pulling in libraries that applications otherwise wouldn't need? Required code is too complex?
We could of coure have a convenience method which optionally generates a domain XML template from the output list of devices, if someone believes that's useful to standardize on, but I don't think the domain XML should be the core format format.
I would also like this library to usable for scenarios in which libvirt is not involved at all. One of the strange things about the QEMU driver in libvirt compared to the other hypervisor drivers is that it is missing an intermediate API layer. In other drivers the hypervisor platform itself provides a full management API layer, and libvirt merely maps the libvirt APIs to the underling mgmt API or data formats. IOW, libvirt is just a mapping layer.
When you're just a "mapping layer", and you're expected to transparently map in both directions, it gets problematic. Especially when there are multiple ways of describing the same setup, or options supported at one end that are ignored/not supported at the other. Not sure why I'm replying to this point, just when I hear "mapping layer" I think about the fact that netcf was never able to deal with the many different ways that debian interfaces files could be written, or ignore but leave in place extra ifcfg options it didn't support (that's just a couple that come to mind, and we shouldn't derail this conversation to talk about them :-/)
QEMU though only really provides a few low level building blocks, alongside other building blocks you have to pull in from Linux. It doesn't even provide a configuration file. Libvirt pulls all these pieces together to form the complete managment QEMU API, as well as mapping everything onto the libvirt domain XML & APIs. I think all there is scope & interest/demand to look at creating an intermediate layer that provides a full managment layer for QEMU, such that libvirt can eventually become just a mapping layer for QEMU. In such a scenario the libvirt-devaddr library is still very useful but you don't want it using the libvirt domain XML, as that's not likely to be the format in use.
My opinion would be that it's not necessary for libvirt domain XML (or a subset) be the format, but that it also shouldn't necessarily be avoided (unless the alternative is better in some quantifiable way). Anyway, in the end I think my opinion is we should push ahead and think about consequences of the specifics later, after some experimenting. I'd love to help if there's a place for it. I'm just not sure where/how I could contribute, especially since I have only about 4 hours worth of golang knowledge :-) (certainly not against getting more though!)

On 3/19/20 4:00 PM, Laine Stump wrote:
TL;DR - I'm not as anti-XML as the proposal seems to be, but also not pro-XML. I also (after thinking about it) understand the advantage of putting this in a separate library. So yeah, let's go it!
[...]
Anyway, in the end I think my opinion is we should push ahead and think about consequences of the specifics later, after some experimenting. I'd love to help if there's a place for it. I'm just not sure where/how I could contribute, especially since I have only about 4 hours worth of golang knowledge :-) (certainly not against getting more though!)
I'll start writing some code here and there for this initiative. Does anyone already started doing something? Otherwise I'll push code in github/gitlab when I have stuff to show. My understanding from the discussions is that the API is going to supply JSON responses instead of domxml (domXML might be supplied as an option, but it wouldn't be the default format used). Is that correct? Thanks, DHB

On 4/30/20 2:20 PM, Daniel Henrique Barboza wrote:
On 3/19/20 4:00 PM, Laine Stump wrote:
TL;DR - I'm not as anti-XML as the proposal seems to be, but also not pro-XML. I also (after thinking about it) understand the advantage of putting this in a separate library. So yeah, let's go it!
[...]
Anyway, in the end I think my opinion is we should push ahead and think about consequences of the specifics later, after some experimenting. I'd love to help if there's a place for it. I'm just not sure where/how I could contribute, especially since I have only about 4 hours worth of golang knowledge :-) (certainly not against getting more though!)
I'll start writing some code here and there for this initiative. Does anyone already started doing something? Otherwise I'll push code in github/gitlab when I have stuff to show.
Since I've been involved with libvirt's PCI address assignment code for quite a long time (and so there's a lot of knowledge about it embedded in my brain) I *should be* starting to do something with libvirt-devaddr, although I've been deliquent in asking danpb for more specific direction - as I said before this is a completely new medium for me, so I don't really know where to start. Based on your enthusiasm, I'm guessing you have more experience than me with using go and various go libraries, is that right? If so that could be very helpful in getting it off the ground. Here are two things that would help enable me to make useful contributions: 1) a basic "source tree for a go library" setup in a libvirt-subproject on gitlab (since gitlab is the official location of libvirt projects now), including basic commit and CI hooks/test cases. I'm guessing we could borrow/steal a lot from what was done by the people who participated in "virt-blocks" last fall. Andrea - any advice/suggestions to give here? (A side question - should we put it under the libvirt umbrella on gitlab right away? Or play around in personal trees at first and then later fork it into an official libvirt project?) 2) a more concrete idea of what the API should look like. This is always the toughest part for me, since it is what the rest of the world sees, so it needs to be intelligible and capable of expansion, and I have a long history of making questionable choices that come back to haunt me (and everybody else! :-P). Since danpb has made good decisions in this area in the past (and since the original proposal is his), I'm thinking/hoping he can help provide direction to minimize mis-steps (on the other hand, I know he's really busy, so maybe he was just hoping that someone else would grab up his proposal and run with it). Once those things are agreed upon and mostly in place, I think it will be more practical for multiple people to contribute, and in particular I will be able to put my memories of the idiosyncracies of libvirt and its PCI and other address allocation to better use (and hopefully I'll become more familiar with go in the process).
My understanding from the discussions is that the API is going to supply JSON responses instead of domxml (domXML might be supplied as an option, but it wouldn't be the default format used). Is that correct?
I don't think there was any hard specification of the output format, just that it doesn't need to be married to XML. I've tended to view the current love affair with JSON as similar to 200x's love affair with XML, so I don't have any assumption that it's the best choice, but if there's nothing better, then I guess why not?

On 4/30/20 4:14 PM, Laine Stump wrote:
On 4/30/20 2:20 PM, Daniel Henrique Barboza wrote:
On 3/19/20 4:00 PM, Laine Stump wrote:
TL;DR - I'm not as anti-XML as the proposal seems to be, but also not pro-XML. I also (after thinking about it) understand the advantage of putting this in a separate library. So yeah, let's go it!
[...]
Since I've been involved with libvirt's PCI address assignment code for quite a long time (and so there's a lot of knowledge about it embedded in my brain) I *should be* starting to do something with libvirt-devaddr, although I've been deliquent in asking danpb for more specific direction - as I said before this is a completely new medium for me, so I don't really know where to start.
Based on your enthusiasm, I'm guessing you have more experience than me with using go and various go libraries, is that right? If so that could be very helpful in getting it off the ground.
Here are two things that would help enable me to make useful contributions:
1) a basic "source tree for a go library" setup in a libvirt-subproject on gitlab (since gitlab is the official location of libvirt projects now), including basic commit and CI hooks/test cases. I'm guessing we could borrow/steal a lot from what was done by the people who participated in "virt-blocks" last fall. Andrea - any advice/suggestions to give here?
It would be of great help if we can get "inspiration" from another project to get the initial CI/unit test skeleton.
(A side question - should we put it under the libvirt umbrella on gitlab right away? Or play around in personal trees at first and then later fork it into an official libvirt project?)
I'd rather put it under a personal gitlab tree first. Putting it inside Libvirt umbrella might generate unrealistic expectations for something that's still on its infancy.
2) a more concrete idea of what the API should look like. This is always the toughest part for me, since it is what the rest of the world sees, so it needs to be intelligible and capable of expansion, and I have a long history of making questionable choices that come back to haunt me (and everybody else! :-P). Since danpb has made good decisions in this area in the past (and since the original proposal is his), I'm thinking/hoping he can help provide direction to minimize mis-steps (on the other hand, I know he's really busy, so maybe he was just hoping that someone else would grab up his proposal and run with it).
My initial plan is to get the logic/APIs design from Libvirt, rename them in a Gopher fashion, re-code it with Go and call it a day :) In all seriousness, we have some room to do "not so good" APIs and change them if necessary since it's a fresh project. In this stage we can start with simplified versions of the use cases danpb described in the first email of the thread and play by ear. Thanks, DHB

On Thu, Apr 30, 2020 at 09:00:51PM -0300, Daniel Henrique Barboza wrote:
My initial plan is to get the logic/APIs design from Libvirt, rename them in a Gopher fashion, re-code it with Go and call it a day :)
That is really not a way I would like to go, as that means we immediately inherit the design bias of the current libvirt code. The goal is to be able to replace current libvirt code eventually, but I don't want it to just be a clone of that code, as I think it misses the opportunity to try to design something better than what we have done. As a particular example.. the current placement code has no conceptual model of machine types present in QEMU. We've just got many "if" tests that take different codepaths based on heuristics about the machine type. I would like the new API to have an explicit conceptual model of each machine type we intend to support. ie it should have full representation of the default topology of devices that are mandated by the machine type. Ideally this modelling should be extendable without having to write code in the placement model. ie we should be able to load a "i440fx.yaml" file describing the i440fx machine type and the placement logic "just works". We should not have any tests like "if (is i440fx)" in the code itself. The libvirt code shows us the range of features we need to support at least though. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/1/20 7:40 AM, Daniel P. Berrangé wrote:
On Thu, Apr 30, 2020 at 09:00:51PM -0300, Daniel Henrique Barboza wrote:
My initial plan is to get the logic/APIs design from Libvirt, rename them in a Gopher fashion, re-code it with Go and call it a day :)
That is really not a way I would like to go, as that means we immediately inherit the design bias of the current libvirt code. The goal is to be able to replace current libvirt code eventually, but I don't want it to just be a clone of that code, as I think it misses the opportunity to try to design something better than what we have done.
As a particular example.. the current placement code has no conceptual model of machine types present in QEMU. We've just got many "if" tests that take different codepaths based on heuristics about the machine type. I would like the new API to have an explicit conceptual model of each machine type we intend to support. ie it should have full representation of the default topology of devices that are mandated by the machine type. Ideally this modelling should be extendable without having to write code in the placement model. ie we should be able to load a "i440fx.yaml" file describing the i440fx machine type and the placement logic "just works". We should not have any tests like "if (is i440fx)" in the code itself.
That's a sound idea. I'd say that instead of basing yourselves in the QEMU machine types addressing we should aim in implementing he machine specification instead, even as a long term goal. E.G let's say that Libvirt wants addressing services for a hotplug in a QEMU i440fx guest. Instead of devaddr implementing "this is how the i440fx addressing works in QEMU", devaddr should be more concerned about "this is how the i440fx processor addressing works". If QEMU does additional/different things on top of that then qemu_driver.c should operate on that. This allows devaddr to be hypervisor agnostic.
The libvirt code shows us the range of features we need to support at least though.
I'll see if I can take a look in all the "if (pseries)" in Libvirt device addressing code to get an idea of how a PowerPC addressing model would work related to x86. DHB
Regards, Daniel

On Fri, May 01, 2020 at 12:57:54PM -0300, Daniel Henrique Barboza wrote:
On 5/1/20 7:40 AM, Daniel P. Berrangé wrote:
On Thu, Apr 30, 2020 at 09:00:51PM -0300, Daniel Henrique Barboza wrote:
My initial plan is to get the logic/APIs design from Libvirt, rename them in a Gopher fashion, re-code it with Go and call it a day :)
That is really not a way I would like to go, as that means we immediately inherit the design bias of the current libvirt code. The goal is to be able to replace current libvirt code eventually, but I don't want it to just be a clone of that code, as I think it misses the opportunity to try to design something better than what we have done.
As a particular example.. the current placement code has no conceptual model of machine types present in QEMU. We've just got many "if" tests that take different codepaths based on heuristics about the machine type. I would like the new API to have an explicit conceptual model of each machine type we intend to support. ie it should have full representation of the default topology of devices that are mandated by the machine type. Ideally this modelling should be extendable without having to write code in the placement model. ie we should be able to load a "i440fx.yaml" file describing the i440fx machine type and the placement logic "just works". We should not have any tests like "if (is i440fx)" in the code itself.
That's a sound idea. I'd say that instead of basing yourselves in the QEMU machine types addressing we should aim in implementing he machine specification instead, even as a long term goal.
E.G let's say that Libvirt wants addressing services for a hotplug in a QEMU i440fx guest. Instead of devaddr implementing "this is how the i440fx addressing works in QEMU", devaddr should be more concerned about "this is how the i440fx processor addressing works". If QEMU does additional/different things on top of that then qemu_driver.c should operate on that. This allows devaddr to be hypervisor agnostic.
Yes, it was not intended to be tied to QEMU's specific implementation either. It should be a generic modelling / addressing system.
The libvirt code shows us the range of features we need to support at least though.
I'll see if I can take a look in all the "if (pseries)" in Libvirt device addressing code to get an idea of how a PowerPC addressing model would work related to x86.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Apr 30, 2020 at 03:14:09PM -0400, Laine Stump wrote:
Here are two things that would help enable me to make useful contributions:
1) a basic "source tree for a go library" setup in a libvirt-subproject on gitlab (since gitlab is the official location of libvirt projects now), including basic commit and CI hooks/test cases. I'm guessing we could borrow/steal a lot from what was done by the people who participated in "virt-blocks" last fall. Andrea - any advice/suggestions to give here?
(A side question - should we put it under the libvirt umbrella on gitlab right away? Or play around in personal trees at first and then later fork it into an official libvirt project?)
I intended it to be under libvirt project right from the start, and have indeed already created the repos & CI. There is no reason to hide it away in private repos. It is fine for the official repo to have zero guarantee of stability in the early days.
2) a more concrete idea of what the API should look like. This is always the toughest part for me, since it is what the rest of the world sees, so it needs to be intelligible and capable of expansion, and I have a long history of making questionable choices that come back to haunt me (and everybody else! :-P). Since danpb has made good decisions in this area in the past (and since the original proposal is his), I'm thinking/hoping he can help provide direction to minimize mis-steps (on the other hand, I know he's really busy, so maybe he was just hoping that someone else would grab up his proposal and run with it).
Yep, this is what I'm fleshing out an API skeleton for now. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Apr 30, 2020 at 03:20:19PM -0300, Daniel Henrique Barboza wrote:
On 3/19/20 4:00 PM, Laine Stump wrote:
TL;DR - I'm not as anti-XML as the proposal seems to be, but also not pro-XML. I also (after thinking about it) understand the advantage of putting this in a separate library. So yeah, let's go it!
[...]
Anyway, in the end I think my opinion is we should push ahead and think about consequences of the specifics later, after some experimenting. I'd love to help if there's a place for it. I'm just not sure where/how I could contribute, especially since I have only about 4 hours worth of golang knowledge :-) (certainly not against getting more though!)
I'll start writing some code here and there for this initiative. Does anyone already started doing something? Otherwise I'll push code in github/gitlab when I have stuff to show.
I've actually started work on a skeleton for the API this week to try to flesh out some rough ideas for the general approach to the problem.
My understanding from the discussions is that the API is going to supply JSON responses instead of domxml (domXML might be supplied as an option, but it wouldn't be the default format used). Is that correct?
I'm completely ignoring libvirt Domain XML right now. I don't want the design to be constrained by any of our historic design decisions in libvirt. IOW, I intend it to be a greenfield site code wise. Oone of the eventual goals is to make use of this to replace current libvirt device addressing code, so eventually attention will have to return to how this would map to Domain XML. It just isn't a short term priority. As for JSON, I'm again not too bothered about that right now, as that's a really minor part of the problem. Primarily I want to come up with a plain Go interface, based on some data model structs and APIs. Those data model structs can trivially be mapped to JSON/YAML using the go encoding/json or encoding/yaml APIs. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 4 Mar 2020, at 13:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
We've been doing alot of refactoring of code in recent times, and also have plans for significant infrastructure changes. We still need to spend time delivering interesting features to users / applications. This mail is to introduce an idea for a solution to an specific area applications have had long term pain with libvirt's current "mechanism, not policy" approach - device addressing. This is a way for us to show brand new ideas & approaches for what the libvirt project can deliver in terms of management APIs.
To set expectations straight: I have written no code for this yet, merely identified the gap & conceptual solution.
The device addressing problem =============================
One of the key jobs libvirt does when processing a new domain XML configuration is to assign addresses to all devices that are present. This involves adding various device controllers (PCI bridges, PCI root ports, IDE/SCSI buses, USB controllers, etc) if they are not already present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each device so they are associated with controllers. When libvirt spawns a QEMU guest, it will pass full address information to QEMU.
Libvirt, as a general rule, aims to avoid defining and implementing policy around expansion of guest configuration / defaults, however, it is inescapable in the case of device addressing due to the need to guarantee a stable hardware ABI to make live migration and save/restore to disk work. The policy that libvirt has implemented for device addressing is, as much as possible, the same as the addressing scheme QEMU would apply itself.
While libvirt succeeds in its goal of providing a stable hardware API, the addressing scheme used is not well suited to all deployment scenarios of QEMU. This is an inevitable result of having a specific assignment policy implemented in libvirt which has to trade off mutually incompatible use cases/goals.
When the libvirt addressing policy is not been sufficient, management applications are forced to take on address assignment themselves, which is a massive non-trivial job with many subtle problems to consider.
Places where libvirt's addressing is insufficient for PCI include
* Setting up multiple guest NUMA nodes and associating devices to specific nodes * Pre-emptive creation of extra PCIe root ports, to allow for later device hotplug on PCIe topologies * Determining whether to place a device on a PCI or PCIe bridge * Controlling whether a device is placed into a hotpluggable slot * Controlling whether a PCIe root port supports hotplug or not * Determining whether to places all devices on distinct slots or buses, vs grouping them all into functions on the same slot * Ability to expand the device addressing without being on the hypervisor host
Libvirt wishes to avoid implementing many different address assignment policies. It also wishes to keep the domain XML as a representation of the virtual hardware, not add a bunch of properties to it which merely serve as tunable input parameters for device addressing algorithms.
There is thus a dilemma here. Management applications increasingly need fine grained control over device addressing, while libvirt doesn't want to expose fine grained policy controls via the XML.
The new libvirt-devaddr API ===========================
The way out of this is to define a brand new virt management API which tackles this specific problem in a way that addresses all the problems mgmt apps have with device addressing and explicitly provides a variety of policy impls with tunable behaviour.
By "new API", I actually mean an entirely new library, completely distinct from libvirt.so, or anything else we've delivered so far. The closest we've come to delivering something at this kind of conceptual level, would be the abortive attempt we made with "libvirt-builder" to deliver a policy-driven API instead of mechanism based. This proposal is still quite different from that attempt.
At a high level
* The new API is "libvirt-devaddr" - short for "libvirt device addressing"
* As input it will take
1. The guest CPU architecture and machine type 2. A list of global tunables specifying desired behaviour of the address assignment policy 3. A minimal list of devices needed in the virtual machine, with optional addresses and optional per-device tunables to override the global tunables
* As output it will emit
1. fully expanded list of devices needed in the virtual machine, with addressing information sufficient to ensure stable hardware ABI
Initially the API would implement something that behaves the same way as libvirt's current address assignment API.
The intended usage would be
* Mgmt application makes a minimal list of devices they want in their guest * List of devices is fed into libvirt-devaddr API * Mgmt application gets back a full list of devices & addresses * Mgmt application writes a libvirt XML doc using this full list & addresses * Mgmt application creates the guest in libvirt
+Adrian, +Andrea, +Michal It dawned on me that kata may provide an additional “borderline” usage model for this new API. Specifically, it might be a case where the tunables may be “relayed” through kata-runtime, but really originate from OpenShift. Also, what about in-guest device naming / assignment? Adrian, do you think that the iommu group issues you ran into could help Dan validate that the new library has all the input it needs to make a sane choice in that case? Do you think that it would be possible to call the library twice with different tunables in order to get the host and guest device names?
IOW, this new "libvirt-devaddr" API is intended to be used prior to creating the XML that is used by libvirt. The API could also be used prior to needing to hotplug a new device to an existing guest. This API is intended to be a deliverable of the libvirt project, but it would be completely independent of the current libvirt API. Most especially note that it would NOT use the domain XML in any way. This gives applications maximum flexibility in how they consume this functionality, not trying to force a way to build domain XML.
It would have greater freedom in its API design, making different choices from libvirt.so on topics such as programming language (C vs Go vs Python etc), API stability timeframe (forever stable vs sometimes changing API), data formats (structs, vs YAML/JSON vs XML etc), and of course the conceptual approach (policy vs mechanism)
The expectation is that this new API would be most likely to be consumed by KubeVirt, OpenStack, Kata, as the list of problems shown earlier is directly based on issues seen working with KubeVirt & OpenStack in particular. It is not limited to these applications and is broadly useful as conceptual thing.
It would be a goal that this API should also be used by libvirt itself to replace its current internal device addressing impl. Essentially the new API should be seen as a way to expose/extract the current libvirt internal algorithm, making it available to applications in a flexible manner. I don't anticipate actually copying the current addressing code in libvirt as-is, but it would certainly serve as reference for the kind of logic we need to implement, so you might consider it a "port" or "rewrite" in some very rough sense.
I think this new API concept is a good way for the project make a start in using Go for libvirt. The functionality covered has a clearly defined scope limit, making it practical to deliver a real impl in a reasonably short time frame. Extracting this will provide a real world benefit to our application consumers, solving many long standing problems they have with libvirt, and thus justify the effort in doing this work in libvirt in a non-C language. The main question mark would be about how we might make this functionality available to Python apps if we chose Go. It is possible to expose a C API from Go, and we would need this to consume it from libvirt. There is then the need to manually write a Python API binding which is tedious work.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/4/20 5:15 PM, Christophe de Dinechin wrote:
On 4 Mar 2020, at 13:50, Daniel P. Berrangé <berrange@redhat.com> wrote:
We've been doing alot of refactoring of code in recent times, and also have plans for significant infrastructure changes. We still need to spend time delivering interesting features to users / applications. This mail is to introduce an idea for a solution to an specific area applications have had long term pain with libvirt's current "mechanism, not policy" approach - device addressing. This is a way for us to show brand new ideas & approaches for what the libvirt project can deliver in terms of management APIs.
To set expectations straight: I have written no code for this yet, merely identified the gap & conceptual solution.
The device addressing problem =============================
One of the key jobs libvirt does when processing a new domain XML configuration is to assign addresses to all devices that are present. This involves adding various device controllers (PCI bridges, PCI root ports, IDE/SCSI buses, USB controllers, etc) if they are not already present, and then assigning PCI, USB, IDE, SCSI, etc, addresses to each device so they are associated with controllers. When libvirt spawns a QEMU guest, it will pass full address information to QEMU.
Libvirt, as a general rule, aims to avoid defining and implementing policy around expansion of guest configuration / defaults, however, it is inescapable in the case of device addressing due to the need to guarantee a stable hardware ABI to make live migration and save/restore to disk work. The policy that libvirt has implemented for device addressing is, as much as possible, the same as the addressing scheme QEMU would apply itself.
While libvirt succeeds in its goal of providing a stable hardware API, the addressing scheme used is not well suited to all deployment scenarios of QEMU. This is an inevitable result of having a specific assignment policy implemented in libvirt which has to trade off mutually incompatible use cases/goals.
When the libvirt addressing policy is not been sufficient, management applications are forced to take on address assignment themselves, which is a massive non-trivial job with many subtle problems to consider.
Places where libvirt's addressing is insufficient for PCI include
* Setting up multiple guest NUMA nodes and associating devices to specific nodes * Pre-emptive creation of extra PCIe root ports, to allow for later device hotplug on PCIe topologies * Determining whether to place a device on a PCI or PCIe bridge * Controlling whether a device is placed into a hotpluggable slot * Controlling whether a PCIe root port supports hotplug or not * Determining whether to places all devices on distinct slots or buses, vs grouping them all into functions on the same slot * Ability to expand the device addressing without being on the hypervisor host
Libvirt wishes to avoid implementing many different address assignment policies. It also wishes to keep the domain XML as a representation of the virtual hardware, not add a bunch of properties to it which merely serve as tunable input parameters for device addressing algorithms.
There is thus a dilemma here. Management applications increasingly need fine grained control over device addressing, while libvirt doesn't want to expose fine grained policy controls via the XML.
The new libvirt-devaddr API ===========================
The way out of this is to define a brand new virt management API which tackles this specific problem in a way that addresses all the problems mgmt apps have with device addressing and explicitly provides a variety of policy impls with tunable behaviour.
By "new API", I actually mean an entirely new library, completely distinct from libvirt.so, or anything else we've delivered so far. The closest we've come to delivering something at this kind of conceptual level, would be the abortive attempt we made with "libvirt-builder" to deliver a policy-driven API instead of mechanism based. This proposal is still quite different from that attempt.
At a high level
* The new API is "libvirt-devaddr" - short for "libvirt device addressing"
* As input it will take
1. The guest CPU architecture and machine type 2. A list of global tunables specifying desired behaviour of the address assignment policy 3. A minimal list of devices needed in the virtual machine, with optional addresses and optional per-device tunables to override the global tunables
* As output it will emit
1. fully expanded list of devices needed in the virtual machine, with addressing information sufficient to ensure stable hardware ABI
Initially the API would implement something that behaves the same way as libvirt's current address assignment API.
The intended usage would be
* Mgmt application makes a minimal list of devices they want in their guest * List of devices is fed into libvirt-devaddr API * Mgmt application gets back a full list of devices & addresses * Mgmt application writes a libvirt XML doc using this full list & addresses * Mgmt application creates the guest in libvirt
+Adrian, +Andrea, +Michal
It dawned on me that kata may provide an additional “borderline” usage model for this new API. Specifically, it might be a case where the tunables may be “relayed” through kata-runtime, but really originate from OpenShift.
OCI Device specification is mknod-based [1] having no bus-specific information so I think all of logic would be implemented by kata-runtime. However, the Device Plugin specifies an ENV variable with the host PCI address.
Also, what about in-guest device naming / assignment? This is a problem because the ENV var will not match the guest's device address. I don't see a way around this without having a deterministic way of addressing devices and modifying/complementing that higher level information.
Adrian, do you think that the iommu group issues you ran into could help Dan validate that the new library has all the input it needs to make a sane choice in that case?
I don't think the iommu group problem would require interaction with the library. Kata agent was just mknod-ing the devices. Fixed in [2]
Do you think that it would be possible to call the library twice with different tunables in order to get the host and guest device names? I don't think I fully understand your proposal. Once qemu is called with a specific set of device addresses, what could possibly be done in the guest?
In order to be able to consume the devices, the application would need to know the host->guest address mappings. Whether that mapping is exposed via kata-agent, ENV var or other means, is yet to be discussed. WRT to the library itself, I think it would alleviate some of the logic currently being implemented in kata-runtime that includes things like: - Determining whether the device's BAR size is small enough for it to be hot-plugged in a pci bridge - Determining whether the machine type supports hotplugging on the root bus, or root-ports need to be pre-allocated. Related work: [4 [5] and associated PRs
IOW, this new "libvirt-devaddr" API is intended to be used prior to creating the XML that is used by libvirt. The API could also be used prior to needing to hotplug a new device to an existing guest. This API is intended to be a deliverable of the libvirt project, but it would be completely independent of the current libvirt API. Most especially note that it would NOT use the domain XML in any way. This gives applications maximum flexibility in how they consume this functionality, not trying to force a way to build domain XML.
It would have greater freedom in its API design, making different choices from libvirt.so on topics such as programming language (C vs Go vs Python etc), API stability timeframe (forever stable vs sometimes changing API), data formats (structs, vs YAML/JSON vs XML etc), and of course the conceptual approach (policy vs mechanism)
The expectation is that this new API would be most likely to be consumed by KubeVirt, OpenStack, Kata, as the list of problems shown earlier is directly based on issues seen working with KubeVirt & OpenStack in particular. It is not limited to these applications and is broadly useful as conceptual thing.
It would be a goal that this API should also be used by libvirt itself to replace its current internal device addressing impl. Essentially the new API should be seen as a way to expose/extract the current libvirt internal algorithm, making it available to applications in a flexible manner. I don't anticipate actually copying the current addressing code in libvirt as-is, but it would certainly serve as reference for the kind of logic we need to implement, so you might consider it a "port" or "rewrite" in some very rough sense.
I think this new API concept is a good way for the project make a start in using Go for libvirt. The functionality covered has a clearly defined scope limit, making it practical to deliver a real impl in a reasonably short time frame. Extracting this will provide a real world benefit to our application consumers, solving many long standing problems they have with libvirt, and thus justify the effort in doing this work in libvirt in a non-C language. The main question mark would be about how we might make this functionality available to Python apps if we chose Go. It is possible to expose a C API from Go, and we would need this to consume it from libvirt. There is then the need to manually write a Python API binding which is tedious work.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
[1] https://github.com/opencontainers/runtime-spec/blob/2a060269036678148a707a92... [2] https://github.com/kata-containers/runtime/pull/2550/commits/4d2574a7230e5a1... [3] https://github.com/kata-containers/runtime/issues/115 [4] https://github.com/kata-containers/runtime/issues/2432 [5] https://github.com/kata-containers/runtime/issues/2460
participants (7)
-
Adrian Moreno
-
Christophe de Dinechin
-
Dan Kenigsberg
-
Daniel Henrique Barboza
-
Daniel P. Berrangé
-
Laine Stump
-
Laine Stump