[libvirt] RFC: New CPU hot(un)plug API and XML

Hi list, I'm planing on adding API that will be used instead of virDomainSetVcpus and will allow a more granular control of which virtual CPUs are enabled for a guest. The new approach will allow to use cpu hotplug properly with NUMA guests as the old APIs would not allow adding CPUs to very specific cgroups. The old APIs should still work fine with the current approach although the final implementation should also allow to unplug vcpus from the guest by using new qemu features. I'm still not sure though whether it will be possible to use this in a backward compatible fashion though depending how this stuff will exactly need to be set up in qemu. # API # As for the new API I'm thinking of the following design: int virDomainVcpu(virDomainPtr domain, unsigned int id, unsigned int flags); The flags for this API would be following: - usual domain modification impact: * VIR_DOMAIN_SET_VCPU_CURRENT * VIR_DOMAIN_SET_VCPU_LIVE * VIR_DOMAIN_SET_VCPU_CONFIG - for specifying the operation as the default operation would query the cpu state: * VIR_DOMAIN_SET_VCPU_ENABLE * VIR_DOMAIN_SET_VCPU_DISABLE - misc: * VIR_DOMAIN_SET_VCPU_GUEST - use the guest agent instead of ACPI hotplug * VIR_DOMAIN_SET_VCPU_NUMA_NODE - 'id' is the ID of a numa node where the cpu should be enabled/disabled rather than CPU id. This is a convenience flag that will allow to add cpu to a given numa node rather than having to find the correct ID. * VIR_DOMAIN_SET_VCPU_CORE - use thread level hotplug (see [1]). This makes sure that the CPU will be plugged in on platforms that require to plug in multiple threads at once. VIR_DOMAIN_SET_VCPU_NUMA_NODE and VIR_DOMAIN_SET_VCPU_GUEST are mutually exclusive as the guest agent doesn't report the guest numa node the CPU is belonging to . If the idea of one API that will both query and set is too nonconformist to our existing API design I have no problem adding Get/Set versions and/or explode the ADD/REMOVE flags into a separate parameter. # XML # The new API will require us to add new XML that will allow to track the state of VCPUs individually. Internally we now have a data structure allowing to keep the relevant data in one place. Currently we are setting data relevant to VCPUs in many places. <domain> [...] <vcpu current='1'>3</vcpu> [...] <cputune> <cpupin ... /> </cputune> [...] <cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB/> <cell id='1' cpus='1-2' memory='102400' unit='KiB/> </numa> As we'll be required to keep the state for every single cpu I'm thinking of adding a new subelement called '<vcpus>' to <domain>. This will have a '<vcpu>' subelement for every configured cpu. I'm specifically not going to add any of the cpupin or numa node ids to the /domain/vcpus/vcpu as input parameters to avoid introducing very compicated checking code that would be required to keep the data in sync. I'm thinking of adding the numa node id as an output only attribute since it's relevant to the hotplug case and it's misplaced otherwise. I certainly can add the duplicated data as output-only attributes. The XML with the new elements should look like: <domain> [...] <vcpu current='1'>3</vcpu> <vcpus> <vcpu id='0' state='enabled'/> <-- option 1, no extra data <vcpu id='1' state='disabled' cell='1'/> <--- option 2, just numa node, since it's non-obvious <vcpu id='2' state='disabled' cell='1' pin='1-2' scheduler='...'/> <!-- option 3 all the data duplicated --> </vcpus> [...] <cputune> <cpupin ... /> </cputune> [...] <cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB/> <cell id='1' cpus='1-2' memory='102400' unit='KiB/> </numa> # migration # To ensure migration compatibility a new libvirt will set a new migration feature flag in cases where a sparse topology was created by any means. Older versions of libvirt will reject it. As the new cpu data will be ignored by the parser of older libvirt we don't need to stop formatting them on migration. (fortunately schemas are not validated during migration) # qemu/platform implementation caveats # When starting the VM for the first time it might be necessary to start a throw-away qemu process to query some details that we'll need to pass in on a command line. I'm not sure if this is still necessary and I'll try to avoid it at all cost. [1] Some architectures (ppc64) don't directly support thread-level hotplug and thus require us to plug in a core which translates into multiple threads (8 in case of power 8). Possibly other yet unknown problems. Thanks for your feedback. Peter

On Mon, Jun 13, 2016 at 02:48:51PM +0200, Peter Krempa wrote:
Hi list,
I'm planing on adding API that will be used instead of virDomainSetVcpus and will allow a more granular control of which virtual CPUs are enabled for a guest.
The new approach will allow to use cpu hotplug properly with NUMA guests as the old APIs would not allow adding CPUs to very specific cgroups.
Great! We need that... Er, mgmt apps need that =)
The old APIs should still work fine with the current approach although the final implementation should also allow to unplug vcpus from the guest by using new qemu features.
I'm still not sure though whether it will be possible to use this in a backward compatible fashion though depending how this stuff will exactly need to be set up in qemu.
If the worst comes to the worst, we can say the old API is deprecated and it'll just do basic things (as it does now). I haven't studied the code like probably did for some time before sending this, but I don't see that it should cause some major problems.
# API #
As for the new API I'm thinking of the following design:
int virDomainVcpu(virDomainPtr domain, unsigned int id, unsigned int flags);
The flags for this API would be following: - usual domain modification impact: * VIR_DOMAIN_SET_VCPU_CURRENT * VIR_DOMAIN_SET_VCPU_LIVE * VIR_DOMAIN_SET_VCPU_CONFIG - for specifying the operation as the default operation would query the cpu state: * VIR_DOMAIN_SET_VCPU_ENABLE * VIR_DOMAIN_SET_VCPU_DISABLE - misc: * VIR_DOMAIN_SET_VCPU_GUEST - use the guest agent instead of ACPI hotplug * VIR_DOMAIN_SET_VCPU_NUMA_NODE - 'id' is the ID of a numa node where the cpu should be enabled/disabled rather than CPU id. This is a convenience flag that will allow to add cpu to a given numa node rather than having to find the correct ID. * VIR_DOMAIN_SET_VCPU_CORE - use thread level hotplug (see [1]). This makes sure that the CPU will be plugged in on platforms that require to plug in multiple threads at once.
VIR_DOMAIN_SET_VCPU_NUMA_NODE and VIR_DOMAIN_SET_VCPU_GUEST are mutually exclusive as the guest agent doesn't report the guest numa node the CPU is belonging to .
So since the agent can only receive number of vcpus then no new feature will be usable with this flag until that command is added to the ga, right? Does it make sense to have this flag for the new API then?
If the idea of one API that will both query and set is too nonconformist to our existing API design I have no problem adding Get/Set versions and/or explode the ADD/REMOVE flags into a separate parameter.
I thought there already was a consensus reached about what should be the default choice for new APIs. I don't remember it, though, as I don't feel strongly for any of those.
# XML #
The new API will require us to add new XML that will allow to track the state of VCPUs individually. Internally we now have a data structure allowing to keep the relevant data in one place.
Currently we are setting data relevant to VCPUs in many places.
<domain> [...] <vcpu current='1'>3</vcpu> [...] <cputune> <cpupin ... /> </cputune> [...] <cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB/> <cell id='1' cpus='1-2' memory='102400' unit='KiB/> </numa>
As we'll be required to keep the state for every single cpu I'm thinking of adding a new subelement called '<vcpus>' to <domain>. This will have a '<vcpu>' subelement for every configured cpu.
I'm specifically not going to add any of the cpupin or numa node ids to the /domain/vcpus/vcpu as input parameters to avoid introducing very compicated checking code that would be required to keep the data in sync.
I'm thinking of adding the numa node id as an output only attribute since it's relevant to the hotplug case and it's misplaced otherwise. I certainly can add the duplicated data as output-only attributes.
The XML with the new elements should look like:
<domain> [...] <vcpu current='1'>3</vcpu> <vcpus> <vcpu id='0' state='enabled'/> <-- option 1, no extra data <vcpu id='1' state='disabled' cell='1'/> <--- option 2, just numa node, since it's non-obvious <vcpu id='2' state='disabled' cell='1' pin='1-2' scheduler='...'/> <!-- option 3 all the data duplicated -->
It is nice to have all the info in there, but won't it confuse users if it is output-only? Wait, let me rephrase that question. Won't it confuse users? Wait, most of our XML does already, so scratch that =) Anyway, how much duplicated info do we already have? I can now only think of the memory device which we had to have anyways. Would it be too confusing to just add <cpu/> device with all the info? That would require all the checks and lot of unnecessary code. But it would be consistent with the memory. And it actually is a device. Most probably not worth the pain. But OTOH if all the data are output-only... Sorry for the ramble, just my 2 cents.
</vcpus> [...] <cputune> <cpupin ... /> </cputune> [...] <cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB/> <cell id='1' cpus='1-2' memory='102400' unit='KiB/> </numa>
# migration #
To ensure migration compatibility a new libvirt will set a new migration feature flag in cases where a sparse topology was created by any means. Older versions of libvirt will reject it.
As the new cpu data will be ignored by the parser of older libvirt we don't need to stop formatting them on migration. (fortunately schemas are not validated during migration)
Unless there are some of those loops through all child elements/attributes, but either you'll come across that or it will bite you in the ass during the first migration trial ;)
# qemu/platform implementation caveats #
When starting the VM for the first time it might be necessary to start a throw-away qemu process to query some details that we'll need to pass in on a command line. I'm not sure if this is still necessary and I'll try to avoid it at all cost.
I hope capabilities will tell us what we need. If not, I hope it can be added.
[1] Some architectures (ppc64) don't directly support thread-level hotplug and thus require us to plug in a core which translates into multiple threads (8 in case of power 8).
Possibly other yet unknown problems.
Fingers crossed for least amount of those.
Thanks for your feedback.
Peter
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Mon, Jun 13, 2016 at 02:48:51PM +0200, Peter Krempa wrote:
# API #
As for the new API I'm thinking of the following design:
int virDomainVcpu(virDomainPtr domain, unsigned int id, unsigned int flags);
The flags for this API would be following: - usual domain modification impact: * VIR_DOMAIN_SET_VCPU_CURRENT * VIR_DOMAIN_SET_VCPU_LIVE * VIR_DOMAIN_SET_VCPU_CONFIG - for specifying the operation as the default operation would query the cpu state: * VIR_DOMAIN_SET_VCPU_ENABLE * VIR_DOMAIN_SET_VCPU_DISABLE
We generally use flags to control behaviour of a specific action, not to actually choose which action to run. IOW, I'd prefer to see us have separate APIs for this virDomainVCPUPlug / virDomainVCPUUnplug Or, have a single API with separate argument to reflect the state, eg virDomainVCPUSetHardware(virDomainPtr domain, unsigned int id, unsigned int present, /* 0 == unplugged, 1 == plugged */ unsigned int flags); of course, inside the QEMU driver, its fine if you want to have these separate APIs call into common code path todo their work.
- misc: * VIR_DOMAIN_SET_VCPU_GUEST - use the guest agent instead of ACPI hotplug
I think in general it was/is a mistake to overload guest agent CPU online/offline-ing, with real CPU hotplug/unplug. The two operations are really distinct from each other, and having them in the same API gives the misleading impression that they're just 2 ways to doing the same thing, where as really they are two completely different beasts with very different security implications too. IOW, I'd suggest we make this more explicit by having another method virDomainVCPUSetGuestUsage(virDomainPtr domain, unsigned int id, unsigned int online, /* 0 == disabled, 1 == enabled */ unsigned int flags);
* VIR_DOMAIN_SET_VCPU_NUMA_NODE - 'id' is the ID of a numa node where the cpu should be enabled/disabled rather than CPU id. This is a convenience flag that will allow to add cpu to a given numa node rather than having to find the correct ID.
Using flags is fine if we want to retrofit this kind of thing into an existing API, but if we're deciding a new API, we should aim to represent NUMA node ID explicitly I think. Then again, I wonder if the desire for this flag is really just a reflection on us making it too hard to figure out the correct vCPU ID for a NUMA node.
* VIR_DOMAIN_SET_VCPU_CORE - use thread level hotplug (see [1]). This makes sure that the CPU will be plugged in on platforms that require to plug in multiple threads at once.
So the use of 'int id' rather strongly suggests we're only plugging a single vCPU at a time. Given that with PPC you're forced to plug an entire set of threads at once, in a single "core", I think perhaps we should have a bitmask instead of an 'int id'. eg unsigned char *cpumask, unsigned int cpumasklen The domain guest capabilities XML could describe the minimum required granularity for hotplugging/unplugging CPUs. So the app would query that and discover whether they need to plug 1 or 8 cpus at once. The app would be responsible for setting the CPU mask to reflect how many CPUs they want to plug at once. That would remove the need to have an explicit VIR_DOMAIN_SET_VCPU_CORE flag I think. We'd simply return an error if the app didn't pay attention to the reuqired CPU hotplug granularity described in the capabilites.
VIR_DOMAIN_SET_VCPU_NUMA_NODE and VIR_DOMAIN_SET_VCPU_GUEST are mutually exclusive as the guest agent doesn't report the guest numa node the CPU is belonging to .
If the idea of one API that will both query and set is too nonconformist to our existing API design I have no problem adding Get/Set versions and/or explode the ADD/REMOVE flags into a separate parameter.
Yep, I'd rather do that.
The new API will require us to add new XML that will allow to track the state of VCPUs individually. Internally we now have a data structure allowing to keep the relevant data in one place.
Presumably we're only going to track the plugged/unplugged state, and *not* the guest OS online/offline state. This would be another reason to deal with guest OS online/offline via a separate API from the real hotplug/unplug API.
Currently we are setting data relevant to VCPUs in many places.
<domain> [...] <vcpu current='1'>3</vcpu> [...] <cputune> <cpupin ... /> </cputune> [...] <cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB/> <cell id='1' cpus='1-2' memory='102400' unit='KiB/> </numa>
As we'll be required to keep the state for every single cpu I'm thinking of adding a new subelement called '<vcpus>' to <domain>. This will have a '<vcpu>' subelement for every configured cpu.
I'm specifically not going to add any of the cpupin or numa node ids to the /domain/vcpus/vcpu as input parameters to avoid introducing very compicated checking code that would be required to keep the data in sync.
I'm thinking of adding the numa node id as an output only attribute since it's relevant to the hotplug case and it's misplaced otherwise. I certainly can add the duplicated data as output-only attributes.
The XML with the new elements should look like:
<domain> [...] <vcpu current='1'>3</vcpu> <vcpus> <vcpu id='0' state='enabled'/> <-- option 1, no extra data <vcpu id='1' state='disabled' cell='1'/> <--- option 2, just numa node, since it's non-obvious <vcpu id='2' state='disabled' cell='1' pin='1-2' scheduler='...'/> <!-- option 3 all the data duplicated --> </vcpus> [...] <cputune> <cpupin ... /> </cputune> [...] <cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB/> <cell id='1' cpus='1-2' memory='102400' unit='KiB/> </numa>
In the 'virsh capabilities' XML, the CPUs are listed underneath the <cell>. So rather than adding <vcpu> as children of <vcpus>, I think perhaps we should just put then under the <cell> here too. That would avoid the need to add a cell=NN attribute.eg <cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB> <cpus> <cpu id="0" state="enabled"/> </cpus> </cell> <cell id='1' cpus='1-2' memory='102400' unit='KiB> <cpus> <cpu id="1" state="disabled"/> <cpu id="2" state="enabled"/> </cpus> </cell> </numa>
# qemu/platform implementation caveats #
When starting the VM for the first time it might be necessary to start a throw-away qemu process to query some details that we'll need to pass in on a command line. I'm not sure if this is still necessary and I'll try to avoid it at all cost.
If that is needed, then I think we would need to fix QEMU to let us introspect that at time we normally probe capabilities. Spawning an extra throw-away QEMU is absolutely not an option. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Jun 15, 2016 at 18:10:01 +0100, Daniel Berrange wrote:
On Mon, Jun 13, 2016 at 02:48:51PM +0200, Peter Krempa wrote:
TL;DR: I mostly agree with your suggestions altough there's a slight problem with the suggested XML format. See [1] at the bottom
# API #
[...]
We generally use flags to control behaviour of a specific action, not to actually choose which action to run.
IOW, I'd prefer to see us have separate APIs for this
virDomainVCPUPlug / virDomainVCPUUnplug
I thought so :)
Or, have a single API with separate argument to reflect the state, eg
virDomainVCPUSetHardware(virDomainPtr domain, unsigned int id, unsigned int present, /* 0 == unplugged, 1 == plugged */ unsigned int flags);
I probably pefer this one, mostly to avoid adding yet more APIs :)
of course, inside the QEMU driver, its fine if you want to have these separate APIs call into common code path todo their work.
- misc: * VIR_DOMAIN_SET_VCPU_GUEST - use the guest agent instead of ACPI hotplug
I think in general it was/is a mistake to overload guest agent CPU online/offline-ing, with real CPU hotplug/unplug. The two operations are really distinct from each other, and having them in the same API gives the misleading impression that they're just 2 ways to doing the same thing, where as really they are two completely different beasts with very different security implications too.
IOW, I'd suggest we make this more explicit by having another method
virDomainVCPUSetGuestUsage(virDomainPtr domain, unsigned int id, unsigned int online, /* 0 == disabled, 1 == enabled */ unsigned int flags);
Agreed. I'll add this separately then.
* VIR_DOMAIN_SET_VCPU_NUMA_NODE - 'id' is the ID of a numa node where the cpu should be enabled/disabled rather than CPU id. This is a convenience flag that will allow to add cpu to a given numa node rather than having to find the correct ID.
Using flags is fine if we want to retrofit this kind of thing into an existing API, but if we're deciding a new API, we should aim to represent NUMA node ID explicitly I think.
Then again, I wonder if the desire for this flag is really just a reflection on us making it too hard to figure out the correct vCPU ID for a NUMA node.
IMO it's pretty obvious how to get the data. You just need to collate it appropriately. For some reason I wanted to add syntax sugar for that but that's certainly not needed and avoiding it will decrease code complexity.
* VIR_DOMAIN_SET_VCPU_CORE - use thread level hotplug (see [1]). This makes sure that the CPU will be plugged in on platforms that require to plug in multiple threads at once.
So the use of 'int id' rather strongly suggests we're only plugging a single vCPU at a time. Given that with PPC you're forced to plug an entire set of threads at once, in a single "core", I think perhaps we should have a bitmask instead of an 'int id'. eg
unsigned char *cpumask, unsigned int cpumasklen
This is a good idea. I'll use this approach but I'll limit this in the docs and code so that this will (un)plug just one entity at time so that we don't run into issues with partial failures.
The domain guest capabilities XML could describe the minimum required granularity for hotplugging/unplugging CPUs. So the app would query that and discover whether they need to plug 1 or 8 cpus at once. The app would be responsible for setting the CPU mask to reflect how many CPUs they want to plug at once. That would remove the
You also need to make sure that you are plugging the correct cpus together, but that should be an easy check.
need to have an explicit VIR_DOMAIN_SET_VCPU_CORE flag I think. We'd simply return an error if the app didn't pay attention to the reuqired CPU hotplug granularity described in the capabilites.
[...]
The new API will require us to add new XML that will allow to track the state of VCPUs individually. Internally we now have a data structure allowing to keep the relevant data in one place.
Presumably we're only going to track the plugged/unplugged state, and *not* the guest OS online/offline state. This would be another reason to deal with guest OS online/offline via a separate API from the real hotplug/unplug API.
Yes, dealing with the guest state would require adding guest agent events for notifying us of changes. [...]
<domain> [...] <vcpu current='1'>3</vcpu> <vcpus> <vcpu id='0' state='enabled'/> <-- option 1, no extra data <vcpu id='1' state='disabled' cell='1'/> <--- option 2, just numa node, since it's non-obvious <vcpu id='2' state='disabled' cell='1' pin='1-2' scheduler='...'/> <!-- option 3 all the data duplicated --> </vcpus> [...] <cputune> <cpupin ... /> </cputune> [...] <cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB/> <cell id='1' cpus='1-2' memory='102400' unit='KiB/> </numa>
In the 'virsh capabilities' XML, the CPUs are listed underneath the <cell>. So rather than adding <vcpu> as children of <vcpus>, I think perhaps we should just put then under the <cell> here too. That would avoid the need to add a cell=NN attribute.eg
<cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB> <cpus> <cpu id="0" state="enabled"/> </cpus> </cell> <cell id='1' cpus='1-2' memory='102400' unit='KiB> <cpus> <cpu id="1" state="disabled"/> <cpu id="2" state="enabled"/> </cpus> </cell> </numa>
[1] I was thinking doing the same at first but there's one caveat. For non-numa guests you don't get the <numa> element at all. In such case the online/offline state would need to be kept somewhere else and I wanted to avoid having two places where to put the data. I've opted for adding a new element which would satisfy even non-numa guests at the expense that you'd need to collate from two places in case of numa guests.
# qemu/platform implementation caveats #
When starting the VM for the first time it might be necessary to start a throw-away qemu process to query some details that we'll need to pass in on a command line. I'm not sure if this is still necessary and I'll try to avoid it at all cost.
If that is needed, then I think we would need to fix QEMU to let us introspect that at time we normally probe capabilities. Spawning an extra throw-away QEMU is absolutely not an option.
I was advocating for this to the qemu developers when I was discussing this stuff with them. I hope they paid attention.

On Thu, Jun 16, 2016 at 09:14:12AM +0200, Peter Krempa wrote:
On Wed, Jun 15, 2016 at 18:10:01 +0100, Daniel Berrange wrote:
On Mon, Jun 13, 2016 at 02:48:51PM +0200, Peter Krempa wrote:
TL;DR: I mostly agree with your suggestions altough there's a slight problem with the suggested XML format. See [1] at the bottom
<domain> [...] <vcpu current='1'>3</vcpu> <vcpus> <vcpu id='0' state='enabled'/> <-- option 1, no extra data <vcpu id='1' state='disabled' cell='1'/> <--- option 2, just numa node, since it's non-obvious <vcpu id='2' state='disabled' cell='1' pin='1-2' scheduler='...'/> <!-- option 3 all the data duplicated --> </vcpus> [...] <cputune> <cpupin ... /> </cputune> [...] <cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB/> <cell id='1' cpus='1-2' memory='102400' unit='KiB/> </numa>
In the 'virsh capabilities' XML, the CPUs are listed underneath the <cell>. So rather than adding <vcpu> as children of <vcpus>, I think perhaps we should just put then under the <cell> here too. That would avoid the need to add a cell=NN attribute.eg
<cpu> <numa> <cell id='0' cpus='0' memory='102400' unit='KiB> <cpus> <cpu id="0" state="enabled"/> </cpus> </cell> <cell id='1' cpus='1-2' memory='102400' unit='KiB> <cpus> <cpu id="1" state="disabled"/> <cpu id="2" state="enabled"/> </cpus> </cell> </numa>
[1]
I was thinking doing the same at first but there's one caveat. For non-numa guests you don't get the <numa> element at all. In such case the online/offline state would need to be kept somewhere else and I wanted to avoid having two places where to put the data.
I've opted for adding a new element which would satisfy even non-numa guests at the expense that you'd need to collate from two places in case of numa guests.
Hmm, yes, I forgot about that, oh well. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (3)
-
Daniel P. Berrange
-
Martin Kletzander
-
Peter Krempa