On Wed, Jun 15, 2016 at 18:10:01 +0100, Daniel Berrange wrote:
On Mon, Jun 13, 2016 at 02:48:51PM +0200, Peter Krempa wrote:
TL;DR: I mostly agree with your suggestions altough there's a slight
problem with the suggested XML format. See [1] at the bottom
> # API #
[...]
We generally use flags to control behaviour of a specific action,
not to actually choose which action to run.
IOW, I'd prefer to see us have separate APIs for this
virDomainVCPUPlug / virDomainVCPUUnplug
I thought so :)
Or, have a single API with separate argument to reflect
the state, eg
virDomainVCPUSetHardware(virDomainPtr domain,
unsigned int id,
unsigned int present, /* 0 == unplugged, 1 == plugged */
unsigned int flags);
I probably pefer this one, mostly to avoid adding yet more APIs :)
of course, inside the QEMU driver, its fine if you want to have
these separate APIs call into common code path todo their work.
> - misc:
> * VIR_DOMAIN_SET_VCPU_GUEST - use the guest agent instead of ACPI hotplug
I think in general it was/is a mistake to overload guest agent
CPU online/offline-ing, with real CPU hotplug/unplug. The two
operations are really distinct from each other, and having them
in the same API gives the misleading impression that they're
just 2 ways to doing the same thing, where as really they are
two completely different beasts with very different security
implications too.
IOW, I'd suggest we make this more explicit by having another
method
virDomainVCPUSetGuestUsage(virDomainPtr domain,
unsigned int id,
unsigned int online, /* 0 == disabled, 1 == enabled */
unsigned int flags);
Agreed. I'll add this separately then.
> * VIR_DOMAIN_SET_VCPU_NUMA_NODE - 'id' is the ID of a numa node where
the
> cpu should be enabled/disabled rather than CPU id. This is a convenience
> flag that will allow to add cpu to a given numa node rather than having
> to find the correct ID.
Using flags is fine if we want to retrofit this kind of thing into an
existing API, but if we're deciding a new API, we should aim to represent
NUMA node ID explicitly I think.
Then again, I wonder if the desire for this flag is really just a reflection
on us making it too hard to figure out the correct vCPU ID for a NUMA node.
IMO it's pretty obvious how to get the data. You just need to collate it
appropriately. For some reason I wanted to add syntax sugar for that but
that's certainly not needed and avoiding it will decrease code
complexity.
> * VIR_DOMAIN_SET_VCPU_CORE - use thread level hotplug (see [1]). This
> makes sure that the CPU will be plugged in
> on platforms that require to plug in multiple
> threads at once.
So the use of 'int id' rather strongly suggests we're only plugging
a single vCPU at a time. Given that with PPC you're forced to plug
an entire set of threads at once, in a single "core", I think perhaps
we should have a bitmask instead of an 'int id'. eg
unsigned char *cpumask,
unsigned int cpumasklen
This is a good idea. I'll use this approach but I'll limit this in the
docs and code so that this will (un)plug just one entity at time so that
we don't run into issues with partial failures.
The domain guest capabilities XML could describe the minimum required
granularity for hotplugging/unplugging CPUs. So the app would query
that and discover whether they need to plug 1 or 8 cpus at once.
The app would be responsible for setting the CPU mask to reflect
how many CPUs they want to plug at once. That would remove the
You also need to make sure that you are plugging the correct cpus
together, but that should be an easy check.
need to have an explicit VIR_DOMAIN_SET_VCPU_CORE flag I think.
We'd simply return an error if the app didn't pay attention to the
reuqired CPU hotplug granularity described in the capabilites.
[...]
> The new API will require us to add new XML that will allow to
track the state
> of VCPUs individually. Internally we now have a data structure allowing to
> keep the relevant data in one place.
Presumably we're only going to track the plugged/unplugged state,
and *not* the guest OS online/offline state. This would be another
reason to deal with guest OS online/offline via a separate API
from the real hotplug/unplug API.
Yes, dealing with the guest state would require adding guest agent
events for notifying us of changes.
[...]
> <domain>
> [...]
> <vcpu current='1'>3</vcpu>
> <vcpus>
> <vcpu id='0' state='enabled'/> <-- option 1, no extra
data
> <vcpu id='1' state='disabled' cell='1'/> <---
option 2, just numa node,
> since it's non-obvious
> <vcpu id='2' state='disabled' cell='1'
pin='1-2' scheduler='...'/>
> <!-- option 3 all the data duplicated -->
> </vcpus>
> [...]
> <cputune>
> <cpupin ... />
> </cputune>
> [...]
> <cpu>
> <numa>
> <cell id='0' cpus='0' memory='102400'
unit='KiB/>
> <cell id='1' cpus='1-2' memory='102400'
unit='KiB/>
> </numa>
In the 'virsh capabilities' XML, the CPUs are listed underneath the
<cell>. So rather than adding <vcpu> as children of <vcpus>, I think
perhaps we should just put then under the <cell> here too. That would
avoid the need to add a cell=NN attribute.eg
<cpu>
<numa>
<cell id='0' cpus='0' memory='102400'
unit='KiB>
<cpus>
<cpu id="0" state="enabled"/>
</cpus>
</cell>
<cell id='1' cpus='1-2' memory='102400'
unit='KiB>
<cpus>
<cpu id="1" state="disabled"/>
<cpu id="2" state="enabled"/>
</cpus>
</cell>
</numa>
[1]
I was thinking doing the same at first but there's one caveat. For
non-numa guests you don't get the <numa> element at all. In such case
the online/offline state would need to be kept somewhere else and I
wanted to avoid having two places where to put the data.
I've opted for adding a new element which would satisfy even non-numa
guests at the expense that you'd need to collate from two places in case
of numa guests.
> # qemu/platform implementation caveats #
>
> When starting the VM for the first time it might be necessary to start a
> throw-away qemu process to query some details that we'll need to pass in on a
> command line. I'm not sure if this is still necessary and I'll try to avoid
it
> at all cost.
If that is needed, then I think we would need to fix QEMU to let us
introspect that at time we normally probe capabilities. Spawning an
extra throw-away QEMU is absolutely not an option.
I was advocating for this to the qemu developers when I was discussing
this stuff with them. I hope they paid attention.