On 05/05/2016 02:48 PM, Andrea Bolognani wrote:
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
> The nodeinfo output was fixed earlier to reflect the actual cpus available in
> KVM mode on PPC64. The earlier fixes covered the aspect of not making a host
> look overcommitted when its not. The current fixes are aimed at helping the
> users make better decisions on the kind of guest cpu topology that can be
> supported on the given sucore_per_core setting of KVM host and also hint the
> way to pin the guest vcpus efficiently.
>
> I am planning to add some test cases once the approach is accepted.
>
> With respect to Patch 2:
> The second patch adds a new element to the cpus tag and I need your inputs on
> if that is okay. Also if there is a better way. I am not sure if the existing
> clients have RNG checks that might fail with the approach. Or if the checks
> are not enoforced on the elements but only on the tags.
>
> With my approach if the rng checks pass, the new element "capacity" even
if
> ignored by many clients would have no impact except for PPC64.
>
> To the extent I looked at code, the siblings changes dont affect existing
> libvirt functionality. Please do let me know otherwise.
So, I've been going through this old thread trying to figure out
a way to improve the status quo. I'd like to collect as much
feedback as possible, especially from people who have worked in
this area of libvirt before or have written tools based on it.
As hinted above, this series is really trying to address two
different issue, and I think it's helpful to reason about them
separately.
** Guest threads limit **
My dual-core laptop will happily run a guest configured with
<cpu>
<topology sockets='1' cores='1' threads='128'/>
</cpu>
but POWER guests are limited to 8/subcores_per_core threads.
How is it limited? Does something explicitly fail (libvirt, qemu, guest OS)?
Or are the threads just not usable in the VM
Is it specific to PPC64 KVM, or PPC64 emulated as well?
We need to report this information to the user somehow, and
I can't see an existing place where it would fit nicely. We
definitely don't want to overload the meaning of an existing
element/attribute with this. It should also only appear in
the (dom)capabilities XML of ppc64 hosts.
I don't think this is too problematic or controversial, we
just need to pick a nice place to display this information.
** Efficient guest topology **
To achieve optimal performance, you want to match guest
threads with host threads.
On x86, you can choose suitable host threads by looking at
the capabilities XML: the presence of elements like
<cpu id='2' socket_id='0' core_id='1'
siblings='2-3'/>
<cpu id='3' socket_id='0' core_id='1'
siblings='2-3'/>
means you should configure your guest to use
<vcpu placement='static' cpuset='2-3'>2</vcpu>
<cpu>
<topology sockets='1' cores='1' threads='2'/>
</cpu>
Notice how siblings can be found either looking at the
attribute with the same name, or by matching them using the
value of the core_id attribute. Also notice how you are
supposed to pin as many vCPUs as the number of elements in
the cpuset - one guest thread per host thread.
Ahh, I see that threads are implicitly reported by the fact that socket_id and
core_id are identical across the different cpu ids... that took me a couple
minutes :)
On POWER, this gets much trickier: only the *primary* thread
of each (sub)core appears to be online in the host, but all
threads can actually have a vCPU running on them. So
<cpu id='0' socket_id='0' core_id='32'
siblings='0,4'/>
<cpu id='4' socket_id='0' core_id='32'
siblings='0,4'/>
which is what you'd get with subcores_per_core=2, is very
confusing.
Okay, this bit took me _more_ than a couple minutes. Is this saying topology of
socket #0
core #32
subcore #1
cpu id='0' thread #1
cpu id='1' thread #2 (offline)
cpu id='2' thread #3 (offline)
cpu id='3' thread #4 (offline)
subcore #2
cpu id='4' thread #1
cpu id='5' thread #2 (offline)
cpu id='6' thread #3 (offline)
cpu id='7' thread #4 (offline)
...
what would the hypothetical physical_core_id value look like in that example?
The optimal guest topology in this case would be
<vcpu placement='static' cpuset='4'>4</vcpu>
<cpu>
<topology sockets='1' cores='1' threads='4'/>
</cpu>
So when we pin to logical CPU #4, ppc KVM is smart enough to see that it's a
subcore thread, will then make use of the offline threads in the same subcore?
Or does libvirt do anything fancy to facilitate this case?
but neither approaches mentioned above work to figure out the
correct value for the cpuset attribute.
In this case, a possible solution would be to alter the values
of the core_id and siblings attribute such that both would be
the same as the id attribute, which would naturally make both
approaches described above work.
Additionaly, a new attribute would be introduced to serve as
a multiplier for the "one guest thread per host thread" rule
mentioned earlier: the resulting XML would look like
<cpu id='0' socket_id='0' core_id='0' siblings='0'
capacity='4'/>
<cpu id='4' socket_id='0' core_id='4' siblings='4'
capacity='4'/>
which contains all the information needed to build the right
guest topology. The capacity attribute would have value 1 on
all architectures except for ppc64.
capacity is pretty generic sounding... not sure if that's good or not in this
case. maybe thread_capacity?
We could arguably use the capacity attribute to cover the
use case described in the first part as well, by declaring that
any value other than 1 means there's a limit to the number of
threads a guest core can have. I think doing so has the
potential to produce much grief in the future, so I'd rather
keep them separate - even if it means inventing a new element.
It's been also proposed to add a physical_core_id attribute,
which would contain the real core id and allow tools to figure
out which subcores belong to the same core - it would be the
same as core_id for all other architectures and for ppc64
when subcores_per_core=1. It's not clear whether having this
attribute would be useful or just confusing.
IMO it seems like something worth adding since it is a pertinent piece of the
topology, even if there isn't a clear programmatic use for it yet.
This is all I have for now. Please let me know what you think
about it.
FWIW virt-manager basically doesn't consume the host topology XML, so there's
no concern there.
A quick grep seems to indicate that both nova (openstack) and vdsm
(ovirt/rhev) _do_ consume this XML for their numa magic (git grep sibling),
but I can't speak to the details of how it's consumed.
- Cole