On Mon, 08 Feb 2016 20:11:41 +0100
Andrea Bolognani <abologna(a)redhat.com> wrote:
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
> The nodeinfo output was fixed earlier to reflect the actual cpus available in
> KVM mode on PPC64. The earlier fixes covered the aspect of not making a host
> look overcommitted when its not. The current fixes are aimed at helping the
> users make better decisions on the kind of guest cpu topology that can be
> supported on the given sucore_per_core setting of KVM host and also hint the
> way to pin the guest vcpus efficiently.
>
> I am planning to add some test cases once the approach is accepted.
>
> With respect to Patch 2:
> The second patch adds a new element to the cpus tag and I need your inputs on
> if that is okay. Also if there is a better way. I am not sure if the existing
> clients have RNG checks that might fail with the approach. Or if the checks
> are not enoforced on the elements but only on the tags.
>
> With my approach if the rng checks pass, the new element "capacity" even
if
> ignored by many clients would have no impact except for PPC64.
>
> To the extent I looked at code, the siblings changes dont affect existing
> libvirt functionality. Please do let me know otherwise.
I've looked at your patches and I can clearly see the reasoning
behind them: basically, subcores would be presented the same way as
proper cores, so that the user doesn't have to worry about ppc64
specific details and just use the same code as x86 to optimally
arrange his guests.
As you mention, the way we report nodeinfo has been changed not to
make the host look prematurely overcommitted; however, this change
has caused a weird disconnect between nodeinfo and capabilities,
which should in theory always agree with each other.
I'm going to be way verbose in this message, stating stuff we
all know about, so that it can be used as a reference for people not
that familiar with the details and most importantly so that, if I
got anything wrong, I can be corrected :)
This is what nodeinfo currently looks like for a ppc64 host with
4 NUMA nodes, 1 socket per node, 5 cores per socket and 8 threads
per core, when subcores-per-core=1 and smt=off:
CPU(s): 160
CPU socket(s): 1
Core(s) per socket: 5
Thread(s) per core: 8
NUMA cell(s): 4
And this is part of the capabilities XML:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='5'>
<cpu id='0' socket_id='0' core_id='32'
siblings='0'/>
<cpu id='8' socket_id='0' core_id='40'
siblings='8'/>
<cpu id='16' socket_id='0' core_id='48'
siblings='16'/>
<cpu id='24' socket_id='0' core_id='96'
siblings='24'/>
<cpu id='32' socket_id='0' core_id='104'
siblings='32'/>
</cpus>
</cell>
So the information don't seem to add up: we claim that the
host has 160 online CPUs, but the NUMA topology only contains
information about 5 of them per each node, so 20 in total.
That's of course because Linux only provides us with topology
information for online CPUs, but KVM on ppc64 wants secondary
threads to be offline in the host. The secondary threads are
still used to schedule guest threads, so reporting 160 online
CPUs is correct for the purpose of planning the number of
guests based on the capacity of the host; the problem is that
the detailed NUMA topology doesn't match with that.
Yeah, that's rather unfortunate. We do want to list all the threads in
the capabilities, I think, since they're capable of running vcpus.
As a second example, here's the nodeinfo reported on the same
host when subcores-per-core=2:
CPU(s): 160
CPU socket(s): 1
Core(s) per socket: 5
Thread(s) per core: 8
NUMA cell(s): 4
and the corresponding capabilities XML:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='10'>
<cpu id='0' socket_id='0' core_id='32'
siblings='0,4'/>
<cpu id='4' socket_id='0' core_id='32'
siblings='0,4'/>
<cpu id='8' socket_id='0' core_id='40'
siblings='8,12'/>
<cpu id='12' socket_id='0' core_id='40'
siblings='8,12'/>
<cpu id='16' socket_id='0' core_id='48'
siblings='16,20'/>
<cpu id='20' socket_id='0' core_id='48'
siblings='16,20'/>
<cpu id='24' socket_id='0' core_id='96'
siblings='24,28'/>
<cpu id='28' socket_id='0' core_id='96'
siblings='24,28'/>
<cpu id='32' socket_id='0' core_id='104'
siblings='32,36'/>
<cpu id='36' socket_id='0' core_id='104'
siblings='32,36'/>
</cpus>
</cell>
Once again we only get information about the primary thread,
resulting in something that's probably very confusing unless you
know how threading works in the ppc64 world. This is, however,
exactly what the kernel exposes to userspace.
Ugh, yeah, that's definitely wrong. It's describing the subcores as if
they were threads.
How would that change with your patches? Here's the nodeinfo:
CPU(s): 160
CPU socket(s): 1
Core(s) per socket: 10
Thread(s) per core: 4
NUMA cell(s): 4
and here's the capabilities XML:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='10'>
<cpu id='0' capacity='4' socket_id='0'
core_id='32' siblings='0'/>
<cpu id='4' capacity='4' socket_id='0'
core_id='32' siblings='4'/>
<cpu id='8' capacity='4' socket_id='0'
core_id='40' siblings='8'/>
<cpu id='12' capacity='4' socket_id='0'
core_id='40' siblings='12'/>
<cpu id='16' capacity='4' socket_id='0'
core_id='48' siblings='16'/>
<cpu id='20' capacity='4' socket_id='0'
core_id='48' siblings='20'/>
<cpu id='24' capacity='4' socket_id='0'
core_id='96' siblings='24'/>
<cpu id='28' capacity='4' socket_id='0'
core_id='96' siblings='28'/>
<cpu id='32' capacity='4' socket_id='0'
core_id='104' siblings='32'/>
<cpu id='36' capacity='4' socket_id='0'
core_id='104' siblings='36'/>
</cpus>
</cell>
That definitely looks much better to me.
So now we're basically reporting each subcore as if it were a
physical core, and once again we're only reporting information
about the primary thread of the subcore, so the output when
subcores are in use is closer to when subcores are not in use.
The additional information, in the 'capacity' attribute, can be
used by users to figure out how many vCPUs can be scheduled on
each CPU, and it can safely be ignored by existing clients; for
x86 hosts, we can just set it to 1.
The new information is more complete than it was before, and
this series certainly would help users make better guest
allocation choices. On the other hand, it doesn't really solve
the problem of nodeinfo and capabilities disagreeing with each
other, and pushes the NUMA topology reported by libvirt a
little farther from the one reported by the kernel.
Uh.. I don't really see how nodeinfo and capabilities disagree here.
Now how the topology differs from the kernel.
It may also break some assumptions, eg. CPU 0 and 4 both have
the same value for 'core_id', so I'd expect them to be among
each other's 'siblings', but they no longer are.
Ah, yes, that's wrong. With this setup the core_id should be set to
the id of the subcore's first thread, rather than the physical core's
first thread.
I have a different proposal: since we're already altering the
NUMA topology information reported by the kernel by counting
secondary threads as online, we might as well go all the way
and rebuild the entire NUMA topology as if they were.
So the capabilities XML when subcores-per-core=1 would look
like:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='10'>
<cpu id='0' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
I don't think adding a subcore_id is a good idea. Because it's only
ever likely to mean something on ppc64, tools are qlikely to just
ignore it and use the core_id. Most of the time that will be wrong:
behaviorally subcores act like cores in almost all regards.
That could be worked around by instead having core_id give the subcore
address and adding a new "supercore_id" or "core_group_id" or
something.
But frankly, I don't think there's actually much point exposing the
physical topology in addition to the logical (subcore) topology. Yes,
subcores will have different performance characteristics to real cores
which will be relevant in some situations. But if you're manually
setting the host's subcore mode then you're already into the realm of
manually tweaking parameters based on knowledge of your system and
workload. Basically I don't see anything upper layers would do with
the subcore vs. core information that isn't likely to be overriden by
manual tweaks in any case where you're setting the subcore mode at all.
Bear in mind that now that we have dynamic split core merged, I'm not
sure there are *any* realistic use cases for using manual subcore
splitting.
<cpu id='1' socket_id='0'
core_id='32' subcore_id='0' siblings='0-7'/>
<cpu id='2' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='3' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='4' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='5' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='6' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='7' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='8' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='9' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='10' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='11' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='12' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='13' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='14' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='15' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='16' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='17' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='18' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='19' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='20' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='21' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='22' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='23' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='24' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='25' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='26' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='27' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='28' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='29' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='30' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='31' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='32' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='33' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='34' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='35' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='36' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='37' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='38' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='39' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
</cpus>
</cell>
and when subcores-per-core=2, it would look like:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='10'>
<cpu id='0' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='1' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='2' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='3' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
In this case, I think siblings should be 0-3. Again subcores act like
cores for most purposes.
<cpu id='4' socket_id='0'
core_id='32' subcore_id='1' siblings='0-7'/>
<cpu id='5' socket_id='0' core_id='32'
subcore_id='1' siblings='0-7'/>
<cpu id='6' socket_id='0' core_id='32'
subcore_id='1' siblings='0-7'/>
<cpu id='7' socket_id='0' core_id='32'
subcore_id='1' siblings='0-7'/>
<cpu id='8' socket_id='0' core_id='40'
subcore_id='2' siblings='8-15'/>
<cpu id='9' socket_id='0' core_id='40'
subcore_id='2' siblings='8-15'/>
<cpu id='10' socket_id='0' core_id='40'
subcore_id='2' siblings='8-15'/>
<cpu id='11' socket_id='0' core_id='40'
subcore_id='2' siblings='8-15'/>
<cpu id='12' socket_id='0' core_id='40'
subcore_id='3' siblings='8-15'/>
<cpu id='13' socket_id='0' core_id='40'
subcore_id='3' siblings='8-15'/>
<cpu id='14' socket_id='0' core_id='40'
subcore_id='3' siblings='8-15'/>
<cpu id='15' socket_id='0' core_id='40'
subcore_id='3' siblings='8-15'/>
<cpu id='16' socket_id='0' core_id='48'
subcore_id='4' siblings='16-23'/>
<cpu id='17' socket_id='0' core_id='48'
subcore_id='4' siblings='16-23'/>
<cpu id='18' socket_id='0' core_id='48'
subcore_id='4' siblings='16-23'/>
<cpu id='19' socket_id='0' core_id='48'
subcore_id='4' siblings='16-23'/>
<cpu id='20' socket_id='0' core_id='48'
subcore_id='5' siblings='16-23'/>
<cpu id='21' socket_id='0' core_id='48'
subcore_id='5' siblings='16-23'/>
<cpu id='22' socket_id='0' core_id='48'
subcore_id='5' siblings='16-23'/>
<cpu id='23' socket_id='0' core_id='48'
subcore_id='5' siblings='16-23'/>
<cpu id='24' socket_id='0' core_id='96'
subcore_id='6' siblings='24-31'/>
<cpu id='25' socket_id='0' core_id='96'
subcore_id='6' siblings='24-31'/>
<cpu id='26' socket_id='0' core_id='96'
subcore_id='6' siblings='24-31'/>
<cpu id='27' socket_id='0' core_id='96'
subcore_id='6' siblings='24-31'/>
<cpu id='28' socket_id='0' core_id='96'
subcore_id='7' siblings='24-31'/>
<cpu id='29' socket_id='0' core_id='96'
subcore_id='7' siblings='24-31'/>
<cpu id='30' socket_id='0' core_id='96'
subcore_id='7' siblings='24-31'/>
<cpu id='31' socket_id='0' core_id='96'
subcore_id='7' siblings='24-31'/>
<cpu id='32' socket_id='0' core_id='104'
subcore_id='8' siblings='32-39'/>
<cpu id='33' socket_id='0' core_id='104'
subcore_id='8' siblings='32-39'/>
<cpu id='34' socket_id='0' core_id='104'
subcore_id='8' siblings='32-39'/>
<cpu id='35' socket_id='0' core_id='104'
subcore_id='8' siblings='32-39'/>
<cpu id='36' socket_id='0' core_id='104'
subcore_id='9' siblings='32-39'/>
<cpu id='37' socket_id='0' core_id='104'
subcore_id='9' siblings='32-39'/>
<cpu id='38' socket_id='0' core_id='104'
subcore_id='9' siblings='32-39'/>
<cpu id='39' socket_id='0' core_id='104'
subcore_id='9' siblings='32-39'/>
</cpus>
</cell>
which, incidentally, is pretty much the same information
the ppc64_cpu utility:
Core 0:
Subcore 0: 0* 1* 2* 3*
Subcore 1: 4* 5* 6* 7*
Core 1:
Subcore 2: 8* 9* 10* 11*
Subcore 3: 12* 13* 14* 15*
Core 2:
Subcore 4: 16* 17* 18* 19*
Subcore 5: 20* 21* 22* 23*
Core 3:
Subcore 6: 24* 25* 26* 27*
Subcore 7: 28* 29* 30* 31*
Core 4:
Subcore 8: 32* 33* 34* 35*
Subcore 9: 36* 37* 38* 39*
and libvirt itself report when subcores-per-core=2 and smt=on.
Other architectures can simply use the same value for both 'core_id'
and 'subcore_id', with no extra effort needed; the presence of a
new attribute can also be used by higher level tools to figure out
whether secondary threads are reported as online or offline.
I'm CCing both David Gibson, who might have an opinion on this due
to his work on KVM on ppc64, and Martin Polednik, because of the
same reason with s/KVM/oVirt/ :)
But of course I'd love to have as much feedback as possible :)
So, as noted, I actually prefer Shivaprasad's original approach of
treating subcores as cores. The implementation of that does need some
adjustment as noted above, basically treating subcores as cores even
more universally than the current patches do.
Basically I see manually setting the subcores-per-core as telling the
system you want it treated as having more cores. So libvirt shouldn't
decide it knows better and report physical cores instead.
Tangentially related is the question of how to deal with threads which
are offline in the host but can be used for vcpus, which appears with
or without subcores.
I have no very strong opinion between the options of (1) adding <cpu>
entries for the offline threads (whose siblings should be set based on
the subcore) or (2) adding a capacity tag to the online threads
indicating that you can put more vcpus on them that the host online
thread count indicates.
(1) is more likely to do the right thing with existing tools, but (2)
is more accurately expressive.
--
David Gibson <dgibson(a)redhat.com>
Senior Software Engineer, Virtualization, Red Hat