On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
The nodeinfo output was fixed earlier to reflect the actual cpus
available in
KVM mode on PPC64. The earlier fixes covered the aspect of not making a host
look overcommitted when its not. The current fixes are aimed at helping the
users make better decisions on the kind of guest cpu topology that can be
supported on the given sucore_per_core setting of KVM host and also hint the
way to pin the guest vcpus efficiently.
I am planning to add some test cases once the approach is accepted.
With respect to Patch 2:
The second patch adds a new element to the cpus tag and I need your inputs on
if that is okay. Also if there is a better way. I am not sure if the existing
clients have RNG checks that might fail with the approach. Or if the checks
are not enoforced on the elements but only on the tags.
With my approach if the rng checks pass, the new element "capacity" even if
ignored by many clients would have no impact except for PPC64.
To the extent I looked at code, the siblings changes dont affect existing
libvirt functionality. Please do let me know otherwise.
I've looked at your patches and I can clearly see the reasoning
behind them: basically, subcores would be presented the same way as
proper cores, so that the user doesn't have to worry about ppc64
specific details and just use the same code as x86 to optimally
arrange his guests.
As you mention, the way we report nodeinfo has been changed not to
make the host look prematurely overcommitted; however, this change
has caused a weird disconnect between nodeinfo and capabilities,
which should in theory always agree with each other.
I'm going to be way verbose in this message, stating stuff we
all know about, so that it can be used as a reference for people not
that familiar with the details and most importantly so that, if I
got anything wrong, I can be corrected :)
This is what nodeinfo currently looks like for a ppc64 host with
4 NUMA nodes, 1 socket per node, 5 cores per socket and 8 threads
per core, when subcores-per-core=1 and smt=off:
CPU(s): 160
CPU socket(s): 1
Core(s) per socket: 5
Thread(s) per core: 8
NUMA cell(s): 4
And this is part of the capabilities XML:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='5'>
<cpu id='0' socket_id='0' core_id='32'
siblings='0'/>
<cpu id='8' socket_id='0' core_id='40'
siblings='8'/>
<cpu id='16' socket_id='0' core_id='48'
siblings='16'/>
<cpu id='24' socket_id='0' core_id='96'
siblings='24'/>
<cpu id='32' socket_id='0' core_id='104'
siblings='32'/>
</cpus>
</cell>
So the information don't seem to add up: we claim that the
host has 160 online CPUs, but the NUMA topology only contains
information about 5 of them per each node, so 20 in total.
That's of course because Linux only provides us with topology
information for online CPUs, but KVM on ppc64 wants secondary
threads to be offline in the host. The secondary threads are
still used to schedule guest threads, so reporting 160 online
CPUs is correct for the purpose of planning the number of
guests based on the capacity of the host; the problem is that
the detailed NUMA topology doesn't match with that.
As a second example, here's the nodeinfo reported on the same
host when subcores-per-core=2:
CPU(s): 160
CPU socket(s): 1
Core(s) per socket: 5
Thread(s) per core: 8
NUMA cell(s): 4
and the corresponding capabilities XML:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='10'>
<cpu id='0' socket_id='0' core_id='32'
siblings='0,4'/>
<cpu id='4' socket_id='0' core_id='32'
siblings='0,4'/>
<cpu id='8' socket_id='0' core_id='40'
siblings='8,12'/>
<cpu id='12' socket_id='0' core_id='40'
siblings='8,12'/>
<cpu id='16' socket_id='0' core_id='48'
siblings='16,20'/>
<cpu id='20' socket_id='0' core_id='48'
siblings='16,20'/>
<cpu id='24' socket_id='0' core_id='96'
siblings='24,28'/>
<cpu id='28' socket_id='0' core_id='96'
siblings='24,28'/>
<cpu id='32' socket_id='0' core_id='104'
siblings='32,36'/>
<cpu id='36' socket_id='0' core_id='104'
siblings='32,36'/>
</cpus>
</cell>
Once again we only get information about the primary thread,
resulting in something that's probably very confusing unless you
know how threading works in the ppc64 world. This is, however,
exactly what the kernel exposes to userspace.
How would that change with your patches? Here's the nodeinfo:
CPU(s): 160
CPU socket(s): 1
Core(s) per socket: 10
Thread(s) per core: 4
NUMA cell(s): 4
and here's the capabilities XML:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='10'>
<cpu id='0' capacity='4' socket_id='0'
core_id='32' siblings='0'/>
<cpu id='4' capacity='4' socket_id='0'
core_id='32' siblings='4'/>
<cpu id='8' capacity='4' socket_id='0'
core_id='40' siblings='8'/>
<cpu id='12' capacity='4' socket_id='0'
core_id='40' siblings='12'/>
<cpu id='16' capacity='4' socket_id='0'
core_id='48' siblings='16'/>
<cpu id='20' capacity='4' socket_id='0'
core_id='48' siblings='20'/>
<cpu id='24' capacity='4' socket_id='0'
core_id='96' siblings='24'/>
<cpu id='28' capacity='4' socket_id='0'
core_id='96' siblings='28'/>
<cpu id='32' capacity='4' socket_id='0'
core_id='104' siblings='32'/>
<cpu id='36' capacity='4' socket_id='0'
core_id='104' siblings='36'/>
</cpus>
</cell>
So now we're basically reporting each subcore as if it were a
physical core, and once again we're only reporting information
about the primary thread of the subcore, so the output when
subcores are in use is closer to when subcores are not in use.
The additional information, in the 'capacity' attribute, can be
used by users to figure out how many vCPUs can be scheduled on
each CPU, and it can safely be ignored by existing clients; for
x86 hosts, we can just set it to 1.
The new information is more complete than it was before, and
this series certainly would help users make better guest
allocation choices. On the other hand, it doesn't really solve
the problem of nodeinfo and capabilities disagreeing with each
other, and pushes the NUMA topology reported by libvirt a
little farther from the one reported by the kernel.
It may also break some assumptions, eg. CPU 0 and 4 both have
the same value for 'core_id', so I'd expect them to be among
each other's 'siblings', but they no longer are.
I have a different proposal: since we're already altering the
NUMA topology information reported by the kernel by counting
secondary threads as online, we might as well go all the way
and rebuild the entire NUMA topology as if they were.
So the capabilities XML when subcores-per-core=1 would look
like:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='10'>
<cpu id='0' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='1' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='2' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='3' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='4' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='5' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='6' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='7' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='8' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='9' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='10' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='11' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='12' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='13' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='14' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='15' socket_id='0' core_id='40'
subcore_id='1' siblings='8-15'/>
<cpu id='16' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='17' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='18' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='19' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='20' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='21' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='22' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='23' socket_id='0' core_id='48'
subcore_id='2' siblings='16-23'/>
<cpu id='24' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='25' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='26' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='27' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='28' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='29' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='30' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='31' socket_id='0' core_id='96'
subcore_id='3' siblings='24-31'/>
<cpu id='32' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='33' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='34' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='35' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='36' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='37' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='38' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
<cpu id='39' socket_id='0' core_id='104'
subcore_id='4' siblings='32-39'/>
</cpus>
</cell>
and when subcores-per-core=2, it would look like:
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>67108864</memory>
<pages unit='KiB' size='64'>1048576</pages>
<pages unit='KiB' size='16384'>0</pages>
<distances>
<sibling id='0' value='10'/>
<sibling id='1' value='20'/>
<sibling id='16' value='40'/>
<sibling id='17' value='40'/>
</distances>
<cpus num='10'>
<cpu id='0' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='1' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='2' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='3' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
<cpu id='4' socket_id='0' core_id='32'
subcore_id='1' siblings='0-7'/>
<cpu id='5' socket_id='0' core_id='32'
subcore_id='1' siblings='0-7'/>
<cpu id='6' socket_id='0' core_id='32'
subcore_id='1' siblings='0-7'/>
<cpu id='7' socket_id='0' core_id='32'
subcore_id='1' siblings='0-7'/>
<cpu id='8' socket_id='0' core_id='40'
subcore_id='2' siblings='8-15'/>
<cpu id='9' socket_id='0' core_id='40'
subcore_id='2' siblings='8-15'/>
<cpu id='10' socket_id='0' core_id='40'
subcore_id='2' siblings='8-15'/>
<cpu id='11' socket_id='0' core_id='40'
subcore_id='2' siblings='8-15'/>
<cpu id='12' socket_id='0' core_id='40'
subcore_id='3' siblings='8-15'/>
<cpu id='13' socket_id='0' core_id='40'
subcore_id='3' siblings='8-15'/>
<cpu id='14' socket_id='0' core_id='40'
subcore_id='3' siblings='8-15'/>
<cpu id='15' socket_id='0' core_id='40'
subcore_id='3' siblings='8-15'/>
<cpu id='16' socket_id='0' core_id='48'
subcore_id='4' siblings='16-23'/>
<cpu id='17' socket_id='0' core_id='48'
subcore_id='4' siblings='16-23'/>
<cpu id='18' socket_id='0' core_id='48'
subcore_id='4' siblings='16-23'/>
<cpu id='19' socket_id='0' core_id='48'
subcore_id='4' siblings='16-23'/>
<cpu id='20' socket_id='0' core_id='48'
subcore_id='5' siblings='16-23'/>
<cpu id='21' socket_id='0' core_id='48'
subcore_id='5' siblings='16-23'/>
<cpu id='22' socket_id='0' core_id='48'
subcore_id='5' siblings='16-23'/>
<cpu id='23' socket_id='0' core_id='48'
subcore_id='5' siblings='16-23'/>
<cpu id='24' socket_id='0' core_id='96'
subcore_id='6' siblings='24-31'/>
<cpu id='25' socket_id='0' core_id='96'
subcore_id='6' siblings='24-31'/>
<cpu id='26' socket_id='0' core_id='96'
subcore_id='6' siblings='24-31'/>
<cpu id='27' socket_id='0' core_id='96'
subcore_id='6' siblings='24-31'/>
<cpu id='28' socket_id='0' core_id='96'
subcore_id='7' siblings='24-31'/>
<cpu id='29' socket_id='0' core_id='96'
subcore_id='7' siblings='24-31'/>
<cpu id='30' socket_id='0' core_id='96'
subcore_id='7' siblings='24-31'/>
<cpu id='31' socket_id='0' core_id='96'
subcore_id='7' siblings='24-31'/>
<cpu id='32' socket_id='0' core_id='104'
subcore_id='8' siblings='32-39'/>
<cpu id='33' socket_id='0' core_id='104'
subcore_id='8' siblings='32-39'/>
<cpu id='34' socket_id='0' core_id='104'
subcore_id='8' siblings='32-39'/>
<cpu id='35' socket_id='0' core_id='104'
subcore_id='8' siblings='32-39'/>
<cpu id='36' socket_id='0' core_id='104'
subcore_id='9' siblings='32-39'/>
<cpu id='37' socket_id='0' core_id='104'
subcore_id='9' siblings='32-39'/>
<cpu id='38' socket_id='0' core_id='104'
subcore_id='9' siblings='32-39'/>
<cpu id='39' socket_id='0' core_id='104'
subcore_id='9' siblings='32-39'/>
</cpus>
</cell>
which, incidentally, is pretty much the same information
the ppc64_cpu utility:
Core 0:
Subcore 0: 0* 1* 2* 3*
Subcore 1: 4* 5* 6* 7*
Core 1:
Subcore 2: 8* 9* 10* 11*
Subcore 3: 12* 13* 14* 15*
Core 2:
Subcore 4: 16* 17* 18* 19*
Subcore 5: 20* 21* 22* 23*
Core 3:
Subcore 6: 24* 25* 26* 27*
Subcore 7: 28* 29* 30* 31*
Core 4:
Subcore 8: 32* 33* 34* 35*
Subcore 9: 36* 37* 38* 39*
and libvirt itself report when subcores-per-core=2 and smt=on.
Other architectures can simply use the same value for both 'core_id'
and 'subcore_id', with no extra effort needed; the presence of a
new attribute can also be used by higher level tools to figure out
whether secondary threads are reported as online or offline.
I'm CCing both David Gibson, who might have an opinion on this due
to his work on KVM on ppc64, and Martin Polednik, because of the
same reason with s/KVM/oVirt/ :)
But of course I'd love to have as much feedback as possible :)
Cheers.
--
Andrea Bolognani
Software Engineer - Virtualization Team