On Tue, 2016-02-16 at 13:04 +1100, David Gibson wrote:
> So the information don't seem to add up: we claim that the
> host has 160 online CPUs, but the NUMA topology only contains
> information about 5 of them per each node, so 20 in total.
>
> That's of course because Linux only provides us with topology
> information for online CPUs, but KVM on ppc64 wants secondary
> threads to be offline in the host. The secondary threads are
> still used to schedule guest threads, so reporting 160 online
> CPUs is correct for the purpose of planning the number of
> guests based on the capacity of the host; the problem is that
> the detailed NUMA topology doesn't match with that.
Yeah, that's rather unfortunate. We do want to list all the threads in
the capabilities, I think, since they're capable of running vcpus.
There's a problem with that, though, that I didn't think about
when writing the original message. See below.
> The new information is more complete than it was before, and
> this series certainly would help users make better guest
> allocation choices. On the other hand, it doesn't really solve
> the problem of nodeinfo and capabilities disagreeing with each
> other, and pushes the NUMA topology reported by libvirt a
> little farther from the one reported by the kernel.
Uh.. I don't really see how nodeinfo and capabilities disagree here.
Because nodeinfo tell us that the host has 160 online CPUs, while
capabilities tell us that only 40 CPUs are actually online.
Just to clarify, since the naming is not very explicit: nodeinfo
reports *online* CPUs only, and each <cpu> element in capabilities
is supposed to represent an *online* CPU. Offline CPUs are not
listed or counted anywhere.
Now how the topology differs from the kernel.
Because the kernel reports physical information, so online CPUs
that are in the same physical core (as evidenced by the core_id
value) are counted as siblings regardless of the subcore
configuration.
> It may also break some assumptions, eg. CPU 0 and 4 both have
> the same value for 'core_id', so I'd expect them to be among
> each other's 'siblings', but they no longer are.
Ah, yes, that's wrong. With this setup the core_id should be set to
the id of the subcore's first thread, rather than the physical core's
first thread.
This would be a pretty wild departure from what the kernel is
exposing to userspace.
> I have a different proposal: since we're already altering
the
> NUMA topology information reported by the kernel by counting
> secondary threads as online, we might as well go all the way
> and rebuild the entire NUMA topology as if they were.
>
> So the capabilities XML when subcores-per-core=1 would look
> like:
>
> <cells num='4'>
> <cell id='0'>
> <memory unit='KiB'>67108864</memory>
> <pages unit='KiB' size='64'>1048576</pages>
> <pages unit='KiB' size='16384'>0</pages>
> <distances>
> <sibling id='0' value='10'/>
> <sibling id='1' value='20'/>
> <sibling id='16' value='40'/>
> <sibling id='17' value='40'/>
> </distances>
> <cpus num='10'>
> <cpu id='0' socket_id='0' core_id='32'
subcore_id='0' siblings='0-7'/>
I don't think adding a subcore_id is a good idea. Because it's only
ever likely to mean something on ppc64, tools are qlikely to just
ignore it and use the core_id.
Fair enough, and something we definitely need to take into account.
Most of the time that will be wrong:
behaviorally subcores act like cores in almost all regards.
That could be worked around by instead having core_id give the subcore
address and adding a new "supercore_id" or "core_group_id" or
something.
Well, those would be just as likely to be ignored by tools as
subcore_id and capacity, wouldn't they? :)
But frankly, I don't think there's actually much point
exposing the
physical topology in addition to the logical (subcore) topology. Yes,
subcores will have different performance characteristics to real cores
which will be relevant in some situations. But if you're manually
setting the host's subcore mode then you're already into the realm of
manually tweaking parameters based on knowledge of your system and
workload. Basically I don't see anything upper layers would do with
the subcore vs. core information that isn't likely to be overriden by
manual tweaks in any case where you're setting the subcore mode at all.
Bear in mind that now that we have dynamic split core merged, I'm not
sure there are *any* realistic use cases for using manual subcore
splitting.
As far as I know, and of course correct me if I'm wrong, dynamic split
cores are a way to avoid heavy performance drops when the host is
overcommitted, but it has some overhead of its own because of the need
to split the core when entering KVM and restoring the previous
configuration when exiting it.
Splitting cores manually, on the other hand, is less flexible but
gives the admin complete control over resource allocations and has
no overhead, which makes it critical for fine performance tuning.
So unless my understanding is off, I believe we need to expose the
topology in a way that enables both use cases.
So, as noted, I actually prefer Shivaprasad's original approach
of
treating subcores as cores. The implementation of that does need some
adjustment as noted above, basically treating subcores as cores even
more universally than the current patches do.
Basically I see manually setting the subcores-per-core as telling the
system you want it treated as having more cores. So libvirt shouldn't
decide it knows better and report physical cores instead.
It's not really a case of libvirt thinking it knows better, it's more
a case of the kernel always reporting information about the physical
topology and x86 not having AFAIK any configuration where physical
and logical topology disagree, whereas that's the case on ppc64 as
soon as you split a core.
Tangentially related is the question of how to deal with threads
which
are offline in the host but can be used for vcpus, which appears with
or without subcores.
This is exactly the issue I was thinking about at the beginning of
the message: if I see a <cpu> element that means the CPU is online,
and I expect to be able to pin a vCPU to it, but that would no
longer be the case if we started adding <cpu> elements for CPUs
that are offline in the host - any attempt to pin vCPUs to such
CPUs would fail.
I have no very strong opinion between the options of (1) adding
<cpu>
entries for the offline threads (whose siblings should be set based on
the subcore) or (2) adding a capacity tag to the online threads
indicating that you can put more vcpus on them that the host online
thread count indicates.
(1) is more likely to do the right thing with existing tools, but (2)
is more accurately expressive.
I kinda think it would be the other way around, as I expect existing
tools to make some of the assumptions described above.
So after discussing for a while with Shivaprasad on IRC, reading
your reply and thinking about it some more, I'm starting to think
that we should report the same information as the kernel whenever
possible, and enhance it with hints such as capacity for tools
and admins, but carefully avoid being too clever for our own good.
We can paper over most of the differences between x86 and ppc64,
but not all of them, I believe. Documentation is key here, and
there's definitely room for improvement both when it comes to
reasonable assumptions and ppc64 specific quirks - this is
something I'm planning to address.
So my proposal is to drop patch 1/2 entirely and rework patch 2/2
to just add the capacity attribute (maybe with a more descriptive
name?), without touching the list of siblings or any other
information retrieved from the kernel.
Cheers.
--
Andrea Bolognani
Software Engineer - Virtualization Team