Re: [PATCH 4/5] qemu: Prefer -numa cpu over -numa node,cpus=

26 May 2020

      On 5/26/20 4:51 PM, Igor Mammedov wrote:
...
On Mon, 25 May 2020 10:05:08 +0200
Michal Privoznik <mprivozn@redhat.com> wrote:
...
This is a problem. The domain XML that is provided can't be changed,
mostly because mgmt apps construct it on the fly and then just pass it
as a RO string to libvirt. While libvirt could create a separate cache,
there has to be a better way.
I mean, I can add some more code that once the guest is running
preserves the mapping during migration. But that assumes a running QEMU.
When starting a domain from scratch, is it acceptable it vCPU topology
changes? I suspect it is not.
I'm not sure I got you but
vCPU topology isn't changnig but when starting QEMU, user has to map
'concrete vCPUs' to spencific numa nodes. The issue here is that
to specify concrete vCPUs user needs to get layout from QEMU first
as it's a function of target/machine/-smp and possibly cpu type.
Assume the following config: 4 vCPUs (2 sockets, 2 cores, 1 thread 
topology) and 2 NUMA nodes and the following assignment to NUMA:

node 0: cpus=0-1
node 1: cpus=2-3

With old libvirt & qemu (and assuming x86_64 - not EPYC), I assume the 
following topology is going to be used:

node 0: socket=0,core=0,thread=0 (vCPU0)  socket=0,core=1,thread=0 (vCPU1)
node 1: socket=1,core=0,thread=0 (vCPU2)  socket=1,core=1,thread=0 (vCPU3)

Now, user upgrades libvirt & qemu but doesn't change the config. And on 
a fresh new start (no migration), they might get a different topology:

node 0: socket=0,core=0,thread=0 (vCPU0)  socket=1,core=0,thread=0 (vCPU1)
node 1: socket=0,core=1,thread=0 (vCPU2)  socket=1,core=1,thread=0 (vCPU3)

(This is a very trivial example that I am intentionally making look bad, 
but the thing is, there are some CPUs with very weird vCPU -> 
socket/core/thread mappings).

The problem here is that with this new version it is libvirt who 
configured the vCPU -> NUMA mapping (using -numa cpu). Why so wrong? 
Well it had no way to ask qemu how it used to be. Okay, so we add an 
interface to QEMU (say -preconfig + query-hotpluggable-cpus) which will 
do the mapping and keep it there indefinitely. But if the interface is 
already there (and "always" will be), I don't see need for the extra 
step (libvirt asking QEMU for the old mapping).
The problem here is not how to assign vCPUs to NUMA nodes, the problem 
is how to translate vCPU IDs to socket=,core=,thread=.
...
that applies not only '-numa cpu' but also to -device cpufoo,
that's why query-hotpluggable-cpus was introduced to let
user get the list of possible CPUs (including topo properties needed to
create them) for a given set of CLI options.
If I recall right libvirt uses topo properies during cpu hotplug but
treats it mainly as opaqueue info so it could feed it back to QEMU.
...
...
...
tries to avoid that as much as it can.
...
How to present it to libvirt user I'm not sure (give them that list perhaps
and let select from it???)
This is what I am trying to figure out in the cover letter. Maybe we
need to let users configure the topology (well, vCPU id to [socket, die,
core, thread] mapping), but then again, in my testing the guest ignored
that and displayed different topology (true, I was testing with -cpu
host, so maybe that's why).
there is ongiong issue with EPYC VCPUs topology, but I otherwise it should work.
Just report bug to qemu-devel, if it's broken.
...
...
But it's irrelevant, to the patch, magical IDs for socket/core/...whatever
should not be generated by libvirt anymore, but rather taken from QEMU for given
machine + -smp combination.
Taken when? We can do this for running machines, but not for freshly
started ones, can we?
it can be used for freshly started as well,
QEMU -S -preconfig -M pc -smp ...
(QMP) query-hotpluggable-cpus
(QMP) set-numa-node ...
...
(QMP) exit-preconfig
(QMP) other stuff libvirt does (like hot-plugging CPUs , ...)
(QMP) cont
I'm not sure this works. query-hotpluggable-cpus does not map vCPU ID
<-> socket/core/thread, For '-smp 2,sockets=2,cores=1,threads=1' the
'query-hotpluggable-cpus' returns:
{"return": [{"props": {"core-id": 0, "thread-id": 0, "socket-id": 1},
"vcpus-count": 1, "type": "qemu64-x86_64-cpu"}, {"props": {"core-id": 0,
"thread-id": 0, "socket-id": 0}, "vcpus-count": 1, "type":
"qemu64-x86_64-cpu"}]}
that's the list I was taling about, which is implicitly ordered by cpu_index
Aha! So in this case it would be:

vCPU0 -> socket=1,core=0,thread=0
vCPU1 -> socket=0,core=0,thread=0

But that doesn't feel right. Is the cpu_index increasing or decreasing 
as I go through the array? Also, how is this able to express holes? E.g. 
there might be some CPUs that don't have linear topology, and for 
instance while socket=0,core=0,thread=0 and socket=0,core=0,thread=2 
exist, socket=0,core=0,thread=1 does not. How am I supposed to know that 
by just looking at the array?
...
...
And 'query-cpus' or 'query-cpus-fast' which map vCPU ID onto
socket/core/thread are not allowed in preconfig state.
these 2 commands apply to present cpu only, if I'm not mistaken.
query-hotpluggable-cpus shows not only present but also CPUs that
could be hotplugged with device_add or used with -device.
Fair enough. I haven't looked into the code that much.
...
...
But if I take a step back, the whole point of deprecating -numa
node,cpus= is that QEMU no longer wants to do vCPU ID <->
socket/core/thread mapping because it's ambiguous. So it feels a bit
weird to design a solution where libvirt would ask QEMU to provide the
mapping only so that it can be configured back. Not only because of the
extra step, but also because QEMU can't then remove the mapping anyway.
I might be misunderstanding the issue though.
if '-numa node,cpus' is removed, we no longer will be using cpu_index as
configuration interface with user, that would allow QEMU start pruning
it from HMP/QMP interfaces and then probably remove it internally.
(I haven't explored yet if we could get rid of it completely but
I'd expect migration stream would be the only reason to keep it intrenally).
I'm quite reluctant to add cpu_index to modern query-hotpluggable-cpus output,
since the whole goal is to get rid of the index, which don't actually work
with SPAPR where CPU entity is a core and threads are internal impl. detail
(while cpu_index has 1:1 mapping with threads).
However if it will let QEMU to drop '-numa node,cpus=', we can discuss
adding optional 'x-cpu-index' to query-hotpluggable-cpus, that will be available
for old machine types for the sole purpose to help libvirt map old CLI to new one.
New machines shouldn't care about index though, since they should be using
'-numa cpu'.
The problem here is that so far, all that libvirt users see are vCPU 
IDs. They use them to assign vCPUs to NUMA nodes. And in order to make 
libvirt switch to the new command line it needs a way to map IDs to 
socket=,core=,thread=. I will play more with the preconfig and let you know.

Michal