Re: [PATCH 5/5] qemu: Prefer -numa cpu over -numa node,cpus=

21 Oct 2021

      On Wed, 20 Oct 2021 13:07:59 +0200
Michal Prívozník <mprivozn@redhat.com> wrote:
...
On 10/6/21 3:32 PM, Igor Mammedov wrote:
...
On Thu, 30 Sep 2021 14:08:34 +0200
Peter Krempa <pkrempa@redhat.com> wrote:
...
On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote:
...
QEMU is trying to obsolete -numa node,cpus= because that uses
ambiguous vCPU id to [socket, die, core, thread] mapping. The new
form is:
-numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
which is repeated for every vCPU and places it at [S, D, C, T]
into guest NUMA node N.
While in general this is magic mapping, we can deal with it.
Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
is given then maxvcpus must be sockets * dies * cores * threads
(i.e. there are no 'holes').
Secondly, if no topology is given then libvirt itself places each
vCPU into a different socket (basically, it fakes topology of:
[maxvcpus, 1, 1, 1])
Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
onto topology, to make sure vCPUs don't start to move around.
There's a problem with this premise though and unfortunately we don't
seem to have qemuxml2argvtest for it.
On PPC64, in certain situations the CPU can be configured such that
threads are visible only to VMs. This has substantial impact on how CPUs
are configured using the modern parameters (until now used only for
cpu hotplug purposes, and that's the reason vCPU hotplug has such
complicated incantations when starting the VM).
In the above situation a CPU with topology of:
 sockets=1, cores=4, threads=8 (thus 32 cpus)
will only expose 4 CPU "devices".
core-id: 0,  core-id: 8, core-id: 16 and core-id: 24
yet the guest will correctly see 32 cpus when used as such.
You can see this in:
tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json
Also note that the 'props' object does _not_ have any socket-id, and
management apps are supposed to pass in 'props' as is. (There's a bunch
of code to do that on hotplug).
The problem is that you need to query the topology first (unless we want
to duplicate all of qemu code that has to do with topology state and
keep up with changes to it) to know how it's behaving on current
machine.  This historically was not possible. The supposed solution for
this was the pre-config state where we'd be able to query and set it up
via QMP, but I was not keeping up sufficiently with that work, so I
don't know if it's possible.
If preconfig is a viable option we IMO should start using it sooner
rather than later and avoid duplicating qemu's logic here.
using preconfig is preferable variant otherwise libvirt
would end up duplicating topology logic which differs not only
between targets but also between machine/cpu types.
Closest example how to use preconfig is in pc_dynamic_cpu_cfg()
test case. Though it uses query-hotpluggable-cpus only for
verification, but one can use the command at the preconfig
stage to get topology for given -smp/-machine type combination.
Alright, -preconfig should be pretty easy. However, I do have some
points to raise/ask:
1) currently, exit-preconfig is marked as experimental (hence its "x-"
prefix). Before libvirt consumes it, QEMU should make it stable. Is
there anything that stops QEMU from doing so or is it just a matter of
sending patches (I volunteer to do that)?
if I recall correctly, it was made experimental due to lack of
actual users (it was supposed that libvirt would consume it
once available but it didn't happen for quite a long time).

So patches to make it stable interface should be fine.
...
2) In my experiments I try to mimic what libvirt does. Here's my cmd
line:
qemu-system-x86_64 \
-S \
-preconfig \
-cpu host \
-smp 120,sockets=2,dies=3,cores=4,threads=5 \
-object '{"qom-type":"memory-backend-memfd","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind"}' \
-numa node,nodeid=0,memdev=ram-node0 \
-no-user-config \
-nodefaults \
-no-shutdown \
-qmp stdio
and here is my QMP log:
{"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 6}, "package": "v6.1.0-1552-g362534a643"}, "capabilities": ["oob"]}}
{"execute":"qmp_capabilities"}
{"return": {}}
{"execute":"query-hotpluggable-cpus"}
{"return": [{"props": {"core-id": 3, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 3, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 2, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 1, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 0, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 2, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"},
<snip/>
{"props": {"core-id": 0, "thread-id": 0, "die-id": 0, "socket-id": 0}, "vcpus-count": 1, "type": "host-x86_64-cpu"}]}
I can see that query-hotpluggable-cpus returns an array. Can I safely
assume that vCPU ID == index in the array? I mean, if I did have -numa
node,cpus=X  can I do  array[X] to obtain mapping onto Core/Thread/
Die/Socket which would then be fed to 'set-numa-node' command. If not,
what is the proper way to do it?

...
From QEMU point of view, you shouldn't assume anything about vCPU
ordering within returned array. It's internal impl. detail
and a subject to change without notice.
What you can assume is that CPUs descriptions in array will be
stable for a given combination of [machine version, smp option, CPU type].
...
And one more thing - if QEMU has to keep vCPU ID mapping code, what's
the point in obsoleting -numa node,cpus=? In the end it is still QEMU
who does the ID -> [Core,Thread,Die,Socket] translation but with extra
steps for mgmt applications.
point is that cpu_index is ambiguous and it's practically impossible
to for user to tell which vCPU exactly it deals with unless
user re-implements and keeps in sync topology code for
 f(board, machine version, smp option, CPU type)

So even if cpu_index is still used inside of QEMU for
other purposes, the external interfaces and API will
be using only consistent topology tuple [Core,Thread,Die,Socket]
to describe and address vCPUs, same like device_add.
...
Michal