Re: [PATCH 5/5] qemu: Prefer -numa cpu over -numa node,cpus=

Thursday, 21 October 2021

On Wed, 20 Oct 2021 13:07:59 +0200
Michal Prívozník <mprivozn(a)redhat.com&gt; wrote:

...
 On 10/6/21 3:32 PM, Igor Mammedov wrote:
 > On Thu, 30 Sep 2021 14:08:34 +0200
 > Peter Krempa <pkrempa(a)redhat.com&gt; wrote:
 >   
 >> On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote:  
 >>> QEMU is trying to obsolete -numa node,cpus= because that uses
 >>> ambiguous vCPU id to [socket, die, core, thread] mapping. The new
 >>> form is:
 >>>
 >>>   -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
 >>>
 >>> which is repeated for every vCPU and places it at [S, D, C, T]
 >>> into guest NUMA node N.
 >>>
 >>> While in general this is magic mapping, we can deal with it.
 >>> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
 >>> is given then maxvcpus must be sockets * dies * cores * threads
 >>> (i.e. there are no 'holes').
 >>> Secondly, if no topology is given then libvirt itself places each
 >>> vCPU into a different socket (basically, it fakes topology of:
 >>> [maxvcpus, 1, 1, 1])
 >>> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
 >>> onto topology, to make sure vCPUs don't start to move around.    
 >>
 >> There's a problem with this premise though and unfortunately we don't
 >> seem to have qemuxml2argvtest for it.
 >>
 >> On PPC64, in certain situations the CPU can be configured such that
 >> threads are visible only to VMs. This has substantial impact on how CPUs
 >> are configured using the modern parameters (until now used only for
 >> cpu hotplug purposes, and that's the reason vCPU hotplug has such
 >> complicated incantations when starting the VM).
 >>
 >> In the above situation a CPU with topology of:
 >>  sockets=1, cores=4, threads=8 (thus 32 cpus)
 >>
 >> will only expose 4 CPU "devices".
 >>
 >>  core-id: 0,  core-id: 8, core-id: 16 and core-id: 24
 >>
 >> yet the guest will correctly see 32 cpus when used as such.
 >>
 >> You can see this in:
 >>
 >> tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json
 >>
 >> Also note that the 'props' object does _not_ have any socket-id, and
 >> management apps are supposed to pass in 'props' as is. (There's a
bunch
 >> of code to do that on hotplug).
 >>
 >> The problem is that you need to query the topology first (unless we want
 >> to duplicate all of qemu code that has to do with topology state and
 >> keep up with changes to it) to know how it's behaving on current
 >> machine.  This historically was not possible. The supposed solution for
 >> this was the pre-config state where we'd be able to query and set it up
 >> via QMP, but I was not keeping up sufficiently with that work, so I
 >> don't know if it's possible.
 >>
 >> If preconfig is a viable option we IMO should start using it sooner
 >> rather than later and avoid duplicating qemu's logic here.  
 > 
 > using preconfig is preferable variant otherwise libvirt
 > would end up duplicating topology logic which differs not only
 > between targets but also between machine/cpu types.
 > 
 > Closest example how to use preconfig is in pc_dynamic_cpu_cfg()
 > test case. Though it uses query-hotpluggable-cpus only for
 > verification, but one can use the command at the preconfig
 > stage to get topology for given -smp/-machine type combination.  

 Alright, -preconfig should be pretty easy. However, I do have some
 points to raise/ask:

 1) currently, exit-preconfig is marked as experimental (hence its "x-"
 prefix). Before libvirt consumes it, QEMU should make it stable. Is
 there anything that stops QEMU from doing so or is it just a matter of
 sending patches (I volunteer to do that)? 
if I recall correctly, it was made experimental due to lack of
actual users (it was supposed that libvirt would consume it
once available but it didn't happen for quite a long time).

So patches to make it stable interface should be fine.

...

 2) In my experiments I try to mimic what libvirt does. Here's my cmd
 line:

 qemu-system-x86_64 \
 -S \
 -preconfig \
 -cpu host \
 -smp 120,sockets=2,dies=3,cores=4,threads=5 \
 -object
'{"qom-type":"memory-backend-memfd","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind"}'
\
 -numa node,nodeid=0,memdev=ram-node0 \
 -no-user-config \
 -nodefaults \
 -no-shutdown \
 -qmp stdio

 and here is my QMP log:

 {"QMP": {"version": {"qemu": {"micro": 50,
"minor": 1, "major": 6}, "package":
"v6.1.0-1552-g362534a643"}, "capabilities": ["oob"]}}

 {"execute":"qmp_capabilities"}
 {"return": {}}

 {"execute":"query-hotpluggable-cpus"}
 {"return": [{"props": {"core-id": 3, "thread-id":
4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1,
"type": "host-x86_64-cpu"}, {"props": {"core-id":
3, "thread-id": 3, "die-id": 2, "socket-id": 1},
"vcpus-count": 1, "type": "host-x86_64-cpu"},
{"props": {"core-id": 3, "thread-id": 2, "die-id":
2, "socket-id": 1}, "vcpus-count": 1, "type":
"host-x86_64-cpu"}, {"props": {"core-id": 3,
"thread-id": 1, "die-id": 2, "socket-id": 1},
"vcpus-count": 1, "type": "host-x86_64-cpu"},
{"props": {"core-id": 3, "thread-id": 0, "die-id":
2, "socket-id": 1}, "vcpus-count": 1, "type":
"host-x86_64-cpu"}, {"props": {"core-id": 2,
"thread-id": 4, "die-id": 2, "socket-id": 1},
"vcpus-count": 1, "type": "host-x86_64-cpu"},
 <snip/>
 {"props": {"core-id": 0, "thread-id": 0,
"die-id": 0, "socket-id": 0}, "vcpus-count": 1,
"type": "host-x86_64-cpu"}]}

 I can see that query-hotpluggable-cpus returns an array. Can I safely
 assume that vCPU ID == index in the array? I mean, if I did have -numa
 node,cpus=X  can I do  array[X] to obtain mapping onto Core/Thread/
 Die/Socket which would then be fed to 'set-numa-node' command. If not,
 what is the proper way to do it? 
...
From QEMU point of view, you shouldn't assume anything about vCPU
ordering within returned array. It's internal impl. detail
and a subject to change without notice.
What you can assume is that CPUs descriptions in array will be
stable for a given combination of [machine version, smp option, CPU type].

...
 And one more thing - if QEMU has to keep vCPU ID mapping code,
what's
 the point in obsoleting -numa node,cpus=? In the end it is still QEMU
 who does the ID -> [Core,Thread,Die,Socket] translation but with extra
 steps for mgmt applications. 
point is that cpu_index is ambiguous and it's practically impossible
to for user to tell which vCPU exactly it deals with unless
user re-implements and keeps in sync topology code for
 f(board, machine version, smp option, CPU type)

So even if cpu_index is still used inside of QEMU for
other purposes, the external interfaces and API will
be using only consistent topology tuple [Core,Thread,Die,Socket]
to describe and address vCPUs, same like device_add.

...
 Michal

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [PATCH 5/5] qemu: Prefer -numa cpu over -numa node,cpus=