On 10/6/21 3:32 PM, Igor Mammedov wrote:
On Thu, 30 Sep 2021 14:08:34 +0200
Peter Krempa <pkrempa(a)redhat.com> wrote:
> On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote:
>> QEMU is trying to obsolete -numa node,cpus= because that uses
>> ambiguous vCPU id to [socket, die, core, thread] mapping. The new
>> form is:
>>
>> -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
>>
>> which is repeated for every vCPU and places it at [S, D, C, T]
>> into guest NUMA node N.
>>
>> While in general this is magic mapping, we can deal with it.
>> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
>> is given then maxvcpus must be sockets * dies * cores * threads
>> (i.e. there are no 'holes').
>> Secondly, if no topology is given then libvirt itself places each
>> vCPU into a different socket (basically, it fakes topology of:
>> [maxvcpus, 1, 1, 1])
>> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
>> onto topology, to make sure vCPUs don't start to move around.
>
> There's a problem with this premise though and unfortunately we don't
> seem to have qemuxml2argvtest for it.
>
> On PPC64, in certain situations the CPU can be configured such that
> threads are visible only to VMs. This has substantial impact on how CPUs
> are configured using the modern parameters (until now used only for
> cpu hotplug purposes, and that's the reason vCPU hotplug has such
> complicated incantations when starting the VM).
>
> In the above situation a CPU with topology of:
> sockets=1, cores=4, threads=8 (thus 32 cpus)
>
> will only expose 4 CPU "devices".
>
> core-id: 0, core-id: 8, core-id: 16 and core-id: 24
>
> yet the guest will correctly see 32 cpus when used as such.
>
> You can see this in:
>
> tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json
>
> Also note that the 'props' object does _not_ have any socket-id, and
> management apps are supposed to pass in 'props' as is. (There's a bunch
> of code to do that on hotplug).
>
> The problem is that you need to query the topology first (unless we want
> to duplicate all of qemu code that has to do with topology state and
> keep up with changes to it) to know how it's behaving on current
> machine. This historically was not possible. The supposed solution for
> this was the pre-config state where we'd be able to query and set it up
> via QMP, but I was not keeping up sufficiently with that work, so I
> don't know if it's possible.
>
> If preconfig is a viable option we IMO should start using it sooner
> rather than later and avoid duplicating qemu's logic here.
using preconfig is preferable variant otherwise libvirt
would end up duplicating topology logic which differs not only
between targets but also between machine/cpu types.
Closest example how to use preconfig is in pc_dynamic_cpu_cfg()
test case. Though it uses query-hotpluggable-cpus only for
verification, but one can use the command at the preconfig
stage to get topology for given -smp/-machine type combination.
Alright, -preconfig should be pretty easy. However, I do have some
points to raise/ask:
1) currently, exit-preconfig is marked as experimental (hence its "x-"
prefix). Before libvirt consumes it, QEMU should make it stable. Is
there anything that stops QEMU from doing so or is it just a matter of
sending patches (I volunteer to do that)?
2) In my experiments I try to mimic what libvirt does. Here's my cmd
line:
qemu-system-x86_64 \
-S \
-preconfig \
-cpu host \
-smp 120,sockets=2,dies=3,cores=4,threads=5 \
-object
'{"qom-type":"memory-backend-memfd","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind"}'
\
-numa node,nodeid=0,memdev=ram-node0 \
-no-user-config \
-nodefaults \
-no-shutdown \
-qmp stdio
and here is my QMP log:
{"QMP": {"version": {"qemu": {"micro": 50,
"minor": 1, "major": 6}, "package":
"v6.1.0-1552-g362534a643"}, "capabilities": ["oob"]}}
{"execute":"qmp_capabilities"}
{"return": {}}
{"execute":"query-hotpluggable-cpus"}
{"return": [{"props": {"core-id": 3, "thread-id":
4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1,
"type": "host-x86_64-cpu"}, {"props": {"core-id":
3, "thread-id": 3, "die-id": 2, "socket-id": 1},
"vcpus-count": 1, "type": "host-x86_64-cpu"},
{"props": {"core-id": 3, "thread-id": 2, "die-id":
2, "socket-id": 1}, "vcpus-count": 1, "type":
"host-x86_64-cpu"}, {"props": {"core-id": 3,
"thread-id": 1, "die-id": 2, "socket-id": 1},
"vcpus-count": 1, "type": "host-x86_64-cpu"},
{"props": {"core-id": 3, "thread-id": 0, "die-id":
2, "socket-id": 1}, "vcpus-count": 1, "type":
"host-x86_64-cpu"}, {"props": {"core-id": 2,
"thread-id": 4, "die-id": 2, "socket-id": 1},
"vcpus-count": 1, "type": "host-x86_64-cpu"},
<snip/>
{"props": {"core-id": 0, "thread-id": 0, "die-id":
0, "socket-id": 0}, "vcpus-count": 1, "type":
"host-x86_64-cpu"}]}
I can see that query-hotpluggable-cpus returns an array. Can I safely
assume that vCPU ID == index in the array? I mean, if I did have -numa
node,cpus=X can I do array[X] to obtain mapping onto Core/Thread/
Die/Socket which would then be fed to 'set-numa-node' command. If not,
what is the proper way to do it?
And one more thing - if QEMU has to keep vCPU ID mapping code, what's
the point in obsoleting -numa node,cpus=? In the end it is still QEMU
who does the ID -> [Core,Thread,Die,Socket] translation but with extra
steps for mgmt applications.
Michal