On Wed, 20 Oct 2021 13:07:59 +0200
Michal Prívozník <mprivozn(a)redhat.com> wrote:
On 10/6/21 3:32 PM, Igor Mammedov wrote:
> On Thu, 30 Sep 2021 14:08:34 +0200
> Peter Krempa <pkrempa(a)redhat.com> wrote:
>
>> On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote:
>>> QEMU is trying to obsolete -numa node,cpus= because that uses
>>> ambiguous vCPU id to [socket, die, core, thread] mapping. The new
>>> form is:
>>>
>>> -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
>>>
>>> which is repeated for every vCPU and places it at [S, D, C, T]
>>> into guest NUMA node N.
>>>
>>> While in general this is magic mapping, we can deal with it.
>>> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
>>> is given then maxvcpus must be sockets * dies * cores * threads
>>> (i.e. there are no 'holes').
>>> Secondly, if no topology is given then libvirt itself places each
>>> vCPU into a different socket (basically, it fakes topology of:
>>> [maxvcpus, 1, 1, 1])
>>> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
>>> onto topology, to make sure vCPUs don't start to move around.
>>
>> There's a problem with this premise though and unfortunately we don't
>> seem to have qemuxml2argvtest for it.
>>
>> On PPC64, in certain situations the CPU can be configured such that
>> threads are visible only to VMs. This has substantial impact on how CPUs
>> are configured using the modern parameters (until now used only for
>> cpu hotplug purposes, and that's the reason vCPU hotplug has such
>> complicated incantations when starting the VM).
>>
>> In the above situation a CPU with topology of:
>> sockets=1, cores=4, threads=8 (thus 32 cpus)
>>
>> will only expose 4 CPU "devices".
>>
>> core-id: 0, core-id: 8, core-id: 16 and core-id: 24
>>
>> yet the guest will correctly see 32 cpus when used as such.
>>
>> You can see this in:
>>
>> tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json
>>
>> Also note that the 'props' object does _not_ have any socket-id, and
>> management apps are supposed to pass in 'props' as is. (There's a
bunch
>> of code to do that on hotplug).
>>
>> The problem is that you need to query the topology first (unless we want
>> to duplicate all of qemu code that has to do with topology state and
>> keep up with changes to it) to know how it's behaving on current
>> machine. This historically was not possible. The supposed solution for
>> this was the pre-config state where we'd be able to query and set it up
>> via QMP, but I was not keeping up sufficiently with that work, so I
>> don't know if it's possible.
>>
>> If preconfig is a viable option we IMO should start using it sooner
>> rather than later and avoid duplicating qemu's logic here.
>
> using preconfig is preferable variant otherwise libvirt
> would end up duplicating topology logic which differs not only
> between targets but also between machine/cpu types.
>
> Closest example how to use preconfig is in pc_dynamic_cpu_cfg()
> test case. Though it uses query-hotpluggable-cpus only for
> verification, but one can use the command at the preconfig
> stage to get topology for given -smp/-machine type combination.
Alright, -preconfig should be pretty easy. However, I do have some
points to raise/ask:
1) currently, exit-preconfig is marked as experimental (hence its "x-"
prefix). Before libvirt consumes it, QEMU should make it stable. Is
there anything that stops QEMU from doing so or is it just a matter of
sending patches (I volunteer to do that)?
if I recall correctly, it was made experimental due to lack of
actual users (it was supposed that libvirt would consume it
once available but it didn't happen for quite a long time).
So patches to make it stable interface should be fine.
2) In my experiments I try to mimic what libvirt does. Here's my cmd
line:
qemu-system-x86_64 \
-S \
-preconfig \
-cpu host \
-smp 120,sockets=2,dies=3,cores=4,threads=5 \
-object
'{"qom-type":"memory-backend-memfd","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind"}'
\
-numa node,nodeid=0,memdev=ram-node0 \
-no-user-config \
-nodefaults \
-no-shutdown \
-qmp stdio
and here is my QMP log:
{"QMP": {"version": {"qemu": {"micro": 50,
"minor": 1, "major": 6}, "package":
"v6.1.0-1552-g362534a643"}, "capabilities": ["oob"]}}
{"execute":"qmp_capabilities"}
{"return": {}}
{"execute":"query-hotpluggable-cpus"}
{"return": [{"props": {"core-id": 3, "thread-id":
4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1,
"type": "host-x86_64-cpu"}, {"props": {"core-id":
3, "thread-id": 3, "die-id": 2, "socket-id": 1},
"vcpus-count": 1, "type": "host-x86_64-cpu"},
{"props": {"core-id": 3, "thread-id": 2, "die-id":
2, "socket-id": 1}, "vcpus-count": 1, "type":
"host-x86_64-cpu"}, {"props": {"core-id": 3,
"thread-id": 1, "die-id": 2, "socket-id": 1},
"vcpus-count": 1, "type": "host-x86_64-cpu"},
{"props": {"core-id": 3, "thread-id": 0, "die-id":
2, "socket-id": 1}, "vcpus-count": 1, "type":
"host-x86_64-cpu"}, {"props": {"core-id": 2,
"thread-id": 4, "die-id": 2, "socket-id": 1},
"vcpus-count": 1, "type": "host-x86_64-cpu"},
<snip/>
{"props": {"core-id": 0, "thread-id": 0,
"die-id": 0, "socket-id": 0}, "vcpus-count": 1,
"type": "host-x86_64-cpu"}]}
I can see that query-hotpluggable-cpus returns an array. Can I safely
assume that vCPU ID == index in the array? I mean, if I did have -numa
node,cpus=X can I do array[X] to obtain mapping onto Core/Thread/
Die/Socket which would then be fed to 'set-numa-node' command. If not,
what is the proper way to do it?
From QEMU point of view, you shouldn't assume anything about vCPU
ordering within returned array. It's internal impl. detail
and a subject to change without notice.
What you can assume is that CPUs descriptions in array will be
stable for a given combination of [machine version, smp option, CPU type].
And one more thing - if QEMU has to keep vCPU ID mapping code,
what's
the point in obsoleting -numa node,cpus=? In the end it is still QEMU
who does the ID -> [Core,Thread,Die,Socket] translation but with extra
steps for mgmt applications.
point is that cpu_index is ambiguous and it's practically impossible
to for user to tell which vCPU exactly it deals with unless
user re-implements and keeps in sync topology code for
f(board, machine version, smp option, CPU type)
So even if cpu_index is still used inside of QEMU for
other purposes, the external interfaces and API will
be using only consistent topology tuple [Core,Thread,Die,Socket]
to describe and address vCPUs, same like device_add.
Michal