
On Wed, 20 Oct 2021 13:07:59 +0200 Michal Prívozník <mprivozn@redhat.com> wrote:
On 10/6/21 3:32 PM, Igor Mammedov wrote:
On Thu, 30 Sep 2021 14:08:34 +0200 Peter Krempa <pkrempa@redhat.com> wrote:
On Tue, Sep 21, 2021 at 16:50:31 +0200, Michal Privoznik wrote:
QEMU is trying to obsolete -numa node,cpus= because that uses ambiguous vCPU id to [socket, die, core, thread] mapping. The new form is:
-numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
which is repeated for every vCPU and places it at [S, D, C, T] into guest NUMA node N.
While in general this is magic mapping, we can deal with it. Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology is given then maxvcpus must be sockets * dies * cores * threads (i.e. there are no 'holes'). Secondly, if no topology is given then libvirt itself places each vCPU into a different socket (basically, it fakes topology of: [maxvcpus, 1, 1, 1]) Thirdly, we can copy whatever QEMU is doing when mapping vCPUs onto topology, to make sure vCPUs don't start to move around.
There's a problem with this premise though and unfortunately we don't seem to have qemuxml2argvtest for it.
On PPC64, in certain situations the CPU can be configured such that threads are visible only to VMs. This has substantial impact on how CPUs are configured using the modern parameters (until now used only for cpu hotplug purposes, and that's the reason vCPU hotplug has such complicated incantations when starting the VM).
In the above situation a CPU with topology of: sockets=1, cores=4, threads=8 (thus 32 cpus)
will only expose 4 CPU "devices".
core-id: 0, core-id: 8, core-id: 16 and core-id: 24
yet the guest will correctly see 32 cpus when used as such.
You can see this in:
tests/qemuhotplugtestcpus/ppc64-modern-individual-monitor.json
Also note that the 'props' object does _not_ have any socket-id, and management apps are supposed to pass in 'props' as is. (There's a bunch of code to do that on hotplug).
The problem is that you need to query the topology first (unless we want to duplicate all of qemu code that has to do with topology state and keep up with changes to it) to know how it's behaving on current machine. This historically was not possible. The supposed solution for this was the pre-config state where we'd be able to query and set it up via QMP, but I was not keeping up sufficiently with that work, so I don't know if it's possible.
If preconfig is a viable option we IMO should start using it sooner rather than later and avoid duplicating qemu's logic here.
using preconfig is preferable variant otherwise libvirt would end up duplicating topology logic which differs not only between targets but also between machine/cpu types.
Closest example how to use preconfig is in pc_dynamic_cpu_cfg() test case. Though it uses query-hotpluggable-cpus only for verification, but one can use the command at the preconfig stage to get topology for given -smp/-machine type combination.
Alright, -preconfig should be pretty easy. However, I do have some points to raise/ask:
1) currently, exit-preconfig is marked as experimental (hence its "x-" prefix). Before libvirt consumes it, QEMU should make it stable. Is there anything that stops QEMU from doing so or is it just a matter of sending patches (I volunteer to do that)?
if I recall correctly, it was made experimental due to lack of actual users (it was supposed that libvirt would consume it once available but it didn't happen for quite a long time). So patches to make it stable interface should be fine.
2) In my experiments I try to mimic what libvirt does. Here's my cmd line:
qemu-system-x86_64 \ -S \ -preconfig \ -cpu host \ -smp 120,sockets=2,dies=3,cores=4,threads=5 \ -object '{"qom-type":"memory-backend-memfd","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,memdev=ram-node0 \ -no-user-config \ -nodefaults \ -no-shutdown \ -qmp stdio
and here is my QMP log:
{"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 6}, "package": "v6.1.0-1552-g362534a643"}, "capabilities": ["oob"]}}
{"execute":"qmp_capabilities"} {"return": {}}
{"execute":"query-hotpluggable-cpus"} {"return": [{"props": {"core-id": 3, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 3, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 2, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 1, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 3, "thread-id": 0, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, {"props": {"core-id": 2, "thread-id": 4, "die-id": 2, "socket-id": 1}, "vcpus-count": 1, "type": "host-x86_64-cpu"}, <snip/> {"props": {"core-id": 0, "thread-id": 0, "die-id": 0, "socket-id": 0}, "vcpus-count": 1, "type": "host-x86_64-cpu"}]}
I can see that query-hotpluggable-cpus returns an array. Can I safely assume that vCPU ID == index in the array? I mean, if I did have -numa node,cpus=X can I do array[X] to obtain mapping onto Core/Thread/ Die/Socket which would then be fed to 'set-numa-node' command. If not, what is the proper way to do it?
From QEMU point of view, you shouldn't assume anything about vCPU ordering within returned array. It's internal impl. detail and a subject to change without notice. What you can assume is that CPUs descriptions in array will be stable for a given combination of [machine version, smp option, CPU type].
And one more thing - if QEMU has to keep vCPU ID mapping code, what's the point in obsoleting -numa node,cpus=? In the end it is still QEMU who does the ID -> [Core,Thread,Die,Socket] translation but with extra steps for mgmt applications.
point is that cpu_index is ambiguous and it's practically impossible to for user to tell which vCPU exactly it deals with unless user re-implements and keeps in sync topology code for f(board, machine version, smp option, CPU type) So even if cpu_index is still used inside of QEMU for other purposes, the external interfaces and API will be using only consistent topology tuple [Core,Thread,Die,Socket] to describe and address vCPUs, same like device_add.
Michal