Re: [PATCH 4/5] qemu: Prefer -numa cpu over -numa node,cpus=

25 May 2020

On 5/22/20 7:18 PM, Igor Mammedov wrote:
...
On Fri, 22 May 2020 18:28:31 +0200
Michal Privoznik <mprivozn@redhat.com> wrote:
...
On 5/22/20 6:07 PM, Igor Mammedov wrote:
...
On Fri, 22 May 2020 16:14:14 +0200
Michal Privoznik <mprivozn@redhat.com> wrote:
...
QEMU is trying to obsolete -numa node,cpus= because that uses
ambiguous vCPU id to [socket, die, core, thread] mapping. The new
form is:
-numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
which is repeated for every vCPU and places it at [S, D, C, T]
into guest NUMA node N.
While in general this is magic mapping, we can deal with it.
Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
is given then maxvcpus must be sockets * dies * cores * threads
(i.e. there are no 'holes').
Secondly, if no topology is given then libvirt itself places each
vCPU into a different socket (basically, it fakes topology of:
[maxvcpus, 1, 1, 1])
Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
onto topology, to make sure vCPUs don't start to move around.
Note, migration from old to new cmd line works and therefore
doesn't need any special handling.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1678085
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
   src/qemu/qemu_command.c                       | 108 +++++++++++++++++-
   .../hugepages-nvdimm.x86_64-latest.args       |   4 +-
   ...memory-default-hugepage.x86_64-latest.args |  10 +-
   .../memfd-memory-numa.x86_64-latest.args      |  10 +-
   ...y-hotplug-nvdimm-access.x86_64-latest.args |   4 +-
   ...ry-hotplug-nvdimm-align.x86_64-latest.args |   4 +-
   ...ry-hotplug-nvdimm-label.x86_64-latest.args |   4 +-
   ...ory-hotplug-nvdimm-pmem.x86_64-latest.args |   4 +-
   ...ory-hotplug-nvdimm-ppc64.ppc64-latest.args |   4 +-
   ...hotplug-nvdimm-readonly.x86_64-latest.args |   4 +-
   .../memory-hotplug-nvdimm.x86_64-latest.args  |   4 +-
   ...vhost-user-fs-fd-memory.x86_64-latest.args |   4 +-
   ...vhost-user-fs-hugepages.x86_64-latest.args |   4 +-
   ...host-user-gpu-secondary.x86_64-latest.args |   3 +-
   .../vhost-user-vga.x86_64-latest.args         |   3 +-
   15 files changed, 158 insertions(+), 16 deletions(-)

diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index 7d84fd8b5e..0de4fe4905 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -7079,6 +7079,91 @@ qemuBuildNumaOldCPUs(virBufferPtr buf,
   }
+/**
+ * qemuTranlsatevCPUID:
+ *
+ * For given vCPU @id and vCPU topology (@cpu) compute corresponding
+ * @socket, @die, @core and @thread). This assumes linear topology,
+ * that is every [socket, die, core, thread] combination is valid vCPU
+ * ID and there are no 'holes'. This is ensured by
+ * qemuValidateDomainDef() if QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS is
+ * set.
I wouldn't make this assumption, each machine can have (and has) it's own layout,
and now it's not hard to change that per machine version if necessary.
I'd suppose one could pull the list of possible CPUs from QEMU started
in preconfig mode with desired -smp x,y,z using QUERY_HOTPLUGGABLE_CPUS
and then continue to configure numa with QMP commands using provided
CPUs layout.
Continue where? At the 'preconfig mode' the guest is already started,
isn't it? Are you suggesting that libvirt starts a dummy QEMU process,
fetches the CPU topology from it an then starts if for real? Libvirt
QEMU started but it's very far from starting guest, at that time it's possible
configure numa mapping at runtime and continue to -S or running state
without restarting QEMU. For the follow up starts, used topology and numa options
can be cached and reused at CLI time as long as machine/-smp combination stays
the same.
This is a problem. The domain XML that is provided can't be changed, 
mostly because mgmt apps construct it on the fly and then just pass it 
as a RO string to libvirt. While libvirt could create a separate cache, 
there has to be a better way.

I mean, I can add some more code that once the guest is running 
preserves the mapping during migration. But that assumes a running QEMU. 
When starting a domain from scratch, is it acceptable it vCPU topology 
changes? I suspect it is not.
...
...
tries to avoid that as much as it can.
...
How to present it to libvirt user I'm not sure (give them that list perhaps
and let select from it???)
This is what I am trying to figure out in the cover letter. Maybe we
need to let users configure the topology (well, vCPU id to [socket, die,
core, thread] mapping), but then again, in my testing the guest ignored
that and displayed different topology (true, I was testing with -cpu
host, so maybe that's why).
there is ongiong issue with EPYC VCPUs topology, but I otherwise it should work.
Just report bug to qemu-devel, if it's broken.
...
...
But it's irrelevant, to the patch, magical IDs for socket/core/...whatever
should not be generated by libvirt anymore, but rather taken from QEMU for given
machine + -smp combination.
Taken when? We can do this for running machines, but not for freshly
started ones, can we?
it can be used for freshly started as well,
QEMU -S -preconfig -M pc -smp ...
(QMP) query-hotpluggable-cpus
(QMP) set-numa-node ...
...
(QMP) exit-preconfig
(QMP) other stuff libvirt does (like hot-plugging CPUs , ...)
(QMP) cont
I'm not sure this works. query-hotpluggable-cpus does not map vCPU ID 
<-> socket/core/thread, For '-smp 2,sockets=2,cores=1,threads=1' the 
'query-hotpluggable-cpus' returns:

{"return": [{"props": {"core-id": 0, "thread-id": 0, "socket-id": 1}, 
"vcpus-count": 1, "type": "qemu64-x86_64-cpu"}, {"props": {"core-id": 0, 
"thread-id": 0, "socket-id": 0}, "vcpus-count": 1, "type": 
"qemu64-x86_64-cpu"}]}

And 'query-cpus' or 'query-cpus-fast' which map vCPU ID onto 
socket/core/thread are not allowed in preconfig state.


But if I take a step back, the whole point of deprecating -numa 
node,cpus= is that QEMU no longer wants to do vCPU ID <-> 
socket/core/thread mapping because it's ambiguous. So it feels a bit 
weird to design a solution where libvirt would ask QEMU to provide the 
mapping only so that it can be configured back. Not only because of the 
extra step, but also because QEMU can't then remove the mapping anyway. 
I might be misunderstanding the issue though.

Michal