Re: [PATCH 4/5] qemu: Prefer -numa cpu over -numa node,cpus=

26 May 2020

On Mon, 25 May 2020 10:05:08 +0200
Michal Privoznik <mprivozn@redhat.com> wrote:
...
On 5/22/20 7:18 PM, Igor Mammedov wrote:
...
On Fri, 22 May 2020 18:28:31 +0200
Michal Privoznik <mprivozn@redhat.com> wrote:
...
On 5/22/20 6:07 PM, Igor Mammedov wrote:
...
On Fri, 22 May 2020 16:14:14 +0200
Michal Privoznik <mprivozn@redhat.com> wrote:
...
QEMU is trying to obsolete -numa node,cpus= because that uses
ambiguous vCPU id to [socket, die, core, thread] mapping. The new
form is:
-numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
which is repeated for every vCPU and places it at [S, D, C, T]
into guest NUMA node N.
While in general this is magic mapping, we can deal with it.
Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
is given then maxvcpus must be sockets * dies * cores * threads
(i.e. there are no 'holes').
Secondly, if no topology is given then libvirt itself places each
vCPU into a different socket (basically, it fakes topology of:
[maxvcpus, 1, 1, 1])
Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
onto topology, to make sure vCPUs don't start to move around.
Note, migration from old to new cmd line works and therefore
doesn't need any special handling.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1678085
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
   src/qemu/qemu_command.c                       | 108 +++++++++++++++++-
   .../hugepages-nvdimm.x86_64-latest.args       |   4 +-
   ...memory-default-hugepage.x86_64-latest.args |  10 +-
   .../memfd-memory-numa.x86_64-latest.args      |  10 +-
   ...y-hotplug-nvdimm-access.x86_64-latest.args |   4 +-
   ...ry-hotplug-nvdimm-align.x86_64-latest.args |   4 +-
   ...ry-hotplug-nvdimm-label.x86_64-latest.args |   4 +-
   ...ory-hotplug-nvdimm-pmem.x86_64-latest.args |   4 +-
   ...ory-hotplug-nvdimm-ppc64.ppc64-latest.args |   4 +-
   ...hotplug-nvdimm-readonly.x86_64-latest.args |   4 +-
   .../memory-hotplug-nvdimm.x86_64-latest.args  |   4 +-
   ...vhost-user-fs-fd-memory.x86_64-latest.args |   4 +-
   ...vhost-user-fs-hugepages.x86_64-latest.args |   4 +-
   ...host-user-gpu-secondary.x86_64-latest.args |   3 +-
   .../vhost-user-vga.x86_64-latest.args         |   3 +-
   15 files changed, 158 insertions(+), 16 deletions(-)

diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index 7d84fd8b5e..0de4fe4905 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -7079,6 +7079,91 @@ qemuBuildNumaOldCPUs(virBufferPtr buf,
   }
+/**
+ * qemuTranlsatevCPUID:
+ *
+ * For given vCPU @id and vCPU topology (@cpu) compute corresponding
+ * @socket, @die, @core and @thread). This assumes linear topology,
+ * that is every [socket, die, core, thread] combination is valid vCPU
+ * ID and there are no 'holes'. This is ensured by
+ * qemuValidateDomainDef() if QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS is
+ * set.  
I wouldn't make this assumption, each machine can have (and has) it's own layout,
and now it's not hard to change that per machine version if necessary.
I'd suppose one could pull the list of possible CPUs from QEMU started
in preconfig mode with desired -smp x,y,z using QUERY_HOTPLUGGABLE_CPUS
and then continue to configure numa with QMP commands using provided
CPUs layout.
Continue where? At the 'preconfig mode' the guest is already started,
isn't it? Are you suggesting that libvirt starts a dummy QEMU process,
fetches the CPU topology from it an then starts if for real? Libvirt  
QEMU started but it's very far from starting guest, at that time it's possible
configure numa mapping at runtime and continue to -S or running state
without restarting QEMU. For the follow up starts, used topology and numa options
can be cached and reused at CLI time as long as machine/-smp combination stays
the same.
This is a problem. The domain XML that is provided can't be changed, 
mostly because mgmt apps construct it on the fly and then just pass it 
as a RO string to libvirt. While libvirt could create a separate cache, 
there has to be a better way.
I mean, I can add some more code that once the guest is running 
preserves the mapping during migration. But that assumes a running QEMU. 
When starting a domain from scratch, is it acceptable it vCPU topology 
changes? I suspect it is not.
I'm not sure I got you but
vCPU topology isn't changnig but when starting QEMU, user has to map
'concrete vCPUs' to spencific numa nodes. The issue here is that
to specify concrete vCPUs user needs to get layout from QEMU first
as it's a function of target/machine/-smp and possibly cpu type.
that applies not only '-numa cpu' but also to -device cpufoo,
that's why query-hotpluggable-cpus was introduced to let
user get the list of possible CPUs (including topo properties needed to
create them) for a given set of CLI options.

If I recall right libvirt uses topo properies during cpu hotplug but
treats it mainly as opaqueue info so it could feed it back to QEMU.
...
...
...
tries to avoid that as much as it can.
...
How to present it to libvirt user I'm not sure (give them that list perhaps
and let select from it???)
This is what I am trying to figure out in the cover letter. Maybe we
need to let users configure the topology (well, vCPU id to [socket, die,
core, thread] mapping), but then again, in my testing the guest ignored
that and displayed different topology (true, I was testing with -cpu
host, so maybe that's why).  
there is ongiong issue with EPYC VCPUs topology, but I otherwise it should work.
Just report bug to qemu-devel, if it's broken.
...
...
But it's irrelevant, to the patch, magical IDs for socket/core/...whatever
should not be generated by libvirt anymore, but rather taken from QEMU for given
machine + -smp combination.
Taken when? We can do this for running machines, but not for freshly
started ones, can we?
it can be used for freshly started as well,
QEMU -S -preconfig -M pc -smp ...
(QMP) query-hotpluggable-cpus
(QMP) set-numa-node ...
...
(QMP) exit-preconfig
(QMP) other stuff libvirt does (like hot-plugging CPUs , ...)
(QMP) cont
I'm not sure this works. query-hotpluggable-cpus does not map vCPU ID 
<-> socket/core/thread, For '-smp 2,sockets=2,cores=1,threads=1' the 
'query-hotpluggable-cpus' returns:
{"return": [{"props": {"core-id": 0, "thread-id": 0, "socket-id": 1}, 
"vcpus-count": 1, "type": "qemu64-x86_64-cpu"}, {"props": {"core-id": 0, 
"thread-id": 0, "socket-id": 0}, "vcpus-count": 1, "type": 
"qemu64-x86_64-cpu"}]}
...
And 'query-cpus' or 'query-cpus-fast' which map vCPU ID onto 
socket/core/thread are not allowed in preconfig state.
that's the list I was taling about, which is implicitly ordered by cpu_index

these 2 commands apply to present cpu only, if I'm not mistaken.
query-hotpluggable-cpus shows not only present but also CPUs that
could be hotplugged with device_add or used with -device.
...
But if I take a step back, the whole point of deprecating -numa 
node,cpus= is that QEMU no longer wants to do vCPU ID <-> 
socket/core/thread mapping because it's ambiguous. So it feels a bit 
weird to design a solution where libvirt would ask QEMU to provide the 
mapping only so that it can be configured back. Not only because of the 
extra step, but also because QEMU can't then remove the mapping anyway. 
I might be misunderstanding the issue though.
if '-numa node,cpus' is removed, we no longer will be using cpu_index as
configuration interface with user, that would allow QEMU start pruning
it from HMP/QMP interfaces and then probably remove it internally.
(I haven't explored yet if we could get rid of it completely but
I'd expect migration stream would be the only reason to keep it intrenally).
I'm quite reluctant to add cpu_index to modern query-hotpluggable-cpus output,
since the whole goal is to get rid of the index, which don't actually work
with SPAPR where CPU entity is a core and threads are internal impl. detail
(while cpu_index has 1:1 mapping with threads).

However if it will let QEMU to drop '-numa node,cpus=', we can discuss
adding optional 'x-cpu-index' to query-hotpluggable-cpus, that will be available
for old machine types for the sole purpose to help libvirt map old CLI to new one.
New machines shouldn't care about index though, since they should be using
'-numa cpu'.

PS:
Since libvirt would like to avoid restarting QEMU, after discussing between
libvirt and QEMU floks, how to approach chicken/egg problem (started qemu vs
cpu descriptors being available after start), -preconfig mode (however ugly) was
introduced as a compromise.

It was discussed on qemu-devel under subject
"enable numa configuration before  machine_init() from HMP/QMP"
which started with RFC and eventually was merged as v7.

I hope above clarifies questions you (would) have.
...
Michal