On 5/22/20 7:18 PM, Igor Mammedov wrote:
On Fri, 22 May 2020 18:28:31 +0200
Michal Privoznik <mprivozn(a)redhat.com> wrote:
> On 5/22/20 6:07 PM, Igor Mammedov wrote:
>> On Fri, 22 May 2020 16:14:14 +0200
>> Michal Privoznik <mprivozn(a)redhat.com> wrote:
>>
>>> QEMU is trying to obsolete -numa node,cpus= because that uses
>>> ambiguous vCPU id to [socket, die, core, thread] mapping. The new
>>> form is:
>>>
>>> -numa cpu,node-id=N,socket-id=S,die-id=D,core-id=C,thread-id=T
>>>
>>> which is repeated for every vCPU and places it at [S, D, C, T]
>>> into guest NUMA node N.
>>>
>>> While in general this is magic mapping, we can deal with it.
>>> Firstly, with QEMU 2.7 or newer, libvirt ensures that if topology
>>> is given then maxvcpus must be sockets * dies * cores * threads
>>> (i.e. there are no 'holes').
>>> Secondly, if no topology is given then libvirt itself places each
>>> vCPU into a different socket (basically, it fakes topology of:
>>> [maxvcpus, 1, 1, 1])
>>> Thirdly, we can copy whatever QEMU is doing when mapping vCPUs
>>> onto topology, to make sure vCPUs don't start to move around.
>>>
>>> Note, migration from old to new cmd line works and therefore
>>> doesn't need any special handling.
>>>
>>> Fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=1678085
>>>
>>> Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
>>> ---
>>> src/qemu/qemu_command.c | 108 +++++++++++++++++-
>>> .../hugepages-nvdimm.x86_64-latest.args | 4 +-
>>> ...memory-default-hugepage.x86_64-latest.args | 10 +-
>>> .../memfd-memory-numa.x86_64-latest.args | 10 +-
>>> ...y-hotplug-nvdimm-access.x86_64-latest.args | 4 +-
>>> ...ry-hotplug-nvdimm-align.x86_64-latest.args | 4 +-
>>> ...ry-hotplug-nvdimm-label.x86_64-latest.args | 4 +-
>>> ...ory-hotplug-nvdimm-pmem.x86_64-latest.args | 4 +-
>>> ...ory-hotplug-nvdimm-ppc64.ppc64-latest.args | 4 +-
>>> ...hotplug-nvdimm-readonly.x86_64-latest.args | 4 +-
>>> .../memory-hotplug-nvdimm.x86_64-latest.args | 4 +-
>>> ...vhost-user-fs-fd-memory.x86_64-latest.args | 4 +-
>>> ...vhost-user-fs-hugepages.x86_64-latest.args | 4 +-
>>> ...host-user-gpu-secondary.x86_64-latest.args | 3 +-
>>> .../vhost-user-vga.x86_64-latest.args | 3 +-
>>> 15 files changed, 158 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
>>> index 7d84fd8b5e..0de4fe4905 100644
>>> --- a/src/qemu/qemu_command.c
>>> +++ b/src/qemu/qemu_command.c
>>> @@ -7079,6 +7079,91 @@ qemuBuildNumaOldCPUs(virBufferPtr buf,
>>> }
>>>
>>>
>>> +/**
>>> + * qemuTranlsatevCPUID:
>>> + *
>>> + * For given vCPU @id and vCPU topology (@cpu) compute corresponding
>>> + * @socket, @die, @core and @thread). This assumes linear topology,
>>> + * that is every [socket, die, core, thread] combination is valid vCPU
>>> + * ID and there are no 'holes'. This is ensured by
>>> + * qemuValidateDomainDef() if QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS is
>>> + * set.
>> I wouldn't make this assumption, each machine can have (and has) it's own
layout,
>> and now it's not hard to change that per machine version if necessary.
>>
>> I'd suppose one could pull the list of possible CPUs from QEMU started
>> in preconfig mode with desired -smp x,y,z using QUERY_HOTPLUGGABLE_CPUS
>> and then continue to configure numa with QMP commands using provided
>> CPUs layout.
>
> Continue where? At the 'preconfig mode' the guest is already started,
> isn't it? Are you suggesting that libvirt starts a dummy QEMU process,
> fetches the CPU topology from it an then starts if for real? Libvirt
QEMU started but it's very far from starting guest, at that time it's possible
configure numa mapping at runtime and continue to -S or running state
without restarting QEMU. For the follow up starts, used topology and numa options
can be cached and reused at CLI time as long as machine/-smp combination stays
the same.
This is a problem. The domain XML that is provided can't be changed,
mostly because mgmt apps construct it on the fly and then just pass it
as a RO string to libvirt. While libvirt could create a separate cache,
there has to be a better way.
I mean, I can add some more code that once the guest is running
preserves the mapping during migration. But that assumes a running QEMU.
When starting a domain from scratch, is it acceptable it vCPU topology
changes? I suspect it is not.
> tries to avoid that as much as it can.
>
>>
>> How to present it to libvirt user I'm not sure (give them that list perhaps
>> and let select from it???)
>
> This is what I am trying to figure out in the cover letter. Maybe we
> need to let users configure the topology (well, vCPU id to [socket, die,
> core, thread] mapping), but then again, in my testing the guest ignored
> that and displayed different topology (true, I was testing with -cpu
> host, so maybe that's why).
there is ongiong issue with EPYC VCPUs topology, but I otherwise it should work.
Just report bug to qemu-devel, if it's broken.
>
>> But it's irrelevant, to the patch, magical IDs for socket/core/...whatever
>> should not be generated by libvirt anymore, but rather taken from QEMU for given
>> machine + -smp combination.
>
> Taken when? We can do this for running machines, but not for freshly
> started ones, can we?
it can be used for freshly started as well,
QEMU -S -preconfig -M pc -smp ...
(QMP) query-hotpluggable-cpus
(QMP) set-numa-node ...
...
(QMP) exit-preconfig
(QMP) other stuff libvirt does (like hot-plugging CPUs , ...)
(QMP) cont
I'm not sure this works. query-hotpluggable-cpus does not map vCPU ID
<-> socket/core/thread, For '-smp 2,sockets=2,cores=1,threads=1' the
'query-hotpluggable-cpus' returns:
{"return": [{"props": {"core-id": 0, "thread-id":
0, "socket-id": 1},
"vcpus-count": 1, "type": "qemu64-x86_64-cpu"},
{"props": {"core-id": 0,
"thread-id": 0, "socket-id": 0}, "vcpus-count": 1,
"type":
"qemu64-x86_64-cpu"}]}
And 'query-cpus' or 'query-cpus-fast' which map vCPU ID onto
socket/core/thread are not allowed in preconfig state.
But if I take a step back, the whole point of deprecating -numa
node,cpus= is that QEMU no longer wants to do vCPU ID <->
socket/core/thread mapping because it's ambiguous. So it feels a bit
weird to design a solution where libvirt would ask QEMU to provide the
mapping only so that it can be configured back. Not only because of the
extra step, but also because QEMU can't then remove the mapping anyway.
I might be misunderstanding the issue though.
Michal