On Mon, May 13, 2024 at 10:22:22PM +0800, Zhao Liu wrote:
Cc Paolo for x86 topology part
Hi Daniel,
On Mon, May 13, 2024 at 01:33:57PM +0100, Daniel P. Berrangé wrote:
> Date: Mon, 13 May 2024 13:33:57 +0100
> From: "Daniel P. Berrangé" <berrange(a)redhat.com>
> Subject: [PATCH 1/2] hw/core: allow parameter=1 for SMP topology on any
> machine
>
> This effectively reverts
>
> commit 54c4ea8f3ae614054079395842128a856a73dbf9
> Author: Zhao Liu <zhao1.liu(a)intel.com>
> Date: Sat Mar 9 00:01:37 2024 +0800
>
> hw/core/machine-smp: Deprecate unsupported "parameter=1" SMP
configurations
>
> but is not done as a 'git revert' since the part of the changes to the
> file hw/core/machine-smp.c which add 'has_XXX' checks remain desirable.
> Furthermore, we have to tweak the subsequently added unit test to
> account for differing warning message.
>
> The rationale for the original deprecation was:
>
> "Currently, it was allowed for users to specify the unsupported
> topology parameter as "1". For example, x86 PC machine doesn't
> support drawer/book/cluster topology levels, but user could specify
> "-smp drawers=1,books=1,clusters=1".
>
> This is meaningless and confusing, so that the support for this kind
> of configurations is marked deprecated since 9.0."
>
> There are varying POVs on the topic of 'unsupported' topology levels.
>
> It is common to say that on a system without hyperthreading, that there
> is always 1 thread. Likewise when new CPUs introduced a concept of
> multiple "dies', it was reasonable to say that all historical CPUs
> before that implicitly had 1 'die'. Likewise for the more recently
> introduced 'modules' and 'clusters' parameter'. From this POV,
it is
> valid to set 'parameter=1' on the -smp command line for any machine,
> only a value > 1 is strictly an error condition.
Currently QEMU has become more and more difficult to maintain a general
topology hierarchy, there are two recent examples:
1. as you mentioned "module" v.s. "cluster", one reason for
introducing
"module" is because it is difficult to define what "cluster" is for
x86,
the cluster in the device tree can be nested, then it can correspond to
an x86 die, or it can correspond to an x86 module. Therefore, specifying
"clusters=1" for x86 is ambiguous.
2. s390 introduces book and drawer, which are above socket/package
level, but for x86, the level above the package names "cluster" (yeah,
"cluster" again :-(). So if user sets "books=1" or
"drawers=1" for x86,
then it's meaningless. Similarly, "clusters=1" is also very confusing for
x86 machine.
I think that only thread/core/socket are architecturally general, the
other topology levels are hard to define across architectures, then
allowing unsupported topology=1 is always confusing...
Moreover, QEMU currently requires a clear topology containment
relationship when defining a topology, after which it will become
increasingly difficult to define a generic topology containment
relationship when new topology levels are introduced in the future...
I'm failing to see what real world technical problems QEMU faces
with a parameter being set to '1' by a mgmt app, when QEMU itself
treats all omitted values as being '1' anyway.
If we're trying to faithfully model the real world, then restricting
the topology against machine types though still looks inherantly wrong.
The valid topology ought to be constrained based on the named CPU model.
eg it doesn't make sense to allow 'dies=4' with a Skylake CPU model,
only an EPYC CPU model, especially if we want to model cache info in
a way that matches the real world silicon better.
> It doesn't cause any functional difficulty for QEMU, because
internally
> the QEMU code is itself assuming that all "unsupported" parameters
> implicitly have a value of '1'.
>
> At the libvirt level, we've allowed applications to set 'parameter=1'
> when configuring a guest, and pass that through to QEMU.
>
> Deprecating this creates extra difficulty for because there's no info
> exposed from QEMU about which machine types "support" which parameters.
> Thus, libvirt can't know whether it is valid to pass 'parameter=1' for
> a given machine type, or whether it will trigger deprecation messages.
I understand that libvirt is having trouble because there is no interface
to expose which topology levels the current machine supports. As a
workaround to eliminate the difficulties at the libvirt level, it's
ok for me.
But I believe deprecating the unsupported topology is necessary, so do
you think it's acceptable to include an interface to expose the supported
topology if it's going to be deprecated again later?
As above, I think that restrictions based on machine type, while nice and
simple, are incorrect long term. If we did impose restrictions based on
CPU model, then we could trivially expose this info to mgmt apps via the
existing mechanism for querying supported CPU models. Limiting based on
CPU model, however, has potentially greater back compat issues, though
it would be strictly more faithful to hardware.
With regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|