On Fri, Feb 16, 2024 at 11:30:09 +0000, Daniel P. Berrangé wrote:
On Wed, Feb 14, 2024 at 02:16:58PM +0100, Jiri Denemark wrote:
> On Wed, Feb 14, 2024 at 13:23:09 +0100, Tim Wiederhake wrote:
> > My knowledge about migration is limited, hence I am hesitant to make
> > factual claims. That being said, my understanding is that by requesting
> > e.g. a Skylake-Client cpu 'mpx' was never actually enabled in the VM
as
> > qemu's version of the same cpu model did not include that feature.
>
> Well, if our CPU model has mpx than QEMU must have had it too at some
> point. The question is whether it was ever released. And also whether
> the feature could have ever been enabled in any running domain or trying
> to enable it caused the domain to fail to start anyway.
QEMU CPUs originally had 'mpx', and it was later turned off by
a 4.0 machine type versions.
Right, by commit ecb85fe48cacb2f8740186e81f2f38a2e02bd963. Machine types
3.1 and older will still get the CPU models with mpx enabled, if I
understand the compat code correctly.
I tried to analyze all possible combinations during migration and came
up with the following table. To make it reasonably small I use some
definitions:
- "CPU def" is a relevant part of CPU definition transferred as part of
domain XML during migration from source to destination host. In other
words, it's describing the actual virtual CPU created by QEMU on the
source host.
- "$M" stands for any model affected by the change in this patch.
- "+mpx" means mpx was explicitly enabled.
- "-mpx" means mpx was explicitly disabled.
- "old" means libvirt release without this patch applied.
- "new" is libvirt with this patch applied.
CPU def | src | dest | result
-----------|------|------|----------------------------------------------
anything | old | old |\ no issue, both side share
anything | new | new |/ the same definition of $m
------------------------------------------------------------------------
$M +mpx | old | new |\ the definition of $M is irrelevant because
$M +mpx | new | old | \ mpx is explicitly disabled or enabled;
$M -mpx | old | new | / the destination knows how the virtual CPU
$M -mpx | new | old |/ looks like
------------------------------------------------------------------------
$M | old | new | migration gets aborted, see below the table
------------------------------------------------------------------------
$M | new | old | this will never happen; mpx is marked as
| | | "removed" in libvirt's CPU map, which means
| | | new libvirt will always explicitly mention
| | | the state of mpx in the definition
------------------------------------------------------------------------
The only problematic case is when migrating a domain with $M without
explicitly disabling or enabling mpx from an old libvirt to the new one.
In case QEMU actually enables mpx when asked for $M, i.e., for machine
types 3.1 and older running on a host that support mpx, libvirt thinks
$M already contains mpx and does not explicitly mention it in the XML.
The new libvirt instructs QEMU to start a machine with CPU model $M
(without explicitly mentioning mpx) and the same old machine type as
used on the source. So QEMU on the destination host uses the
compatibility code which adds mpx to $M and enables it. But the new
libvirt on the destination host has mpx already removed from the CPU
model $M and it will complain about unexpected mpx=on when checking what
features were enabled or disabled by QEMU and aborts the migration.
That said, we can only safely remove CPU features from existing
(released) CPU models if the problematic case can never happen. For
example if the removed feature is completely broken and it's impossible
to start a domain using this feature. Or when QEMU never enables the
feature without being explicitly asked to do so.
On the other hand, while thinking about all this I got an idea how we
could make removing features safe even in the problematic case. But I
first need to think about it more. I'll either send a patch for it after
I'm done or I'll reply here that the idea was wrong :-)
Jirka