Re: Ways to deal with broken machine types

29 Mar 2021


      * Igor Mammedov (imammedo@redhat.com) wrote:
...
On Tue, 23 Mar 2021 17:40:36 +0000
Daniel P. Berrangé <berrange@redhat.com> wrote:
...
On Tue, Mar 23, 2021 at 05:54:47PM +0100, Igor Mammedov wrote:
...
Let me hijack this thread for beyond this case scope.
I agree that for this particular bug we've done all we could, but
there is broader issue to discuss here.
We have machine versions to deal with hw compatibility issues and that covers most of the cases,
but occasionally we notice problem well after release(s),
so users may be stuck with broken VM and need to manually fix configuration (and/or VM).
Figuring out what's wrong and how to fix it is far from trivial. So lets discuss if we
can help to ease this pain, yes it will be late for first victims but it's still
better than never.
To summarize the problem situation
- We rely on a machine type version to encode a precise guest ABI.
 - Due a bug, we are in a situation where the same machine type
   encodes two distinct guest ABIs due to a mistake introduced
   betwen QEMU N-2 and N-1
 - We want to fix the bug in QEMU N
 - For incoming migration there is no way to distinguish between
   the ABIs used in N-2 and N-1, to pick the right one
So we're left with an unwinnable problem:
- Not fixing the bug =>
a) user migrating N-2 to N-1 have ABI change
       b) user migrating N-2 to N have ABI change
       c) user migrating N-1 to N are fine
No mitigation for (a) or (b)
- Fixing the bug =>
a) user migrating N-2 to N-1 have ABI change.
       b) user migrating N-2 to N are fine
       c) user migrating N-1 to N have ABI change
Bad situations (a) and (c) are mitigated by
    backporting fix to N-1-stable too.
Generally we have preferred to fix the bug, because we have
usually identified them fairly quickly after release, and
backporting the fix to stable has been sufficient mitigation
against ill effects. Basically the people left broken are a
relatively small set out of the total userbase.
The real challenge arises when we are slow to identify the
problem, such that we have a large number of people impacted.
...
I'll try to sum up idea Michael suggested (here comes my unorganized brain-dump),
1. We can keep in VM's config QEMU version it was created on
   and as minimum warn user with a pointer to known issues if version in
   config mismatches version of actually used QEMU, with a knob to silence
   it for particular mismatch.
When an issue becomes know and resolved we know for sure how and what
changed and embed instructions on what options to use for fixing up VM's
config to preserve old HW config depending on QEMU version VM was installed on.
...
some more ideas:
   2. let mgmt layer to keep fixup list and apply them to config if available
       (user would need to upgrade mgmt or update fixup list somehow)
   3. let mgmt layer to pass VM's QEMU version to currently used QEMU, so
      that QEMU could maintain and apply fixups based on QEMU version + machine type.
      The user will have to upgrade to newer QEMU to get/use new fixups.
The nice thing about machine type versioning is that we are treating the
versions as opaque strings which represent a specific ABI, regardless of
the QEMU version. This means that even if distros backport fixes for bugs
or even new features, the machine type compatibility check remains a
simple equality comparsion.
As soon as you introduce the QEMU version though, we have created a
large matrix for compatibility. This matrix is expanded if a distro
chooses to backport fixes for any of the machine type bugs to their
stable streams. This can get particularly expensive when there are
multiple streams a distro is maintaining.
*IF* the original N-1 qemu has a property that could be queried by
the mgmt app to identify a machine type bug, then we could potentially
apply a fixup automatically.
eg query-machines command in QEMU version N could report against
"pc-i440fx-5.0", that there was a regression fix that has to be
applied if property "foo" had value "bar".
Now, the mgmt app wants to migrate from QEMU N-2 or N-1 to QEMU N.
It can query the value of "foo" on the source QEMU with qom-get.
It now knows whether it has to override this property "foo" when
spawning QEMU N on the target host.
Of course this doesn't help us if neither N-1 or N-2 QEMU had a
property that can be queried to identify the bug - ie if the
property in question was newly introduced in QEMU N to fix the
bug.
...
In my opinion both would lead to explosion of 'possibly needed' properties for each
change we introduce in hw/firmware(read ACPI) and very possibly a lot of conditional
branches in QEMU code. And I'm afraid it will become hard to maintain QEMU =>
more bugs in future.
Also it will lead to explosion of test matrix for downstreams who care about testing.
If we proactively gate changes on properties, we can just update fixup lists in mgmt,
without need to update QEMU (aka Insite rules) at a cost of complexity on QMEU side.
Alternatively we can be conservative in spawning new properties, that means creating
them only when issue is fixed and require users to update QEMU, so that fixups could
be applied to VM.
Feel free to shoot the messenger down or suggest ways how we can deal with the problem.
The best solution is of course to not have introduced the ABI change in
the first place. We have lots of testing, but upstream at least, I don't
think we have anything that is explicitly recording the ABI associated
with each machine type and validating that it hasn't changed. We rely on
the developers to follow the coding practices wrt setting machine type
defaults for back compat, and while we're good, we inevitably screw up
every now & then.
Downstreams do have some of this ABI testing - several problems like the
one we have there, have been identified when RHEL downstream QE did
migration tests and found a change in RHEL machine types, which then
was traced back to upstream.
I feel like we need some standard tool which can be run inside a VM
that dumps all the possible ABI relevant information about the virtual
machine in a nice data format.
We would have to run this for each machine type, and save the
results to git immediately after release. Then for every change to
master, we would have to run the test again for every historic
machine type version and compare to the recorded ABI record.
Like Michael said we don't know that something is broken until it's
too late and this particular case it's not even broken (strictly speaking
change is correct) and is not even a part of ABI (it's ACPI code, i.e. firmware).
Problem is in the way virtio drivers enumerate devices, which makes the same
device appear as a new one. We can work around issue on hypervisor side so user
won't loose network connectivity or would be able to boot guest after QEMU upgrade.
We can suggest user re-installing their Windows (method that fixes almost all Win issues)
or to try to make it pain-less for user in these rare cases, by upgrading to
new QEMU (or fixed stable) which has workaround, so only the first few has to suffer.
(I think downstreams would even more benefit from this, there were similar problems
there before).
Yes, It surely will expand test matrix, but it should be limited to specific cases
we implemented fixups for.
My suggestion from a long while ago (which no one liked) was to
include the source qemu version and then have a quirks list of things to
fix up.

Dave
...
...
Regards,
Daniel
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK