Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu

Friday, 11 March 2022

On Fri, Mar 11, 2022 at 04:31:49PM +0100, Christian Borntraeger wrote:
...

 Am 11.03.22 um 16:24 schrieb Christian Borntraeger:
 > 
 > 
 > Am 11.03.22 um 15:56 schrieb Daniel P. Berrangé:
 > > On Fri, Mar 11, 2022 at 03:52:57PM +0100, Christian Borntraeger wrote:
 > > > 
 > > > 
 > > > Am 11.03.22 um 14:08 schrieb Daniel P. Berrangé:
 > > > > On Fri, Mar 11, 2022 at 12:37:46PM +0000, Daniel P. Berrangé wrote:
 > > > > > On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger
wrote:
 > > > > > > 
 > > > > > > 
 > > > > > > Am 11.03.22 um 10:23 schrieb David Hildenbrand:
 > > > > > > > On 11.03.22 10:17, Daniel P. Berrangé wrote:
 > > > > > > > > On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin
Walling wrote:
 > > > > > > > > > CPU models past gen16a will no longer
support the csske feature. In
 > > > > > > > > > order to secure migration of guests running
on machines that still
 > > > > > > > > > support this feature to machines that do
not, let's disable csske
 > > > > > > > > > in the host-model.
 > > > > > > > 
 > > > > > > > Sorry to say, removing CPU features is a no-go when
wanting to guarantee
 > > > > > > > forward migration without taking care about CPU model
details manually
 > > > > > > > and simply using the host model. Self-made HW vendor
problem.
 > > > > > > 
 > > > > > > And this simply does not reflect reality. Intel and Power
have removed TX
 > > > > > > for example. We can now sit back and please ourselves how
we live in our
 > > > > > > world of dreams. Or we can try to define an interface that
deals with
 > > > > > > reality and actually solves problems.
 > > > > > 
 > > > > > This proposal wouldn't have helped in the case of Intel
removing
 > > > > > TSX, because it was removed without prior warning in the middle
 > > > > > of the product lifecycle. At that time there were already
millions
 > > > > > of VMs in existance using the removed feature.
 > > > > > 
 > > > > > > > > The problem scenario you describe is the intended
semantics of
 > > > > > > > > host-model though. It enables all features
available in the host
 > > > > > > > > that you launched on. It lets you live migrate to
a target host
 > > > > > > > > with the same, or a greater number of features.
If the target has
 > > > > > > > > a greater number of features, it should restrict
the VM to the
 > > > > > > > > subset of features that were present on the
original source CPU.
 > > > > > > > > If the target has fewer features, then you simply
can't live
 > > > > > > > > migrate a VM using host-model.
 > > > > > > > > 
 > > > > > > > > To get live migration in both directions across
CPUs with differing
 > > > > > > > > featuresets, then the VM needs to be configured
with a named CPU
 > > > > > > > > model that is a subset of both, rather than
host-model.
 > > > > > > > 
 > > > > > > > Right, and cpu-model-baseline does that job for you if
you're lazy to
 > > > > > > > lookup the proper model.
 > > > > > > 
 > > > > > > Yes baseline will work, but this requires tooling like
openstack. The normal
 > > > > > > user will just use the default and this is host-model.
 > > > > > > 
 > > > > > > Let me explain the usecase for this feature. Migration
between different versins
 > > > > > > baseline: always works
 > > > > > > host-passthrough: you get what you deserve
 > > > > > > default model: works
 > > > > > > We have disabled CSSKE from our default models (-cpu gen15a
will not present csske).
 > > > > > > So that works as well.
 > > > > > > host-model: Also works for all machines that have csske.
 > > > > > > Now: Lets say gen17 will no longer support this. That means
that we can not migrate
 > > > > > > host-model from gen16 or gen15 because those will have
csske.
 > > > > > > What options do we have? If we disable csske in the host
capabilities that would mean
 > > > > > > that a host compare against an xml from an older QEMU would
fail (even if you move
 > > > > > > from gen14 to gen14). So this is not a good option.
 > > > > > > 
 > > > > > > By disabling deprecated features ONLY for the _initial_
expansion of model-model, but
 > > > > > > keeping it in the host capabilities you can migrate
existing guests (with the
 > > > > > > feature) as we only disable in the expansion, but manually
asking for it still works.
 > > > > > > AND it will allow to move this instantiation of the guest
to future machines without
 > > > > > > the feature. Basically everything works.
 > > > > > 
 > > > > > The change you proposal works functionally, but none the less it
is
 > > > > > changing the semantics of host-model. It is defined to expose
all the
 > > > > > features in the host, and the proposal changes yet. If an app
actually
 > > > > > /wants/ to use the deprecated feature and it exists in the host,
then
 > > > > > host-model should be allowing that as it does today.
 > > > > > 
 > > > > > The problem scenario you describe is ultimately that OpenStack
does
 > > > > > not have a future proof default CPU choice. Libvirt and QEMU
provide
 > > > > > a mechanism for them to pick other CPU models that would address
the
 > > > > > problem, but they're not using that. The challenge is that
OpenStack
 > > > > > defaults currently are a zero-interaction thing.
 > > > > > 
 > > > > > They could retain their zero-interaction defaults, if at install
time
 > > > > > they queried the libvirt capabilities to learn which named CPU
models
 > > > > > are available, whereupon they could decide to use gen15a.  The
main
 > > > > > challenge here is that the list of named CPU models is an
unordered
 > > > > > set, so it is hard to programatically figure out which of the
available
 > > > > > named CPU models is the newest/best/recommended.
 > > > > > 
 > > > > > IOW, what's missing is a way for apps to easily identify
that 'gen15a'
 > > > > > is the best CPU to use on the host, without needing human
interaction.
 > > > > 
 > > > > I think this could be solved with a change to query-cpu-definitions
 > > > > in QEMU, to add an extra 'recommended: bool' attribute to
the
 > > > > CpuDefinitionInfo struct.  This would be defined to be only set for
 > > > > 1 CPU model in the list, and would reflect the recommended CPU model
 > > > > given the current version of QEMU, kernel and hardware. Or we could
 > > > > allow 'recommended' to be set for more than 1 CPU, provided
we define
 > > > > an explicit ordering of returned CPU models.
 > > > 
 > > > I like the recommended: bool attribute. It should provide what we need.
 > > > 
 > > > Would you then also suggest to use this for host-model or only for a new
 > > > type like "host-recommended" ?
 > > 
 > > Neither of those. Libvirt would simply report this attribute in
 > > the information it exposes about CPUs.
 > > 
 > > OpenStack would explicitly extract this and set it in the XML
 > > for the guest, so that each guest's view of "recommended" is
 > > fixed from the time that guest is first created, rather than
 > > potentially changing on each later boots.
 > 
 > Openstack is one thing, but I think this flag would really be useful
 > for instantiation without open stack.

 To make things more clear. I would like to have a way where a virsh start of a
 guest xml without CPU model would work for migration in as many scenarios as
 possible. And if the default model (today host-model) would ignore features that
 are not recommended, this would be perfect. 
Libvirt's ABI/API guarantee policy is to not change semantics of
historical configuration. So anything we might in this relation
would require an explicit XML addition compared to today. If
someone makes alot of use of features like live migration then
they will be using a serious mgmt app, not virsh.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu