Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu

Friday, 11 March 2022

On Fri, Mar 11, 2022 at 04:24:22PM +0100, Christian Borntraeger wrote:
...

 Am 11.03.22 um 15:56 schrieb Daniel P. Berrangé:
 > On Fri, Mar 11, 2022 at 03:52:57PM +0100, Christian Borntraeger wrote:
 > > 
 > > 
 > > Am 11.03.22 um 14:08 schrieb Daniel P. Berrangé:
 > > > On Fri, Mar 11, 2022 at 12:37:46PM +0000, Daniel P. Berrangé wrote:
 > > > > On Fri, Mar 11, 2022 at 01:12:35PM +0100, Christian Borntraeger
wrote:
 > > > > > 
 > > > > > 
 > > > > > Am 11.03.22 um 10:23 schrieb David Hildenbrand:
 > > > > > > On 11.03.22 10:17, Daniel P. Berrangé wrote:
 > > > > > > > On Thu, Mar 10, 2022 at 11:17:38PM -0500, Collin
Walling wrote:
 > > > > > > > > CPU models past gen16a will no longer support the
csske feature. In
 > > > > > > > > order to secure migration of guests running on
machines that still
 > > > > > > > > support this feature to machines that do not,
let's disable csske
 > > > > > > > > in the host-model.
 > > > > > > 
 > > > > > > Sorry to say, removing CPU features is a no-go when wanting
to guarantee
 > > > > > > forward migration without taking care about CPU model
details manually
 > > > > > > and simply using the host model. Self-made HW vendor
problem.
 > > > > > 
 > > > > > And this simply does not reflect reality. Intel and Power have
removed TX
 > > > > > for example. We can now sit back and please ourselves how we
live in our
 > > > > > world of dreams. Or we can try to define an interface that deals
with
 > > > > > reality and actually solves problems.
 > > > > 
 > > > > This proposal wouldn't have helped in the case of Intel removing
 > > > > TSX, because it was removed without prior warning in the middle
 > > > > of the product lifecycle. At that time there were already millions
 > > > > of VMs in existance using the removed feature.
 > > > > 
 > > > > > > > The problem scenario you describe is the intended
semantics of
 > > > > > > > host-model though. It enables all features available
in the host
 > > > > > > > that you launched on. It lets you live migrate to a
target host
 > > > > > > > with the same, or a greater number of features. If the
target has
 > > > > > > > a greater number of features, it should restrict the
VM to the
 > > > > > > > subset of features that were present on the original
source CPU.
 > > > > > > > If the target has fewer features, then you simply
can't live
 > > > > > > > migrate a VM using host-model.
 > > > > > > > 
 > > > > > > > To get live migration in both directions across CPUs
with differing
 > > > > > > > featuresets, then the VM needs to be configured with a
named CPU
 > > > > > > > model that is a subset of both, rather than
host-model.
 > > > > > > 
 > > > > > > Right, and cpu-model-baseline does that job for you if
you're lazy to
 > > > > > > lookup the proper model.
 > > > > > 
 > > > > > Yes baseline will work, but this requires tooling like
openstack. The normal
 > > > > > user will just use the default and this is host-model.
 > > > > > 
 > > > > > Let me explain the usecase for this feature. Migration between
different versins
 > > > > > baseline: always works
 > > > > > host-passthrough: you get what you deserve
 > > > > > default model: works
 > > > > > We have disabled CSSKE from our default models (-cpu gen15a will
not present csske).
 > > > > > So that works as well.
 > > > > > host-model: Also works for all machines that have csske.
 > > > > > Now: Lets say gen17 will no longer support this. That means that
we can not migrate
 > > > > > host-model from gen16 or gen15 because those will have csske.
 > > > > > What options do we have? If we disable csske in the host
capabilities that would mean
 > > > > > that a host compare against an xml from an older QEMU would fail
(even if you move
 > > > > > from gen14 to gen14). So this is not a good option.
 > > > > > 
 > > > > > By disabling deprecated features ONLY for the _initial_
expansion of model-model, but
 > > > > > keeping it in the host capabilities you can migrate existing
guests (with the
 > > > > > feature) as we only disable in the expansion, but manually
asking for it still works.
 > > > > > AND it will allow to move this instantiation of the guest to
future machines without
 > > > > > the feature. Basically everything works.
 > > > > 
 > > > > The change you proposal works functionally, but none the less it is
 > > > > changing the semantics of host-model. It is defined to expose all
the
 > > > > features in the host, and the proposal changes yet. If an app
actually
 > > > > /wants/ to use the deprecated feature and it exists in the host,
then
 > > > > host-model should be allowing that as it does today.
 > > > > 
 > > > > The problem scenario you describe is ultimately that OpenStack does
 > > > > not have a future proof default CPU choice. Libvirt and QEMU provide
 > > > > a mechanism for them to pick other CPU models that would address the
 > > > > problem, but they're not using that. The challenge is that
OpenStack
 > > > > defaults currently are a zero-interaction thing.
 > > > > 
 > > > > They could retain their zero-interaction defaults, if at install
time
 > > > > they queried the libvirt capabilities to learn which named CPU
models
 > > > > are available, whereupon they could decide to use gen15a.  The main
 > > > > challenge here is that the list of named CPU models is an unordered
 > > > > set, so it is hard to programatically figure out which of the
available
 > > > > named CPU models is the newest/best/recommended.
 > > > > 
 > > > > IOW, what's missing is a way for apps to easily identify that
'gen15a'
 > > > > is the best CPU to use on the host, without needing human
interaction.
 > > > 
 > > > I think this could be solved with a change to query-cpu-definitions
 > > > in QEMU, to add an extra 'recommended: bool' attribute to the
 > > > CpuDefinitionInfo struct.  This would be defined to be only set for
 > > > 1 CPU model in the list, and would reflect the recommended CPU model
 > > > given the current version of QEMU, kernel and hardware. Or we could
 > > > allow 'recommended' to be set for more than 1 CPU, provided we
define
 > > > an explicit ordering of returned CPU models.
 > > 
 > > I like the recommended: bool attribute. It should provide what we need.
 > > 
 > > Would you then also suggest to use this for host-model or only for a new
 > > type like "host-recommended" ?
 > 
 > Neither of those. Libvirt would simply report this attribute in
 > the information it exposes about CPUs.
 > 
 > OpenStack would explicitly extract this and set it in the XML
 > for the guest, so that each guest's view of "recommended" is
 > fixed from the time that guest is first created, rather than
 > potentially changing on each later boots.

 Openstack is one thing, but I think this flag would really be useful
 for instantiation without open stack. 
Sure, any mgmt app using libvirt that provisions guests can use
this approach. I just mentioned openstack as that was what you
mentioned at the start of this thread.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [PATCH RFC 1/1] qemu: capabilities: disable csske for host cpu