Re: device compatibility interface for live migration with assigned devices

Tuesday, 14 July 2020

On Tue, 14 Jul 2020 13:33:24 +0100
Sean Mooney <smooney(a)redhat.com&gt; wrote:

...
 On Tue, 2020-07-14 at 11:21 +0100, Daniel P. Berrangé wrote:
 > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
 > > hi folks,
 > > we are defining a device migration compatibility interface that helps upper
 > > layer stack like openstack/ovirt/libvirt to check if two devices are
 > > live migration compatible.
 > > The "devices" here could be MDEVs, physical devices, or hybrid of the
two.
 > > e.g. we could use it to check whether
 > > - a src MDEV can migrate to a target MDEV,  
 mdev live migration is completely possible to do but i agree with Dan barrange's
comments
 from the point of view of openstack integration i dont see calling out to a vender
sepecific
 tool to be an accpetable 
As I replied to Dan, I'm hoping Yan was referring more to vendor
specific knowledge rather than actual tools.

...
 solutions for device compatiablity checking. the sys filesystem
 that describs the mdevs that can be created shoudl also
 contain the relevent infomation such
 taht nova could integrate it via libvirt xml representation or directly retrive the
 info from
 sysfs.
 > > - a src VF in SRIOV can migrate to a target VF in SRIOV,  
 so vf to vf migration is not possible in the general case as there is no standarised
 way to transfer teh device state as part of the siorv specs produced by the pci-sig
 as such there is not vender neutral way to support sriov live migration.  
We're not talking about a general case, we're talking about physical
devices which have vfio wrappers or hooks with device specific
knowledge in order to support the vfio migration interface.  The point
is that a discussion around vfio device migration cannot be limited to
mdev devices.

...
 > > - a src MDEV can migration to a target VF in SRIOV.  
 that also makes this unviable
 > >   (e.g. SIOV/SRIOV backward compatibility case)
 > > 
 > > The upper layer stack could use this interface as the last step to check
 > > if one device is able to migrate to another device before triggering a real
 > > live migration procedure.  
 well actully that is already too late really. ideally we would want to do this
compaiablity
 check much sooneer to avoid the migration failing. in an openstack envionment  at least
 by the time we invoke libvirt (assuming your using the libvirt driver) to do the
migration we have alreaedy
 finished schduling the instance to the new host. if if we do the compatiablity check at
this point
 and it fails then the live migration is aborted and will not be retired. These types of
late check lead to a
 poor user experince as unless you check the migration detial it basically looks like the
migration was ignored
 as it start to migrate and then continuge running on the orgininal host.

 when using generic pci passhotuhg with openstack, the pci alias is intended to reference
a single vendor id/product
 id so you will have 1+ alias for each type of device. that allows openstack to schedule
based on the availability of a
 compatibale device because we track inventories of pci devices and can query that when
selecting a host.

 if we were to support mdev live migration in the future we would want to take the same
declarative approch.
 1 interospec the capability of the deivce we manage
 2 create inventories of the allocatable devices and there capabilities
 3 schdule the instance to a host based on the device-type/capabilities and claim it
atomicly to prevent raceces
 4 have the lower level hyperviors do addtional validation if need prelive migration.

 this proposal seams to be targeting extending step 4 where as ideally we should focuse on
providing the info that would
 be relevant in set 1 preferably in a vendor neutral way vai a kernel interface like /sys.

I think this is reading a whole lot into the phrase "last step".  We
want to make the information available for a management engine to
consume as needed to make informed decisions regarding likely
compatible target devices.

...
 > > we are not sure if this interface is of value or help to
you. please don't
 > > hesitate to drop your valuable comments.
 > > 
 > > 
 > > (1) interface definition
 > > The interface is defined in below way:
 > > 
 > >              __    userspace
 > >               /\              \
 > >              /                 \write
 > >             / read              \
 > >    ________/__________       ___\|/_____________
 > >   | migration_version |     | migration_version |-->check migration
 > >   ---------------------     ---------------------   compatibility
 > >      device A                    device B
 > > 
 > > 
 > > a device attribute named migration_version is defined under each device's
 > > sysfs node. e.g.
(/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).  
 this might be useful as we could tag the inventory with the migration version and only
might to
 devices with  the same version 
Is cross version compatibility something that you'd consider using?

...
 > > userspace tools read the migration_version as a string from
the source device,
 > > and write it to the migration_version sysfs attribute in the target device.  
 this would not be useful as the schduler cannot directlly connect to the compute host
 and even if it could it would be extreamly slow to do this for 1000s of hosts and
potentally
 multiple devices per host. 
Seems similar to Dan's requirement, looks like the 'read for version,
write for compatibility' test idea isn't really viable.

...
 > > 
 > > The userspace should treat ANY of below conditions as two devices not
compatible:
 > > - any one of the two devices does not have a migration_version attribute
 > > - error when reading from migration_version attribute of one device
 > > - error when writing migration_version string of one device to
 > >   migration_version attribute of the other device
 > > 
 > > The string read from migration_version attribute is defined by device vendor
 > > driver and is completely opaque to the userspace.  
 opaque vendor specific stings that higher level orchestros have to pass form host
 to host and cant reason about are evil, when allowed they prolifroate and
 makes any idea of a vendor nutral abstraction and interoperablity between systems
 impossible to reason about. that said there is a way to make it opaue but still useful
 to userspace. see below
 > > for a Intel vGPU, string format can be defined like
 > > "parent device PCI ID" + "version of gvt driver" +
"mdev type" + "aggregator count".
 > > 
 > > for an NVMe VF connecting to a remote storage. it could be
 > > "PCI ID" + "driver version" + "configured remote
storage URL"
 > > 
 > > for a QAT VF, it may be
 > > "PCI ID" + "driver version" + "supported encryption
set".
 > > 
 > > (to avoid namespace confliction from each vendor, we may prefix a driver name
to
 > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)  
 honestly i would much prefer if the version string was just a semver string.
 e.g. {major}.{minor}.{bugfix} 

 if you do a driver/frimware update and break compatiablity with an older version bump
the
 major version.

 if you add optional a feature that does not break backwards compatiablity if you migrate
 an older instance to the new host then just bump the minor/feature number.

 if you have a fix for a bug that does not change the feature set or compatiblity
backwards or
 forwards then bump the bugfix number

 then the check is as simple as 
 1.) is the mdev type the same
 2.) is the major verion the same
 3.) am i going form the same version to same version or same version to newer version

 if all 3 are true we can migrate.
 e.g. 
 2.0.1 -> 2.1.1 (ok same major version and migrating from older feature release to
newer feature release)
 2.1.1 -> 2.0.1 (not ok same major version and migrating from new feature release to
old feature release may be
 incompatable)
 2.0.0 -> 3.0.0 (not ok chaning major version)
 2.0.1 -> 2.0.0 (ok same major and minor version, all bugfixs in the same minor release
should be compatibly) 
What's the value of the bugfix field in this scheme?

The simplicity is good, but is it too simple.  It's not immediately
clear to me whether all features can be hidden behind a minor version.
For instance, if we have an mdev device that supports this notion of
aggregation, which is proposed as a solution to the problem that
physical hardware might support lots and lots of assignable interfaces
which can be combined into arbitrary sets for mdev devices, making it
impractical to expose an mdev type for every possible enumeration of
assignable interfaces within a device.  We therefore expose a base type
where the aggregation is built later.  This essentially puts us in a
scenario where even within an mdev type running on the same driver,
there are devices that are not directly compatible with each other.

...
 we dont need vendor to rencode the driver name or vendor id and
product id in the string. that info is alreay
 available both to the device driver and to userspace via /sys already we just need to
know if version of
 the same mdev are compatiable so a simple semver version string which is well know in the
software world
 at least is a clean abstration we can reuse. 
This presumes there's no cross device migration.  An mdev type can only
be migrated to the same mdev type, all of the devices within that type
have some based compatibility, a phsyical device can only be migrated to
the same physical device.  In the latter case what defines the type?  If
it's a PCI device, is it only vendor:device IDs?  What about revision?
What about subsystem IDs?  What about possibly an onboard ROM or
internal firmware?  The information may be available, but which things
are relevant to migration?  We already see desires to allow migration
between physical and mdev, but also to expose mdev types that might be
composable to be compatible with other types.  Thanks,

Alex

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: device compatibility interface for live migration with assigned devices