Re: [libvirt] [PATCH 0/3] sample: vfio mdev display devices.

26 Apr 2018

      * Alex Williamson (alex.williamson@redhat.com) wrote:
...
On Wed, 25 Apr 2018 21:00:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
...
On 4/25/2018 4:29 AM, Alex Williamson wrote:
...
On Wed, 25 Apr 2018 01:20:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
...
On 4/24/2018 3:10 AM, Alex Williamson wrote:
...
On Wed, 18 Apr 2018 12:31:53 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:
...
On Mon,  9 Apr 2018 12:35:10 +0200
Gerd Hoffmann <kraxel@redhat.com> wrote:
> This little series adds three drivers, for demo-ing and testing vfio
> display interface code.  There is one mdev device for each interface
> type (mdpy.ko for region and mbochs.ko for dmabuf).
Erik Skultety brought up a good question today regarding how libvirt is
meant to handle these different flavors of display interfaces and
knowing whether a given mdev device has display support at all.  It
seems that we cannot simply use the default display=auto because
libvirt needs to specifically configure gl support for a dmabuf type
interface versus not having such a requirement for a region interface,
perhaps even removing the emulated graphics in some cases (though I
don't think we have boot graphics through either solution yet).
Additionally, GVT-g seems to need the x-igd-opregion support
enabled(?), which is a non-starter for libvirt as it's an experimental
option!
Currently the only way to determine display support is through the
VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on
their own they'd need to get to the point where they could open the
vfio device and perform the ioctl.  That means opening a vfio
container, adding the group, setting the iommu type, and getting the
device.  I was initially a bit appalled at asking libvirt to do that,
but the alternative is to put this information in sysfs, but doing that
we risk that we need to describe every nuance of the mdev device
through sysfs and it becomes a dumping ground for every possible
feature an mdev device might have.
...    
So I was ready to return and suggest that maybe libvirt should probe
the device to know about these ancillary configuration details, but
then I remembered that both mdev vGPU vendors had external dependencies
to even allow probing the device.  KVMGT will fail to open the device
if it's not associated with an instance of KVM and NVIDIA vGPU, I
believe, will fail if the vGPU manager process cannot find the QEMU
instance to extract the VM UUID.  (Both of these were bad ideas)
Here's another proposal that's really growing on me:
* Fix the vendor drivers!  Allow devices to be opened and probed
   without these external dependencies.
 * Libvirt uses the existing vfio API to open the device and probe the
   necessary ioctls, if it can't probe the device, the feature is
   unavailable, ie. display=off, no migration.
I'm trying to think simpler mechanism using sysfs that could work for
any feature and knowing source-destination migration compatibility check
by libvirt before initiating migration.
I have another proposal:
* Add a ioctl VFIO_DEVICE_PROBE_FEATURES
struct vfio_device_features {
    __u32 argsz;
    __u32 features;
}
Define bit for each feature:
#define VFIO_DEVICE_FEATURE_DISPLAY_REGION	(1 << 0)
#define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF	(1 << 1)
#define VFIO_DEVICE_FEATURE_MIGRATION		(1 << 2)
* Vendor driver returns bitmask of supported features during
initialization phase.
* In vfio core module, trap this ioctl for each device  in
vfio_device_fops_unl_ioctl(),
Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our
blocking point with mdev drivers, we can't get a device fd, so we can't
call an ioctl on the device fd.
I'm sorry, I thought we could expose features when QEMU initialize, but
libvirt needs to know supported features before QEMU initialize.
...
...
check features bitmask returned by vendor
driver and add a sysfs file if feature is supported that device. This
sysfs file would return 0/1.
I don't understand why we have an ioctl interface, if the user can get
to the device fd then we have existing interfaces to probe these
things, it seems like you're just wanting to pass a features bitmap
through to vfio_add_group_dev() that vfio-core would expose through
sysfs, but a list of feature bits doesn't convey enough info except for
the most basic uses.
Yes, vfio_add_group_dev() seems to be better way to convey features to
vfio core.
...
...
For migration this bit will only indicate if host driver supports
migration feature.
For source and destination compatibility check libvirt would need more
data/variables to check like,
* if same type of 'mdev_type' device create-able at destination,
   i.e. if ('mdev_type'->available_instances > 0)
* if host_driver_version at source and destination are compatible.
Host driver from same release branch should be mostly compatible, but if
there are major changes in structures or APIs, host drivers from
different branches might not be compatible, for example, if source and
destination are from different branches and one of the structure had
changed, then data collected at source might not be compatible with
structures at destination and typecasting it to changed structures would
mess up migrated data during restoration.
Of course now you're asking that libvirt understand the release
versioning scheme of every vendor driver and that it remain
programatically consistent.  We can't even do this with in-kernel
drivers.  And in the end, still the best we can do is guess.
Libvirt doesn't need to understand the version, libvirt need to do
strcmp version string from source and destination. If those are equal,
then libvirt would understand that they are compatible.
Who's to say that the driver version and migration compatibility have
any relation at all?  Some drivers might focus on designing their own
migration interface that can maintain compatibility across versions
(QEMU does this), some drivers may only allow identical version
migration (which is going to frustrate upper level management tools and
customers - RHEL goes to great extents to support cross version
migration).  We cannot have a one size fits all here that driver version
defines completely the migration compatibility.
I'll agree; I don't know enough about these devices, but to give you
some example of things I'd expect to work:
   a) User adds new machines to their data centre with larger/newer
version of the same vendors GPU; in some cases that should work
(depending on vendor details etc)
   b) The same thing but with identical hardware but a newer driver on
the destination.

Obviously there will be some cut offs that say some versions are
incompatible;  but for normal migration we jump through serious hoops
to make sure stuff works; customers will expect the same with some
VFIO devices.
...
...
...
...
* if guest_driver_version is compatible with host driver at destination.
For mdev devices, guest driver communicates with host driver in some
form. If there are changes in structures/APIs of such communication,
guest driver at source might not be compatible with host driver at
destination.
And another guess plus now the guest driver is involved which libvirt
has no visibility to.
Like above libvirt need to do strcmp.
Insufficient, imo
...
...
...
'available_instances' sysfs already exist, later two should be added by
vendor driver which libvirt can use for migration compatibility check.
As noted previously, display and migration are not necessarily
mdev-only features, it's possible that vfio-pci or vfio-platform could
also implement these, so the sysfs interface cannot be restricted to
the mdev template and lifecycle interface.
I agree.
Feature bitmask passed to vfio core is not mdev specific. But here
'available_instances' for migration compatibility check is mdev
specific. If mdev device is not create-able at destination, there is no
point in initiating migration by libvirt.
'available_instances' for migration compatibility check...?  We use
available_instances to know whether we have the resources to create a
given mdev type.  It's certainly a prerequisite to have a device of the
identical type at the migration target and how we define what is an
identical device for a directly assigned PCI device is yet another
overly complicated rat hole.  But an identical device doesn't
necessarily imply migration compatibility and I think that's the
problem we're tackling.  We cannot assume based only on the device type
that migration is compatible, that's basically saying we're never going
to have any bugs or oversights or new features in the migration stream.
Those things certainly happen; state that we forgot to transfer, new
features enables on devices, devices configured in different ways.
...
Chatting with Laine, it may be worth a step back to include migration
experts and people up the stack with more visibility to how openstack
operates.  The issue here is that if vfio gains migration support then
we have a portion of the migration stream that is not under the control
of QEMU, we cannot necessarily tie it to a QEMU machine type and we
cannot necessarily dictate how the vfio bus driver (vendor driver)
handles versioning and compatibility.  My intent was to expose some
sort of migration information through the vfio API so that upper level
tools could determine source and target compatibility, but this in
itself is I think something new that those tools need to agree how it
might be done.  How would something like openstack want to handle not
only finding a migration target with a compatible device, but also
verifying if the device supports the migration format of the source
device?
Alternatively, should we do anything?  Is the problem too hard and we
should let the driver return an error when it receives an incompatible
migration stream, aborting the migration?
It's a bit nasty; if you've hit the 'evacuate host' button then what
happens when you've got some incompatible hosts.

Dave
...
...
...
One more try... we have a vfio_group fd.  This is created by the bus
drivers calling vfio_add_group_dev() and registers a struct device, a
struct vfio_device_ops, and private data.  Typically we only wire the
device_ops to the resulting file descriptor we get from
VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl
through the group fd?  The ioctl would need to take a string arg to
match to a device name, plus an ioctl cmd and arg for the device_ops
ioctl.  The group ioctl would need to filter cmds to known, benign
queries.  We'd also need to verify that the allowed ioctls have no
dependencies on setup done in device_ops.open().
So these ioctls would be called without devices open() call, doesn't
this seem to be against file operations standard?
vfio_device_ops is modeled largely after file operations, but I don't
think we're bound by that for the interaction between vfio-core and the
vfio bus drivers.  We could make a separate callback for unprivileged
ioctls, but that seems like more work per driver when we really want to
maintain the identical API, we just want to provide a more limited
interface and change the calling point.
An issue I thought of for migration though is that this path wouldn't
have access to the migration region and therefore if we place a header
within that region containing the compatibility and versioning
information, the user still couldn't access it.  This doesn't seem to
be a blocker though as we could put that information within the region
capability that defines the region as used for migration.  Possibly a
device could have multiple migration regions with different formats
for backwards compatibility, of course then we'd need a way to
determine which to use and which combinations have been validated.
Thanks,
Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK