On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote:
* Alex Williamson (alex.williamson(a)redhat.com) wrote:
> On Wed, 25 Apr 2018 21:00:39 +0530
> Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
>
>> On 4/25/2018 4:29 AM, Alex Williamson wrote:
>>> On Wed, 25 Apr 2018 01:20:08 +0530
>>> Kirti Wankhede <kwankhede(a)nvidia.com> wrote:
>>>
>>>> On 4/24/2018 3:10 AM, Alex Williamson wrote:
>>>>> On Wed, 18 Apr 2018 12:31:53 -0600
>>>>> Alex Williamson <alex.williamson(a)redhat.com> wrote:
>>>>>
>>>>>> On Mon, 9 Apr 2018 12:35:10 +0200
>>>>>> Gerd Hoffmann <kraxel(a)redhat.com> wrote:
>>>>>>
>>>>>>> This little series adds three drivers, for demo-ing and
testing vfio
>>>>>>> display interface code. There is one mdev device for each
interface
>>>>>>> type (mdpy.ko for region and mbochs.ko for dmabuf).
>>>>>>
>>>>>> Erik Skultety brought up a good question today regarding how
libvirt is
>>>>>> meant to handle these different flavors of display interfaces
and
>>>>>> knowing whether a given mdev device has display support at all.
It
>>>>>> seems that we cannot simply use the default display=auto because
>>>>>> libvirt needs to specifically configure gl support for a dmabuf
type
>>>>>> interface versus not having such a requirement for a region
interface,
>>>>>> perhaps even removing the emulated graphics in some cases (though
I
>>>>>> don't think we have boot graphics through either solution
yet).
>>>>>> Additionally, GVT-g seems to need the x-igd-opregion support
>>>>>> enabled(?), which is a non-starter for libvirt as it's an
experimental
>>>>>> option!
>>>>>>
>>>>>> Currently the only way to determine display support is through
the
>>>>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that
on
>>>>>> their own they'd need to get to the point where they could
open the
>>>>>> vfio device and perform the ioctl. That means opening a vfio
>>>>>> container, adding the group, setting the iommu type, and getting
the
>>>>>> device. I was initially a bit appalled at asking libvirt to do
that,
>>>>>> but the alternative is to put this information in sysfs, but
doing that
>>>>>> we risk that we need to describe every nuance of the mdev device
>>>>>> through sysfs and it becomes a dumping ground for every possible
>>>>>> feature an mdev device might have.
> ...
>>>>>> So I was ready to return and suggest that maybe libvirt should
probe
>>>>>> the device to know about these ancillary configuration details,
but
>>>>>> then I remembered that both mdev vGPU vendors had external
dependencies
>>>>>> to even allow probing the device. KVMGT will fail to open the
device
>>>>>> if it's not associated with an instance of KVM and NVIDIA
vGPU, I
>>>>>> believe, will fail if the vGPU manager process cannot find the
QEMU
>>>>>> instance to extract the VM UUID. (Both of these were bad ideas)
>>>>>
>>>>> Here's another proposal that's really growing on me:
>>>>>
>>>>> * Fix the vendor drivers! Allow devices to be opened and probed
>>>>> without these external dependencies.
>>>>> * Libvirt uses the existing vfio API to open the device and probe
the
>>>>> necessary ioctls, if it can't probe the device, the feature
is
>>>>> unavailable, ie. display=off, no migration.
>>>>>
>>>>
>>>> I'm trying to think simpler mechanism using sysfs that could work
for
>>>> any feature and knowing source-destination migration compatibility check
>>>> by libvirt before initiating migration.
>>>>
>>>> I have another proposal:
>>>> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES
>>>> struct vfio_device_features {
>>>> __u32 argsz;
>>>> __u32 features;
>>>> }
>>>>
>>>> Define bit for each feature:
>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0)
>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1)
>>>> #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2)
>>>>
>>>> * Vendor driver returns bitmask of supported features during
>>>> initialization phase.
>>>>
>>>> * In vfio core module, trap this ioctl for each device in
>>>> vfio_device_fops_unl_ioctl(),
>>>
>>> Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our
>>> blocking point with mdev drivers, we can't get a device fd, so we
can't
>>> call an ioctl on the device fd.
>>>
>>
>> I'm sorry, I thought we could expose features when QEMU initialize, but
>> libvirt needs to know supported features before QEMU initialize.
>>
>>
>>>> check features bitmask returned by vendor
>>>> driver and add a sysfs file if feature is supported that device. This
>>>> sysfs file would return 0/1.
>>>
>>> I don't understand why we have an ioctl interface, if the user can get
>>> to the device fd then we have existing interfaces to probe these
>>> things, it seems like you're just wanting to pass a features bitmap
>>> through to vfio_add_group_dev() that vfio-core would expose through
>>> sysfs, but a list of feature bits doesn't convey enough info except for
>>> the most basic uses.
>>>
>>
>> Yes, vfio_add_group_dev() seems to be better way to convey features to
>> vfio core.
>>
>>>> For migration this bit will only indicate if host driver supports
>>>> migration feature.
>>>>
>>>> For source and destination compatibility check libvirt would need more
>>>> data/variables to check like,
>>>> * if same type of 'mdev_type' device create-able at destination,
>>>> i.e. if ('mdev_type'->available_instances > 0)
>>>>
>>>> * if host_driver_version at source and destination are compatible.
>>>> Host driver from same release branch should be mostly compatible, but if
>>>> there are major changes in structures or APIs, host drivers from
>>>> different branches might not be compatible, for example, if source and
>>>> destination are from different branches and one of the structure had
>>>> changed, then data collected at source might not be compatible with
>>>> structures at destination and typecasting it to changed structures would
>>>> mess up migrated data during restoration.
>>>
>>> Of course now you're asking that libvirt understand the release
>>> versioning scheme of every vendor driver and that it remain
>>> programatically consistent. We can't even do this with in-kernel
>>> drivers. And in the end, still the best we can do is guess.
>>>
>>
>> Libvirt doesn't need to understand the version, libvirt need to do
>> strcmp version string from source and destination. If those are equal,
>> then libvirt would understand that they are compatible.
>
> Who's to say that the driver version and migration compatibility have
> any relation at all? Some drivers might focus on designing their own
> migration interface that can maintain compatibility across versions
> (QEMU does this), some drivers may only allow identical version
> migration (which is going to frustrate upper level management tools and
> customers - RHEL goes to great extents to support cross version
> migration). We cannot have a one size fits all here that driver version
> defines completely the migration compatibility.
I'll agree; I don't know enough about these devices, but to give you
some example of things I'd expect to work:
a) User adds new machines to their data centre with larger/newer
version of the same vendors GPU; in some cases that should work
(depending on vendor details etc)
b) The same thing but with identical hardware but a newer driver on
the destination.
Obviously there will be some cut offs that say some versions are
incompatible; but for normal migration we jump through serious hoops
to make sure stuff works; customers will expect the same with some
VFIO devices.
How libvirt checks that cut off where some versions are incompatible?
>>>> * if guest_driver_version is compatible with host
driver at destination.
>>>> For mdev devices, guest driver communicates with host driver in some
>>>> form. If there are changes in structures/APIs of such communication,
>>>> guest driver at source might not be compatible with host driver at
>>>> destination.
>>>
>>> And another guess plus now the guest driver is involved which libvirt
>>> has no visibility to.
>>>
>>
>> Like above libvirt need to do strcmp.
>
> Insufficient, imo
>
>>>> 'available_instances' sysfs already exist, later two should be
added by
>>>> vendor driver which libvirt can use for migration compatibility check.
>>>
>>> As noted previously, display and migration are not necessarily
>>> mdev-only features, it's possible that vfio-pci or vfio-platform could
>>> also implement these, so the sysfs interface cannot be restricted to
>>> the mdev template and lifecycle interface.
>>>
>>
>> I agree.
>> Feature bitmask passed to vfio core is not mdev specific. But here
>> 'available_instances' for migration compatibility check is mdev
>> specific. If mdev device is not create-able at destination, there is no
>> point in initiating migration by libvirt.
>
> 'available_instances' for migration compatibility check...? We use
> available_instances to know whether we have the resources to create a
> given mdev type. It's certainly a prerequisite to have a device of the
> identical type at the migration target and how we define what is an
> identical device for a directly assigned PCI device is yet another
> overly complicated rat hole. But an identical device doesn't
> necessarily imply migration compatibility and I think that's the
> problem we're tackling. We cannot assume based only on the device type
> that migration is compatible, that's basically saying we're never going
> to have any bugs or oversights or new features in the migration stream.
Those things certainly happen; state that we forgot to transfer, new
features enables on devices, devices configured in different ways.
How libvirt checks migration compatibility for other devices across QEMU
versions where source support a device and destination running with
older QEMU version doesn't support that device or that device doesn't
exist in that system?
Thanks,
Kirti
> Chatting with Laine, it may be worth a step back to include
migration
> experts and people up the stack with more visibility to how openstack
> operates. The issue here is that if vfio gains migration support then
> we have a portion of the migration stream that is not under the control
> of QEMU, we cannot necessarily tie it to a QEMU machine type and we
> cannot necessarily dictate how the vfio bus driver (vendor driver)
> handles versioning and compatibility. My intent was to expose some
> sort of migration information through the vfio API so that upper level
> tools could determine source and target compatibility, but this in
> itself is I think something new that those tools need to agree how it
> might be done. How would something like openstack want to handle not
> only finding a migration target with a compatible device, but also
> verifying if the device supports the migration format of the source
> device?
>
> Alternatively, should we do anything? Is the problem too hard and we
> should let the driver return an error when it receives an incompatible
> migration stream, aborting the migration?
It's a bit nasty; if you've hit the 'evacuate host' button then what
happens when you've got some incompatible hosts.
Dave
>>> One more try... we have a vfio_group fd. This is created by the bus
>>> drivers calling vfio_add_group_dev() and registers a struct device, a
>>> struct vfio_device_ops, and private data. Typically we only wire the
>>> device_ops to the resulting file descriptor we get from
>>> VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl
>>> through the group fd? The ioctl would need to take a string arg to
>>> match to a device name, plus an ioctl cmd and arg for the device_ops
>>> ioctl. The group ioctl would need to filter cmds to known, benign
>>> queries. We'd also need to verify that the allowed ioctls have no
>>> dependencies on setup done in device_ops.open().
>>
>> So these ioctls would be called without devices open() call, doesn't
>> this seem to be against file operations standard?
>
> vfio_device_ops is modeled largely after file operations, but I don't
> think we're bound by that for the interaction between vfio-core and the
> vfio bus drivers. We could make a separate callback for unprivileged
> ioctls, but that seems like more work per driver when we really want to
> maintain the identical API, we just want to provide a more limited
> interface and change the calling point.
>
> An issue I thought of for migration though is that this path wouldn't
> have access to the migration region and therefore if we place a header
> within that region containing the compatibility and versioning
> information, the user still couldn't access it. This doesn't seem to
> be a blocker though as we could put that information within the region
> capability that defines the region as used for migration. Possibly a
> device could have multiple migration regions with different formats
> for backwards compatibility, of course then we'd need a way to
> determine which to use and which combinations have been validated.
> Thanks,
>
> Alex
--
Dr. David Alan Gilbert / dgilbert(a)redhat.com / Manchester, UK