Re: [libvirt] [PATCH 0/3] sample: vfio mdev display devices.

On Mon, 9 Apr 2018 12:35:10 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
This little series adds three drivers, for demo-ing and testing vfio display interface code. There is one mdev device for each interface type (mdpy.ko for region and mbochs.ko for dmabuf).
Erik Skultety brought up a good question today regarding how libvirt is meant to handle these different flavors of display interfaces and knowing whether a given mdev device has display support at all. It seems that we cannot simply use the default display=auto because libvirt needs to specifically configure gl support for a dmabuf type interface versus not having such a requirement for a region interface, perhaps even removing the emulated graphics in some cases (though I don't think we have boot graphics through either solution yet). Additionally, GVT-g seems to need the x-igd-opregion support enabled(?), which is a non-starter for libvirt as it's an experimental option! Currently the only way to determine display support is through the VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on their own they'd need to get to the point where they could open the vfio device and perform the ioctl. That means opening a vfio container, adding the group, setting the iommu type, and getting the device. I was initially a bit appalled at asking libvirt to do that, but the alternative is to put this information in sysfs, but doing that we risk that we need to describe every nuance of the mdev device through sysfs and it becomes a dumping ground for every possible feature an mdev device might have. So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas) Therefore, how can libvirt know if a given mdev device supports a display and which type of display it supports, and potentially which vendor specific options might be required to further enable that display (if they weren't experimental)? A terrible solution would be that libvirt hard codes that NVIDIA works with regions and Intel works with dmabufs, but even then there's a backwards and forwards compatibility problem, that libvirt needs to support older kernels and drivers where display support is not present and newer drivers where perhaps Intel is now doing regions and NVIDIA is supporting dmabuf, so it cannot simply be assumed based on the vendor. The only solution I see down that path would be identifying specific {vendor,type} pairs that support a predefined display type, but that's just absurd to think that vendors would rev their mdev types to expose this and that libvirt would keep a database mapping types to features. We also have the name and description attributes, but these are currently free form, so libvirt rightfully ignores them entirely. I don't know if we could create a defined feature string within those free form strings. Otherwise, it seems we have no choice but to dive into the pool of exposing such features via sysfs and we'll need to be vigilant of feature creep or vendor specific features (ex. we're not adding a feature to indicate an opregion requirement). How should we do this? Perhaps a bar we can set is that if a feature cannot be discovered through a standard vfio API, then it is not suitable for this sysfs API. Such things can be described via our existing mdev vendor specific attribute interface. We currently have this sysfs interface: mdev_supported_types/ |-- $VENDOR_TYPE | |-- available_instances | |-- create | |-- description | |-- device_api | |-- devices | `-- name ioctls for vfio devices which only provide information include: VFIO_DEVICE_GET_INFO VFIO_DEVICE_GET_REGION_INFO VFIO_DEVICE_GET_IRQ_INFO VFIO_DEVICE_GET_PCI_HOT_RESET_INFO VFIO_DEVICE_QUERY_GFX_PLANE We don't need to support all of these initially, but here's a starting idea for what this may look like in sysfs: $VENDOR_TYPE/ |-- available_instances |-- create |-- description |-- device_api |-- devices |-- name `-- vfio-pci `-- device |-- gfx_plane | |-- dmabuf | `-- region |-- irqs | |-- 0 | | |-- count | | `-- flags | `-- 1 | |-- count | `-- flags `-- regions |-- 0 | |-- flags | |-- offset | `-- size `-- 3 |-- flags |-- offset `-- size The existing device_api file reports "vfio-pci", so we base the device API info in a directory named vfio-pci. We're specifically exposing device information, so we have a device directory. We have a GFX_PLANE query ioctl, so we have a gfx_plane sub-directory. I imagine the dmabuf and region files here expose either Y/N or 1/0. I continue on the example with how we might expose irqs and regions, but even with regions we can bury down into how is sparse mmap exposed, how are device specific regions described, etc. Filling this in to completion without a specific userspace need to expose the information is just an exercise in bloating the kernel. That almost begins to look reasonable, but then we can only expose this for mdev devices, what if we were to hack a back door into a directly assigned GPU that tracks the location of active display in the framebuffer and implement the GFX_PLANE interface for that? We have no sysfs representation for either the template or the actual device for anything other than mdev. This inconsistency with physically assigned devices has been one of my arguments against enhancing mdev sysfs. Thanks to anyone still reading this. Ideas how we might help libvirt fill this information void so that they can actually configure a VM with a display device? Thanks, Alex

Hi,
Erik Skultety brought up a good question today regarding how libvirt is meant to handle these different flavors of display interfaces and knowing whether a given mdev device has display support at all. It seems that we cannot simply use the default display=auto because libvirt needs to specifically configure gl support for a dmabuf type interface versus not having such a requirement for a region interface, perhaps even removing the emulated graphics in some cases (though I don't think we have boot graphics through either solution yet).
Correct, no boot graphics yet. The option to disable emulated graphics should be added nevertheless. It's an option after all, you don't have to use it. But after install things usually work just fine, it just takes a little longer for the guest display to show up.. There is also the option to add a serial console to the guest for boot loader access.
Additionally, GVT-g seems to need the x-igd-opregion support enabled(?), which is a non-starter for libvirt as it's an experimental option!
Windows guests need it, yes. And it seems we have still have to add igd opregion support to ovmf as only bios guests are working. Or hack up a efi rom doing that. But patching ovmf is probably alot easier because it already has support code for fw_cfg access. Linux i915.ko is happy without opregion.
So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas)
Oops. I've trapped into the kvm issue too. Wondering what the reason is, shouldn't this work with tcg too? But, yes, that indeed pretty much kills the "just let libvirt use the probe ioctl" idea.
The existing device_api file reports "vfio-pci", so we base the device API info in a directory named vfio-pci. We're specifically exposing device information, so we have a device directory. We have a GFX_PLANE query ioctl, so we have a gfx_plane sub-directory. I imagine the dmabuf and region files here expose either Y/N or 1/0.
Do we want tie this to vfio-pci? All existing devices are actually pci, and the qemu code only works for vfio-pci devices too. But at vfio api level there is no vfio-pci dependency I'm aware of, and I think we shouldn't add one without a good reason. Should we just add a gfx_plane_api file maybe? Which would be a comma-separated list of interfaces, listed in order of preference in case multiple are supported.
anything other than mdev. This inconsistency with physically assigned devices has been one of my arguments against enhancing mdev sysfs.
Thanks to anyone still reading this. Ideas how we might help libvirt fill this information void so that they can actually configure a VM with a display device? Thanks,
Well, no good idea for the physical assigned device case. cheers, Gerd PS: Any comment on the sample driver patches? Or should I take the lack of comments as "no news is good news, they are queued up already"?

On 2018.04.19 10:40:18 +0200, Gerd Hoffmann wrote:
Hi,
Erik Skultety brought up a good question today regarding how libvirt is meant to handle these different flavors of display interfaces and knowing whether a given mdev device has display support at all. It seems that we cannot simply use the default display=auto because libvirt needs to specifically configure gl support for a dmabuf type interface versus not having such a requirement for a region interface, perhaps even removing the emulated graphics in some cases (though I don't think we have boot graphics through either solution yet).
Correct, no boot graphics yet. The option to disable emulated graphics should be added nevertheless. It's an option after all, you don't have to use it.
But after install things usually work just fine, it just takes a little longer for the guest display to show up.. There is also the option to add a serial console to the guest for boot loader access.
Additionally, GVT-g seems to need the x-igd-opregion support enabled(?), which is a non-starter for libvirt as it's an experimental option!
Windows guests need it, yes. And it seems we have still have to add igd opregion support to ovmf as only bios guests are working. Or hack up a efi rom doing that. But patching ovmf is probably alot easier because it already has support code for fw_cfg access.
Linux i915.ko is happy without opregion.
yeah, that's true.
So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas)
Oops. I've trapped into the kvm issue too. Wondering what the reason is, shouldn't this work with tcg too?
But, yes, that indeed pretty much kills the "just let libvirt use the probe ioctl" idea.
I also don't like that strict link and although now KVM is the only upstream hypervisor GVT supports, we shouldn't require a must available instance for some device info access.
The existing device_api file reports "vfio-pci", so we base the device API info in a directory named vfio-pci. We're specifically exposing device information, so we have a device directory. We have a GFX_PLANE query ioctl, so we have a gfx_plane sub-directory. I imagine the dmabuf and region files here expose either Y/N or 1/0.
Do we want tie this to vfio-pci? All existing devices are actually pci, and the qemu code only works for vfio-pci devices too. But at vfio api level there is no vfio-pci dependency I'm aware of, and I think we shouldn't add one without a good reason.
Should we just add a gfx_plane_api file maybe? Which would be a comma-separated list of interfaces, listed in order of preference in case multiple are supported.
Or a 'feature' file with defined string list for those capabilities? Might be easier to extend in future.
anything other than mdev. This inconsistency with physically assigned devices has been one of my arguments against enhancing mdev sysfs.
Thanks to anyone still reading this. Ideas how we might help libvirt fill this information void so that they can actually configure a VM with a display device? Thanks,
Well, no good idea for the physical assigned device case.
cheers, Gerd
PS: Any comment on the sample driver patches? Or should I take the lack of comments as "no news is good news, they are queued up already"? _______________________________________________ intel-gvt-dev mailing list intel-gvt-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev
-- Open Source Technology Center, Intel ltd. $gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827

On Thu, 19 Apr 2018 10:40:18 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas)
Oops. I've trapped into the kvm issue too. Wondering what the reason is, shouldn't this work with tcg too?
It's used for some sort of page tracking backdoor. Yes, I think vfio devices, including mdev, should work with tcg. Separating device assignment to not be integrally tied to kvm is something I've strived for with vfio.
But, yes, that indeed pretty much kills the "just let libvirt use the probe ioctl" idea.
The existing device_api file reports "vfio-pci", so we base the device API info in a directory named vfio-pci. We're specifically exposing device information, so we have a device directory. We have a GFX_PLANE query ioctl, so we have a gfx_plane sub-directory. I imagine the dmabuf and region files here expose either Y/N or 1/0.
Do we want tie this to vfio-pci? All existing devices are actually pci, and the qemu code only works for vfio-pci devices too. But at vfio api level there is no vfio-pci dependency I'm aware of, and I think we shouldn't add one without a good reason.
The intention was to tie it to 'device_api' which reports 'vfio-pci', so the user would read the device_api, learn that it uses vfio-pci, then look for attributes in a vfio-pci sub-directory. If device_api reported vfio-ccw, they'd look for a vfio-ccw directory.
Should we just add a gfx_plane_api file maybe? Which would be a comma-separated list of interfaces, listed in order of preference in case multiple are supported.
I'm afraid that as soon as we get away from a strict representation of the vfio API, we're going to see feature creep with such a solution. Ex. which hw encoders are supported, frame rate limiters, number of heads, etc.
anything other than mdev. This inconsistency with physically assigned devices has been one of my arguments against enhancing mdev sysfs.
Thanks to anyone still reading this. Ideas how we might help libvirt fill this information void so that they can actually configure a VM with a display device? Thanks,
Well, no good idea for the physical assigned device case.
Minimally, I think anything we decide needs to be placed into the instantiated device sysfs hierarchy rather than the template directory for a given mdev type, otherwise we have no hope of supporting it with physical devices.
PS: Any comment on the sample driver patches? Or should I take the lack of comments as "no news is good news, they are queued up already"?
I do not have them queued yet, I'll take a closer look at them shortly and let you know if I find any issues. Thanks for doing these! I think they'll be very helpful, especially for the task above to provide reference implementations for whatever API exposure we design. Thanks, Alex

On 19/04/2018 10:40, Gerd Hoffmann wrote:
So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas) Oops. I've trapped into the kvm issue too. Wondering what the reason is, shouldn't this work with tcg too?
As far as I understand, KVMGT requires KVM support in order to track writes to guest memory. It's a kernel API provided by the kvm.ko module, so no TCG support. Paolo

On Wed, 18 Apr 2018 12:31:53 -0600 Alex Williamson <alex.williamson@redhat.com> wrote:
On Mon, 9 Apr 2018 12:35:10 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
This little series adds three drivers, for demo-ing and testing vfio display interface code. There is one mdev device for each interface type (mdpy.ko for region and mbochs.ko for dmabuf).
Erik Skultety brought up a good question today regarding how libvirt is meant to handle these different flavors of display interfaces and knowing whether a given mdev device has display support at all. It seems that we cannot simply use the default display=auto because libvirt needs to specifically configure gl support for a dmabuf type interface versus not having such a requirement for a region interface, perhaps even removing the emulated graphics in some cases (though I don't think we have boot graphics through either solution yet). Additionally, GVT-g seems to need the x-igd-opregion support enabled(?), which is a non-starter for libvirt as it's an experimental option!
Currently the only way to determine display support is through the VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on their own they'd need to get to the point where they could open the vfio device and perform the ioctl. That means opening a vfio container, adding the group, setting the iommu type, and getting the device. I was initially a bit appalled at asking libvirt to do that, but the alternative is to put this information in sysfs, but doing that we risk that we need to describe every nuance of the mdev device through sysfs and it becomes a dumping ground for every possible feature an mdev device might have.
So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas)
Here's another proposal that's really growing on me: * Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies. * Libvirt uses the existing vfio API to open the device and probe the necessary ioctls, if it can't probe the device, the feature is unavailable, ie. display=off, no migration. I'm really having a hard time getting behind inventing a secondary API just to work around arbitrary requirements from mdev vendor drivers. vfio was never intended to be locked to QEMU or KVM, these two vendor drivers are the only examples of such requirements, and we're only encouraging this behavior if we add a redundant API for device probing. Any solution on the table currently would require changes to the mdev vendor drivers, so why not this change? Please defend why each driver needs these external dependencies and why the device open callback is the best, or only, place in the stack to enforce that dependency. Let's see what we're really dealing with here. Thanks, Alex

Hi,
Here's another proposal that's really growing on me:
* Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies.
Hmm. If you try use gvt with tcg then, wouldn't qemu think "device probed ok, all green" then even though that isn't the case? cheers, Gerd

On Tue, 24 Apr 2018 09:17:37 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
Hi,
Here's another proposal that's really growing on me:
* Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies.
Hmm. If you try use gvt with tcg then, wouldn't qemu think "device probed ok, all green" then even though that isn't the case?
Well, is there a way to make it work with tcg? That would be the best solution. Perhaps KVM could be handled as an accelerator rather than a required component. I don't really understand how the page tracking interface is used and why it's not required by NVIDIA if it's so fundamental to GVT-g. Otherwise, are there other points at which the device could refuse to be enabled, for instance what if the write to enable bus-master in the PCI command register returned an error if the device isn't fully configured. Paolo had suggested offline that maybe there could be a read-only mode of the device that allows probing. I think that would be a fair bit of work and complexity to support, but I'm open to those sorts of ideas. I can't be sure the NVIDIA requirement isn't purely for accounting purposes within their own proprietary userspace manager, without any real technical requirement. Hoping Intel and NVIDIA can comment on these so we really understand why these are in place before we bend over backwards for a secondary API interface. Thanks, Alex

-----Original Message----- From: intel-gvt-dev [mailto:intel-gvt-dev-bounces@lists.freedesktop.org] On Behalf Of Alex Williamson Sent: Wednesday, April 25, 2018 1:36 AM To: Gerd Hoffmann <kraxel@redhat.com> Cc: kvm@vger.kernel.org; Erik Skultety <eskultet@redhat.com>; libvirt <libvir- list@redhat.com>; Zhang, Tina <tina.zhang@intel.com>; kwankhede@nvidia.com; intel-gvt-dev@lists.freedesktop.org Subject: Re: [PATCH 0/3] sample: vfio mdev display devices.
On Tue, 24 Apr 2018 09:17:37 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
Hi,
Here's another proposal that's really growing on me:
* Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies.
Hmm. If you try use gvt with tcg then, wouldn't qemu think "device probed ok, all green" then even though that isn't the case?
Well, is there a way to make it work with tcg? That would be the best solution. Perhaps KVM could be handled as an accelerator rather than a required component. I don't really understand how the page tracking interface is used and why it's not required by NVIDIA if it's so fundamental to GVT-g. Otherwise,
GVT-g needs hypervisors' (like Xen or KVM) help to trap the guest GPU page table update, so that GVT-g can update the shadow page table correctly, in host, with host physical address, not guest physical address. As this page table is in memory, GVT-g needs hypervisors' help to make it write protected, so that it can trap the updates on time.
are there other points at which the device could refuse to be enabled, for instance what if the write to enable bus-master in the PCI command register returned an error if the device isn't fully configured. Paolo had suggested offline
If we add some logic to let GVT-g support basic VFIO APIs even in tcg use case, could the following things be reasonable? 1. A dummy vGPU is created with an UUID. 2. When VFIO_DEVICE_GET_INFO is invoked by libvirt, GVT-g tells that this vGPU is actually a dummy one and cannot work. 3. Then libvirt choose not to boot a VM with this dummy vGPU. 4. Maybe we also need some logic to let a VM with this dummy vGPU boot and work just as there is no vGPU support. Thanks. BR, Tina
that maybe there could be a read-only mode of the device that allows probing. I think that would be a fair bit of work and complexity to support, but I'm open to those sorts of ideas. I can't be sure the NVIDIA requirement isn't purely for accounting purposes within their own proprietary userspace manager, without any real technical requirement. Hoping Intel and NVIDIA can comment on these so we really understand why these are in place before we bend over backwards for a secondary API interface. Thanks,
Alex _______________________________________________ intel-gvt-dev mailing list intel-gvt-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

On 4/24/2018 3:10 AM, Alex Williamson wrote:
On Wed, 18 Apr 2018 12:31:53 -0600 Alex Williamson <alex.williamson@redhat.com> wrote:
On Mon, 9 Apr 2018 12:35:10 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
This little series adds three drivers, for demo-ing and testing vfio display interface code. There is one mdev device for each interface type (mdpy.ko for region and mbochs.ko for dmabuf).
Erik Skultety brought up a good question today regarding how libvirt is meant to handle these different flavors of display interfaces and knowing whether a given mdev device has display support at all. It seems that we cannot simply use the default display=auto because libvirt needs to specifically configure gl support for a dmabuf type interface versus not having such a requirement for a region interface, perhaps even removing the emulated graphics in some cases (though I don't think we have boot graphics through either solution yet). Additionally, GVT-g seems to need the x-igd-opregion support enabled(?), which is a non-starter for libvirt as it's an experimental option!
Currently the only way to determine display support is through the VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on their own they'd need to get to the point where they could open the vfio device and perform the ioctl. That means opening a vfio container, adding the group, setting the iommu type, and getting the device. I was initially a bit appalled at asking libvirt to do that, but the alternative is to put this information in sysfs, but doing that we risk that we need to describe every nuance of the mdev device through sysfs and it becomes a dumping ground for every possible feature an mdev device might have.
One or two sysfs file for each feature shouldn't be that much of over head? In kernel, other subsystem modules expose capability through sysfs, like PCI subsystem adds 'boot_vga' file for VGA device which returns 0/1 depending on if its boot VGA device. Similarly 'd3cold_allowed', 'msi_bus'...
So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas)
Here's another proposal that's really growing on me:
* Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies. * Libvirt uses the existing vfio API to open the device and probe the necessary ioctls, if it can't probe the device, the feature is unavailable, ie. display=off, no migration.
I'm trying to think simpler mechanism using sysfs that could work for any feature and knowing source-destination migration compatibility check by libvirt before initiating migration. I have another proposal: * Add a ioctl VFIO_DEVICE_PROBE_FEATURES struct vfio_device_features { __u32 argsz; __u32 features; } Define bit for each feature: #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2) * Vendor driver returns bitmask of supported features during initialization phase. * In vfio core module, trap this ioctl for each device in vfio_device_fops_unl_ioctl(), check features bitmask returned by vendor driver and add a sysfs file if feature is supported that device. This sysfs file would return 0/1. For migration this bit will only indicate if host driver supports migration feature. For source and destination compatibility check libvirt would need more data/variables to check like, * if same type of 'mdev_type' device create-able at destination, i.e. if ('mdev_type'->available_instances > 0) * if host_driver_version at source and destination are compatible. Host driver from same release branch should be mostly compatible, but if there are major changes in structures or APIs, host drivers from different branches might not be compatible, for example, if source and destination are from different branches and one of the structure had changed, then data collected at source might not be compatible with structures at destination and typecasting it to changed structures would mess up migrated data during restoration. * if guest_driver_version is compatible with host driver at destination. For mdev devices, guest driver communicates with host driver in some form. If there are changes in structures/APIs of such communication, guest driver at source might not be compatible with host driver at destination. 'available_instances' sysfs already exist, later two should be added by vendor driver which libvirt can use for migration compatibility check. Thanks, Kirti

On Wed, 25 Apr 2018 01:20:08 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/24/2018 3:10 AM, Alex Williamson wrote:
On Wed, 18 Apr 2018 12:31:53 -0600 Alex Williamson <alex.williamson@redhat.com> wrote:
On Mon, 9 Apr 2018 12:35:10 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
This little series adds three drivers, for demo-ing and testing vfio display interface code. There is one mdev device for each interface type (mdpy.ko for region and mbochs.ko for dmabuf).
Erik Skultety brought up a good question today regarding how libvirt is meant to handle these different flavors of display interfaces and knowing whether a given mdev device has display support at all. It seems that we cannot simply use the default display=auto because libvirt needs to specifically configure gl support for a dmabuf type interface versus not having such a requirement for a region interface, perhaps even removing the emulated graphics in some cases (though I don't think we have boot graphics through either solution yet). Additionally, GVT-g seems to need the x-igd-opregion support enabled(?), which is a non-starter for libvirt as it's an experimental option!
Currently the only way to determine display support is through the VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on their own they'd need to get to the point where they could open the vfio device and perform the ioctl. That means opening a vfio container, adding the group, setting the iommu type, and getting the device. I was initially a bit appalled at asking libvirt to do that, but the alternative is to put this information in sysfs, but doing that we risk that we need to describe every nuance of the mdev device through sysfs and it becomes a dumping ground for every possible feature an mdev device might have.
One or two sysfs file for each feature shouldn't be that much of over head? In kernel, other subsystem modules expose capability through sysfs, like PCI subsystem adds 'boot_vga' file for VGA device which returns 0/1 depending on if its boot VGA device. Similarly 'd3cold_allowed', 'msi_bus'...
Obviously we could add sysfs files, but unlike properties that the PCI core exposes about struct pci_dev fields, the idea of a vfio_device is much more abstract. Each bus driver creates its own device representation, so we have a top level vfio_device referencing through an opaque pointer a vfio_pci_device, vfio_platform_device, or mdev_device, and each mdev vendor driver creates its own private data structure below the mdev_device. So it's not quite a simple as one new attribute "show" function to handle all devices of that bus_type. We need a consistent implementation in each bus driver and vendor driver or we need to figure out how to percolate the information up to the vfio core. Your idea below seems to take the percolate approach.
So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas)
Here's another proposal that's really growing on me:
* Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies. * Libvirt uses the existing vfio API to open the device and probe the necessary ioctls, if it can't probe the device, the feature is unavailable, ie. display=off, no migration.
I'm trying to think simpler mechanism using sysfs that could work for any feature and knowing source-destination migration compatibility check by libvirt before initiating migration.
I have another proposal: * Add a ioctl VFIO_DEVICE_PROBE_FEATURES struct vfio_device_features { __u32 argsz; __u32 features; }
Define bit for each feature: #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2)
* Vendor driver returns bitmask of supported features during initialization phase.
* In vfio core module, trap this ioctl for each device in vfio_device_fops_unl_ioctl(),
Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our blocking point with mdev drivers, we can't get a device fd, so we can't call an ioctl on the device fd.
check features bitmask returned by vendor driver and add a sysfs file if feature is supported that device. This sysfs file would return 0/1.
I don't understand why we have an ioctl interface, if the user can get to the device fd then we have existing interfaces to probe these things, it seems like you're just wanting to pass a features bitmap through to vfio_add_group_dev() that vfio-core would expose through sysfs, but a list of feature bits doesn't convey enough info except for the most basic uses.
For migration this bit will only indicate if host driver supports migration feature.
For source and destination compatibility check libvirt would need more data/variables to check like, * if same type of 'mdev_type' device create-able at destination, i.e. if ('mdev_type'->available_instances > 0)
* if host_driver_version at source and destination are compatible. Host driver from same release branch should be mostly compatible, but if there are major changes in structures or APIs, host drivers from different branches might not be compatible, for example, if source and destination are from different branches and one of the structure had changed, then data collected at source might not be compatible with structures at destination and typecasting it to changed structures would mess up migrated data during restoration.
Of course now you're asking that libvirt understand the release versioning scheme of every vendor driver and that it remain programatically consistent. We can't even do this with in-kernel drivers. And in the end, still the best we can do is guess.
* if guest_driver_version is compatible with host driver at destination. For mdev devices, guest driver communicates with host driver in some form. If there are changes in structures/APIs of such communication, guest driver at source might not be compatible with host driver at destination.
And another guess plus now the guest driver is involved which libvirt has no visibility to.
'available_instances' sysfs already exist, later two should be added by vendor driver which libvirt can use for migration compatibility check.
As noted previously, display and migration are not necessarily mdev-only features, it's possible that vfio-pci or vfio-platform could also implement these, so the sysfs interface cannot be restricted to the mdev template and lifecycle interface. One more try... we have a vfio_group fd. This is created by the bus drivers calling vfio_add_group_dev() and registers a struct device, a struct vfio_device_ops, and private data. Typically we only wire the device_ops to the resulting file descriptor we get from VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl through the group fd? The ioctl would need to take a string arg to match to a device name, plus an ioctl cmd and arg for the device_ops ioctl. The group ioctl would need to filter cmds to known, benign queries. We'd also need to verify that the allowed ioctls have no dependencies on setup done in device_ops.open(). *_INFO and QUERY_GFX_PLANE ioctls would be the only candidates. Bus drivers could of course keep an open count in their private data so they know how the ioctl is being called (if necessary) and the group fd only allows a single open, so there's no risk that another user could interact with the group in bad ways once the device is opened (and of course we use file level access control on the group device file anyway). This is sort of a rethink of Paolo's suggestion of a read-only fd, but the fd is the existing group fd and any references to the device would only be held around the calling of the nested ioctl. Could it work? Thanks, Alex

On 4/25/2018 4:29 AM, Alex Williamson wrote:
On Wed, 25 Apr 2018 01:20:08 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/24/2018 3:10 AM, Alex Williamson wrote:
On Wed, 18 Apr 2018 12:31:53 -0600 Alex Williamson <alex.williamson@redhat.com> wrote:
On Mon, 9 Apr 2018 12:35:10 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
This little series adds three drivers, for demo-ing and testing vfio display interface code. There is one mdev device for each interface type (mdpy.ko for region and mbochs.ko for dmabuf).
Erik Skultety brought up a good question today regarding how libvirt is meant to handle these different flavors of display interfaces and knowing whether a given mdev device has display support at all. It seems that we cannot simply use the default display=auto because libvirt needs to specifically configure gl support for a dmabuf type interface versus not having such a requirement for a region interface, perhaps even removing the emulated graphics in some cases (though I don't think we have boot graphics through either solution yet). Additionally, GVT-g seems to need the x-igd-opregion support enabled(?), which is a non-starter for libvirt as it's an experimental option!
Currently the only way to determine display support is through the VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on their own they'd need to get to the point where they could open the vfio device and perform the ioctl. That means opening a vfio container, adding the group, setting the iommu type, and getting the device. I was initially a bit appalled at asking libvirt to do that, but the alternative is to put this information in sysfs, but doing that we risk that we need to describe every nuance of the mdev device through sysfs and it becomes a dumping ground for every possible feature an mdev device might have.
One or two sysfs file for each feature shouldn't be that much of over head? In kernel, other subsystem modules expose capability through sysfs, like PCI subsystem adds 'boot_vga' file for VGA device which returns 0/1 depending on if its boot VGA device. Similarly 'd3cold_allowed', 'msi_bus'...
Obviously we could add sysfs files, but unlike properties that the PCI core exposes about struct pci_dev fields, the idea of a vfio_device is much more abstract. Each bus driver creates its own device representation, so we have a top level vfio_device referencing through an opaque pointer a vfio_pci_device, vfio_platform_device, or mdev_device, and each mdev vendor driver creates its own private data structure below the mdev_device. So it's not quite a simple as one new attribute "show" function to handle all devices of that bus_type. We need a consistent implementation in each bus driver and vendor driver or we need to figure out how to percolate the information up to the vfio core. Your idea below seems to take the percolate approach.
So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas)
Here's another proposal that's really growing on me:
* Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies. * Libvirt uses the existing vfio API to open the device and probe the necessary ioctls, if it can't probe the device, the feature is unavailable, ie. display=off, no migration.
I'm trying to think simpler mechanism using sysfs that could work for any feature and knowing source-destination migration compatibility check by libvirt before initiating migration.
I have another proposal: * Add a ioctl VFIO_DEVICE_PROBE_FEATURES struct vfio_device_features { __u32 argsz; __u32 features; }
Define bit for each feature: #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2)
* Vendor driver returns bitmask of supported features during initialization phase.
* In vfio core module, trap this ioctl for each device in vfio_device_fops_unl_ioctl(),
Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our blocking point with mdev drivers, we can't get a device fd, so we can't call an ioctl on the device fd.
I'm sorry, I thought we could expose features when QEMU initialize, but libvirt needs to know supported features before QEMU initialize.
check features bitmask returned by vendor driver and add a sysfs file if feature is supported that device. This sysfs file would return 0/1.
I don't understand why we have an ioctl interface, if the user can get to the device fd then we have existing interfaces to probe these things, it seems like you're just wanting to pass a features bitmap through to vfio_add_group_dev() that vfio-core would expose through sysfs, but a list of feature bits doesn't convey enough info except for the most basic uses.
Yes, vfio_add_group_dev() seems to be better way to convey features to vfio core.
For migration this bit will only indicate if host driver supports migration feature.
For source and destination compatibility check libvirt would need more data/variables to check like, * if same type of 'mdev_type' device create-able at destination, i.e. if ('mdev_type'->available_instances > 0)
* if host_driver_version at source and destination are compatible. Host driver from same release branch should be mostly compatible, but if there are major changes in structures or APIs, host drivers from different branches might not be compatible, for example, if source and destination are from different branches and one of the structure had changed, then data collected at source might not be compatible with structures at destination and typecasting it to changed structures would mess up migrated data during restoration.
Of course now you're asking that libvirt understand the release versioning scheme of every vendor driver and that it remain programatically consistent. We can't even do this with in-kernel drivers. And in the end, still the best we can do is guess.
Libvirt doesn't need to understand the version, libvirt need to do strcmp version string from source and destination. If those are equal, then libvirt would understand that they are compatible.
* if guest_driver_version is compatible with host driver at destination. For mdev devices, guest driver communicates with host driver in some form. If there are changes in structures/APIs of such communication, guest driver at source might not be compatible with host driver at destination.
And another guess plus now the guest driver is involved which libvirt has no visibility to.
Like above libvirt need to do strcmp.
'available_instances' sysfs already exist, later two should be added by vendor driver which libvirt can use for migration compatibility check.
As noted previously, display and migration are not necessarily mdev-only features, it's possible that vfio-pci or vfio-platform could also implement these, so the sysfs interface cannot be restricted to the mdev template and lifecycle interface.
I agree. Feature bitmask passed to vfio core is not mdev specific. But here 'available_instances' for migration compatibility check is mdev specific. If mdev device is not create-able at destination, there is no point in initiating migration by libvirt.
One more try... we have a vfio_group fd. This is created by the bus drivers calling vfio_add_group_dev() and registers a struct device, a struct vfio_device_ops, and private data. Typically we only wire the device_ops to the resulting file descriptor we get from VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl through the group fd? The ioctl would need to take a string arg to match to a device name, plus an ioctl cmd and arg for the device_ops ioctl. The group ioctl would need to filter cmds to known, benign queries. We'd also need to verify that the allowed ioctls have no dependencies on setup done in device_ops.open().
So these ioctls would be called without devices open() call, doesn't this seem to be against file operations standard? Thanks, Kirti
*_INFO and QUERY_GFX_PLANE ioctls would be the only candidates. Bus drivers could of course keep an open count in their private data so they know how the ioctl is being called (if necessary) and the group fd only allows a single open, so there's no risk that another user could interact with the group in bad ways once the device is opened (and of course we use file level access control on the group device file anyway). This is sort of a rethink of Paolo's suggestion of a read-only fd, but the fd is the existing group fd and any references to the device would only be held around the calling of the nested ioctl. Could it work? Thanks,

On Wed, 25 Apr 2018 21:00:39 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/25/2018 4:29 AM, Alex Williamson wrote:
On Wed, 25 Apr 2018 01:20:08 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/24/2018 3:10 AM, Alex Williamson wrote:
On Wed, 18 Apr 2018 12:31:53 -0600 Alex Williamson <alex.williamson@redhat.com> wrote:
On Mon, 9 Apr 2018 12:35:10 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
This little series adds three drivers, for demo-ing and testing vfio display interface code. There is one mdev device for each interface type (mdpy.ko for region and mbochs.ko for dmabuf).
Erik Skultety brought up a good question today regarding how libvirt is meant to handle these different flavors of display interfaces and knowing whether a given mdev device has display support at all. It seems that we cannot simply use the default display=auto because libvirt needs to specifically configure gl support for a dmabuf type interface versus not having such a requirement for a region interface, perhaps even removing the emulated graphics in some cases (though I don't think we have boot graphics through either solution yet). Additionally, GVT-g seems to need the x-igd-opregion support enabled(?), which is a non-starter for libvirt as it's an experimental option!
Currently the only way to determine display support is through the VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on their own they'd need to get to the point where they could open the vfio device and perform the ioctl. That means opening a vfio container, adding the group, setting the iommu type, and getting the device. I was initially a bit appalled at asking libvirt to do that, but the alternative is to put this information in sysfs, but doing that we risk that we need to describe every nuance of the mdev device through sysfs and it becomes a dumping ground for every possible feature an mdev device might have. ... So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas)
Here's another proposal that's really growing on me:
* Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies. * Libvirt uses the existing vfio API to open the device and probe the necessary ioctls, if it can't probe the device, the feature is unavailable, ie. display=off, no migration.
I'm trying to think simpler mechanism using sysfs that could work for any feature and knowing source-destination migration compatibility check by libvirt before initiating migration.
I have another proposal: * Add a ioctl VFIO_DEVICE_PROBE_FEATURES struct vfio_device_features { __u32 argsz; __u32 features; }
Define bit for each feature: #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2)
* Vendor driver returns bitmask of supported features during initialization phase.
* In vfio core module, trap this ioctl for each device in vfio_device_fops_unl_ioctl(),
Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our blocking point with mdev drivers, we can't get a device fd, so we can't call an ioctl on the device fd.
I'm sorry, I thought we could expose features when QEMU initialize, but libvirt needs to know supported features before QEMU initialize.
check features bitmask returned by vendor driver and add a sysfs file if feature is supported that device. This sysfs file would return 0/1.
I don't understand why we have an ioctl interface, if the user can get to the device fd then we have existing interfaces to probe these things, it seems like you're just wanting to pass a features bitmap through to vfio_add_group_dev() that vfio-core would expose through sysfs, but a list of feature bits doesn't convey enough info except for the most basic uses.
Yes, vfio_add_group_dev() seems to be better way to convey features to vfio core.
For migration this bit will only indicate if host driver supports migration feature.
For source and destination compatibility check libvirt would need more data/variables to check like, * if same type of 'mdev_type' device create-able at destination, i.e. if ('mdev_type'->available_instances > 0)
* if host_driver_version at source and destination are compatible. Host driver from same release branch should be mostly compatible, but if there are major changes in structures or APIs, host drivers from different branches might not be compatible, for example, if source and destination are from different branches and one of the structure had changed, then data collected at source might not be compatible with structures at destination and typecasting it to changed structures would mess up migrated data during restoration.
Of course now you're asking that libvirt understand the release versioning scheme of every vendor driver and that it remain programatically consistent. We can't even do this with in-kernel drivers. And in the end, still the best we can do is guess.
Libvirt doesn't need to understand the version, libvirt need to do strcmp version string from source and destination. If those are equal, then libvirt would understand that they are compatible.
Who's to say that the driver version and migration compatibility have any relation at all? Some drivers might focus on designing their own migration interface that can maintain compatibility across versions (QEMU does this), some drivers may only allow identical version migration (which is going to frustrate upper level management tools and customers - RHEL goes to great extents to support cross version migration). We cannot have a one size fits all here that driver version defines completely the migration compatibility.
* if guest_driver_version is compatible with host driver at destination. For mdev devices, guest driver communicates with host driver in some form. If there are changes in structures/APIs of such communication, guest driver at source might not be compatible with host driver at destination.
And another guess plus now the guest driver is involved which libvirt has no visibility to.
Like above libvirt need to do strcmp.
Insufficient, imo
'available_instances' sysfs already exist, later two should be added by vendor driver which libvirt can use for migration compatibility check.
As noted previously, display and migration are not necessarily mdev-only features, it's possible that vfio-pci or vfio-platform could also implement these, so the sysfs interface cannot be restricted to the mdev template and lifecycle interface.
I agree. Feature bitmask passed to vfio core is not mdev specific. But here 'available_instances' for migration compatibility check is mdev specific. If mdev device is not create-able at destination, there is no point in initiating migration by libvirt.
'available_instances' for migration compatibility check...? We use available_instances to know whether we have the resources to create a given mdev type. It's certainly a prerequisite to have a device of the identical type at the migration target and how we define what is an identical device for a directly assigned PCI device is yet another overly complicated rat hole. But an identical device doesn't necessarily imply migration compatibility and I think that's the problem we're tackling. We cannot assume based only on the device type that migration is compatible, that's basically saying we're never going to have any bugs or oversights or new features in the migration stream. Chatting with Laine, it may be worth a step back to include migration experts and people up the stack with more visibility to how openstack operates. The issue here is that if vfio gains migration support then we have a portion of the migration stream that is not under the control of QEMU, we cannot necessarily tie it to a QEMU machine type and we cannot necessarily dictate how the vfio bus driver (vendor driver) handles versioning and compatibility. My intent was to expose some sort of migration information through the vfio API so that upper level tools could determine source and target compatibility, but this in itself is I think something new that those tools need to agree how it might be done. How would something like openstack want to handle not only finding a migration target with a compatible device, but also verifying if the device supports the migration format of the source device? Alternatively, should we do anything? Is the problem too hard and we should let the driver return an error when it receives an incompatible migration stream, aborting the migration?
One more try... we have a vfio_group fd. This is created by the bus drivers calling vfio_add_group_dev() and registers a struct device, a struct vfio_device_ops, and private data. Typically we only wire the device_ops to the resulting file descriptor we get from VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl through the group fd? The ioctl would need to take a string arg to match to a device name, plus an ioctl cmd and arg for the device_ops ioctl. The group ioctl would need to filter cmds to known, benign queries. We'd also need to verify that the allowed ioctls have no dependencies on setup done in device_ops.open().
So these ioctls would be called without devices open() call, doesn't this seem to be against file operations standard?
vfio_device_ops is modeled largely after file operations, but I don't think we're bound by that for the interaction between vfio-core and the vfio bus drivers. We could make a separate callback for unprivileged ioctls, but that seems like more work per driver when we really want to maintain the identical API, we just want to provide a more limited interface and change the calling point. An issue I thought of for migration though is that this path wouldn't have access to the migration region and therefore if we place a header within that region containing the compatibility and versioning information, the user still couldn't access it. This doesn't seem to be a blocker though as we could put that information within the region capability that defines the region as used for migration. Possibly a device could have multiple migration regions with different formats for backwards compatibility, of course then we'd need a way to determine which to use and which combinations have been validated. Thanks, Alex

* Alex Williamson (alex.williamson@redhat.com) wrote:
On Wed, 25 Apr 2018 21:00:39 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/25/2018 4:29 AM, Alex Williamson wrote:
On Wed, 25 Apr 2018 01:20:08 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/24/2018 3:10 AM, Alex Williamson wrote:
On Wed, 18 Apr 2018 12:31:53 -0600 Alex Williamson <alex.williamson@redhat.com> wrote:
On Mon, 9 Apr 2018 12:35:10 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
> This little series adds three drivers, for demo-ing and testing vfio > display interface code. There is one mdev device for each interface > type (mdpy.ko for region and mbochs.ko for dmabuf).
Erik Skultety brought up a good question today regarding how libvirt is meant to handle these different flavors of display interfaces and knowing whether a given mdev device has display support at all. It seems that we cannot simply use the default display=auto because libvirt needs to specifically configure gl support for a dmabuf type interface versus not having such a requirement for a region interface, perhaps even removing the emulated graphics in some cases (though I don't think we have boot graphics through either solution yet). Additionally, GVT-g seems to need the x-igd-opregion support enabled(?), which is a non-starter for libvirt as it's an experimental option!
Currently the only way to determine display support is through the VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on their own they'd need to get to the point where they could open the vfio device and perform the ioctl. That means opening a vfio container, adding the group, setting the iommu type, and getting the device. I was initially a bit appalled at asking libvirt to do that, but the alternative is to put this information in sysfs, but doing that we risk that we need to describe every nuance of the mdev device through sysfs and it becomes a dumping ground for every possible feature an mdev device might have. ... So I was ready to return and suggest that maybe libvirt should probe the device to know about these ancillary configuration details, but then I remembered that both mdev vGPU vendors had external dependencies to even allow probing the device. KVMGT will fail to open the device if it's not associated with an instance of KVM and NVIDIA vGPU, I believe, will fail if the vGPU manager process cannot find the QEMU instance to extract the VM UUID. (Both of these were bad ideas)
Here's another proposal that's really growing on me:
* Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies. * Libvirt uses the existing vfio API to open the device and probe the necessary ioctls, if it can't probe the device, the feature is unavailable, ie. display=off, no migration.
I'm trying to think simpler mechanism using sysfs that could work for any feature and knowing source-destination migration compatibility check by libvirt before initiating migration.
I have another proposal: * Add a ioctl VFIO_DEVICE_PROBE_FEATURES struct vfio_device_features { __u32 argsz; __u32 features; }
Define bit for each feature: #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2)
* Vendor driver returns bitmask of supported features during initialization phase.
* In vfio core module, trap this ioctl for each device in vfio_device_fops_unl_ioctl(),
Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our blocking point with mdev drivers, we can't get a device fd, so we can't call an ioctl on the device fd.
I'm sorry, I thought we could expose features when QEMU initialize, but libvirt needs to know supported features before QEMU initialize.
check features bitmask returned by vendor driver and add a sysfs file if feature is supported that device. This sysfs file would return 0/1.
I don't understand why we have an ioctl interface, if the user can get to the device fd then we have existing interfaces to probe these things, it seems like you're just wanting to pass a features bitmap through to vfio_add_group_dev() that vfio-core would expose through sysfs, but a list of feature bits doesn't convey enough info except for the most basic uses.
Yes, vfio_add_group_dev() seems to be better way to convey features to vfio core.
For migration this bit will only indicate if host driver supports migration feature.
For source and destination compatibility check libvirt would need more data/variables to check like, * if same type of 'mdev_type' device create-able at destination, i.e. if ('mdev_type'->available_instances > 0)
* if host_driver_version at source and destination are compatible. Host driver from same release branch should be mostly compatible, but if there are major changes in structures or APIs, host drivers from different branches might not be compatible, for example, if source and destination are from different branches and one of the structure had changed, then data collected at source might not be compatible with structures at destination and typecasting it to changed structures would mess up migrated data during restoration.
Of course now you're asking that libvirt understand the release versioning scheme of every vendor driver and that it remain programatically consistent. We can't even do this with in-kernel drivers. And in the end, still the best we can do is guess.
Libvirt doesn't need to understand the version, libvirt need to do strcmp version string from source and destination. If those are equal, then libvirt would understand that they are compatible.
Who's to say that the driver version and migration compatibility have any relation at all? Some drivers might focus on designing their own migration interface that can maintain compatibility across versions (QEMU does this), some drivers may only allow identical version migration (which is going to frustrate upper level management tools and customers - RHEL goes to great extents to support cross version migration). We cannot have a one size fits all here that driver version defines completely the migration compatibility.
I'll agree; I don't know enough about these devices, but to give you some example of things I'd expect to work: a) User adds new machines to their data centre with larger/newer version of the same vendors GPU; in some cases that should work (depending on vendor details etc) b) The same thing but with identical hardware but a newer driver on the destination. Obviously there will be some cut offs that say some versions are incompatible; but for normal migration we jump through serious hoops to make sure stuff works; customers will expect the same with some VFIO devices.
* if guest_driver_version is compatible with host driver at destination. For mdev devices, guest driver communicates with host driver in some form. If there are changes in structures/APIs of such communication, guest driver at source might not be compatible with host driver at destination.
And another guess plus now the guest driver is involved which libvirt has no visibility to.
Like above libvirt need to do strcmp.
Insufficient, imo
'available_instances' sysfs already exist, later two should be added by vendor driver which libvirt can use for migration compatibility check.
As noted previously, display and migration are not necessarily mdev-only features, it's possible that vfio-pci or vfio-platform could also implement these, so the sysfs interface cannot be restricted to the mdev template and lifecycle interface.
I agree. Feature bitmask passed to vfio core is not mdev specific. But here 'available_instances' for migration compatibility check is mdev specific. If mdev device is not create-able at destination, there is no point in initiating migration by libvirt.
'available_instances' for migration compatibility check...? We use available_instances to know whether we have the resources to create a given mdev type. It's certainly a prerequisite to have a device of the identical type at the migration target and how we define what is an identical device for a directly assigned PCI device is yet another overly complicated rat hole. But an identical device doesn't necessarily imply migration compatibility and I think that's the problem we're tackling. We cannot assume based only on the device type that migration is compatible, that's basically saying we're never going to have any bugs or oversights or new features in the migration stream.
Those things certainly happen; state that we forgot to transfer, new features enables on devices, devices configured in different ways.
Chatting with Laine, it may be worth a step back to include migration experts and people up the stack with more visibility to how openstack operates. The issue here is that if vfio gains migration support then we have a portion of the migration stream that is not under the control of QEMU, we cannot necessarily tie it to a QEMU machine type and we cannot necessarily dictate how the vfio bus driver (vendor driver) handles versioning and compatibility. My intent was to expose some sort of migration information through the vfio API so that upper level tools could determine source and target compatibility, but this in itself is I think something new that those tools need to agree how it might be done. How would something like openstack want to handle not only finding a migration target with a compatible device, but also verifying if the device supports the migration format of the source device?
Alternatively, should we do anything? Is the problem too hard and we should let the driver return an error when it receives an incompatible migration stream, aborting the migration?
It's a bit nasty; if you've hit the 'evacuate host' button then what happens when you've got some incompatible hosts. Dave
One more try... we have a vfio_group fd. This is created by the bus drivers calling vfio_add_group_dev() and registers a struct device, a struct vfio_device_ops, and private data. Typically we only wire the device_ops to the resulting file descriptor we get from VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl through the group fd? The ioctl would need to take a string arg to match to a device name, plus an ioctl cmd and arg for the device_ops ioctl. The group ioctl would need to filter cmds to known, benign queries. We'd also need to verify that the allowed ioctls have no dependencies on setup done in device_ops.open().
So these ioctls would be called without devices open() call, doesn't this seem to be against file operations standard?
vfio_device_ops is modeled largely after file operations, but I don't think we're bound by that for the interaction between vfio-core and the vfio bus drivers. We could make a separate callback for unprivileged ioctls, but that seems like more work per driver when we really want to maintain the identical API, we just want to provide a more limited interface and change the calling point.
An issue I thought of for migration though is that this path wouldn't have access to the migration region and therefore if we place a header within that region containing the compatibility and versioning information, the user still couldn't access it. This doesn't seem to be a blocker though as we could put that information within the region capability that defines the region as used for migration. Possibly a device could have multiple migration regions with different formats for backwards compatibility, of course then we'd need a way to determine which to use and which combinations have been validated. Thanks,
Alex -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote:
* Alex Williamson (alex.williamson@redhat.com) wrote:
On Wed, 25 Apr 2018 21:00:39 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/25/2018 4:29 AM, Alex Williamson wrote:
On Wed, 25 Apr 2018 01:20:08 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/24/2018 3:10 AM, Alex Williamson wrote:
On Wed, 18 Apr 2018 12:31:53 -0600 Alex Williamson <alex.williamson@redhat.com> wrote:
> On Mon, 9 Apr 2018 12:35:10 +0200 > Gerd Hoffmann <kraxel@redhat.com> wrote: > >> This little series adds three drivers, for demo-ing and testing vfio >> display interface code. There is one mdev device for each interface >> type (mdpy.ko for region and mbochs.ko for dmabuf). > > Erik Skultety brought up a good question today regarding how libvirt is > meant to handle these different flavors of display interfaces and > knowing whether a given mdev device has display support at all. It > seems that we cannot simply use the default display=auto because > libvirt needs to specifically configure gl support for a dmabuf type > interface versus not having such a requirement for a region interface, > perhaps even removing the emulated graphics in some cases (though I > don't think we have boot graphics through either solution yet). > Additionally, GVT-g seems to need the x-igd-opregion support > enabled(?), which is a non-starter for libvirt as it's an experimental > option! > > Currently the only way to determine display support is through the > VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on > their own they'd need to get to the point where they could open the > vfio device and perform the ioctl. That means opening a vfio > container, adding the group, setting the iommu type, and getting the > device. I was initially a bit appalled at asking libvirt to do that, > but the alternative is to put this information in sysfs, but doing that > we risk that we need to describe every nuance of the mdev device > through sysfs and it becomes a dumping ground for every possible > feature an mdev device might have. ... > So I was ready to return and suggest that maybe libvirt should probe > the device to know about these ancillary configuration details, but > then I remembered that both mdev vGPU vendors had external dependencies > to even allow probing the device. KVMGT will fail to open the device > if it's not associated with an instance of KVM and NVIDIA vGPU, I > believe, will fail if the vGPU manager process cannot find the QEMU > instance to extract the VM UUID. (Both of these were bad ideas)
Here's another proposal that's really growing on me:
* Fix the vendor drivers! Allow devices to be opened and probed without these external dependencies. * Libvirt uses the existing vfio API to open the device and probe the necessary ioctls, if it can't probe the device, the feature is unavailable, ie. display=off, no migration.
I'm trying to think simpler mechanism using sysfs that could work for any feature and knowing source-destination migration compatibility check by libvirt before initiating migration.
I have another proposal: * Add a ioctl VFIO_DEVICE_PROBE_FEATURES struct vfio_device_features { __u32 argsz; __u32 features; }
Define bit for each feature: #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2)
* Vendor driver returns bitmask of supported features during initialization phase.
* In vfio core module, trap this ioctl for each device in vfio_device_fops_unl_ioctl(),
Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our blocking point with mdev drivers, we can't get a device fd, so we can't call an ioctl on the device fd.
I'm sorry, I thought we could expose features when QEMU initialize, but libvirt needs to know supported features before QEMU initialize.
check features bitmask returned by vendor driver and add a sysfs file if feature is supported that device. This sysfs file would return 0/1.
I don't understand why we have an ioctl interface, if the user can get to the device fd then we have existing interfaces to probe these things, it seems like you're just wanting to pass a features bitmap through to vfio_add_group_dev() that vfio-core would expose through sysfs, but a list of feature bits doesn't convey enough info except for the most basic uses.
Yes, vfio_add_group_dev() seems to be better way to convey features to vfio core.
For migration this bit will only indicate if host driver supports migration feature.
For source and destination compatibility check libvirt would need more data/variables to check like, * if same type of 'mdev_type' device create-able at destination, i.e. if ('mdev_type'->available_instances > 0)
* if host_driver_version at source and destination are compatible. Host driver from same release branch should be mostly compatible, but if there are major changes in structures or APIs, host drivers from different branches might not be compatible, for example, if source and destination are from different branches and one of the structure had changed, then data collected at source might not be compatible with structures at destination and typecasting it to changed structures would mess up migrated data during restoration.
Of course now you're asking that libvirt understand the release versioning scheme of every vendor driver and that it remain programatically consistent. We can't even do this with in-kernel drivers. And in the end, still the best we can do is guess.
Libvirt doesn't need to understand the version, libvirt need to do strcmp version string from source and destination. If those are equal, then libvirt would understand that they are compatible.
Who's to say that the driver version and migration compatibility have any relation at all? Some drivers might focus on designing their own migration interface that can maintain compatibility across versions (QEMU does this), some drivers may only allow identical version migration (which is going to frustrate upper level management tools and customers - RHEL goes to great extents to support cross version migration). We cannot have a one size fits all here that driver version defines completely the migration compatibility.
I'll agree; I don't know enough about these devices, but to give you some example of things I'd expect to work: a) User adds new machines to their data centre with larger/newer version of the same vendors GPU; in some cases that should work (depending on vendor details etc) b) The same thing but with identical hardware but a newer driver on the destination.
Obviously there will be some cut offs that say some versions are incompatible; but for normal migration we jump through serious hoops to make sure stuff works; customers will expect the same with some VFIO devices.
How libvirt checks that cut off where some versions are incompatible?
* if guest_driver_version is compatible with host driver at destination. For mdev devices, guest driver communicates with host driver in some form. If there are changes in structures/APIs of such communication, guest driver at source might not be compatible with host driver at destination.
And another guess plus now the guest driver is involved which libvirt has no visibility to.
Like above libvirt need to do strcmp.
Insufficient, imo
'available_instances' sysfs already exist, later two should be added by vendor driver which libvirt can use for migration compatibility check.
As noted previously, display and migration are not necessarily mdev-only features, it's possible that vfio-pci or vfio-platform could also implement these, so the sysfs interface cannot be restricted to the mdev template and lifecycle interface.
I agree. Feature bitmask passed to vfio core is not mdev specific. But here 'available_instances' for migration compatibility check is mdev specific. If mdev device is not create-able at destination, there is no point in initiating migration by libvirt.
'available_instances' for migration compatibility check...? We use available_instances to know whether we have the resources to create a given mdev type. It's certainly a prerequisite to have a device of the identical type at the migration target and how we define what is an identical device for a directly assigned PCI device is yet another overly complicated rat hole. But an identical device doesn't necessarily imply migration compatibility and I think that's the problem we're tackling. We cannot assume based only on the device type that migration is compatible, that's basically saying we're never going to have any bugs or oversights or new features in the migration stream.
Those things certainly happen; state that we forgot to transfer, new features enables on devices, devices configured in different ways.
How libvirt checks migration compatibility for other devices across QEMU versions where source support a device and destination running with older QEMU version doesn't support that device or that device doesn't exist in that system? Thanks, Kirti
Chatting with Laine, it may be worth a step back to include migration experts and people up the stack with more visibility to how openstack operates. The issue here is that if vfio gains migration support then we have a portion of the migration stream that is not under the control of QEMU, we cannot necessarily tie it to a QEMU machine type and we cannot necessarily dictate how the vfio bus driver (vendor driver) handles versioning and compatibility. My intent was to expose some sort of migration information through the vfio API so that upper level tools could determine source and target compatibility, but this in itself is I think something new that those tools need to agree how it might be done. How would something like openstack want to handle not only finding a migration target with a compatible device, but also verifying if the device supports the migration format of the source device?
Alternatively, should we do anything? Is the problem too hard and we should let the driver return an error when it receives an incompatible migration stream, aborting the migration?
It's a bit nasty; if you've hit the 'evacuate host' button then what happens when you've got some incompatible hosts.
Dave
One more try... we have a vfio_group fd. This is created by the bus drivers calling vfio_add_group_dev() and registers a struct device, a struct vfio_device_ops, and private data. Typically we only wire the device_ops to the resulting file descriptor we get from VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl through the group fd? The ioctl would need to take a string arg to match to a device name, plus an ioctl cmd and arg for the device_ops ioctl. The group ioctl would need to filter cmds to known, benign queries. We'd also need to verify that the allowed ioctls have no dependencies on setup done in device_ops.open().
So these ioctls would be called without devices open() call, doesn't this seem to be against file operations standard?
vfio_device_ops is modeled largely after file operations, but I don't think we're bound by that for the interaction between vfio-core and the vfio bus drivers. We could make a separate callback for unprivileged ioctls, but that seems like more work per driver when we really want to maintain the identical API, we just want to provide a more limited interface and change the calling point.
An issue I thought of for migration though is that this path wouldn't have access to the migration region and therefore if we place a header within that region containing the compatibility and versioning information, the user still couldn't access it. This doesn't seem to be a blocker though as we could put that information within the region capability that defines the region as used for migration. Possibly a device could have multiple migration regions with different formats for backwards compatibility, of course then we'd need a way to determine which to use and which combinations have been validated. Thanks,
Alex -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

* Kirti Wankhede (kwankhede@nvidia.com) wrote:
On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote:
* Alex Williamson (alex.williamson@redhat.com) wrote:
On Wed, 25 Apr 2018 21:00:39 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/25/2018 4:29 AM, Alex Williamson wrote:
On Wed, 25 Apr 2018 01:20:08 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/24/2018 3:10 AM, Alex Williamson wrote: > On Wed, 18 Apr 2018 12:31:53 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > >> On Mon, 9 Apr 2018 12:35:10 +0200 >> Gerd Hoffmann <kraxel@redhat.com> wrote: >> >>> This little series adds three drivers, for demo-ing and testing vfio >>> display interface code. There is one mdev device for each interface >>> type (mdpy.ko for region and mbochs.ko for dmabuf). >> >> Erik Skultety brought up a good question today regarding how libvirt is >> meant to handle these different flavors of display interfaces and >> knowing whether a given mdev device has display support at all. It >> seems that we cannot simply use the default display=auto because >> libvirt needs to specifically configure gl support for a dmabuf type >> interface versus not having such a requirement for a region interface, >> perhaps even removing the emulated graphics in some cases (though I >> don't think we have boot graphics through either solution yet). >> Additionally, GVT-g seems to need the x-igd-opregion support >> enabled(?), which is a non-starter for libvirt as it's an experimental >> option! >> >> Currently the only way to determine display support is through the >> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on >> their own they'd need to get to the point where they could open the >> vfio device and perform the ioctl. That means opening a vfio >> container, adding the group, setting the iommu type, and getting the >> device. I was initially a bit appalled at asking libvirt to do that, >> but the alternative is to put this information in sysfs, but doing that >> we risk that we need to describe every nuance of the mdev device >> through sysfs and it becomes a dumping ground for every possible >> feature an mdev device might have. ... >> So I was ready to return and suggest that maybe libvirt should probe >> the device to know about these ancillary configuration details, but >> then I remembered that both mdev vGPU vendors had external dependencies >> to even allow probing the device. KVMGT will fail to open the device >> if it's not associated with an instance of KVM and NVIDIA vGPU, I >> believe, will fail if the vGPU manager process cannot find the QEMU >> instance to extract the VM UUID. (Both of these were bad ideas) > > Here's another proposal that's really growing on me: > > * Fix the vendor drivers! Allow devices to be opened and probed > without these external dependencies. > * Libvirt uses the existing vfio API to open the device and probe the > necessary ioctls, if it can't probe the device, the feature is > unavailable, ie. display=off, no migration. >
I'm trying to think simpler mechanism using sysfs that could work for any feature and knowing source-destination migration compatibility check by libvirt before initiating migration.
I have another proposal: * Add a ioctl VFIO_DEVICE_PROBE_FEATURES struct vfio_device_features { __u32 argsz; __u32 features; }
Define bit for each feature: #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2)
* Vendor driver returns bitmask of supported features during initialization phase.
* In vfio core module, trap this ioctl for each device in vfio_device_fops_unl_ioctl(),
Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our blocking point with mdev drivers, we can't get a device fd, so we can't call an ioctl on the device fd.
I'm sorry, I thought we could expose features when QEMU initialize, but libvirt needs to know supported features before QEMU initialize.
check features bitmask returned by vendor driver and add a sysfs file if feature is supported that device. This sysfs file would return 0/1.
I don't understand why we have an ioctl interface, if the user can get to the device fd then we have existing interfaces to probe these things, it seems like you're just wanting to pass a features bitmap through to vfio_add_group_dev() that vfio-core would expose through sysfs, but a list of feature bits doesn't convey enough info except for the most basic uses.
Yes, vfio_add_group_dev() seems to be better way to convey features to vfio core.
For migration this bit will only indicate if host driver supports migration feature.
For source and destination compatibility check libvirt would need more data/variables to check like, * if same type of 'mdev_type' device create-able at destination, i.e. if ('mdev_type'->available_instances > 0)
* if host_driver_version at source and destination are compatible. Host driver from same release branch should be mostly compatible, but if there are major changes in structures or APIs, host drivers from different branches might not be compatible, for example, if source and destination are from different branches and one of the structure had changed, then data collected at source might not be compatible with structures at destination and typecasting it to changed structures would mess up migrated data during restoration.
Of course now you're asking that libvirt understand the release versioning scheme of every vendor driver and that it remain programatically consistent. We can't even do this with in-kernel drivers. And in the end, still the best we can do is guess.
Libvirt doesn't need to understand the version, libvirt need to do strcmp version string from source and destination. If those are equal, then libvirt would understand that they are compatible.
Who's to say that the driver version and migration compatibility have any relation at all? Some drivers might focus on designing their own migration interface that can maintain compatibility across versions (QEMU does this), some drivers may only allow identical version migration (which is going to frustrate upper level management tools and customers - RHEL goes to great extents to support cross version migration). We cannot have a one size fits all here that driver version defines completely the migration compatibility.
I'll agree; I don't know enough about these devices, but to give you some example of things I'd expect to work: a) User adds new machines to their data centre with larger/newer version of the same vendors GPU; in some cases that should work (depending on vendor details etc) b) The same thing but with identical hardware but a newer driver on the destination.
Obviously there will be some cut offs that say some versions are incompatible; but for normal migration we jump through serious hoops to make sure stuff works; customers will expect the same with some VFIO devices.
How libvirt checks that cut off where some versions are incompatible?
We have versioned 'machine types' - so for example QEMU has pc-i440fx-2.11 pc-i440fx-2.10 machine types; any version of qemu that supports machine type pc-i440fx-2.10 should behave the same to it's emulated devices. If we change the behaviour then we tie it to the new machine type; so the behaviour of a device in pc-i440fx-2.11 might be a bit different. Occasionally we'll kill off old machine types; (actually we should do it more!) - but certainly when we do downstream versions we tie it to machine types as well. We also have some migration-capability flags, so some features can only be used if both sides have that flag, and also Libvirt has some checking of host CPU flags.
* if guest_driver_version is compatible with host driver at destination. For mdev devices, guest driver communicates with host driver in some form. If there are changes in structures/APIs of such communication, guest driver at source might not be compatible with host driver at destination.
And another guess plus now the guest driver is involved which libvirt has no visibility to.
Like above libvirt need to do strcmp.
Insufficient, imo
'available_instances' sysfs already exist, later two should be added by vendor driver which libvirt can use for migration compatibility check.
As noted previously, display and migration are not necessarily mdev-only features, it's possible that vfio-pci or vfio-platform could also implement these, so the sysfs interface cannot be restricted to the mdev template and lifecycle interface.
I agree. Feature bitmask passed to vfio core is not mdev specific. But here 'available_instances' for migration compatibility check is mdev specific. If mdev device is not create-able at destination, there is no point in initiating migration by libvirt.
'available_instances' for migration compatibility check...? We use available_instances to know whether we have the resources to create a given mdev type. It's certainly a prerequisite to have a device of the identical type at the migration target and how we define what is an identical device for a directly assigned PCI device is yet another overly complicated rat hole. But an identical device doesn't necessarily imply migration compatibility and I think that's the problem we're tackling. We cannot assume based only on the device type that migration is compatible, that's basically saying we're never going to have any bugs or oversights or new features in the migration stream.
Those things certainly happen; state that we forgot to transfer, new features enables on devices, devices configured in different ways.
How libvirt checks migration compatibility for other devices across QEMU versions where source support a device and destination running with older QEMU version doesn't support that device or that device doesn't exist in that system?
Libvirt inspects the qemu to get lists of devices and capabilities; I'll leave it to the libvirt guys to add more detail if needed. Dave
Thanks, Kirti
Chatting with Laine, it may be worth a step back to include migration experts and people up the stack with more visibility to how openstack operates. The issue here is that if vfio gains migration support then we have a portion of the migration stream that is not under the control of QEMU, we cannot necessarily tie it to a QEMU machine type and we cannot necessarily dictate how the vfio bus driver (vendor driver) handles versioning and compatibility. My intent was to expose some sort of migration information through the vfio API so that upper level tools could determine source and target compatibility, but this in itself is I think something new that those tools need to agree how it might be done. How would something like openstack want to handle not only finding a migration target with a compatible device, but also verifying if the device supports the migration format of the source device?
Alternatively, should we do anything? Is the problem too hard and we should let the driver return an error when it receives an incompatible migration stream, aborting the migration?
It's a bit nasty; if you've hit the 'evacuate host' button then what happens when you've got some incompatible hosts.
Dave
One more try... we have a vfio_group fd. This is created by the bus drivers calling vfio_add_group_dev() and registers a struct device, a struct vfio_device_ops, and private data. Typically we only wire the device_ops to the resulting file descriptor we get from VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl through the group fd? The ioctl would need to take a string arg to match to a device name, plus an ioctl cmd and arg for the device_ops ioctl. The group ioctl would need to filter cmds to known, benign queries. We'd also need to verify that the allowed ioctls have no dependencies on setup done in device_ops.open().
So these ioctls would be called without devices open() call, doesn't this seem to be against file operations standard?
vfio_device_ops is modeled largely after file operations, but I don't think we're bound by that for the interaction between vfio-core and the vfio bus drivers. We could make a separate callback for unprivileged ioctls, but that seems like more work per driver when we really want to maintain the identical API, we just want to provide a more limited interface and change the calling point.
An issue I thought of for migration though is that this path wouldn't have access to the migration region and therefore if we place a header within that region containing the compatibility and versioning information, the user still couldn't access it. This doesn't seem to be a blocker though as we could put that information within the region capability that defines the region as used for migration. Possibly a device could have multiple migration regions with different formats for backwards compatibility, of course then we'd need a way to determine which to use and which combinations have been validated. Thanks,
Alex -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

On Thu, 26 Apr 2018 19:55:23 +0100 "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
* Kirti Wankhede (kwankhede@nvidia.com) wrote:
On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote:
* Alex Williamson (alex.williamson@redhat.com) wrote:
On Wed, 25 Apr 2018 21:00:39 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/25/2018 4:29 AM, Alex Williamson wrote:
On Wed, 25 Apr 2018 01:20:08 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
> On 4/24/2018 3:10 AM, Alex Williamson wrote: >> On Wed, 18 Apr 2018 12:31:53 -0600 >> Alex Williamson <alex.williamson@redhat.com> wrote: >> >>> On Mon, 9 Apr 2018 12:35:10 +0200 >>> Gerd Hoffmann <kraxel@redhat.com> wrote: >>> >>>> This little series adds three drivers, for demo-ing and testing vfio >>>> display interface code. There is one mdev device for each interface >>>> type (mdpy.ko for region and mbochs.ko for dmabuf). >>> >>> Erik Skultety brought up a good question today regarding how libvirt is >>> meant to handle these different flavors of display interfaces and >>> knowing whether a given mdev device has display support at all. It >>> seems that we cannot simply use the default display=auto because >>> libvirt needs to specifically configure gl support for a dmabuf type >>> interface versus not having such a requirement for a region interface, >>> perhaps even removing the emulated graphics in some cases (though I >>> don't think we have boot graphics through either solution yet). >>> Additionally, GVT-g seems to need the x-igd-opregion support >>> enabled(?), which is a non-starter for libvirt as it's an experimental >>> option! >>> >>> Currently the only way to determine display support is through the >>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on >>> their own they'd need to get to the point where they could open the >>> vfio device and perform the ioctl. That means opening a vfio >>> container, adding the group, setting the iommu type, and getting the >>> device. I was initially a bit appalled at asking libvirt to do that, >>> but the alternative is to put this information in sysfs, but doing that >>> we risk that we need to describe every nuance of the mdev device >>> through sysfs and it becomes a dumping ground for every possible >>> feature an mdev device might have. ... >>> So I was ready to return and suggest that maybe libvirt should probe >>> the device to know about these ancillary configuration details, but >>> then I remembered that both mdev vGPU vendors had external dependencies >>> to even allow probing the device. KVMGT will fail to open the device >>> if it's not associated with an instance of KVM and NVIDIA vGPU, I >>> believe, will fail if the vGPU manager process cannot find the QEMU >>> instance to extract the VM UUID. (Both of these were bad ideas) >> >> Here's another proposal that's really growing on me: >> >> * Fix the vendor drivers! Allow devices to be opened and probed >> without these external dependencies. >> * Libvirt uses the existing vfio API to open the device and probe the >> necessary ioctls, if it can't probe the device, the feature is >> unavailable, ie. display=off, no migration. >> > > I'm trying to think simpler mechanism using sysfs that could work for > any feature and knowing source-destination migration compatibility check > by libvirt before initiating migration. > > I have another proposal: > * Add a ioctl VFIO_DEVICE_PROBE_FEATURES > struct vfio_device_features { > __u32 argsz; > __u32 features; > } > > Define bit for each feature: > #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) > #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) > #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2) > > * Vendor driver returns bitmask of supported features during > initialization phase. > > * In vfio core module, trap this ioctl for each device in > vfio_device_fops_unl_ioctl(),
Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our blocking point with mdev drivers, we can't get a device fd, so we can't call an ioctl on the device fd.
I'm sorry, I thought we could expose features when QEMU initialize, but libvirt needs to know supported features before QEMU initialize.
> check features bitmask returned by vendor > driver and add a sysfs file if feature is supported that device. This > sysfs file would return 0/1.
I don't understand why we have an ioctl interface, if the user can get to the device fd then we have existing interfaces to probe these things, it seems like you're just wanting to pass a features bitmap through to vfio_add_group_dev() that vfio-core would expose through sysfs, but a list of feature bits doesn't convey enough info except for the most basic uses.
Yes, vfio_add_group_dev() seems to be better way to convey features to vfio core.
> For migration this bit will only indicate if host driver supports > migration feature. > > For source and destination compatibility check libvirt would need more > data/variables to check like, > * if same type of 'mdev_type' device create-able at destination, > i.e. if ('mdev_type'->available_instances > 0) > > * if host_driver_version at source and destination are compatible. > Host driver from same release branch should be mostly compatible, but if > there are major changes in structures or APIs, host drivers from > different branches might not be compatible, for example, if source and > destination are from different branches and one of the structure had > changed, then data collected at source might not be compatible with > structures at destination and typecasting it to changed structures would > mess up migrated data during restoration.
Of course now you're asking that libvirt understand the release versioning scheme of every vendor driver and that it remain programatically consistent. We can't even do this with in-kernel drivers. And in the end, still the best we can do is guess.
Libvirt doesn't need to understand the version, libvirt need to do strcmp version string from source and destination. If those are equal, then libvirt would understand that they are compatible.
Who's to say that the driver version and migration compatibility have any relation at all? Some drivers might focus on designing their own migration interface that can maintain compatibility across versions (QEMU does this), some drivers may only allow identical version migration (which is going to frustrate upper level management tools and customers - RHEL goes to great extents to support cross version migration). We cannot have a one size fits all here that driver version defines completely the migration compatibility.
I'll agree; I don't know enough about these devices, but to give you some example of things I'd expect to work: a) User adds new machines to their data centre with larger/newer version of the same vendors GPU; in some cases that should work (depending on vendor details etc) b) The same thing but with identical hardware but a newer driver on the destination.
Obviously there will be some cut offs that say some versions are incompatible; but for normal migration we jump through serious hoops to make sure stuff works; customers will expect the same with some VFIO devices.
How libvirt checks that cut off where some versions are incompatible?
We have versioned 'machine types' - so for example QEMU has pc-i440fx-2.11 pc-i440fx-2.10
machine types; any version of qemu that supports machine type pc-i440fx-2.10 should behave the same to it's emulated devices. If we change the behaviour then we tie it to the new machine type; so the behaviour of a device in pc-i440fx-2.11 might be a bit different. Occasionally we'll kill off old machine types; (actually we should do it more!) - but certainly when we do downstream versions we tie it to machine types as well.
We also have some migration-capability flags, so some features can only be used if both sides have that flag, and also Libvirt has some checking of host CPU flags.
I think this sort of host compatibility checking for CPU flags is the part where we need some libvirt input on how they'd like to extend this for device compatibility. A complication here is whether it's reasonable for libvirt to collect migration compatibility data except for the actual target device. For instance, if the user model is to create mdev devices on demand, the vendor driver might be upgraded between system startup and migration, I don't think we can assume the migration information remains static or is necessarily the same for each mdev type provided by the vendor driver, or maybe for each parent device. Is it possible that libvirt would evaluate a migration target device to this extent immediately before the migration? How would openstack handle managing a datacenter with such a model?
> * if guest_driver_version is compatible with host driver at destination. > For mdev devices, guest driver communicates with host driver in some > form. If there are changes in structures/APIs of such communication, > guest driver at source might not be compatible with host driver at > destination.
And another guess plus now the guest driver is involved which libvirt has no visibility to.
Like above libvirt need to do strcmp.
Insufficient, imo
> 'available_instances' sysfs already exist, later two should be added by > vendor driver which libvirt can use for migration compatibility check.
As noted previously, display and migration are not necessarily mdev-only features, it's possible that vfio-pci or vfio-platform could also implement these, so the sysfs interface cannot be restricted to the mdev template and lifecycle interface.
I agree. Feature bitmask passed to vfio core is not mdev specific. But here 'available_instances' for migration compatibility check is mdev specific. If mdev device is not create-able at destination, there is no point in initiating migration by libvirt.
'available_instances' for migration compatibility check...? We use available_instances to know whether we have the resources to create a given mdev type. It's certainly a prerequisite to have a device of the identical type at the migration target and how we define what is an identical device for a directly assigned PCI device is yet another overly complicated rat hole. But an identical device doesn't necessarily imply migration compatibility and I think that's the problem we're tackling. We cannot assume based only on the device type that migration is compatible, that's basically saying we're never going to have any bugs or oversights or new features in the migration stream.
Those things certainly happen; state that we forgot to transfer, new features enables on devices, devices configured in different ways.
How libvirt checks migration compatibility for other devices across QEMU versions where source support a device and destination running with older QEMU version doesn't support that device or that device doesn't exist in that system?
Libvirt inspects the qemu to get lists of devices and capabilities; I'll leave it to the libvirt guys to add more detail if needed.
Right, so do we need a way to invoke QEMU with a device to report the migration capabilities of that device? To this point, I think the migration viability of a target system has been entirely encompassed within QEMU's ability to support the versioned machine type and the compatibility of CPU flags, devices have not been considered as their compatibility is guaranteed within a machine type and version. Thanks, Alex

Hi, The previous discussion hasn't produced results, so let's start over. Here's the situation: - We currently have kernel and QEMU support for the QEMU vfio-pci display option. - The default for this option is 'auto', so the device will attempt to generate a display if the underlying device supports it, currently only GVTg and some future release of NVIDIA vGPU (plus Gerd's sample mdpy and mbochs). - The display option is implemented via two different mechanism, a vfio region (NVIDIA, mdpy) or a dma-buf (GVTg, mbochs). - Displays using dma-buf require OpenGL support, displays making use of region support do not. - Enabling OpenGL support requires specific VM configurations, which libvirt /may/ want to facilitate. - Probing display support for a given device is complicated by the fact that GVTg and NVIDIA both impose requirements on the process opening the device file descriptor through the vfio API: - GVTg requires a KVM association or will fail to allow the device to be opened. - NVIDIA requires that their vgpu-manager process can locate a UUID for the VM via the process commandline. - These are both horrible impositions and prevent libvirt from simply probing the device itself. The above has pressed the need for investigating some sort of alternative API through which libvirt might introspect a vfio device and with vfio device migration on the horizon, it's natural that some sort of support for migration state compatibility for the device need be considered as a second user of such an API. However, we currently have no concept of migration compatibility on a per-device level as there are no migratable devices that live outside of the QEMU code base. It's therefore assumed that per device migration compatibility is encompassed by the versioned machine type for the overall VM. We need participation all the way to the top of the VM management stack to resolve this issue and it's dragging down the (possibly) more simple question of how do we resolve the display situation. Therefore I'm looking for alternatives for display that work within what we have available to us at the moment. Erik Skultety, who initially raised the display question, has identified one possible solution, which is to simply make the display configuration the user's problem (apologies if I've misinterpreted Erik). I believe this would work something like: - libvirt identifies a version of QEMU that includes 'display' support for vfio-pci devices and defaults to adding display=off for every vfio-pci device [have we chosen the wrong default (auto) in QEMU?]. - New XML support would allow a user to enable display support on the vfio device. - Resolving any OpenGL dependencies of that change would be left to the user. A nice aspect of this is that policy decisions are left to the user and clearly no interface changes are necessary, perhaps with the exception of deciding whether we've made the wrong default choice for vfio-pci devices in QEMU. On the other hand, if we do want to give libvirt a mechanism to probe the display support for a device, we can make a simplified QEMU instance be the mechanism through which we do that. For example the script[1] can be provided with either a PCI device or sysfs path to an mdev device and run a minimal VM instance meeting the requirements of both GVTg and NVIDIA to report the display support and GL requirements for a device. There are clearly some unrefined and atrocious bits of this script, but it's only a proof of concept, the process management can be improved and we can decide whether we want to provide qmp mechanism to introspect the device rather than grep'ing error messages. The goal is simply to show that we could choose to embrace QEMU and use it not as a VM, but simply a tool for poking at a device given the restrictions the mdev vendor drivers have already imposed. So I think the question bounces back to libvirt, does libvirt want enough information about the display requirements for a given device to automatically attempt to add GL support for it, effectively a policy of 'if it's supported try to enable it', or should we leave well enough alone and let the user choose to enable it? Maybe some guiding questions: - Will dma-buf always require GL support? - Does GL support limit our ability to have a display over a remote connection? - Do region-based displays also work with GL support, even if not required? Furthermore, should QEMU vfio-pci flip the default to 'off' for compatibility? Thanks, Alex [1] https://gist.github.com/awilliam/2ccd31e85923ac8135694a7db2306646

On Thu, May 03, 2018 at 12:58:00PM -0600, Alex Williamson wrote:
Hi,
The previous discussion hasn't produced results, so let's start over. Here's the situation:
- We currently have kernel and QEMU support for the QEMU vfio-pci display option.
- The default for this option is 'auto', so the device will attempt to generate a display if the underlying device supports it, currently only GVTg and some future release of NVIDIA vGPU (plus Gerd's sample mdpy and mbochs).
- The display option is implemented via two different mechanism, a vfio region (NVIDIA, mdpy) or a dma-buf (GVTg, mbochs).
- Displays using dma-buf require OpenGL support, displays making use of region support do not.
- Enabling OpenGL support requires specific VM configurations, which libvirt /may/ want to facilitate.
- Probing display support for a given device is complicated by the fact that GVTg and NVIDIA both impose requirements on the process opening the device file descriptor through the vfio API:
- GVTg requires a KVM association or will fail to allow the device to be opened.
How exactly is this association checked?
- NVIDIA requires that their vgpu-manager process can locate a UUID for the VM via the process commandline.
- These are both horrible impositions and prevent libvirt from simply probing the device itself.
So I feel like we're trying to solve a problem coming from one layer on a bunch of different layers which inherently prevents us to produce a viable long term solution without dragging a significant amount of hacky nasty code and it is not the missing sysfs attributes I have in mind. Why does NVIDIA's vgpu-manager need to locate a UUID of a qemu VM? I assume that's to prevent multiple VM instances trying to use the same mdev device, in which case can't the vgpu-manager track references to how many "open" and "close" calls have been made to the same device? This is just from a layman's perspective, but it would allow the following: - when libvirt starts, it initializes all its drivers (let's focus on QEMU) - as part of this initialization, libvirt probes QEMU for capabilities and caches them in order to use them when spawning VMs Now, if we (theoretically) can settle on easing the restrictions Alex has mentioned, we in fact could introduce a QMP command to probe these devices and provide libvirt with useful information at that point in time. Of course, since the 3rd party vendor is "de-coupled" from qemu, libvirt would have no way to find out that the driver has changed in the meantime, thus still using the old information we gathered, ergo potentially causing the QEMU process to fail eventually. But then again, there's very often a strong recommendation to reboot your host after a driver update, especially in NVIDIA's case, which means this fact wouldn't matter. However, there's also a significant drawback to my proposal which probably renders it completely useless (but we can continue from there...) and that is the devices would either have to be present already (not an option) or QEMU would need to be enhanced in a way, that it would create a dummy device during QMP probing, open it, collect the information libvirt needs, close it and remove it. If the driver doesn't change in the meantime, this should be sufficient for a VM to be successfully instantiated with a display, right?
The above has pressed the need for investigating some sort of alternative API through which libvirt might introspect a vfio device and with vfio device migration on the horizon, it's natural that some sort of support for migration state compatibility for the device need be considered as a second user of such an API. However, we currently have no concept of migration compatibility on a per-device level as there are no migratable devices that live outside of the QEMU code base. It's therefore assumed that per device migration compatibility is encompassed by the versioned machine type for the overall VM. We need participation all the way to the top of the VM management stack to resolve this issue and it's dragging down the (possibly) more simple question of how do we resolve the display situation. Therefore I'm looking for alternatives for display that work within what we have available to us at the moment.
Erik Skultety, who initially raised the display question, has identified one possible solution, which is to simply make the display configuration the user's problem (apologies if I've misinterpreted Erik). I believe this would work something like:
- libvirt identifies a version of QEMU that includes 'display' support for vfio-pci devices and defaults to adding display=off for every vfio-pci device [have we chosen the wrong default (auto) in QEMU?].
From libvirt's POV, having a new XML attribute display to the host device type mdev should with a default value 'off', potentially extending this to 'auto' once we have enough information to base our decision on. We'll need to combine
this with a new attribute value for the <video> element that would prevent adding an emulated VGA any time <graphics> (spice,VNC) is requested, but that's something we'd need to do anyway, so I'm just mentioning it.
- New XML support would allow a user to enable display support on the vfio device.
- Resolving any OpenGL dependencies of that change would be left to the user.
A nice aspect of this is that policy decisions are left to the user and clearly no interface changes are necessary, perhaps with the exception of deciding whether we've made the wrong default choice for vfio-pci devices in QEMU.
It's a common practice that we offload decisions like this to users (including management layer, i.e. openstack, ovirt).
On the other hand, if we do want to give libvirt a mechanism to probe the display support for a device, we can make a simplified QEMU instance be the mechanism through which we do that. For example the script[1] can be provided with either a PCI device or sysfs path to an mdev device and run a minimal VM instance meeting the requirements of both GVTg and NVIDIA to report the display support and GL requirements for a device. There are clearly some unrefined and atrocious bits of this script, but it's only a proof of concept, the process management can be improved and we can decide whether we want to provide qmp mechanism to introspect the device rather than grep'ing error messages. The goal is simply to show that we could choose to embrace
if not for anything else, error messages change, so that's not a way, QMP is a much more standardized approach, but then again, as I mentioned above, at the moment, libvirt probes for capabilities during its start.
QEMU and use it not as a VM, but simply a tool for poking at a device given the restrictions the mdev vendor drivers have already imposed.
So I think the question bounces back to libvirt, does libvirt want enough information about the display requirements for a given device to automatically attempt to add GL support for it, effectively a policy of 'if it's supported try to enable it', or should we leave well enough alone and let the user choose to enable it?
Maybe some guiding questions:
- Will dma-buf always require GL support?
- Does GL support limit our ability to have a display over a remote connection?
- Do region-based displays also work with GL support, even if not required?
Yeah, these are IMHO really tough to answer because we can't really predict the future, which again favours a new libvirt attribute more. Even if we decided that we truly need a dummy VM as tool for libvirt to probe this info, I still feel like this should be done up in the virtualization stack and libvirt again would be just a tool to do stuff the way it's told to do it. But I'd very much like to hear Dan's opinion, since beside libvirt he can cover openstack too. Regards, Erik

On Fri, 4 May 2018 09:49:44 +0200 Erik Skultety <eskultet@redhat.com> wrote:
On Thu, May 03, 2018 at 12:58:00PM -0600, Alex Williamson wrote:
Hi,
The previous discussion hasn't produced results, so let's start over. Here's the situation:
- We currently have kernel and QEMU support for the QEMU vfio-pci display option.
- The default for this option is 'auto', so the device will attempt to generate a display if the underlying device supports it, currently only GVTg and some future release of NVIDIA vGPU (plus Gerd's sample mdpy and mbochs).
- The display option is implemented via two different mechanism, a vfio region (NVIDIA, mdpy) or a dma-buf (GVTg, mbochs).
- Displays using dma-buf require OpenGL support, displays making use of region support do not.
- Enabling OpenGL support requires specific VM configurations, which libvirt /may/ want to facilitate.
- Probing display support for a given device is complicated by the fact that GVTg and NVIDIA both impose requirements on the process opening the device file descriptor through the vfio API:
- GVTg requires a KVM association or will fail to allow the device to be opened.
How exactly is this association checked?
The intel_vgpu_open() callback for the mdev device registers a vfio group notifier for VFIO_GROUP_NOTIFY_SET_KVM events. The KVM pointer is already registered via the addition of the vfio group to the vfio-kvm pseudo device, so the registration synchronously triggers the notifier callback and the result is tested slightly later in the open path in kvmgt_guest_init().
- NVIDIA requires that their vgpu-manager process can locate a UUID for the VM via the process commandline.
- These are both horrible impositions and prevent libvirt from simply probing the device itself.
So I feel like we're trying to solve a problem coming from one layer on a bunch of different layers which inherently prevents us to produce a viable long term solution without dragging a significant amount of hacky nasty code and it is not the missing sysfs attributes I have in mind. Why does NVIDIA's vgpu-manager need to locate a UUID of a qemu VM? I assume that's to prevent multiple VM instances trying to use the same mdev device, in which case can't the vgpu-manager track references to how many "open" and "close" calls have been made
Hard to say, NVIDIA hasn't been terribly forthcoming about this requirement, but probably not multiple users of the same mdev device as that's already prevented through vfio in general. Intel has discussed that their requirement is to be able to track VM page table updates so they can update their shadow tables, so effectively rather than mediating interactions directly with the device, they're using a KVM back channel to manage the DMA translation address space for the device. The flip side is that while these requirements are annoying and hard for non-VM users to deal with, is there a next logical point in the interaction with the vfio device where the vendor driver can reasonably impose those requirements? For instance, both vendors expose a vfio-pci interface, so they could prevent the user driver from enabling bus master in the PCI command register, but that's a fairly subtle failure, typically drivers wouldn't even bother to read back after a write to the bus master bit to see if it sticks and this sort of enabling is done by the guest, not the hypervisor. There's really no error path for a write to the device.
to the same device? This is just from a layman's perspective, but it would allow the following: - when libvirt starts, it initializes all its drivers (let's focus on QEMU) - as part of this initialization, libvirt probes QEMU for capabilities and caches them in order to use them when spawning VMs
Now, if we (theoretically) can settle on easing the restrictions Alex has mentioned, we in fact could introduce a QMP command to probe these devices and provide libvirt with useful information at that point in time. Of course, since the 3rd party vendor is "de-coupled" from qemu, libvirt would have no way to find out that the driver has changed in the meantime, thus still using the old information we gathered, ergo potentially causing the QEMU process to fail eventually. But then again, there's very often a strong recommendation to reboot your host after a driver update, especially in NVIDIA's case, which means this fact wouldn't matter. However, there's also a significant drawback to my proposal which probably renders it completely useless (but we can continue from there...) and that is the devices would either have to be present already (not an option) or QEMU would need to be enhanced in a way, that it would create a dummy device during QMP probing, open it, collect the information libvirt needs, close it and remove it. If the driver doesn't change in the meantime, this should be sufficient for a VM to be successfully instantiated with a display, right?
I don't think this last requirement is possible, QEMU is as clueless about the capabilities of an mdev device as anyone else until that device is opened and probed, so how would we invent this "dummy device"? I don't really see how there's any ability for pre-determination of the device capabilities, we can only probe the actual device we intend to use.
The above has pressed the need for investigating some sort of alternative API through which libvirt might introspect a vfio device and with vfio device migration on the horizon, it's natural that some sort of support for migration state compatibility for the device need be considered as a second user of such an API. However, we currently have no concept of migration compatibility on a per-device level as there are no migratable devices that live outside of the QEMU code base. It's therefore assumed that per device migration compatibility is encompassed by the versioned machine type for the overall VM. We need participation all the way to the top of the VM management stack to resolve this issue and it's dragging down the (possibly) more simple question of how do we resolve the display situation. Therefore I'm looking for alternatives for display that work within what we have available to us at the moment.
Erik Skultety, who initially raised the display question, has identified one possible solution, which is to simply make the display configuration the user's problem (apologies if I've misinterpreted Erik). I believe this would work something like:
- libvirt identifies a version of QEMU that includes 'display' support for vfio-pci devices and defaults to adding display=off for every vfio-pci device [have we chosen the wrong default (auto) in QEMU?].
From libvirt's POV, having a new XML attribute display to the host device type mdev should with a default value 'off', potentially extending this to 'auto' once we have enough information to base our decision on. We'll need to combine this with a new attribute value for the <video> element that would prevent adding an emulated VGA any time <graphics> (spice,VNC) is requested, but that's something we'd need to do anyway, so I'm just mentioning it.
This raises another question, is the configuration of the emulated graphics a factor in the handling the mdev device's display option? AFAIK, neither vGPU vendor provides a VBIOS for boot graphics, so even with a display option, we're mostly targeting a secondary graphics head, otherwise the user will be running headless until the guest OS drivers initialize.
- New XML support would allow a user to enable display support on the vfio device.
- Resolving any OpenGL dependencies of that change would be left to the user.
A nice aspect of this is that policy decisions are left to the user and clearly no interface changes are necessary, perhaps with the exception of deciding whether we've made the wrong default choice for vfio-pci devices in QEMU.
It's a common practice that we offload decisions like this to users (including management layer, i.e. openstack, ovirt).
On the other hand, if we do want to give libvirt a mechanism to probe the display support for a device, we can make a simplified QEMU instance be the mechanism through which we do that. For example the script[1] can be provided with either a PCI device or sysfs path to an mdev device and run a minimal VM instance meeting the requirements of both GVTg and NVIDIA to report the display support and GL requirements for a device. There are clearly some unrefined and atrocious bits of this script, but it's only a proof of concept, the process management can be improved and we can decide whether we want to provide qmp mechanism to introspect the device rather than grep'ing error messages. The goal is simply to show that we could choose to embrace
if not for anything else, error messages change, so that's not a way, QMP is a much more standardized approach, but then again, as I mentioned above, at the moment, libvirt probes for capabilities during its start.
Right, and none of these device capabilities are currently present via qmp, and in fact the VM fails to start in my example script when GL is needed but not present, so there's no QMP interface to probe until a configuration is found that the VM at least initializes w/o error.
QEMU and use it not as a VM, but simply a tool for poking at a device given the restrictions the mdev vendor drivers have already imposed.
So I think the question bounces back to libvirt, does libvirt want enough information about the display requirements for a given device to automatically attempt to add GL support for it, effectively a policy of 'if it's supported try to enable it', or should we leave well enough alone and let the user choose to enable it?
Maybe some guiding questions:
- Will dma-buf always require GL support?
- Does GL support limit our ability to have a display over a remote connection?
- Do region-based displays also work with GL support, even if not required?
Yeah, these are IMHO really tough to answer because we can't really predict the future, which again favours a new libvirt attribute more. Even if we decided that we truly need a dummy VM as tool for libvirt to probe this info, I still feel like this should be done up in the virtualization stack and libvirt again would be just a tool to do stuff the way it's told to do it. But I'd very much like to hear Dan's opinion, since beside libvirt he can cover openstack too.
I've learned from Gerd offline that remote connections are possible, requiring maybe yet a different set of options, so I'm leaning even further in the direction that libvirt can really only provide the user with options, but cannot reasonably infer the intentions of the user's configuration even if device capabilities were exposed. Thanks, Alex

Hi,
This raises another question, is the configuration of the emulated graphics a factor in the handling the mdev device's display option? AFAIK, neither vGPU vendor provides a VBIOS for boot graphics, so even with a display option, we're mostly targeting a secondary graphics head, otherwise the user will be running headless until the guest OS drivers initialize.
Right now yes, no boot display for vgpu devices. I'm trying to fix that with ramfb. There are a bunch of rough edges still and details to hashed out. It'll probably be uefi only. cheers, Gerd

Hi Gerd, Can I know your status on the boot display support work? I'm interested to try it in some real use cases. Thanks, Henry
-----Original Message----- From: intel-gvt-dev [mailto:intel-gvt-dev-bounces@lists.freedesktop.org] On Behalf Of Gerd Hoffmann Sent: Monday, May 7, 2018 2:26 PM To: Alex Williamson <alex.williamson@redhat.com> Cc: Neo Jia <cjia@nvidia.com>; kvm@vger.kernel.org; Erik Skultety <eskultet@redhat.com>; libvirt <libvir-list@redhat.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>; Zhang, Tina <tina.zhang@intel.com>; Kirti Wankhede <kwankhede@nvidia.com>; Laine Stump <laine@redhat.com>; Daniel P. Berrange <berrange@redhat.com>; Jiri Denemark <jdenemar@redhat.com>; intel-gvt-dev@lists.freedesktop.org Subject: Re: Expose vfio device display/migration to libvirt and above, was Re: [PATCH 0/3] sample: vfio mdev display devices.
Hi,
This raises another question, is the configuration of the emulated graphics a factor in the handling the mdev device's display option? AFAIK, neither vGPU vendor provides a VBIOS for boot graphics, so even with a display option, we're mostly targeting a secondary graphics head, otherwise the user will be running headless until the guest OS drivers initialize.
Right now yes, no boot display for vgpu devices. I'm trying to fix that with ramfb. There are a bunch of rough edges still and details to hashed out. It'll probably be uefi only.
cheers, Gerd

On Fri, Jul 20, 2018 at 04:56:15AM +0000, Yuan, Hang wrote:
Hi Gerd,
Can I know your status on the boot display support work? I'm interested to try it in some real use cases.
https://git.kraxel.org/cgit/qemu/log/?h=sirius/ramfb-vfio Most of the bits needed (general ramfb support) is merged upstream and will be in 3.0. Wiring up ramfb for vfio display devices is in the branch listed above and should follow for 3.1 cheers, Gerd

...
Now, if we (theoretically) can settle on easing the restrictions Alex has mentioned, we in fact could introduce a QMP command to probe these devices and provide libvirt with useful information at that point in time. Of course, since the 3rd party vendor is "de-coupled" from qemu, libvirt would have no way to find out that the driver has changed in the meantime, thus still using the old information we gathered, ergo potentially causing the QEMU process to fail eventually. But then again, there's very often a strong recommendation to reboot your host after a driver update, especially in NVIDIA's case, which means this fact wouldn't matter. However, there's also a significant drawback to my proposal which probably renders it completely useless (but we can continue from there...) and that is the devices would either have to be present already (not an option) or QEMU would need to be enhanced in a way, that it would create a dummy device during QMP probing, open it, collect the information libvirt needs, close it and remove it. If the driver doesn't change in the meantime, this should be sufficient for a VM to be successfully instantiated with a display, right?
I don't think this last requirement is possible, QEMU is as clueless about the capabilities of an mdev device as anyone else until that device is opened and probed, so how would we invent this "dummy device"? I don't really see how there's any ability for pre-determination of the device capabilities, we can only probe the actual device we intend to use.
Hmm, let's say libvirt is able to create mdevs. Do the vendor drivers impose any kind of limitations on whether a specific device-type or a specific instance of a type does or does not present certain features like display or migration in comparison to the other types/instances? IOW I would assume that once the driver version does support display/migration, any mdev instance of any mdev type the driver supports will "inherit" the support for display/migration. If this assumption works, libvirt, knowing there are some mdev capable parent devices, could technically create a dummy instance of the first type it can for each parent device, passing the UUID to qemu QMP query command, qemu would then open and probe the device, returning the capabilities which libvirt would then cache. Next time a VM is due to start, libvirt can use the device UUID to check the capabilities we cached and try setting appropriate config options. However, as you've mentioned, this approach is fairly policy-driven, which doesn't cope with what libvirt's goal is. Would such a suggestion help at all from QEMU's POV?
The above has pressed the need for investigating some sort of alternative API through which libvirt might introspect a vfio device and with vfio device migration on the horizon, it's natural that some sort of support for migration state compatibility for the device need be considered as a second user of such an API. However, we currently have no concept of migration compatibility on a per-device level as there are no migratable devices that live outside of the QEMU code base. It's therefore assumed that per device migration compatibility is encompassed by the versioned machine type for the overall VM. We need participation all the way to the top of the VM management stack to resolve this issue and it's dragging down the (possibly) more simple question of how do we resolve the display situation. Therefore I'm looking for alternatives for display that work within what we have available to us at the moment.
Erik Skultety, who initially raised the display question, has identified one possible solution, which is to simply make the display configuration the user's problem (apologies if I've misinterpreted Erik). I believe this would work something like:
- libvirt identifies a version of QEMU that includes 'display' support for vfio-pci devices and defaults to adding display=off for every vfio-pci device [have we chosen the wrong default (auto) in QEMU?].
From libvirt's POV, having a new XML attribute display to the host device type mdev should with a default value 'off', potentially extending this to 'auto' once we have enough information to base our decision on. We'll need to combine this with a new attribute value for the <video> element that would prevent adding an emulated VGA any time <graphics> (spice,VNC) is requested, but that's something we'd need to do anyway, so I'm just mentioning it.
This raises another question, is the configuration of the emulated graphics a factor in the handling the mdev device's display option? AFAIK, neither vGPU vendor provides a VBIOS for boot graphics, so even
Good point, I forgot about the fact that we don't have boot graphics yet, in which case no, having the 'none' value isn't a factor here, libvirt can continue adding an emulated VGA device just to have some boot output. I'm also curious how the display on the secondary GPU is going to be presented to the end user, but that's out of scope for libvirt.
with a display option, we're mostly targeting a secondary graphics head, otherwise the user will be running headless until the guest OS drivers initialize.
- New XML support would allow a user to enable display support on the vfio device.
- Resolving any OpenGL dependencies of that change would be left to the user.
A nice aspect of this is that policy decisions are left to the user and clearly no interface changes are necessary, perhaps with the exception of deciding whether we've made the wrong default choice for vfio-pci devices in QEMU.
It's a common practice that we offload decisions like this to users (including management layer, i.e. openstack, ovirt).
On the other hand, if we do want to give libvirt a mechanism to probe the display support for a device, we can make a simplified QEMU instance be the mechanism through which we do that. For example the script[1] can be provided with either a PCI device or sysfs path to an mdev device and run a minimal VM instance meeting the requirements of both GVTg and NVIDIA to report the display support and GL requirements for a device. There are clearly some unrefined and atrocious bits of this script, but it's only a proof of concept, the process management can be improved and we can decide whether we want to provide qmp mechanism to introspect the device rather than grep'ing error messages. The goal is simply to show that we could choose to embrace
if not for anything else, error messages change, so that's not a way, QMP is a much more standardized approach, but then again, as I mentioned above, at the moment, libvirt probes for capabilities during its start.
Right, and none of these device capabilities are currently present via qmp, and in fact the VM fails to start in my example script when GL is needed but not present, so there's no QMP interface to probe until a configuration is found that the VM at least initializes w/o error.
QEMU and use it not as a VM, but simply a tool for poking at a device given the restrictions the mdev vendor drivers have already imposed.
So I think the question bounces back to libvirt, does libvirt want enough information about the display requirements for a given device to automatically attempt to add GL support for it, effectively a policy of 'if it's supported try to enable it', or should we leave well enough alone and let the user choose to enable it?
Maybe some guiding questions:
- Will dma-buf always require GL support?
- Does GL support limit our ability to have a display over a remote connection?
- Do region-based displays also work with GL support, even if not required?
Yeah, these are IMHO really tough to answer because we can't really predict the future, which again favours a new libvirt attribute more. Even if we decided that we truly need a dummy VM as tool for libvirt to probe this info, I still feel like this should be done up in the virtualization stack and libvirt again would be just a tool to do stuff the way it's told to do it. But I'd very much like to hear Dan's opinion, since beside libvirt he can cover openstack too.
I've learned from Gerd offline that remote connections are possible, requiring maybe yet a different set of options, so I'm leaning even further in the direction that libvirt can really only provide the user with options, but cannot reasonably infer the intentions of the user's configuration even if device capabilities were exposed. Thanks,
Agreed, this would turn being extremely policy-based, but like Daniel, I'm really not sure whether these can be determined in an automated way on any level, sure, ovirt could present a set of contextual menus so a 'human' user would make the call (even a wrong one for that matter), not as much for openstack I guess. Erik

On Thu, 10 May 2018 13:00:29 +0200 Erik Skultety <eskultet@redhat.com> wrote:
...
Now, if we (theoretically) can settle on easing the restrictions Alex has mentioned, we in fact could introduce a QMP command to probe these devices and provide libvirt with useful information at that point in time. Of course, since the 3rd party vendor is "de-coupled" from qemu, libvirt would have no way to find out that the driver has changed in the meantime, thus still using the old information we gathered, ergo potentially causing the QEMU process to fail eventually. But then again, there's very often a strong recommendation to reboot your host after a driver update, especially in NVIDIA's case, which means this fact wouldn't matter. However, there's also a significant drawback to my proposal which probably renders it completely useless (but we can continue from there...) and that is the devices would either have to be present already (not an option) or QEMU would need to be enhanced in a way, that it would create a dummy device during QMP probing, open it, collect the information libvirt needs, close it and remove it. If the driver doesn't change in the meantime, this should be sufficient for a VM to be successfully instantiated with a display, right?
I don't think this last requirement is possible, QEMU is as clueless about the capabilities of an mdev device as anyone else until that device is opened and probed, so how would we invent this "dummy device"? I don't really see how there's any ability for pre-determination of the device capabilities, we can only probe the actual device we intend to use.
Hmm, let's say libvirt is able to create mdevs. Do the vendor drivers impose any kind of limitations on whether a specific device-type or a specific instance of a type does or does not present certain features like display or migration in comparison to the other types/instances? IOW I would assume that once the driver version does support display/migration, any mdev instance of any mdev type the driver supports will "inherit" the support for display/migration. If this assumption works, libvirt, knowing there are some mdev capable parent devices, could technically create a dummy instance of the first type it can for each parent device, passing the UUID to qemu QMP query command, qemu would then open and probe the device, returning the capabilities which libvirt would then cache. Next time a VM is due to start, libvirt can use the device UUID to check the capabilities we cached and try setting appropriate config options. However, as you've mentioned, this approach is fairly policy-driven, which doesn't cope with what libvirt's goal is. Would such a suggestion help at all from QEMU's POV?
There is no guarantee that all mdevs are equal for a given vendor. For instance we know that the smallest vGPU instance for Intel is intended for compute offload, it's configured with barely enough framebuffer and screen resolution for a working desktop. Does it necessarily make sense that it would support all of the same capabilities as a more desktop focused mdev instance? For that matter, can we necessarily guarantee that all mdev types for a given parent device are the same class of device? For a GPU parent device we might have some VGA class devices supporting a display and some 3D controllers which don't. So I think the operative word above is "assumption". You can make whatever assumptions you want, but they're only that, there's nothing that binds the mdev vendor driver to those assumptions.
The above has pressed the need for investigating some sort of alternative API through which libvirt might introspect a vfio device and with vfio device migration on the horizon, it's natural that some sort of support for migration state compatibility for the device need be considered as a second user of such an API. However, we currently have no concept of migration compatibility on a per-device level as there are no migratable devices that live outside of the QEMU code base. It's therefore assumed that per device migration compatibility is encompassed by the versioned machine type for the overall VM. We need participation all the way to the top of the VM management stack to resolve this issue and it's dragging down the (possibly) more simple question of how do we resolve the display situation. Therefore I'm looking for alternatives for display that work within what we have available to us at the moment.
Erik Skultety, who initially raised the display question, has identified one possible solution, which is to simply make the display configuration the user's problem (apologies if I've misinterpreted Erik). I believe this would work something like:
- libvirt identifies a version of QEMU that includes 'display' support for vfio-pci devices and defaults to adding display=off for every vfio-pci device [have we chosen the wrong default (auto) in QEMU?].
From libvirt's POV, having a new XML attribute display to the host device type mdev should with a default value 'off', potentially extending this to 'auto' once we have enough information to base our decision on. We'll need to combine this with a new attribute value for the <video> element that would prevent adding an emulated VGA any time <graphics> (spice,VNC) is requested, but that's something we'd need to do anyway, so I'm just mentioning it.
This raises another question, is the configuration of the emulated graphics a factor in the handling the mdev device's display option? AFAIK, neither vGPU vendor provides a VBIOS for boot graphics, so even
Good point, I forgot about the fact that we don't have boot graphics yet, in which case no, having the 'none' value isn't a factor here, libvirt can continue adding an emulated VGA device just to have some boot output. I'm also curious how the display on the secondary GPU is going to be presented to the end user, but that's out of scope for libvirt.
I don't believe the guest behavior necessarily changes, depending on the guest OS capabilities, the emulated and assigned/mdev graphics are separate displays and the user can configure which to use. The change is that now there are ways to get to that mdev display that are in-band for the hypervisor, such as virt-viewer. I haven't actually managed to get this to work yet, but I can see that a second display should be offered when this is configured properly.
with a display option, we're mostly targeting a secondary graphics head, otherwise the user will be running headless until the guest OS drivers initialize.
- New XML support would allow a user to enable display support on the vfio device.
- Resolving any OpenGL dependencies of that change would be left to the user.
A nice aspect of this is that policy decisions are left to the user and clearly no interface changes are necessary, perhaps with the exception of deciding whether we've made the wrong default choice for vfio-pci devices in QEMU.
It's a common practice that we offload decisions like this to users (including management layer, i.e. openstack, ovirt).
On the other hand, if we do want to give libvirt a mechanism to probe the display support for a device, we can make a simplified QEMU instance be the mechanism through which we do that. For example the script[1] can be provided with either a PCI device or sysfs path to an mdev device and run a minimal VM instance meeting the requirements of both GVTg and NVIDIA to report the display support and GL requirements for a device. There are clearly some unrefined and atrocious bits of this script, but it's only a proof of concept, the process management can be improved and we can decide whether we want to provide qmp mechanism to introspect the device rather than grep'ing error messages. The goal is simply to show that we could choose to embrace
if not for anything else, error messages change, so that's not a way, QMP is a much more standardized approach, but then again, as I mentioned above, at the moment, libvirt probes for capabilities during its start.
Right, and none of these device capabilities are currently present via qmp, and in fact the VM fails to start in my example script when GL is needed but not present, so there's no QMP interface to probe until a configuration is found that the VM at least initializes w/o error.
QEMU and use it not as a VM, but simply a tool for poking at a device given the restrictions the mdev vendor drivers have already imposed.
So I think the question bounces back to libvirt, does libvirt want enough information about the display requirements for a given device to automatically attempt to add GL support for it, effectively a policy of 'if it's supported try to enable it', or should we leave well enough alone and let the user choose to enable it?
Maybe some guiding questions:
- Will dma-buf always require GL support?
- Does GL support limit our ability to have a display over a remote connection?
- Do region-based displays also work with GL support, even if not required?
Yeah, these are IMHO really tough to answer because we can't really predict the future, which again favours a new libvirt attribute more. Even if we decided that we truly need a dummy VM as tool for libvirt to probe this info, I still feel like this should be done up in the virtualization stack and libvirt again would be just a tool to do stuff the way it's told to do it. But I'd very much like to hear Dan's opinion, since beside libvirt he can cover openstack too.
I've learned from Gerd offline that remote connections are possible, requiring maybe yet a different set of options, so I'm leaning even further in the direction that libvirt can really only provide the user with options, but cannot reasonably infer the intentions of the user's configuration even if device capabilities were exposed. Thanks,
Agreed, this would turn being extremely policy-based, but like Daniel, I'm really not sure whether these can be determined in an automated way on any level, sure, ovirt could present a set of contextual menus so a 'human' user would make the call (even a wrong one for that matter), not as much for openstack I guess.
Perhaps the idea of a local display really has no place in either an ovirt or openstack configuration, so if everything works with GL and SPICE will use something GL compatible (and \assuming\ the overhead of enabling that thing is trivial), perhaps data center management tools would simply always direct libvirt to use such a configuration. They'd need to know then whether display is supported or have things wired such that the current default of display=auto will always work when it's available. Thanks, Alex

On Thu, May 03, 2018 at 12:58:00PM -0600, Alex Williamson wrote:
Hi,
The previous discussion hasn't produced results, so let's start over. Here's the situation:
- We currently have kernel and QEMU support for the QEMU vfio-pci display option.
- The default for this option is 'auto', so the device will attempt to generate a display if the underlying device supports it, currently only GVTg and some future release of NVIDIA vGPU (plus Gerd's sample mdpy and mbochs).
- The display option is implemented via two different mechanism, a vfio region (NVIDIA, mdpy) or a dma-buf (GVTg, mbochs).
- Displays using dma-buf require OpenGL support, displays making use of region support do not.
- Enabling OpenGL support requires specific VM configurations, which libvirt /may/ want to facilitate.
- Probing display support for a given device is complicated by the fact that GVTg and NVIDIA both impose requirements on the process opening the device file descriptor through the vfio API:
- GVTg requires a KVM association or will fail to allow the device to be opened.
- NVIDIA requires that their vgpu-manager process can locate a UUID for the VM via the process commandline.
- These are both horrible impositions and prevent libvirt from simply probing the device itself.
Agreed, these requirements are just horrific. Probing for features should not require this kind of level environmental setup. I can just about understand & accept how we ended up here, because this scenario is not one that was strongly considered when the first impls were being done. I don't think we should accept it as a long term requirement though.
Erik Skultety, who initially raised the display question, has identified one possible solution, which is to simply make the display configuration the user's problem (apologies if I've misinterpreted Erik). I believe this would work something like:
- libvirt identifies a version of QEMU that includes 'display' support for vfio-pci devices and defaults to adding display=off for every vfio-pci device [have we chosen the wrong default (auto) in QEMU?].
- New XML support would allow a user to enable display support on the vfio device.
- Resolving any OpenGL dependencies of that change would be left to the user.
A nice aspect of this is that policy decisions are left to the user and clearly no interface changes are necessary, perhaps with the exception of deciding whether we've made the wrong default choice for vfio-pci devices in QEMU.
Unless I'm mis-understanding this isn't really a solution to the problem, rather it is us simply giving up and telling someone else to try to fix the problem. The 'user' here is not a human - it is simply the next level up in the mgmt stack, eg OpenStack or oVirt. If we can't solve it acceptably in libvirt code, I don't have much hope that OpenStack can solve it in their code, since they have even stronger need to automate everything.
On the other hand, if we do want to give libvirt a mechanism to probe the display support for a device, we can make a simplified QEMU instance be the mechanism through which we do that. For example the script[1] can be provided with either a PCI device or sysfs path to an mdev device and run a minimal VM instance meeting the requirements of both GVTg and NVIDIA to report the display support and GL requirements for a device. There are clearly some unrefined and atrocious bits of this script, but it's only a proof of concept, the process management can be improved and we can decide whether we want to provide qmp mechanism to introspect the device rather than grep'ing error messages. The goal is simply to show that we could choose to embrace QEMU and use it not as a VM, but simply a tool for poking at a device given the restrictions the mdev vendor drivers have already imposed.
Feels like a pretty heavy weight solution, that just encourages the drivers to continue down the undesirable path they're already on, possibly making the situation even worse over time. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, 4 May 2018 10:16:09 +0100 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Thu, May 03, 2018 at 12:58:00PM -0600, Alex Williamson wrote:
Hi,
The previous discussion hasn't produced results, so let's start over. Here's the situation:
- We currently have kernel and QEMU support for the QEMU vfio-pci display option.
- The default for this option is 'auto', so the device will attempt to generate a display if the underlying device supports it, currently only GVTg and some future release of NVIDIA vGPU (plus Gerd's sample mdpy and mbochs).
- The display option is implemented via two different mechanism, a vfio region (NVIDIA, mdpy) or a dma-buf (GVTg, mbochs).
- Displays using dma-buf require OpenGL support, displays making use of region support do not.
- Enabling OpenGL support requires specific VM configurations, which libvirt /may/ want to facilitate.
- Probing display support for a given device is complicated by the fact that GVTg and NVIDIA both impose requirements on the process opening the device file descriptor through the vfio API:
- GVTg requires a KVM association or will fail to allow the device to be opened.
- NVIDIA requires that their vgpu-manager process can locate a UUID for the VM via the process commandline.
- These are both horrible impositions and prevent libvirt from simply probing the device itself.
Agreed, these requirements are just horrific. Probing for features should not require this kind of level environmental setup. I can just about understand & accept how we ended up here, because this scenario is not one that was strongly considered when the first impls were being done. I don't think we should accept it as a long term requirement though.
Erik Skultety, who initially raised the display question, has identified one possible solution, which is to simply make the display configuration the user's problem (apologies if I've misinterpreted Erik). I believe this would work something like:
- libvirt identifies a version of QEMU that includes 'display' support for vfio-pci devices and defaults to adding display=off for every vfio-pci device [have we chosen the wrong default (auto) in QEMU?].
- New XML support would allow a user to enable display support on the vfio device.
- Resolving any OpenGL dependencies of that change would be left to the user.
A nice aspect of this is that policy decisions are left to the user and clearly no interface changes are necessary, perhaps with the exception of deciding whether we've made the wrong default choice for vfio-pci devices in QEMU.
Unless I'm mis-understanding this isn't really a solution to the problem, rather it is us simply giving up and telling someone else to try to fix the problem. The 'user' here is not a human - it is simply the next level up in the mgmt stack, eg OpenStack or oVirt. If we can't solve it acceptably in libvirt code, I don't have much hope that OpenStack can solve it in their code, since they have even stronger need to automate everything.
But to solve this at any level other than the user suggests there is one "right" answer to automatically configuring the device. Is there? If a device supports a display, does the user necessarily want to enable it? If there's a difference between enabling a display for a local user or a remote user, is there any reasonable expectation that we can automatically make that determination?
On the other hand, if we do want to give libvirt a mechanism to probe the display support for a device, we can make a simplified QEMU instance be the mechanism through which we do that. For example the script[1] can be provided with either a PCI device or sysfs path to an mdev device and run a minimal VM instance meeting the requirements of both GVTg and NVIDIA to report the display support and GL requirements for a device. There are clearly some unrefined and atrocious bits of this script, but it's only a proof of concept, the process management can be improved and we can decide whether we want to provide qmp mechanism to introspect the device rather than grep'ing error messages. The goal is simply to show that we could choose to embrace QEMU and use it not as a VM, but simply a tool for poking at a device given the restrictions the mdev vendor drivers have already imposed.
Feels like a pretty heavy weight solution, that just encourages the drivers to continue down the undesirable path they're already on, possibly making the situation even worse over time.
I'm not getting the impression that the vendor drivers are considering a change, or necessarily can change. The NVIDIA UUID requirement certainly seems arbitrary, but page tracking via KVM seems to be more directly useful to maintaining the address space of the device relative to the VM, even if it really wasn't the intent of the mdev interface. Perhaps we could introduce vfio interfaces to replace this, but is that just adding an unnecessary layer of interaction for all but this probe activity. Maybe the KVM interface should never have been added, but given that it exists, does it make sense to say that it can't be used, or required? Thanks, Alex

Hi,
Maybe some guiding questions:
- Will dma-buf always require GL support?
Yes.
- Does GL support limit our ability to have a display over a remote connection?
Currently yes, althrough the plan is to support gl display remotely in spice. The workflow will be completely different though. Non-gl spice uses the classic display channel, the plan for gl spice is to feed the dma-bufs into the gpu's video encoder then send a video stream.
- Do region-based displays also work with GL support, even if not required?
Yes. Any qemu display device works with gl-enabled UI. cheers, Gerd

On Fri, Apr 27, 2018 at 12:15:01AM +0530, Kirti Wankhede wrote:
On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote:
* Alex Williamson (alex.williamson@redhat.com) wrote:
On Wed, 25 Apr 2018 21:00:39 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/25/2018 4:29 AM, Alex Williamson wrote:
On Wed, 25 Apr 2018 01:20:08 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 4/24/2018 3:10 AM, Alex Williamson wrote: > On Wed, 18 Apr 2018 12:31:53 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > >> On Mon, 9 Apr 2018 12:35:10 +0200 >> Gerd Hoffmann <kraxel@redhat.com> wrote: >> >>> This little series adds three drivers, for demo-ing and testing vfio >>> display interface code. There is one mdev device for each interface >>> type (mdpy.ko for region and mbochs.ko for dmabuf). >> >> Erik Skultety brought up a good question today regarding how libvirt is >> meant to handle these different flavors of display interfaces and >> knowing whether a given mdev device has display support at all. It >> seems that we cannot simply use the default display=auto because >> libvirt needs to specifically configure gl support for a dmabuf type >> interface versus not having such a requirement for a region interface, >> perhaps even removing the emulated graphics in some cases (though I >> don't think we have boot graphics through either solution yet). >> Additionally, GVT-g seems to need the x-igd-opregion support >> enabled(?), which is a non-starter for libvirt as it's an experimental >> option! >> >> Currently the only way to determine display support is through the >> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on >> their own they'd need to get to the point where they could open the >> vfio device and perform the ioctl. That means opening a vfio >> container, adding the group, setting the iommu type, and getting the >> device. I was initially a bit appalled at asking libvirt to do that, >> but the alternative is to put this information in sysfs, but doing that >> we risk that we need to describe every nuance of the mdev device >> through sysfs and it becomes a dumping ground for every possible >> feature an mdev device might have. ... >> So I was ready to return and suggest that maybe libvirt should probe >> the device to know about these ancillary configuration details, but >> then I remembered that both mdev vGPU vendors had external dependencies >> to even allow probing the device. KVMGT will fail to open the device >> if it's not associated with an instance of KVM and NVIDIA vGPU, I >> believe, will fail if the vGPU manager process cannot find the QEMU >> instance to extract the VM UUID. (Both of these were bad ideas) > > Here's another proposal that's really growing on me: > > * Fix the vendor drivers! Allow devices to be opened and probed > without these external dependencies. > * Libvirt uses the existing vfio API to open the device and probe the > necessary ioctls, if it can't probe the device, the feature is > unavailable, ie. display=off, no migration. >
I'm trying to think simpler mechanism using sysfs that could work for any feature and knowing source-destination migration compatibility check by libvirt before initiating migration.
I have another proposal: * Add a ioctl VFIO_DEVICE_PROBE_FEATURES struct vfio_device_features { __u32 argsz; __u32 features; }
Define bit for each feature: #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2)
* Vendor driver returns bitmask of supported features during initialization phase.
* In vfio core module, trap this ioctl for each device in vfio_device_fops_unl_ioctl(),
Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our blocking point with mdev drivers, we can't get a device fd, so we can't call an ioctl on the device fd.
I'm sorry, I thought we could expose features when QEMU initialize, but libvirt needs to know supported features before QEMU initialize.
check features bitmask returned by vendor driver and add a sysfs file if feature is supported that device. This sysfs file would return 0/1.
I don't understand why we have an ioctl interface, if the user can get to the device fd then we have existing interfaces to probe these things, it seems like you're just wanting to pass a features bitmap through to vfio_add_group_dev() that vfio-core would expose through sysfs, but a list of feature bits doesn't convey enough info except for the most basic uses.
Yes, vfio_add_group_dev() seems to be better way to convey features to vfio core.
For migration this bit will only indicate if host driver supports migration feature.
For source and destination compatibility check libvirt would need more data/variables to check like, * if same type of 'mdev_type' device create-able at destination, i.e. if ('mdev_type'->available_instances > 0)
* if host_driver_version at source and destination are compatible. Host driver from same release branch should be mostly compatible, but if there are major changes in structures or APIs, host drivers from different branches might not be compatible, for example, if source and destination are from different branches and one of the structure had changed, then data collected at source might not be compatible with structures at destination and typecasting it to changed structures would mess up migrated data during restoration.
Of course now you're asking that libvirt understand the release versioning scheme of every vendor driver and that it remain programatically consistent. We can't even do this with in-kernel drivers. And in the end, still the best we can do is guess.
Libvirt doesn't need to understand the version, libvirt need to do strcmp version string from source and destination. If those are equal, then libvirt would understand that they are compatible.
Who's to say that the driver version and migration compatibility have any relation at all? Some drivers might focus on designing their own migration interface that can maintain compatibility across versions (QEMU does this), some drivers may only allow identical version migration (which is going to frustrate upper level management tools and customers - RHEL goes to great extents to support cross version migration). We cannot have a one size fits all here that driver version defines completely the migration compatibility.
I'll agree; I don't know enough about these devices, but to give you some example of things I'd expect to work: a) User adds new machines to their data centre with larger/newer version of the same vendors GPU; in some cases that should work (depending on vendor details etc) b) The same thing but with identical hardware but a newer driver on the destination.
Obviously there will be some cut offs that say some versions are incompatible; but for normal migration we jump through serious hoops to make sure stuff works; customers will expect the same with some VFIO devices.
How libvirt checks that cut off where some versions are incompatible?
* if guest_driver_version is compatible with host driver at destination. For mdev devices, guest driver communicates with host driver in some form. If there are changes in structures/APIs of such communication, guest driver at source might not be compatible with host driver at destination.
And another guess plus now the guest driver is involved which libvirt has no visibility to.
Like above libvirt need to do strcmp.
Insufficient, imo
'available_instances' sysfs already exist, later two should be added by vendor driver which libvirt can use for migration compatibility check.
As noted previously, display and migration are not necessarily mdev-only features, it's possible that vfio-pci or vfio-platform could also implement these, so the sysfs interface cannot be restricted to the mdev template and lifecycle interface.
I agree. Feature bitmask passed to vfio core is not mdev specific. But here 'available_instances' for migration compatibility check is mdev specific. If mdev device is not create-able at destination, there is no point in initiating migration by libvirt.
'available_instances' for migration compatibility check...? We use available_instances to know whether we have the resources to create a given mdev type. It's certainly a prerequisite to have a device of the identical type at the migration target and how we define what is an identical device for a directly assigned PCI device is yet another overly complicated rat hole. But an identical device doesn't necessarily imply migration compatibility and I think that's the problem we're tackling. We cannot assume based only on the device type that migration is compatible, that's basically saying we're never going to have any bugs or oversights or new features in the migration stream.
Those things certainly happen; state that we forgot to transfer, new features enables on devices, devices configured in different ways.
How libvirt checks migration compatibility for other devices across QEMU versions where source support a device and destination running with older QEMU version doesn't support that device or that device doesn't exist in that system?
We spoke about this on the call, but I'll write it down anyway so that we have a track of it. Currently, libvirt doesn't support migration of a domain with devices living outside of QEMU, therefore we'd need a completely new schema to support this. The other thing I mentioned regarding probing of the migration capabilities was that we should really consider openstack as both the consumer and a commander of libvirt, because it's openstack that maintains a global view of all the hosts in a cluster, so rather than poking libvirt to probe a random host, I assume they'd already like to have this information beforehand so that they can incorporate the logic in their scheduler, IOW at the point where migration is about to happen, openstack should imho already know which host is capable of hosting the VM to be migrated. That being said, it would most probably be libvirt who provides openstack with this information just like openstack probes libvirt for domain capabilities, but the idea stays the same, ideally we'd need to have this information before users decide to migrate a VM. Thanks, Erik

From: Alex Williamson Sent: Thursday, April 19, 2018 2:32 AM
That almost begins to look reasonable, but then we can only expose this for mdev devices, what if we were to hack a back door into a directly assigned GPU that tracks the location of active display in the framebuffer and implement the GFX_PLANE interface for that? We have no sysfs representation for either the template or the actual device for anything other than mdev. This inconsistency with physically assigned devices has been one of my arguments against enhancing mdev sysfs.
One possible option is to wrap directly assigned GPU into a mdev. The parent driver could be a dummy PCI driver which does basic PCI initialization, and then provide hooks for vendor-specific hack. Thanks Kevin

On Thu, Apr 26, 2018 at 03:44:15AM +0000, Tian, Kevin wrote:
From: Alex Williamson Sent: Thursday, April 19, 2018 2:32 AM
That almost begins to look reasonable, but then we can only expose this for mdev devices, what if we were to hack a back door into a directly assigned GPU that tracks the location of active display in the framebuffer and implement the GFX_PLANE interface for that? We have no sysfs representation for either the template or the actual device for anything other than mdev. This inconsistency with physically assigned devices has been one of my arguments against enhancing mdev sysfs.
One possible option is to wrap directly assigned GPU into a mdev. The parent driver could be a dummy PCI driver which does basic PCI initialization, and then provide hooks for vendor-specific hack.
Thowing amdgpu into the mix. Looks they have vgpu support too, but using sriov instead of mdev. Having VFIO_GFX support surely looks useful there. Adding a mdev dependency to the VFIO_GFX api would makes things more complicated there for (IMHO) no good reason ... cheers, Gerd

On Thu, 26 Apr 2018 08:14:27 +0200 Gerd Hoffmann <kraxel@redhat.com> wrote:
On Thu, Apr 26, 2018 at 03:44:15AM +0000, Tian, Kevin wrote:
From: Alex Williamson Sent: Thursday, April 19, 2018 2:32 AM
That almost begins to look reasonable, but then we can only expose this for mdev devices, what if we were to hack a back door into a directly assigned GPU that tracks the location of active display in the framebuffer and implement the GFX_PLANE interface for that? We have no sysfs representation for either the template or the actual device for anything other than mdev. This inconsistency with physically assigned devices has been one of my arguments against enhancing mdev sysfs.
One possible option is to wrap directly assigned GPU into a mdev. The parent driver could be a dummy PCI driver which does basic PCI initialization, and then provide hooks for vendor-specific hack.
Thowing amdgpu into the mix. Looks they have vgpu support too, but using sriov instead of mdev. Having VFIO_GFX support surely looks useful there. Adding a mdev dependency to the VFIO_GFX api would makes things more complicated there for (IMHO) no good reason ...
Yes, it may be that a device wanting to implement display or migration might take the mdev approach, but that should be a choice of the implementation, not a requirement imposed by the API. Thanks, Alex
participants (11)
-
Alex Williamson
-
Daniel P. Berrangé
-
Dr. David Alan Gilbert
-
Erik Skultety
-
Gerd Hoffmann
-
Kirti Wankhede
-
Paolo Bonzini
-
Tian, Kevin
-
Yuan, Hang
-
Zhang, Tina
-
Zhenyu Wang