On Fri, 4 May 2018 09:49:44 +0200
Erik Skultety <eskultet(a)redhat.com> wrote:
On Thu, May 03, 2018 at 12:58:00PM -0600, Alex Williamson wrote:
> Hi,
>
> The previous discussion hasn't produced results, so let's start over.
> Here's the situation:
>
> - We currently have kernel and QEMU support for the QEMU vfio-pci
> display option.
>
> - The default for this option is 'auto', so the device will attempt to
> generate a display if the underlying device supports it, currently
> only GVTg and some future release of NVIDIA vGPU (plus Gerd's
> sample mdpy and mbochs).
>
> - The display option is implemented via two different mechanism, a
> vfio region (NVIDIA, mdpy) or a dma-buf (GVTg, mbochs).
>
> - Displays using dma-buf require OpenGL support, displays making
> use of region support do not.
>
> - Enabling OpenGL support requires specific VM configurations, which
> libvirt /may/ want to facilitate.
>
> - Probing display support for a given device is complicated by the
> fact that GVTg and NVIDIA both impose requirements on the process
> opening the device file descriptor through the vfio API:
>
> - GVTg requires a KVM association or will fail to allow the device
> to be opened.
How exactly is this association checked?
The intel_vgpu_open() callback for the mdev device registers a vfio
group notifier for VFIO_GROUP_NOTIFY_SET_KVM events. The KVM pointer is
already registered via the addition of the vfio group to the vfio-kvm
pseudo device, so the registration synchronously triggers the notifier
callback and the result is tested slightly later in the open path in
kvmgt_guest_init().
>
> - NVIDIA requires that their vgpu-manager process can locate a
> UUID for the VM via the process commandline.
>
> - These are both horrible impositions and prevent libvirt from
> simply probing the device itself.
So I feel like we're trying to solve a problem coming from one layer
on a bunch of different layers which inherently prevents us to
produce a viable long term solution without dragging a significant
amount of hacky nasty code and it is not the missing sysfs attributes
I have in mind. Why does NVIDIA's vgpu-manager need to locate a UUID
of a qemu VM? I assume that's to prevent multiple VM instances trying
to use the same mdev device, in which case can't the vgpu-manager
track references to how many "open" and "close" calls have been made
Hard to say, NVIDIA hasn't been terribly forthcoming about this
requirement, but probably not multiple users of the same mdev device
as that's already prevented through vfio in general. Intel has
discussed that their requirement is to be able to track VM page table
updates so they can update their shadow tables, so effectively rather
than mediating interactions directly with the device, they're using a
KVM back channel to manage the DMA translation address space for the
device.
The flip side is that while these requirements are annoying and hard
for non-VM users to deal with, is there a next logical point in the
interaction with the vfio device where the vendor driver can reasonably
impose those requirements? For instance, both vendors expose a
vfio-pci interface, so they could prevent the user driver from enabling
bus master in the PCI command register, but that's a fairly subtle
failure, typically drivers wouldn't even bother to read back after a
write to the bus master bit to see if it sticks and this sort of
enabling is done by the guest, not the hypervisor. There's really no
error path for a write to the device.
to the same device? This is just from a layman's perspective, but
it
would allow the following:
- when libvirt starts, it initializes all its drivers (let's
focus on QEMU)
- as part of this initialization, libvirt probes QEMU for
capabilities and caches them in order to use them when spawning VMs
Now, if we (theoretically) can settle on easing the restrictions Alex
has mentioned, we in fact could introduce a QMP command to probe
these devices and provide libvirt with useful information at that
point in time. Of course, since the 3rd party vendor is "de-coupled"
from qemu, libvirt would have no way to find out that the driver has
changed in the meantime, thus still using the old information we
gathered, ergo potentially causing the QEMU process to fail
eventually. But then again, there's very often a strong
recommendation to reboot your host after a driver update, especially
in NVIDIA's case, which means this fact wouldn't matter. However,
there's also a significant drawback to my proposal which probably
renders it completely useless (but we can continue from there...) and
that is the devices would either have to be present already (not an
option) or QEMU would need to be enhanced in a way, that it would
create a dummy device during QMP probing, open it, collect the
information libvirt needs, close it and remove it. If the driver
doesn't change in the meantime, this should be sufficient for a VM to
be successfully instantiated with a display, right?
I don't think this last requirement is possible, QEMU is as clueless
about the capabilities of an mdev device as anyone else until that
device is opened and probed, so how would we invent this "dummy
device"? I don't really see how there's any ability for
pre-determination of the device capabilities, we can only probe the
actual device we intend to use.
> The above has pressed the need for investigating some sort of
> alternative API through which libvirt might introspect a vfio device
> and with vfio device migration on the horizon, it's natural that
> some sort of support for migration state compatibility for the
> device need be considered as a second user of such an API.
> However, we currently have no concept of migration compatibility on
> a per-device level as there are no migratable devices that live
> outside of the QEMU code base. It's therefore assumed that per
> device migration compatibility is encompassed by the versioned
> machine type for the overall VM. We need participation all the way
> to the top of the VM management stack to resolve this issue and
> it's dragging down the (possibly) more simple question of how do we
> resolve the display situation. Therefore I'm looking for
> alternatives for display that work within what we have available to
> us at the moment.
>
> Erik Skultety, who initially raised the display question, has
> identified one possible solution, which is to simply make the
> display configuration the user's problem (apologies if I've
> misinterpreted Erik). I believe this would work something like:
>
> - libvirt identifies a version of QEMU that includes 'display'
> support for vfio-pci devices and defaults to adding display=off for
> every vfio-pci device [have we chosen the wrong default (auto) in
> QEMU?].
From libvirt's POV, having a new XML attribute display to the host
device type mdev should with a default value 'off', potentially
extending this to 'auto' once we have enough information to base our
decision on. We'll need to combine this with a new attribute value
for the <video> element that would prevent adding an emulated VGA any
time <graphics> (spice,VNC) is requested, but that's something we'd
need to do anyway, so I'm just mentioning it.
This raises another question, is the configuration of the emulated
graphics a factor in the handling the mdev device's display option?
AFAIK, neither vGPU vendor provides a VBIOS for boot graphics, so even
with a display option, we're mostly targeting a secondary graphics
head, otherwise the user will be running headless until the guest OS
drivers initialize.
> - New XML support would allow a user to enable display support
on
> the vfio device.
>
> - Resolving any OpenGL dependencies of that change would be left to
> the user.
>
> A nice aspect of this is that policy decisions are left to the user
> and clearly no interface changes are necessary, perhaps with the
> exception of deciding whether we've made the wrong default choice
> for vfio-pci devices in QEMU.
It's a common practice that we offload decisions like this to users
(including management layer, i.e. openstack, ovirt).
>
> On the other hand, if we do want to give libvirt a mechanism to
> probe the display support for a device, we can make a simplified
> QEMU instance be the mechanism through which we do that. For
> example the script[1] can be provided with either a PCI device or
> sysfs path to an mdev device and run a minimal VM instance meeting
> the requirements of both GVTg and NVIDIA to report the display
> support and GL requirements for a device. There are clearly some
> unrefined and atrocious bits of this script, but it's only a proof
> of concept, the process management can be improved and we can
> decide whether we want to provide qmp mechanism to introspect the
> device rather than grep'ing error messages. The goal is simply to
> show that we could choose to embrace
if not for anything else, error messages change, so that's not a way,
QMP is a much more standardized approach, but then again, as I
mentioned above, at the moment, libvirt probes for capabilities
during its start.
Right, and none of these device capabilities are currently present via
qmp, and in fact the VM fails to start in my example script when GL is
needed but not present, so there's no QMP interface to probe until a
configuration is found that the VM at least initializes w/o error.
> QEMU and use it not as a VM, but simply a tool for poking at a
> device given the restrictions the mdev vendor drivers have already
> imposed.
>
> So I think the question bounces back to libvirt, does libvirt want
> enough information about the display requirements for a given
> device to automatically attempt to add GL support for it,
> effectively a policy of 'if it's supported try to enable it', or
> should we leave well enough alone and let the user choose to enable
> it?
>
> Maybe some guiding questions:
>
> - Will dma-buf always require GL support?
>
> - Does GL support limit our ability to have a display over a remote
> connection?
>
> - Do region-based displays also work with GL support, even if not
> required?
Yeah, these are IMHO really tough to answer because we can't really
predict the future, which again favours a new libvirt attribute more.
Even if we decided that we truly need a dummy VM as tool for libvirt
to probe this info, I still feel like this should be done up in the
virtualization stack and libvirt again would be just a tool to do
stuff the way it's told to do it. But I'd very much like to hear
Dan's opinion, since beside libvirt he can cover openstack too.
I've learned from Gerd offline that remote connections are possible,
requiring maybe yet a different set of options, so I'm leaning even
further in the direction that libvirt can really only provide the user
with options, but cannot reasonably infer the intentions of the user's
configuration even if device capabilities were exposed. Thanks,
Alex