On Tue, 22 Mar 2016 19:04:31 +0100
Andrea Bolognani <abologna(a)redhat.com> wrote:
On Tue, 2016-03-22 at 09:04 -0600, Alex Williamson wrote:
> > Could this be controlled by a kernel parameter? So you
> > can just add something like
> >
> > vfio-pci.devices=0000:02:00.0,0000:03:00.0
> >
> > to your bootloader configuration and be sure that the
> > devices you've listed will never be touched by the host.
> >
> > Note that I have no idea how much work implementing
> > something like the above would be, or whether it would
> > even be possible :)
>
> vfio-pci is typically a module, device assignment and userspace drivers
> are very niche as far as the kernel is concerned and the driver is
> complex enough that building it statically into the kernel has some
> disadvantages. So adding module options to vfio-pci doesn't really
> help. We sort of tried this with adding the vfio-pci.ids module
> option so we'd have parity with pci-stub, but it didn't really work
> because users need to go through enough hoops to get vfio-pci loaded
> early that they might as well just use the driver_override option.
> Live and learn, pci-stub.ids is still generally more useful than the
> equivalent vfio-pci option because pci-stub is (or should be) built
> statically into the kernel.
I see. But it doesn't necessarily have to be a kernel module
option, we just need the information to get to initramfs
somehow. I believe a bunch of arguments, such as "splash", are
already processed this way.
> Also, a fun property that we get to deal with in PCI space is that bus
> numbers are under the control of the OS, so if you boot with
> pci=assign-buses, your device address may change. That means that
> specifying a device by address is not as clean of a solution as you
> might think.
That sounds... Problematic.
Wouldn't that mean that the guest XML itself might suddenly
become invalid? When we configure a device for PCI passthrough,
we identify it using domain:bus:slot.function.
Well, barring configuration or commandline changes, PCI device
addresses are generally stable. To do better we'd need to provide
device paths like ACPI does in the DMAR table. ACPI specifies the base
bus number of the root bridge, so that address is predictable. It then
gives the path to the device from one of those stable bus numbers. For
instance in this topology:
-[0000:00]-+-00.0
+-01.0-[01]--+-00.0
| \-00.1
Rather than specifying 0000:01:00.0 it would be [0000:00],01.0,00.0.
Therefore if the OS renumbers the subordinate bus behind 01.0, the path
is still valid. The support issues induced by needing to come up with
something like that are probably greater than the rare occasion that
device addresses change.
> It seems like this quickly devolves into a question of
> whether the kernel is even the right place to do this, if you have a
> properly modular kernel then userspace really has the control of how
> modules are loaded and can intervene to exclude certain devices.
> The kink is that we need to do that early in boot, which circles right
> back around to the initramfs scripts, but afaik those are generally
> managed by each distro with only loose similarities, so we'd need to
> hope that we contribute something that becomes a defacto standard.
Yeah, once you get in initramfs territory you can hardly count
on standardization. IIRC some distros even provide *several*
alternative tools for the job :)
Then again, it seems quite weird to me that some sort of
mechanism for setting driver_override during early boot is not
available... Wouldn't it make the most sense for it to be set
before the kernel has had a chance to bind devices to the default
drivers?
Of course. There are mechanisms available, the one I typically use is
to exploit modprobe.d's install script option and load a small script
into the initramfs which sets driver_override before a known
conflicting driver is loaded. This falls into site specific hacking
though.
> > > The
> > > problem is more that this is not an uncommon scenario. In one of my
> > > previous replies I listed all of the driver issues that I know about.
> > > In some cases we can't fix them because the driver is proprietary, in
> > > others we just don't have the bandwidth. Users have far more devices
> > > at their disposal to test than we do, so they're likely going to
> > > continue to run into these issues.
> > Summing up the issues you listed in that message, along with
> > my thoughts:
> >
> > * stealing audio device from host when running a game in a
> > guest, giving it back afterwards
> >
> > - desktop use case
> > - seems to be working fine with managed='yes'
> > - if something goes wrong when reattaching to the host,
> > the user can probably still go about his business;
> > rebooting the host is going to be annoying but not a
> > huge deal
> > - limited number of in-kernel slots means you can only
> > do this a number of time before it stops working;
> > again, having to reboot the host once every so-many
> > guest boots is probably not a deal breaker
> > - managed='detach' is not helpful here, and neither is
> > the proposal above
> >
> > * NIC reserved for guest use
> >
> > - both server and desktop use case
> > - may or may not handle going back and forth between
> > the host and the guest
> > - managed='detach' would still prevent most issues,
> > assuming the host driver detaches cleanly
> > - best would be if the host never touched the device
> > - a solution like the one outlined above seems perfect
> > - for oVirt / OpenStack, the management layer could
> > take care of updating the bootloader configuration
> > - for pure libvirt, that responsability would fall on
> > the user
> > - should virt-manager etc. rather default to
> > managed='detach' when assigning a NIC to a guest?
> >
> > * GPU with binary driver
> >
> > - both server and desktop use case
> > - doesn't handle dynamic unbind gracefully
> > - best would be if the host never touched the device
> > - a solution like the one outlined above seems perfect
> > - for oVirt / OpenStack, the management layer could
> > take care of updating the bootloader configuration
> > - for pure libvirt, that responsability would fall on
> > the user
> > - managed='detach' can prevent some failures, but not
> > those that happen on host driver unbind
> >
> > * Primary Intel GPU
> >
> > - desktop use case (any reason to do this for servers?)
> > - will probably fail on reattach
> > - mananged='detach' would prevent host crashes and the
> > like
> > - must assume the user has some other way to connect to
> > the host, because at some point they're going to be
> > unable to use the GPU from the host anyway
> >
> > Have I missed anything?
>
> I think the only thing missing is that libvirt sort of becomes a
> natural management point for an assigned device because it's specified
> with some management characteristics. Users set managed='yes',
> assuming libvirt will do the right thing
People who assume that managed='yes' will always do the right
thing are probably going to be disappointed no matter what :)
> and because the managed='no'
> path requires them to come up with their own solution. So maybe we
> just need to create some framework for users to be able to take that
> managed='no' path with feeling like they're stepping off a cliff into
> their own adhoc management scripts.
That's what I'm thinking.
If, as argued above, the "proper" solution is in most cases
to prevent devices from being bound to the host driver in the
first place, then libvirt enters the picture way too late to
be able to take care of it.
If we can figure out a way to give users the ability to mark
a bunch of devices as "reserved for assignment" and have early
boot set driver_override to vfio-pci for us, we'll make life
easier for people currently relying on pci-stub.ids as well.
The simple stuff, that already works fine today with
managed='yes', will of course keep working.
Does this sound sensible / doable?
Sure, it's "just" a matter of knowing what package to start hacking on
and proposing patches to the downstreams. We could effectively also
duplicate the functionality of managed='detach' with libvirt hook
scripts, but they also lack any sort of standard framework. Last I
looked they still only called out to a monolithic script, rather than
making site local modifications easy with some sort of libvirt-hooks.d
directory. Thanks,
Alex