Re: [libvirt] [RFC PATCH] hostdev: add support for "managed='detach'"

22 Mar 2016

      On Tue, 2016-03-22 at 09:04 -0600, Alex Williamson wrote:
...
...
Could this be controlled by a kernel parameter? So you
can just add something like

   vfio-pci.devices=0000:02:00.0,0000:03:00.0

to your bootloader configuration and be sure that the
devices you've listed will never be touched by the host.

Note that I have no idea how much work implementing
something like the above would be, or whether it would
even be possible :)

vfio-pci is typically a module, device assignment and userspace drivers
are very niche as far as the kernel is concerned and the driver is
complex enough that building it statically into the kernel has some
disadvantages.  So adding module options to vfio-pci doesn't really
help.  We sort of tried this with adding the vfio-pci.ids module
option so we'd have parity with pci-stub, but it didn't really work
because users need to go through enough hoops to get vfio-pci loaded
early that they might as well just use the driver_override option.
Live and learn, pci-stub.ids is still generally more useful than the
equivalent vfio-pci option because pci-stub is (or should be) built
statically into the kernel.
I see. But it doesn't necessarily have to be a kernel module
option, we just need the information to get to initramfs
somehow. I believe a bunch of arguments, such as "splash", are
already processed this way.
...
Also, a fun property that we get to deal with in PCI space is that bus
numbers are under the control of the OS, so if you boot with
pci=assign-buses, your device address may change.  That means that
specifying a device by address is not as clean of a solution as you
might think.
That sounds... Problematic.

Wouldn't that mean that the guest XML itself might suddenly
become invalid? When we configure a device for PCI passthrough,
we identify it using domain:bus:slot.function.
...
It seems like this quickly devolves into a question of
whether the kernel is even the right place to do this, if you have a
properly modular kernel then userspace really has the control of how
modules are loaded and can intervene to exclude certain devices.
The kink is that we need to do that early in boot, which circles right
back around to the initramfs scripts, but afaik those are generally
managed by each distro with only loose similarities, so we'd need to
hope that we contribute something that becomes a defacto standard.
Yeah, once you get in initramfs territory you can hardly count
on standardization. IIRC some distros even provide *several*
alternative tools for the job :)

Then again, it seems quite weird to me that some sort of
mechanism for setting driver_override during early boot is not
available... Wouldn't it make the most sense for it to be set
before the kernel has had a chance to bind devices to the default
drivers?
...
...
...
The
problem is more that this is not an uncommon scenario.  In one of my
previous replies I listed all of the driver issues that I know about.
In some cases we can't fix them because the driver is proprietary, in
others we just don't have the bandwidth.  Users have far more devices
at their disposal to test than we do, so they're likely going to
continue to run into these issues.  
Summing up the issues you listed in that message, along with
my thoughts:

   * stealing audio device from host when running a game in a
     guest, giving it back afterwards

     - desktop use case
     - seems to be working fine with managed='yes'
     - if something goes wrong when reattaching to the host,
       the user can probably still go about his business;
       rebooting the host is going to be annoying but not a
       huge deal
     - limited number of in-kernel slots means you can only
       do this a number of time before it stops working;
       again, having to reboot the host once every so-many
       guest boots is probably not a deal breaker
     - managed='detach' is not helpful here, and neither is
       the proposal above

   * NIC reserved for guest use

     - both server and desktop use case
     - may or may not handle going back and forth between
       the host and the guest
     - managed='detach' would still prevent most issues,
       assuming the host driver detaches cleanly
     - best would be if the host never touched the device
     - a solution like the one outlined above seems perfect
     - for oVirt / OpenStack, the management layer could
       take care of updating the bootloader configuration
     - for pure libvirt, that responsability would fall on
       the user
     - should virt-manager etc. rather default to
       managed='detach' when assigning a NIC to a guest?

   * GPU with binary driver

     - both server and desktop use case
     - doesn't handle dynamic unbind gracefully
     - best would be if the host never touched the device
     - a solution like the one outlined above seems perfect
     - for oVirt / OpenStack, the management layer could
       take care of updating the bootloader configuration
     - for pure libvirt, that responsability would fall on
       the user
     - managed='detach' can prevent some failures, but not
       those that happen on host driver unbind

   * Primary Intel GPU

     - desktop use case (any reason to do this for servers?)
     - will probably fail on reattach
     - mananged='detach' would prevent host crashes and the
       like
     - must assume the user has some other way to connect to
       the host, because at some point they're going to be
       unable to use the GPU from the host anyway

Have I missed anything?

I think the only thing missing is that libvirt sort of becomes a
natural management point for an assigned device because it's specified
with some management characteristics.  Users set managed='yes',
assuming libvirt will do the right thing
People who assume that managed='yes' will always do the right
thing are probably going to be disappointed no matter what :)
...
and because the managed='no'
path requires them to come up with their own solution.  So maybe we
just need to create some framework for users to be able to take that
managed='no' path with feeling like they're stepping off a cliff into
their own adhoc management scripts.
That's what I'm thinking.

If, as argued above, the "proper" solution is in most cases
to prevent devices from being bound to the host driver in the
first place, then libvirt enters the picture way too late to
be able to take care of it.

If we can figure out a way to give users the ability to mark
a bunch of devices as "reserved for assignment" and have early
boot set driver_override to vfio-pci for us, we'll make life
easier for people currently relying on pci-stub.ids as well.

The simple stuff, that already works fine today with
managed='yes', will of course keep working.

Does this sound sensible / doable?

Cheers.

-- 
Andrea Bolognani
Software Engineer - Virtualization Team