Re: [libvirt] [RFC PATCH] hostdev: add support for "managed='detach'"

15 Mar 2016

      On Tue, 15 Mar 2016 14:21:35 -0400
Laine Stump <laine@laine.org> wrote:
...
On 03/15/2016 01:00 PM, Daniel P. Berrange wrote:
...
...
Suggested by Alex Williamson.
If you plan to assign a GPU to a virtual machine, but that GPU happens
to be the host system console, you likely want it to start out using
the host driver (so that boot messages/etc will be displayed), then
later have the host driver replaced with vfio-pci for assignment to
the virtual machine.
However, in at least some cases (e.g. Intel i915) once the device has
been detached from the host driver and attached to vfio-pci, attempts
to reattach to the host driver only lead to "grief" (ask Alex for
details). This means that simply using "managed='yes'" in libvirt
won't work.
And if you set "managed='no'" in libvirt then either you have to
manually run virsh nodedev-detach prior to the first start of the
guest, or you have to have a management application intelligent enough
to know that it should detach from the host driver, but never reattach
to it.
This patch makes it simple/automatic to deal with such a case - it
adds a third "managed" mode for assigned PCI devices, called
"detach". It will detach ("unbind" in driver parlance) the device from
the host driver prior to assigning it to the guest, but when the guest
is finished with the device, will leave it bound to vfio-pci. This
allows re-using the device for another guest, without requiring
initial out-of-band intervention to unbind the host driver.  
You say that managed=yes causes pain upon re-attachment and that
apps should use managed=detach to avoid it, but how do management
apps know which devices are going to cause pain ? Libvirt isn't
On Mon, Mar 14, 2016 at 03:41:48PM -0400, Laine Stump wrote:  
providing any info on whether a particular device id needs to
use managed=yes vs managed=detach, and we don't want to be asking
the user to choose between modes in openstack/ovirt IMHO. I think
thats a fundamental problem with inventing a new value for managed
here.
My suspicion is that in many/most cases users don't actually need for 
the device to be re-bound to the host driver after the guest is finished 
with it, because they're only going to use the device to assign to a 
different guest anyway. But because managed='yes' is what's supplied and 
is the easiest way to get it setup for assignment to a guest, that's 
what they use.
As a matter of fact, all this extra churn of changing the driver back 
and forth for devices that are only actually used when they're bound to 
vfio-pci just wastes time, and makes it more likely that libvirt and its 
users will reveal and get caught up in the effects of some strange 
kernel driver loading/unloading bug (there was recently a bug reported 
like this; unfortunately the BZ record had customer info in it, so it's 
not publicly accessible :-( )
So beyond making this behavior available only when absolutely necessary, 
I think it is useful in other cases, at the user's discretion (and as I 
implied above, I think that if they understood the function and the 
tradeoffs, most people would choose to use managed='detach' rather than 
managed='yes')
(alternately, we could come back to the discussion of having persistent 
nodedevice config, with one of the configurables being which devices 
should be bound to vfio-pci when libvirtd is started. Did we maybe even 
talk about exactly that in the past? I can't remember... That would of 
course preclude the use case where someone 1) normally wanted to use the 
device for the host, but 2) occasionally wanted to use it for a guest, 
after which 3) they were well aware that they would need to reboot the 
host before they could use the device on the host again. I know, I know 
- "odd edge cases", and in particular "odd edge cases only encountered 
by people who know other ways of working around the problem" :-))
...
Can you provide more details about the problems with detaching ?
Is this inherant to all VGA cards, or is it specific to the Intel
i915, or specific to a kernel version or something else ?
I feel like this is something where libvirt should "do the right
thing", since that's really what managed=yes is all about.
eg, if we have managed=yes and we see an i915, we should
automatically skip re-attach for that device.
Alex can give a much better description of that than I can (I had told 
git to Cc him on the original patch, but it seems it didn't do that; I'm 
trying again). But what if there is such a behavior now for a certain 
set of VGA cards, and it gets fixed in the future? Would we continue to 
force avoiding re-attach for the device? I understand the allure of 
always doing the right thing without requiring config (and the dislike 
of adding new seemingly esoteric options), but I don't know that libvirt 
has (or can get) the necessary info to make the correct decision in all 
cases.
I agree, blacklisting VGA devices or any other specific device types or
host drivers is bound to be the wrong thing to do for someone or at
some point in time.  I think if we look at the way devices are
typically used for device assignment, we'd probably see that they're
used exclusively for device assignment or exclusively for the host.  My
guess is that it's a much less common scenario that a user actually
wants to steal a device from the host only while a VM is using it.  It
is done though, I know of folks that steal an audio device from the
host when they run their game VM and give it back when shutdown.  I
don't know that it's possible for libvirt to always just do the right
thing here, it involves inferring the intentions of the user.

So here are the types of things we're dealing with that made me suggest
this idea to Laine; in the i915 scenario, the Intel graphics device
(IGD) is typically the primary host graphics.  If we want to assign it
to a VM, obviously at some point it needs to move to vfio-pci, but do
we know that the user has an alternate console configured or do they go
headless when that happens?  If they go headless then they probably
don't want to use kernel boot options and blacklisting to prevent i915
from claiming the device or getting it attached to pci-stub or
vfio-pci.  Often that's not even enough since efifb or vesafb might try
to claim resources of the device even if the PCI driver is prevented
from doing so.  In such a case, it's pretty convenient that the user
can just set managed='yes' and the device gets snatched away from the
host when the VM starts... but then the i915 driver sometimes barfs
when the VM is shutdown and and i915 takes back the device.  The host is
left in a mostly unusable state.  Yes, the user could do a
nodedev-detach before starting the VM and yes, the i915 driver issue
may just be temporary, but this isn't the first time this has
happened.

As Laine mentioned, we've seen another customer issue where a certain
NIC is left in an inconsistent state, sometimes, when returned to the
host.  They have absolutely no use for this NIC on the host, so this
was mostly a pointless operation anyway.  In this case we had to use a
pci-stub.ids option to prevent the host NIC driver from touching the
devices since there was really no easy way to set manage='no' and
pre-bind the devices to vfio-pci in their ovirt/openstack environment.
NICs usually fair better at repeated attach/detach scenarios thanks to
physical hotplug support, but it's really a question of how robust is
the driver.  For instance, how many people are out there hotplugging
$10 Realtek NICs vs multi-hundred dollar enterprise class NICs?  Has
anyone ever done physical hotplug of a graphics card, sound cards or
USB controller?

Even in the scenario I mention above where the user thinks they want to
bounce their audio device back and forth between VM and host, there's
actually a fixed array of alsa cards in the kernel and unbinding a
device just leaks that slot. :-\

We also have nvidia.ko, which not only messes with the device to the
point where it may or may not work in the VM, but the vendor doesn't
support dynamically unbinding devices.  It will still do it, but it
kinda forgets to tell Xorg to stop using the device.  We generally
recommend folks doing GPU assignment to avoid the host driver
altogether, nouveau and radeon sometimes don't even like to do the
unbind.  i915 is actually doing better than average in this case and the
fact that it's typically the primary graphics sort of breaks that rule
anyway.

So we have all sorts of driver issues that are sure to come and go over
time and all sorts of use cases that seem difficult to predict.  If we
know we're in a ovirt/openstack environment, managed='detach' might
actually be a more typical use case than managed='yes'.  It still
leaves a gap that we hope the host driver doesn't do anything bad when
it initializes the device and hope that it releases the device cleanly,
but it's probably better than tempting fate by unnecessarily bouncing
it back and forth between drivers. Thanks,

Alex