On 7/18/19 2:18 PM, Laine Stump wrote:
On 7/18/19 11:56 AM, Daniel Henrique Barboza wrote:
>
>
> On 7/18/19 12:29 PM, Laine Stump wrote:
>> On 7/18/19 10:29 AM, Daniel Henrique Barboza wrote:
>>> Hi,
>>>
>>> I have a PoC that enables partial coldplug assignment of multifunction
>>> PCI devices with managed mode. At this moment, Libvirt can't handle
>>> this scenario - the code will detach only the hostdevs from the XML,
>>> when in fact the whole IOMMU needs to be detached. This can be
>>> verified by the fact that Libvirt handles the unmanaged scenario
>>> well, as long as the user detaches the whole IOMMU beforehand.
>>>
>>> I have played with 2 approaches. The one I am planning to contribute
>>> back is a change inside virHostdevGetPCIHostDeviceList(), that
>>> adds the extra PCI devices for detach/re-attach in case a PCI
>>> Multifunction device in managed mode is presented in the XML.
>>
>>
>> If you're thinking of doing that automatically, then I should warn
>> you that we had discussed that a long time ago, and decided that it
>> was a bad idea to do it because it was likely someone would, e.g.
>> try to assign an audio device to their guest that happened to be one
>> function on a multifunction device that also contained a disk
>> controller (or some other device) that the host needed for proper
>> operation.
>>
>>
>
> Let's say that I have a Multi PCI card with 4 functions, and I want a
> guest to use
> only the function 0 of that card. At this moment, I'm only able to do
> that if I
> manually execute nodedev-detach on all 4 functions beforehand and use
> function
> 0 as a hostdev with managed=false.
>
> What I've implemented is a way of doing the detach/re-attach of the
> whole IOMMU
> for the user, if the hostdev is set with managed=true (and perhaps I
> should also
> consider verifying the 'multifunction=yes' attribute as well, for
> more clarity).
> I am not trying to assign all the IOMMU devices to the guest - not
> sure if that's
> what you were talking about up there, but I'm happy to emphasize
> that's not
> the case.
No, we're talking about the same thing. We specifically talked about
the possibility of doing exactly this several years ago, and decided
against it.
>
> Now, yes, if the user is unaware of the consequences of detaching all
> devices
> of the IOMMU from the host, bad things can happen. If that's what
> you're saying,
> fair enough. I can make an argument about how we can't shield the
> user from
> his/her own 'unawareness' forever, but in the end it's better to be
> on the safe
> side.
We really shouldn't do anything with any host device if it's not
explicitly given in the config.
>
>
>> It may be that in *your* particular case, you understand that the
>> functions you don't want to assign to the guest are not otherwise
>> used, and it's not dangerous to suddenly detach them from their host
>> driver. But you can't assume that will always be the case.
>>
>>
>> If you *really* can't accept just assigning all the devices in that
>> IOMMU group to the guest (thus making them all explicitly listed in
>> the config, and obvious to the administrator that they won't be
>> available on the host) and simply not using them, then you either
>> need to separately detach those particular functions from the host,
>> or come up with a way of having the domain config explicitly list
>> them as "detached from the host but not actually attached to the
guest".
>>
>
> I can live with that - it will automate the detach/re-attach process,
> which is
> my goal here, and it force the user to know exactly what is going to
> be detached
> from the host, minimizing errors. If no one is against adding an extra
> parameter 'unassigned=true' to the hostdev in these cases, I can make
> this
> happen.
I don't have any idealogical opinion against that (maybe there's a
better name for the attribute, but I can't think of it).
I gavve the attribute the first name I could think of when I wrote that. As
long as it make sense I have no reservations about changing it later on.
But to back up a bit - what is it about managed='yes' that
makes you
want to do it that way instead of managed='no'? Do you really ever
need the devices to be binded to the host driver? Or are you just
using managed='yes' because there's not a standard/concenient place to
configure devices to be permanently binded to vfio-pci immediately
when the host boots?
It's a case of user convenience for devices that has mixed usage, at
least in my opinion.
For example, I can say from personal experience dealing with devices
that will never be used directly by the host, such as NVIDIA GPUs that are
used only as hostdevs of guests, that this code I'm developing is
pointless. In that setup the GPUs are binded to vfio-pci right after the
host boots using a /etc/rc.d script (or something equivalent). Not sure
if this is the standard way of binding a device to vfio-pci, but it works
for that environment. Thus, for that scenario, we'll never use
managed=true because it doesn't make sense.
The problem is with devices that the user expects to use both in guests
and in the host. In that case, the user will need either to handle the
nodedev
detach/reattach manually or to use managed=true and let Libvirt re-attach
the devices every time the guest is destroyed. Even if the device is
going to
be used in the same or another guest right after (meaning that we
re-attached
the device back simply to detach it again right after), using managed=true
is convenient because the user doesn't need to think about the state of
the device.
Truthfully, a great majority of the most troubling bugs with device
assignment are due to use of managed='yes', since it exercises the
kernel's device driver binding/unbinding code so much, and reveals
strange races in the (usually kernel) code, but in almost all cases
the devices being assigned to guests are *never* used directly by the
host anyway, so there is no point in repeatedly rebinding the host
driver to the device - it just sits there unused [1] until the next
time it is needed by a guest, and at that time it gets rebinded to
vfio-pci, rinse, repeat.
I think we should spend more time making it easier to have devices
"pre-binded" to vfio-pci at boot time, so that we could discourage use
of managed='yes'. (not "instead of" what you're doing, but "in
addition to" it).
I think managed=true use can be called a 'bad user habit' in that sense.
I can
think of some ways to alleviate it:
- an user setting in an conf file that changes how managed=true works.
Instead
of detach/re-attach the device, Libvirt will only detach the device,
leaving the
device bind to vfio-pci even after guest destroy
- same idea, but with a (yet another) XML attribute "re-attach=false" in the
hostdev definition. I like this idea better because you can set customized
behavior for each device/guest instead of changing the managed mechanic to
everyone
- for boot time (daemon start time), one way I can think of is an XML
file with the hostdev devices that are supposed to be pre-binded to vfio-pci
by libvirtd. Then the user can run the guests using managed=false in those
devices knowing that those devices are already taken care of. I don't know
how this idea interacts with the new "remember_owner" feature that
Michal implemented recently though .....
- we can make a 're-definition' of what managed=false means for PCI
hostdevs. Instead of failing to execute the guest if the device isn't
bind to
vfio-pci, managed=false means that Libvirt will detach the device from
the host if necessary, but it will not re-attach it back. If such a
redefinition
is a massive break to API and user expect behavior (I suspect it is ...)
we can
create a "managed=mixed" with this mechanic
Probably there are better ways to do it - this is what I can think of right
now.
Thanks,
DHB
[1] (in the case of network device VFs, often it isn't
"unused", but
instead is *improperly* used on the host due to NetworkManager
insisting on setting the device IFF_UP and starting up a DHCP client.
So it's not just finding races in the kernel driver
binding/initialization code, but also falling prey to (imho) the poor
choice of NM to force all interfaces up and default to running dhcp on
all unconfigured interfaces)