On Tue, 2 Feb 2021 08:28:52 +0100
Erik Skultety <eskultet(a)redhat.com> wrote:
On Mon, Feb 01, 2021 at 04:38:56PM -0700, Alex Williamson wrote:
> On Mon, 1 Feb 2021 16:57:44 -0600
> Jonathon Jongsma <jjongsma(a)redhat.com> wrote:
>
> > On Mon, 1 Feb 2021 11:33:08 +0100
> > Erik Skultety <eskultet(a)redhat.com> wrote:
> >
> > > On Mon, Feb 01, 2021 at 09:48:11AM +0000, Daniel P. Berrangé
> > > wrote:
> > > > On Mon, Feb 01, 2021 at 10:45:43AM +0100, Erik Skultety
> > > > wrote:
> > > > > On Mon, Feb 01, 2021 at 09:42:32AM +0000, Daniel P. Berrangé
> > > > > wrote:
> > > > > > On Fri, Jan 29, 2021 at 05:34:36PM -0600, Jonathon Jongsma
> > > > > > wrote:
> > > > > > > On Thu, 7 Jan 2021 17:43:54 +0100
> > > > > > > Erik Skultety <eskultet(a)redhat.com> wrote:
> > > > > > >
> > > > > > > > > Tested with v6.10.0-283-g1948d4e61e.
> > > > > > > > >
> > > > > > > > > 1.Can define/start/destroy mdev device
successfully;
> > > > > > > > >
> > > > > > > > > 2.'virsh nodedev-list' has no
'--active' option,
> > > > > > > > > which is inconsistent with the description
in the
> > > > > > > > > patch: # virsh nodedev-list --active
> > > > > > > > > error: command 'nodedev-list'
doesn't support option
> > > > > > > > > --active
> > > > > > > > >
> > > > > > > > > 3.virsh client hang when trying to destroy a
mdev
> > > > > > > > > device which is using by a vm, and after
that all
> > > > > > > > > 'virsh nodev*' cmds will hang. If
restarting
> > > > > > > > > llibvirtd after that, libvirtd will hang.
> > > > > > > >
> > > > > > > > It hangs because underneath a write to the
'remove'
> > > > > > > > sysfs attribute is now blocking for some reason
and
> > > > > > > > since we're relying on mdevctl to do it for
us, hence
> > > > > > > > "it hangs". I'm not trying to make
an excuse, it's
> > > > > > > > plain wrong. I'd love to rely on such a
basic
> > > > > > > > functionality, but it looks like we'll have
to go
> > > > > > > > with a extremely ugly workaround and try to get
the
> > > > > > > > list of active domains from the nodedev driver
and
> > > > > > > > see whether any of them has the device assigned
> > > > > > > > before we try to destroy the mdev via the
nodedev
> > > > > > > > driver.
> > > > > > >
> > > > > > > So, I've been trying to figure out a way to do
this,
> > > > > > > but as far as I know, there's no way to get a list
of
> > > > > > > active domains from within the nodedev driver, and I
> > > > > > > can't think of any better ways to handle it. Any
ideas?
> > > > > > >
> > > > > >
> > > > > > Correct, the nodedev driver isn't permitted to talk to
> > > > > > any of the virt drivers.
> > > > >
> > > > > Oh, not even via secondary connection? What makes nodedev so
> > > > > special, since we can open a secondary connection from e.g.
> > > > > the storage driver?
> > > >
> > > > It is technically possible, but it should never be done,
> > > > because it introduces a bi-directional dependancy between the
> > > > daemons which introduces the danger of deadlocking them. None
> > > > of the secondary drivers should connect to the hypervisor
> > > > drivers.
> > > > > > Is there anything in sysfs which reports whether the
> > > > > > device is in use ?
> > > > >
> > > > > Nothing that I know of, the way it used to work was that you
> > > > > tried to write to sysfs and kernel returned a write error
> > > > > with "device in use" or something like that, but that
has
> > > > > changed since :(.
> > >
> > > Without having tried this and since mdevctl is just a Bash
> > > script, can we bypass mdevctl on destroys a little bit by
> > > constructing the path to the sysfs attribute ourselves and
> > > perform a non-blocking write of zero bytes to see if we get an
> > > error? If so, we'll skip mdevctl and report an error. If we
> > > don't, we'll invoke mdevctl to remove the device in order to
> > > remain consistent. Would that be an acceptable workaround
> > > (provided it would work)?
> >
> > As far as I can tell, this doesn't work. According to my
> > tests, attempting to write zero bytes to $(mdev_sysfs_path)/remove
> > doesn't result in an error if the mdev is in use by a vm. It just
> > "successfully" writes zero bytes. Adding Alex to cc in case he
> > has an idea for a workaround here.
>
> [Cc +Connie]
>
> I'm not really sure why mdevs are unique here. When we write to
> remove, the first step is to release the device from the driver, so
> it's really the same as an unbind for a vfio-pci device. PCI
> drivers cannot return an error, an unbind is handled not as a
> request, but a directive, so when the device is in use we block
> until the unbind can complete. With vfio-pci (and added upstream
> to the mdev core - depending on vendor support), the driver remove
> callback triggers a virtual interrupt to the user asking to
> cooperatively return the device (triggering a hot-unplug in QEMU).
> Has this really worked so well in vfio-pci that we've forgotten
> that an unbind can block there too or are we better about tracking
> something with PCI devices vs mdevs?
Does any of the current vendor guest drivers for mdev support unplug?
While I'm not trying to argue that unpluging a vfio-pci cannot block,
it just works seamlessly in majority of cases nowadays, but I guess
we were in the same situation with PCI assignment in the past?
The whole point here is IMO about a massive inconvenience for a
library consumer to be blocked on an operation and not knowing why,
whereas when you return an instant error saying why the operation
cannot be completed right now that opens the door for a necessary
adjustment in their usage of the library.
>
> On idea for a solution would be that vfio only allows a single open
> of a group at a time, so if libvirt were to open the group it could
> know that it's unused. If you can manage to close the group once
> you've already triggered the remove/unbind, then I'd think the
> completion of the write would be deterministic. If the group is in
> use elsewhere, the open should get back an -EBUSY and you probably
> ought to be careful
Honestly, ^this seems like a fairly straightforward workaround to me.
Erik
Yes, thanks. This seems pretty clean and simple to to implement, and I
can't really think of any significant downsides.
Jonathon
> about removing/unbinding it anyway. It might be possible to
> implement this in mdevctl too, ie. redirect /dev/null to group file
> and fork, fork the echo 1 > remove, kill the redirect, return a
> device in use error if the initial redirect fails. Thanks,
>
> Alex