Re: [libvirt] [Qemu-devel] [PATCH] qdev: DEVICE_DELETED event

On 2013年03月08日 03:15, Michael S. Tsirkin wrote:
On Thu, Mar 07, 2013 at 08:00:29PM +0100, Andreas Färber wrote:
Am 07.03.2013 19:12, schrieb Michael S. Tsirkin:
On Thu, Mar 07, 2013 at 06:23:46PM +0100, Markus Armbruster wrote:
"Michael S. Tsirkin"<mst@redhat.com> writes:
On Thu, Mar 07, 2013 at 03:14:15PM +0100, Markus Armbruster wrote:
Andreas Färber<afaerber@suse.de> writes:
> Am 07.03.2013 11:07, schrieb Michael S. Tsirkin: >> On Thu, Mar 07, 2013 at 10:55:23AM +0100, Markus Armbruster wrote: >>> "Michael S. Tsirkin"<mst@redhat.com> writes: >>> >>>> On Wed, Mar 06, 2013 at 02:57:22PM +0100, Andreas Färber wrote: >>>>> Am 06.03.2013 14:00, schrieb Michael S. Tsirkin: >>>>>> libvirt has a long-standing bug: when removing the device, >>>>>> it can request removal but does not know when does the >>>>>> removal complete. Add an event so we can fix this in a robust way. >>>>>> >>>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com> >>>>> >>>>> Sounds like a good idea to me. :) >>>>> >>>>> [...] >>>>>> diff --git a/hw/qdev.c b/hw/qdev.c >>>>>> index 689cd54..f30d251 100644 >>>>>> --- a/hw/qdev.c >>>>>> +++ b/hw/qdev.c >>>>>> @@ -29,6 +29,7 @@ >>>>>> #include "sysemu/sysemu.h" >>>>>> #include "qapi/error.h" >>>>>> #include "qapi/visitor.h" >>>>>> +#include "qapi/qmp/qjson.h" >>>>>> >>>>>> int qdev_hotplug = 0; >>>>>> static bool qdev_hot_added = false; >>>>>> @@ -267,6 +268,11 @@ void qdev_init_nofail(DeviceState *dev) >>>>>> /* Unlink device from bus and free the structure. */ >>>>>> void qdev_free(DeviceState *dev) >>>>>> { >>>>>> + if (dev->id) { >>>>>> + QObject *data = qobject_from_jsonf("{ 'device': %s }", dev->id); >>>>>> + monitor_protocol_event(QEVENT_DEVICE_DELETED, data); >>>>>> + qobject_decref(data); >>>>>> + } >>>>>> object_unparent(OBJECT(dev)); >>>>>> } >>>>>> >>>>> >>>>> I'm pretty sure this is the wrong place to fire the notification. We >>>>> should rather do this when the device is actually deleted - which >>>>> qdev_free() does *not* actually guarantee, as criticized in the s390x >>>>> and unref'ing contexts. >>>>> I would suggest to place your code into device_unparent() instead. >>>>> >>>>> Another thing to consider is what data to pass to the event: Not all >>>>> devices have an ID. >>>> >>>> If they don't they were not created by management so management is >>>> probably not interested in them being removed. >>>> >>>> We could always add a 'path' key later if this assumption >>>> proves incorrect. >>> >>> In old qdev, ID was all we had, because paths were busted. Thus, >>> management had no choice but use IDs. >>> >>> If I understand modern qdev correctly, we got a canonical path. Old >>> APIs like device_del still accept only ID. Should new APIs still be >>> designed that way? Or should they always accept / provide the canonical >>> path, plus optional ID for convenience? >> >> What are advantages of exposing the path to users in this way?
The path is the device's canonical name. Canonical means path:device is 1:1. Path always works. Qdev ID only works when the user assigned one.
Funny case: board creates a hot-pluggable device by default (thus no qdev ID), guest ejects it, what do you put into the event? Your code simply doesn't emit one.
You could blame the user; after all he could've used -nodefaults, and added the device himself, with an ID.
I blame your design instead, which needlessly complicates the event's semantics: it gets emitted only for devices with a qdev ID. Which you neglected to document clearly, by the way.
Good point, I'll document this.
If you put the path into the event, you can emit it always, which is simpler. Feel free to throw in the qdev ID.
I don't blame anyone. User not assigning an id is a clear indication that user does not care about the lifetime of this device.
>> Looks like maintainance hassle without real benefits?
I can't see path being a greater maintenance hassle than ID.
Sure, the less events we emit the less we need to support. You want to expose all kind of internal events, then management will come to depend on it and we'll have to maintain them forever.
Misunderstanding. I'm *not* asking for more events. I'm asking for the DEVICE_DELETED event to carry the device's canonical name: its QOM path.
> Anthony had rejected earlier QOM patches by Paolo related to qdev id, > saying it was deprecated in favor of those QOM paths.
More reason to put the path into the event, not just the qdev ID.
libvirt does not seems to want it there. We'll always be able to add info but will never be able to remove info, keep it minimal.
Yes, adding members to an event is easy. Doesn't mean we should do it just for the heck of it. If we don't need a member now, and we think there's a chance we won't need in the future, then we probably shouldn't add it now.
I believe the chance of not needing the QOM path is effectively zero.
Moreover, we'd add not just a member in this case, we'd add a *trigger*.
Before: the event gets emitted only for devices with a qdev ID.
After: the event gets emitted for all devices.
I very much prefer the latter, because it's simpler.
[...]
I still don't see why it's useful for anyone. For now I hear from the libvirt guys that this patch does exactly what they need so I'll keep it simple. You are welcome to send a follow-up patch adding a path and more triggers, I won't object.
Well, the libvirt guys have been told to poll using qom-list, which needs the path, not an ID. Using it in both places would make it symmetrical - that may qualify as useful. (I'm not aware of any id -> path lookup QMP command.)
Nontheless, you can retain my Reviewed-by on v4+ as long as the code in hw/qdev.c doesn't change.
Andreas
I suggested retrying device_del, this has an advantage of working on more qemu version.
I'm wondering if it could be long time to wait for the device_del completes (AFAIK from previous bugs, it can be, though it should be fine for most of the cases). If it's too long, it will be a problem for management, because it looks like hanging. We can have a timeout for the device_del in libvirt, but the problem is the device_del can be still in progress by qemu, which could cause the inconsistency. Unless qemu has some command to cancel the device_del. Osier

Osier Yang <jyang@redhat.com> writes:
I'm wondering if it could be long time to wait for the device_del completes (AFAIK from previous bugs, it can be, though it should be fine for most of the cases). If it's too long, it will be a problem for management, because it looks like hanging. We can have a timeout for the device_del in libvirt, but the problem is the device_del can be still in progress by qemu, which could cause the inconsistency. Unless qemu has some command to cancel the device_del.
I'm afraid cancelling isn't possible, at least not for PCI. Here's how device_del works for PCI when it works, roughly: 1. device_del asks the device model to unplug itself. 2. PCI device models delegate the job to the device model providing their PCI bus. Let's assume it's our PIIX3/PIIX4 mongrel. That one puts an unplug request into PIIX4 function 3 where guest ACPI can see it, and triggers its interrupt. Then it immediately sends the QMP success reply. 3. Guest ACPI (SeaBIOS) services the interrupt. It finds the unplug request, and asks the guest OS nicely to give up the device. 4. If the guest OS has a working ACPI driver, and it feels like giving up the device, it does so, and tells ACPI when it's done. 5. Guest ACPI cleans up whatever it needs cleaned up, and signals successful unplug by writing the slog number to a PIIX4 function 3 register. 6. The PIIX device destroys the device in that slot. I call this the ACPI unplug dance. We don't control steps 3..5. There's no way for the guest to tell us "I got your unplug request, but I'm not going to honor it". Even if their was, a guest without a working ACPI driver wouldn't use it, so we couldn't rely on it anyway. There's no way for us to tell the guest "I changed my mind on this unplug". All we can do is wait and see. Either the device goes away, or it stays. Native PCIe is different, I'm told, but I know even less of that than I know of PCI/ACPI.

On Fri, Mar 08, 2013 at 09:50:55 +0100, Markus Armbruster wrote:
Osier Yang <jyang@redhat.com> writes:
I'm wondering if it could be long time to wait for the device_del completes (AFAIK from previous bugs, it can be, though it should be fine for most of the cases). If it's too long, it will be a problem for management, because it looks like hanging. We can have a timeout for the device_del in libvirt, but the problem is the device_del can be still in progress by qemu, which could cause the inconsistency. Unless qemu has some command to cancel the device_del.
I'm afraid cancelling isn't possible, at least not for PCI.
I don't think we need anything like that. We just need the device deletion API to return immediately without actually removing stuff from domain definition (unless the device was really removed fast enough, e.g., USB devices are removed before device_del returns) and then remove the device from domain definition when we get the event from QEMU or when libvirtd reconnects to a domain and sees a particular device is no longer present. After all, devices may be removed even if we didn't ask for it (when the removal is initiated by a guest OS). And we should also provide similar event for higher level apps. The question is whether we can make use of our existing API or if we need to introduce a new one. But that's of little relevance to qemu-devel I guess. Jirka

On 2013年03月08日 17:25, Jiri Denemark wrote:
On Fri, Mar 08, 2013 at 09:50:55 +0100, Markus Armbruster wrote:
Osier Yang<jyang@redhat.com> writes:
I'm wondering if it could be long time to wait for the device_del completes (AFAIK from previous bugs, it can be, though it should be fine for most of the cases). If it's too long, it will be a problem for management, because it looks like hanging. We can have a timeout for the device_del in libvirt, but the problem is the device_del can be still in progress by qemu, which could cause the inconsistency. Unless qemu has some command to cancel the device_del.
I'm afraid cancelling isn't possible, at least not for PCI.
I don't think we need anything like that. We just need the device deletion API to return immediately without actually removing stuff from domain definition (unless the device was really removed fast enough, e.g., USB devices are removed before device_del returns) and then remove the device from domain definition when we get the event from QEMU or when libvirtd reconnects to a domain and sees a particular device is no longer present. After all, devices may be removed even if we didn't ask for it (when the removal is initiated by a guest OS). And we should also provide similar event for higher level apps.
Removing the device from domain config unless we get the event from qemu or find the device disappeared by polling makes sense. That's the mainly reason for we want the event and polling actually. But the problem is our APIs don't want to have long time hanging. If we don't change the APIs and return quickly just like what we do currently, it's confused for user, because when he wants to attach the device again while the device_del is still in progress, he will get the error like "Device ID *** is in used", however, our detaching APIs return success prior to that. I.E, if device_del needs long time to complete in some cases? can we live with it? Osier

On 2013年03月08日 18:37, Osier Yang wrote:
On 2013年03月08日 17:25, Jiri Denemark wrote:
On Fri, Mar 08, 2013 at 09:50:55 +0100, Markus Armbruster wrote:
Osier Yang<jyang@redhat.com> writes:
I'm wondering if it could be long time to wait for the device_del completes (AFAIK from previous bugs, it can be, though it should be fine for most of the cases). If it's too long, it will be a problem for management, because it looks like hanging. We can have a timeout for the device_del in libvirt, but the problem is the device_del can be still in progress by qemu, which could cause the inconsistency. Unless qemu has some command to cancel the device_del.
I'm afraid cancelling isn't possible, at least not for PCI.
I don't think we need anything like that. We just need the device deletion API to return immediately without actually removing stuff from domain definition (unless the device was really removed fast enough, e.g., USB devices are removed before device_del returns) and then remove the device from domain definition when we get the event from QEMU or when libvirtd reconnects to a domain and sees a particular device is no longer present. After all, devices may be removed even if we didn't ask for it (when the removal is initiated by a guest OS). And we should also provide similar event for higher level apps.
Removing the device from domain config unless we get the event from qemu or find the device disappeared by polling makes sense. That's the mainly reason for we want the event and polling actually.
But the problem is our APIs don't want to have long time hanging. If we don't change the APIs and return quickly just like what we do currently, it's confused for user, because when he wants to attach the device again while the device_del is still in progress, he will get the error like "Device ID *** is in used", however, our detaching APIs return success prior to that.
I.E, if device_del needs long time to complete in some cases? can we live with it?
After talking with Jirka internally on IRC, we got agreement that waiting for the qemu event or polling before the detaching APIs returning success is not workable, because the time for device_del completing is really depended, even worse, it may never complete, that means we might break the back-compatibility. if going that way. The conclusion is that we need documents to say the detaching APIs returning success doesn't mean the device is really removed, and also we should expose the qemu event in libvirt so that the upper layer management has a way to known if the device is really gone. Osier

On 03/08/2013 04:25 AM, Jiri Denemark wrote:
On Fri, Mar 08, 2013 at 09:50:55 +0100, Markus Armbruster wrote:
Osier Yang <jyang@redhat.com> writes:
I'm wondering if it could be long time to wait for the device_del completes (AFAIK from previous bugs, it can be, though it should be fine for most of the cases). If it's too long, it will be a problem for management, because it looks like hanging. We can have a timeout for the device_del in libvirt, but the problem is the device_del can be still in progress by qemu, which could cause the inconsistency. Unless qemu has some command to cancel the device_del. I'm afraid cancelling isn't possible, at least not for PCI. I don't think we need anything like that. We just need the device deletion API to return immediately without actually removing stuff from domain definition
I don't think we can do that - it changes the user-visible semantics. I think we need to continue to remove the device from the XML immediately, but internally keep track of the fact that this device (and the qemu id used to refer to it) can't yet be re-used. The qemu driver currently has activeHostPciDevs and inactiveHostPciDevs. Maybe we also need a "zombieHostPciDevs" for devices that we've sent the device_del command for, but haven't yet received notice that they're actually removed. (BTW, shouldn't these lists of devices be global to all of libvirt, rather than qemu-specific?)

On 2013年03月12日 23:11, Laine Stump wrote:
On 03/08/2013 04:25 AM, Jiri Denemark wrote:
On Fri, Mar 08, 2013 at 09:50:55 +0100, Markus Armbruster wrote:
Osier Yang<jyang@redhat.com> writes:
I'm wondering if it could be long time to wait for the device_del completes (AFAIK from previous bugs, it can be, though it should be fine for most of the cases). If it's too long, it will be a problem for management, because it looks like hanging. We can have a timeout for the device_del in libvirt, but the problem is the device_del can be still in progress by qemu, which could cause the inconsistency. Unless qemu has some command to cancel the device_del. I'm afraid cancelling isn't possible, at least not for PCI. I don't think we need anything like that. We just need the device deletion API to return immediately without actually removing stuff from domain definition
I don't think we can do that - it changes the user-visible semantics. I think we need to continue to remove the device from the XML immediately, but internally keep track of the fact that this device (and the qemu id used to refer to it) can't yet be re-used.
Yeah, I think there is agreement now, either in this thread (I pasted the conclusion with talking with Jirka), or in the comments of the related bug (#BZ 813752).
The qemu driver currently has activeHostPciDevs and inactiveHostPciDevs. Maybe we also need a "zombieHostPciDevs" for devices that we've sent the device_del command for, but haven't yet received notice that they're actually removed.
Having an internal list may help us improve the error message and quit earlier instead of going through to qemu, but we will need internal XMLs of domain anyway, otherwise there is no way to known which devices are pending for the qemu event or need polling. And OTOH, I'm wondering how much benifit we can get from the new internal list, any other benifit except quiting a bit earlier and more sensible error message than the error from qemu?. If no, Is it deserved to maintain an hairy internal lost (from the experience of activePciHostDevs)? I will say it's not.
(BTW, shouldn't these lists of devices be global to all of libvirt, rather than qemu-specific?)
Good point, it should be global to avoid conflicts between VMs of diffrent drivers. Osier

On 2013年03月08日 16:50, Markus Armbruster wrote:
Osier Yang<jyang@redhat.com> writes:
I'm wondering if it could be long time to wait for the device_del completes (AFAIK from previous bugs, it can be, though it should be fine for most of the cases). If it's too long, it will be a problem for management, because it looks like hanging. We can have a timeout for the device_del in libvirt, but the problem is the device_del can be still in progress by qemu, which could cause the inconsistency. Unless qemu has some command to cancel the device_del.
I'm afraid cancelling isn't possible, at least not for PCI.
Here's how device_del works for PCI when it works, roughly:
1. device_del asks the device model to unplug itself.
2. PCI device models delegate the job to the device model providing their PCI bus. Let's assume it's our PIIX3/PIIX4 mongrel. That one puts an unplug request into PIIX4 function 3 where guest ACPI can see it, and triggers its interrupt. Then it immediately sends the QMP success reply.
3. Guest ACPI (SeaBIOS) services the interrupt. It finds the unplug request, and asks the guest OS nicely to give up the device.
4. If the guest OS has a working ACPI driver, and it feels like giving up the device, it does so, and tells ACPI when it's done.
5. Guest ACPI cleans up whatever it needs cleaned up, and signals successful unplug by writing the slog number to a PIIX4 function 3 register.
6. The PIIX device destroys the device in that slot.
I call this the ACPI unplug dance.
We don't control steps 3..5.
There's no way for the guest to tell us "I got your unplug request, but I'm not going to honor it". Even if their was, a guest without a working ACPI driver wouldn't use it, so we couldn't rely on it anyway.
There's no way for us to tell the guest "I changed my mind on this unplug". All we can do is wait and see. Either the device goes away, or it stays.
Hum, as I replied to Jirka in later mail, IMHO it needs to change libvirt detaching APIs to either wait for the event or find the device is really removed by polling before returning success. But it sounds to me that how long it takes to wait or polling is really depended? Osier

Osier Yang <jyang@redhat.com> writes:
On 2013年03月08日 16:50, Markus Armbruster wrote:
Osier Yang<jyang@redhat.com> writes:
I'm wondering if it could be long time to wait for the device_del completes (AFAIK from previous bugs, it can be, though it should be fine for most of the cases). If it's too long, it will be a problem for management, because it looks like hanging. We can have a timeout for the device_del in libvirt, but the problem is the device_del can be still in progress by qemu, which could cause the inconsistency. Unless qemu has some command to cancel the device_del.
I'm afraid cancelling isn't possible, at least not for PCI.
Here's how device_del works for PCI when it works, roughly:
1. device_del asks the device model to unplug itself.
2. PCI device models delegate the job to the device model providing their PCI bus. Let's assume it's our PIIX3/PIIX4 mongrel. That one puts an unplug request into PIIX4 function 3 where guest ACPI can see it, and triggers its interrupt. Then it immediately sends the QMP success reply.
3. Guest ACPI (SeaBIOS) services the interrupt. It finds the unplug request, and asks the guest OS nicely to give up the device.
4. If the guest OS has a working ACPI driver, and it feels like giving up the device, it does so, and tells ACPI when it's done.
5. Guest ACPI cleans up whatever it needs cleaned up, and signals successful unplug by writing the slog number to a PIIX4 function 3 register.
6. The PIIX device destroys the device in that slot.
I call this the ACPI unplug dance.
We don't control steps 3..5.
There's no way for the guest to tell us "I got your unplug request, but I'm not going to honor it". Even if their was, a guest without a working ACPI driver wouldn't use it, so we couldn't rely on it anyway.
There's no way for us to tell the guest "I changed my mind on this unplug". All we can do is wait and see. Either the device goes away, or it stays.
Hum, as I replied to Jirka in later mail, IMHO it needs to change libvirt detaching APIs to either wait for the event or find the device is really removed by polling before returning success. But it sounds to me that how long it takes to wait or polling is really depended?
Time between device_del and event DEVICE_DELETED event is *unbounded*. Could be instantaneous, could be never, could be anything in between. I'd expect it to be either fairly short or never most of the time in practice.
participants (4)
-
Jiri Denemark
-
Laine Stump
-
Markus Armbruster
-
Osier Yang