[libvirt] [PATCH] events: Fix domain event race on client disconnect

GNOME Boxes sometimes stops getting domain events from libvirtd, even after restarting it. Further investigation in libvirtd shows that events are properly queued with virDomainEventStateQueue, but the timer virDomainEventTimer which flushes the events and sends them to the clients never gets called. Looking at the event queue in gdb shows that it's non-empty and that its size increases with each new events. virDomainEventTimer is set up in virDomainEventStateRegister[ID] when going from 0 client connecte to 1 client connected, but is initially disabled. The timer is removed in virDomainEventStateRegister[ID] when the last client is disconnected (going from 1 client connected to 0). This timer (which handles sending the events to the clients) is enabled in virDomainEventStateQueue when queueing an event on an empty queue (queue containing 0 events). It's disabled in virDomainEventStateFlush after flushing the queue (ie removing all the elements from it). This way, no extra work is done when the queue is empty, and when the next event comes up, the timer will get reenabled because the queue will go from 0 event to 1 event, which triggers enabling the timer. However, with this Boxes bug, we have a client connected (Boxes), a non-empty queue (there are events waiting to be sent), but a disabled timer, so something went wrong. When Boxes connects (it's the only client connecting to the libvirtd instance I used for debugging), the event timer is not set as expected (state->timer == -1 when virDomainEventStateRegisterID is called), but at the same time the event queue is not empty. In other words, we had no clients connected, but pending events. This also explains why the timer never gets enabled as this is only done when an event is queued on an empty queue. I think this can happen if an event gets queued using virDomainEventStateQueue and the client disconnection happens before the event timer virDomainEventTimer gets a chance to run and flush the event. In this situation, virDomainEventStateDeregister[ID] will get called with a non-empty event queue, the timer will be destroyed if this was the only client connected. Then, when other clients connect at a later time, they will never get notified about domain events as the event timer will never get enabled because the timer is only enabled if the event queue is empty when virDomainEventStateRegister[ID] gets called, which will is no longer the case. To avoid this issue, this commit makes sure to remove all events from the event queue when the last client in unregistered. As there is no longer anyone interested in receiving these events, these events are stale so there is no need to keep them around. A client connecting later will have no interest in getting events that happened before it got connected. --- src/conf/domain_event.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/src/conf/domain_event.c b/src/conf/domain_event.c index 43ecdcf..7ab973b 100644 --- a/src/conf/domain_event.c +++ b/src/conf/domain_event.c @@ -525,13 +525,13 @@ void virDomainEventFree(virDomainEventPtr event) } /** - * virDomainEventQueueFree: + * virDomainEventQueueClear: * @queue: pointer to the queue * - * Free the memory in the queue. We process this like a list here + * Removes all elements from the queue */ static void -virDomainEventQueueFree(virDomainEventQueuePtr queue) +virDomainEventQueueClear(virDomainEventQueuePtr queue) { int i; if (!queue) @@ -541,6 +541,22 @@ virDomainEventQueueFree(virDomainEventQueuePtr queue) virDomainEventFree(queue->events[i]); } VIR_FREE(queue->events); + queue->count = 0; +} + +/** + * virDomainEventQueueFree: + * @queue: pointer to the queue + * + * Free the memory in the queue. We process this like a list here + */ +static void +virDomainEventQueueFree(virDomainEventQueuePtr queue) +{ + if (!queue) + return; + + virDomainEventQueueClear(queue); VIR_FREE(queue); } @@ -1559,6 +1575,7 @@ virDomainEventStateDeregister(virConnectPtr conn, state->timer != -1) { virEventRemoveTimeout(state->timer); state->timer = -1; + virDomainEventQueueClear(state->queue); } virDomainEventStateUnlock(state); @@ -1596,6 +1613,7 @@ virDomainEventStateDeregisterID(virConnectPtr conn, state->timer != -1) { virEventRemoveTimeout(state->timer); state->timer = -1; + virDomainEventQueueClear(state->queue); } virDomainEventStateUnlock(state); -- 1.7.11.4

On Thu, Sep 06, 2012 at 03:06:41PM +0200, Christophe Fergeau wrote:
GNOME Boxes sometimes stops getting domain events from libvirtd, even after restarting it. Further investigation in libvirtd shows that events are properly queued with virDomainEventStateQueue, but the timer virDomainEventTimer which flushes the events and sends them to the clients never gets called. Looking at the event queue in gdb shows that it's non-empty and that its size increases with each new events.
virDomainEventTimer is set up in virDomainEventStateRegister[ID] when going from 0 client connecte to 1 client connected, but is initially disabled. The timer is removed in virDomainEventStateRegister[ID] when the last client is disconnected (going from 1 client connected to 0).
This timer (which handles sending the events to the clients) is enabled in virDomainEventStateQueue when queueing an event on an empty queue (queue containing 0 events). It's disabled in virDomainEventStateFlush after flushing the queue (ie removing all the elements from it). This way, no extra work is done when the queue is empty, and when the next event comes up, the timer will get reenabled because the queue will go from 0 event to 1 event, which triggers enabling the timer.
However, with this Boxes bug, we have a client connected (Boxes), a non-empty queue (there are events waiting to be sent), but a disabled timer, so something went wrong.
When Boxes connects (it's the only client connecting to the libvirtd instance I used for debugging), the event timer is not set as expected (state->timer == -1 when virDomainEventStateRegisterID is called), but at the same time the event queue is not empty. In other words, we had no clients connected, but pending events. This also explains why the timer never gets enabled as this is only done when an event is queued on an empty queue.
I think this can happen if an event gets queued using virDomainEventStateQueue and the client disconnection happens before the event timer virDomainEventTimer gets a chance to run and flush the event. In this situation, virDomainEventStateDeregister[ID] will get called with a non-empty event queue, the timer will be destroyed if this was the only client connected. Then, when other clients connect at a later time, they will never get notified about domain events as the event timer will never get enabled because the timer is only enabled if the event queue is empty when virDomainEventStateRegister[ID] gets called, which will is no longer the case.
A nice long detailed explanation. I agree that this scenario you outline is plausible as an explanation for why Boxes sometimes stops getting events from libvirtd. It also explains why we don't see it with VDSM - since they're a long running service, not a frequently stoppped/started desktop app, they're much less likely to hit the race.
To avoid this issue, this commit makes sure to remove all events from the event queue when the last client in unregistered. As there is no longer anyone interested in receiving these events, these events are stale so there is no need to keep them around. A client connecting later will have no interest in getting events that happened before it got connected.
Yep, makes sense.
--- src/conf/domain_event.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/src/conf/domain_event.c b/src/conf/domain_event.c index 43ecdcf..7ab973b 100644 --- a/src/conf/domain_event.c +++ b/src/conf/domain_event.c @@ -525,13 +525,13 @@ void virDomainEventFree(virDomainEventPtr event) }
/** - * virDomainEventQueueFree: + * virDomainEventQueueClear: * @queue: pointer to the queue * - * Free the memory in the queue. We process this like a list here + * Removes all elements from the queue */ static void -virDomainEventQueueFree(virDomainEventQueuePtr queue) +virDomainEventQueueClear(virDomainEventQueuePtr queue) { int i; if (!queue) @@ -541,6 +541,22 @@ virDomainEventQueueFree(virDomainEventQueuePtr queue) virDomainEventFree(queue->events[i]); } VIR_FREE(queue->events); + queue->count = 0; +} + +/** + * virDomainEventQueueFree: + * @queue: pointer to the queue + * + * Free the memory in the queue. We process this like a list here + */ +static void +virDomainEventQueueFree(virDomainEventQueuePtr queue) +{ + if (!queue) + return; + + virDomainEventQueueClear(queue); VIR_FREE(queue); }
@@ -1559,6 +1575,7 @@ virDomainEventStateDeregister(virConnectPtr conn, state->timer != -1) { virEventRemoveTimeout(state->timer); state->timer = -1; + virDomainEventQueueClear(state->queue); }
virDomainEventStateUnlock(state); @@ -1596,6 +1613,7 @@ virDomainEventStateDeregisterID(virConnectPtr conn, state->timer != -1) { virEventRemoveTimeout(state->timer); state->timer = -1; + virDomainEventQueueClear(state->queue); }
virDomainEventStateUnlock(state);
ACK Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Sep 07, 2012 at 01:24:35PM +0100, Daniel P. Berrange wrote:
A nice long detailed explanation. I agree that this scenario you outline is plausible as an explanation for why Boxes sometimes stops getting events from libvirtd.
I've ran more tests in the mean time without this patch applied, but with the one below to add some debugging: diff --git a/src/conf/domain_event.c b/src/conf/domain_event.c index 43ecdcf..33d90fb 100644 --- a/src/conf/domain_event.c +++ b/src/conf/domain_event.c @@ -1501,7 +1501,13 @@ virDomainEventStateRegisterID(virConnectPtr conn, int ret = -1; virDomainEventStateLock(state); + VIR_WARN("RegisterID"); + if ((state->callbacks->count == 0) && (state->timer == -1)) { + if (state->queue->count != 0) { + VIR_WARN("REG: queue's not empty: %d", state->queue->count); + } + } if ((state->callbacks->count == 0) && (state->timer == -1) && (state->timer = virEventAddTimeout(-1, @@ -1584,6 +1590,7 @@ virDomainEventStateDeregisterID(virConnectPtr conn, { int ret; + VIR_WARN("DeregisterID"); virDomainEventStateLock(state); if (state->isDispatching) ret = virDomainEventCallbackListMarkDeleteID(conn, @@ -1596,6 +1603,9 @@ virDomainEventStateDeregisterID(virConnectPtr conn, state->timer != -1) { virEventRemoveTimeout(state->timer); state->timer = -1; + if (state->queue->count != 0) { + VIR_WARN("DEREG: queue's not empty: %d", state->queue->count); + } } virDomainEventStateUnlock(state); I've hit the event lost issue once, and right when this started happening, the log was: 2012-09-06 11:37:06.094+0000: 30498: warning : virDomainEventStateDeregisterID:1593 : DeregisterID 2012-09-06 11:37:06.094+0000: 30498: warning : virDomainEventStateDeregisterID:1607 : DEREG: queue's not empty: 1 2012-09-06 11:45:42.363+0000: 30502: warning : virDomainEventStateRegisterID:1504 : RegisterID 2012-09-06 11:45:42.363+0000: 30502: warning : virDomainEventStateRegisterID:1508 : REG: queue's not empty: 1 and after that, no events and these warnings kept happening with an increasing number of queued events which is consistent with the hypothesis I made in this patch. Christophe

On Fri, Sep 07, 2012 at 02:44:03PM +0200, Christophe Fergeau wrote:
On Fri, Sep 07, 2012 at 01:24:35PM +0100, Daniel P. Berrange wrote:
A nice long detailed explanation. I agree that this scenario you outline is plausible as an explanation for why Boxes sometimes stops getting events from libvirtd.
I've ran more tests in the mean time without this patch applied, but with the one below to add some debugging:
diff --git a/src/conf/domain_event.c b/src/conf/domain_event.c index 43ecdcf..33d90fb 100644 --- a/src/conf/domain_event.c +++ b/src/conf/domain_event.c @@ -1501,7 +1501,13 @@ virDomainEventStateRegisterID(virConnectPtr conn, int ret = -1;
virDomainEventStateLock(state); + VIR_WARN("RegisterID");
+ if ((state->callbacks->count == 0) && (state->timer == -1)) { + if (state->queue->count != 0) { + VIR_WARN("REG: queue's not empty: %d", state->queue->count); + } + } if ((state->callbacks->count == 0) && (state->timer == -1) && (state->timer = virEventAddTimeout(-1, @@ -1584,6 +1590,7 @@ virDomainEventStateDeregisterID(virConnectPtr conn, { int ret;
+ VIR_WARN("DeregisterID"); virDomainEventStateLock(state); if (state->isDispatching) ret = virDomainEventCallbackListMarkDeleteID(conn, @@ -1596,6 +1603,9 @@ virDomainEventStateDeregisterID(virConnectPtr conn, state->timer != -1) { virEventRemoveTimeout(state->timer); state->timer = -1; + if (state->queue->count != 0) { + VIR_WARN("DEREG: queue's not empty: %d", state->queue->count); + } }
virDomainEventStateUnlock(state);
I've hit the event lost issue once, and right when this started happening, the log was: 2012-09-06 11:37:06.094+0000: 30498: warning : virDomainEventStateDeregisterID:1593 : DeregisterID 2012-09-06 11:37:06.094+0000: 30498: warning : virDomainEventStateDeregisterID:1607 : DEREG: queue's not empty: 1 2012-09-06 11:45:42.363+0000: 30502: warning : virDomainEventStateRegisterID:1504 : RegisterID 2012-09-06 11:45:42.363+0000: 30502: warning : virDomainEventStateRegisterID:1508 : REG: queue's not empty: 1
and after that, no events and these warnings kept happening with an increasing number of queued events which is consistent with the hypothesis I made in this patch.
Great, that's pretty encouraging then. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 07.09.2012 14:47, Daniel P. Berrange wrote:
On Fri, Sep 07, 2012 at 02:44:03PM +0200, Christophe Fergeau wrote:
On Fri, Sep 07, 2012 at 01:24:35PM +0100, Daniel P. Berrange wrote:
A nice long detailed explanation. I agree that this scenario you outline is plausible as an explanation for why Boxes sometimes stops getting events from libvirtd.
I've ran more tests in the mean time without this patch applied, but with the one below to add some debugging:
diff --git a/src/conf/domain_event.c b/src/conf/domain_event.c index 43ecdcf..33d90fb 100644 --- a/src/conf/domain_event.c +++ b/src/conf/domain_event.c @@ -1501,7 +1501,13 @@ virDomainEventStateRegisterID(virConnectPtr conn, int ret = -1;
virDomainEventStateLock(state); + VIR_WARN("RegisterID");
[1]
+ if ((state->callbacks->count == 0) && (state->timer == -1)) { + if (state->queue->count != 0) { + VIR_WARN("REG: queue's not empty: %d", state->queue->count);
I understand the WARN level here but not in [1]. Isn't DEBUG just enough there?
+ } + } if ((state->callbacks->count == 0) && (state->timer == -1) && (state->timer = virEventAddTimeout(-1, @@ -1584,6 +1590,7 @@ virDomainEventStateDeregisterID(virConnectPtr conn, { int ret;
+ VIR_WARN("DeregisterID");
ditto
virDomainEventStateLock(state); if (state->isDispatching) ret = virDomainEventCallbackListMarkDeleteID(conn, @@ -1596,6 +1603,9 @@ virDomainEventStateDeregisterID(virConnectPtr conn, state->timer != -1) { virEventRemoveTimeout(state->timer); state->timer = -1; + if (state->queue->count != 0) { + VIR_WARN("DEREG: queue's not empty: %d", state->queue->count); + } }
virDomainEventStateUnlock(state);
I've hit the event lost issue once, and right when this started happening, the log was: 2012-09-06 11:37:06.094+0000: 30498: warning : virDomainEventStateDeregisterID:1593 : DeregisterID 2012-09-06 11:37:06.094+0000: 30498: warning : virDomainEventStateDeregisterID:1607 : DEREG: queue's not empty: 1 2012-09-06 11:45:42.363+0000: 30502: warning : virDomainEventStateRegisterID:1504 : RegisterID 2012-09-06 11:45:42.363+0000: 30502: warning : virDomainEventStateRegisterID:1508 : REG: queue's not empty: 1
and after that, no events and these warnings kept happening with an increasing number of queued events which is consistent with the hypothesis I made in this patch.
Great, that's pretty encouraging then.
Daniel

Hey, On Fri, Sep 07, 2012 at 02:55:51PM +0200, Michal Privoznik wrote:
On 07.09.2012 14:47, Daniel P. Berrange wrote:
On Fri, Sep 07, 2012 at 02:44:03PM +0200, Christophe Fergeau wrote:
On Fri, Sep 07, 2012 at 01:24:35PM +0100, Daniel P. Berrange wrote:
A nice long detailed explanation. I agree that this scenario you outline is plausible as an explanation for why Boxes sometimes stops getting events from libvirtd.
I've ran more tests in the mean time without this patch applied, but with the one below to add some debugging:
diff --git a/src/conf/domain_event.c b/src/conf/domain_event.c index 43ecdcf..33d90fb 100644 --- a/src/conf/domain_event.c +++ b/src/conf/domain_event.c @@ -1501,7 +1501,13 @@ virDomainEventStateRegisterID(virConnectPtr conn, int ret = -1;
virDomainEventStateLock(state); + VIR_WARN("RegisterID");
[1]
+ if ((state->callbacks->count == 0) && (state->timer == -1)) { + if (state->queue->count != 0) { + VIR_WARN("REG: queue's not empty: %d", state->queue->count);
I understand the WARN level here but not in [1]. Isn't DEBUG just enough there?
Oh this was just a debug patch, I'm not proposing it for inclusion, hence the WARN so that the logs I'm interested in jump to my face during the debugging ;) Christophe
participants (3)
-
Christophe Fergeau
-
Daniel P. Berrange
-
Michal Privoznik