[libvirt] [PATCH] fix Xen event handling

As was reported on IRC and found by Dan Berrange, sometimes Xen event handling could start to go wild and block processing of requests in the daemon. The fault at least on libvirt side is that we didn't filter out non read/write event requests when asking for watches in the xenstore. The provided patch seems to work for the person who reported the original problem, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Wed, Mar 11, 2009 at 02:27:39PM +0100, Daniel Veillard wrote:
As was reported on IRC and found by Dan Berrange, sometimes Xen event handling could start to go wild and block processing of requests in the daemon. The fault at least on libvirt side is that we didn't filter out non read/write event requests when asking for watches in the xenstore. The provided patch seems to work for the person who reported the original problem,
ACK, looks good & reporter confirmed that it works.
Index: src/xs_internal.c =================================================================== RCS file: /data/cvs/libxen/src/xs_internal.c,v retrieving revision 1.88 diff -u -u -r1.88 xs_internal.c --- src/xs_internal.c 5 Feb 2009 18:14:00 -0000 1.88 +++ src/xs_internal.c 11 Mar 2009 13:23:17 -0000 @@ -1215,7 +1215,7 @@ static void xenStoreWatchEvent(int watch ATTRIBUTE_UNUSED, int fd ATTRIBUTE_UNUSED, - int events ATTRIBUTE_UNUSED, + int events, void *data) { char **event; @@ -1226,8 +1226,12 @@
virConnectPtr conn = data; xenUnifiedPrivatePtr priv = (xenUnifiedPrivatePtr) conn->privateData; + if(!priv) return;
+ /* only set a watch on read and write events */ + if (events & (VIR_EVENT_HANDLE_ERROR | VIR_EVENT_HANDLE_HANGUP)) return; + xenUnifiedLock(priv);
if(!priv->xshandle)
-- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, Mar 11, 2009 at 01:31:38PM +0000, Daniel P. Berrange wrote:
On Wed, Mar 11, 2009 at 02:27:39PM +0100, Daniel Veillard wrote:
As was reported on IRC and found by Dan Berrange, sometimes Xen event handling could start to go wild and block processing of requests in the daemon. The fault at least on libvirt side is that we didn't filter out non read/write event requests when asking for watches in the xenstore. The provided patch seems to work for the person who reported the original problem,
ACK, looks good & reporter confirmed that it works.
Okay, commited ! thanks, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Daniel Veillard:
As was reported on IRC and found by Dan Berrange, sometimes Xen event handling could start to go wild and block processing of requests in the daemon.
The fault at least on libvirt side is that we didn't filter out non read/write event requests when asking for watches in the xenstore. The provided patch seems to work for the person who reported the original problem,
It seems to work, although it may not have solved my problem: http://dpaste.com/13740/ There are five backed-up pause requests on this server, and it seems that four of them may be stuck in the xenstore code (apologies for the lack of debugging symbols on there. I really ought to get a debug package built, or haul over an unstripped version of libxenstore.so 3.2 from somewhere). -- Information gladly given, but safety requires Nick Moffitt avoiding unnecessary conversation. nick@zork.net

On Thu, Mar 12, 2009 at 10:04:50PM +0000, Nick Moffitt wrote:
Daniel Veillard:
As was reported on IRC and found by Dan Berrange, sometimes Xen event handling could start to go wild and block processing of requests in the daemon.
The fault at least on libvirt side is that we didn't filter out non read/write event requests when asking for watches in the xenstore. The provided patch seems to work for the person who reported the original problem,
It seems to work, although it may not have solved my problem:
There are five backed-up pause requests on this server, and it seems that four of them may be stuck in the xenstore code (apologies for the lack of debugging symbols on there. I really ought to get a debug package built, or haul over an unstripped version of libxenstore.so 3.2 from somewhere).
Ok, turns out the original patch (though worthwhile) was a red herring. The core problem was that the event loop was getting confused when we removed a monitored file handle. It would then start running the wrong event handler callbacks. So the xenstore watch handler got called even though no watch was pending :-( This also caused clients to get stuck because the watch detecting end-of-file on the socket wasn't getting called. The patch fixes the loop which dispatches callbacks, so that it does not assume the index into 'nfds' matches the index in 'handles'. They have to be tracked indepedantly, to take account of deleted handles. Daniel Index: qemud/event.c =================================================================== RCS file: /data/cvs/libvirt/qemud/event.c,v retrieving revision 1.19 diff -u -p -r1.19 event.c --- qemud/event.c 17 Feb 2009 09:44:18 -0000 1.19 +++ qemud/event.c 13 Mar 2009 15:31:33 -0000 @@ -409,25 +409,26 @@ static int virEventDispatchTimeouts(void * Returns 0 upon success, -1 if an error occurred */ static int virEventDispatchHandles(int nfds, struct pollfd *fds) { - int i; + int i, n; - for (i = 0 ; i < nfds ; i++) { + for (i = 0, n = 0 ; i < eventLoop.handlesCount && n < nfds ; i++) { if (eventLoop.handles[i].deleted) { EVENT_DEBUG("Skip deleted %d", eventLoop.handles[i].fd); continue; } - if (fds[i].revents) { + if (fds[n].revents) { virEventHandleCallback cb = eventLoop.handles[i].cb; void *opaque = eventLoop.handles[i].opaque; - int hEvents = virPollEventToEventHandleType(fds[i].revents); - EVENT_DEBUG("Dispatch %d %d %p", fds[i].fd, - fds[i].revents, eventLoop.handles[i].opaque); + int hEvents = virPollEventToEventHandleType(fds[n].revents); + EVENT_DEBUG("Dispatch %d %d %p", fds[n].fd, + fds[n].revents, eventLoop.handles[i].opaque); virEventUnlock(); (cb)(eventLoop.handles[i].watch, - fds[i].fd, hEvents, opaque); + fds[n].fd, hEvents, opaque); virEventLock(); } + n++; } return 0; -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Fri, Mar 13, 2009 at 03:34:32PM +0000, Daniel P. Berrange wrote:
Ok, turns out the original patch (though worthwhile) was a red herring. The core problem was that the event loop was getting confused when we removed a monitored file handle. It would then start running the wrong event handler callbacks. So the xenstore watch handler got called even though no watch was pending :-( This also caused clients to get stuck because the watch detecting end-of-file on the socket wasn't getting called.
The patch fixes the loop which dispatches callbacks, so that it does not assume the index into 'nfds' matches the index in 'handles'. They have to be tracked indepedantly, to take account of deleted handles.
ACK, this fixes a reproduceable libvirt-cim test problem I was having and we were hitting the same problem, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Daniel P. Berrange:
On Thu, Mar 12, 2009 at 10:04:50PM +0000, Nick Moffitt wrote:
http://dpaste.com/13740/ [snip] The patch fixes the loop which dispatches callbacks, so that it does not assume the index into 'nfds' matches the index in 'handles'. They have to be tracked indepedantly, to take account of deleted handles.
Fantastic! I just got to test this patch today, and dozens of mass-suspend and mass-delete operations have gone without a hitch. Many thanks! -- BitKeeper, how quaint. Nick Moffitt -- Alan Cox nick@zork.net
participants (3)
-
Daniel P. Berrange
-
Daniel Veillard
-
Nick Moffitt