
Hollis Blanchard wrote:
On Tue, 2009-08-18 at 16:24 -0700, Kaitlin Rupert wrote:
Hi Hollis,
I applied your patch and the patches Richard submitted to do some testing. I noticed a few issues commented below...
+/* Delete all watches marked for deletion. */ +static void event_watch_free_deleted(void) +{ + struct watch *cur; + struct watch **link; + + CU_DEBUG("%s", __func__); + + pthread_mutex_lock(&watch_list_mutex); + + cur = watch_list; + link = &watch_list; + while (cur != NULL) { + struct watch *next = cur->next; + + if (cur->deleted) { + *link = next; + + cur->ff(cur->opaque); I'm seeing a seg fault here because once we reach this point, cur is NULL.
+ free(cur); + watch_count--; + } else + link = &cur->next; + + cur = next; + } + + pthread_mutex_unlock(&watch_list_mutex); +} +
Am I missing something obvious? The loop only runs while cur is not NULL.
Agreed. I meant to say in my previous email that I am confused as well. Actually, I put a check for (cur != NULL) right about the call to cur->ff - and cur is definitely NULL at that point. I haven't spent enough time debugging it, here's what little I have so far: event.c(142): event_watch_free_deleted event.c(146): +++++++++watch_list addr is 0x7fffdc014b30 Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff61c6710 (LWP 22188)] 0x0000000000000000 in ?? () Missing separate debuginfos, use: debuginfo-install tog-pegasus-2.9.0-3.fc12.x86_64 (gdb) bt #0 0x0000000000000000 in ?? () #1 0x00000000069811f1 in event_watch_free_deleted () at event.c:156 #2 event_thread () at event.c:375 #3 0x0000000003c5e91a in start_thread () from /lib64/libpthread.so.0 #4 0x0000000004936b2d in clone () from /lib64/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) up #1 0x00000000069811f1 in event_watch_free_deleted () at event.c:156 156 cur->ff(cur->opaque); (gdb) print cur $1 = (struct watch *) 0x0 (gdb)
Moreover, the list is locked, so concurrent list modifications wouldn't even explain it... unless the other user didn't take the lock (which we do on purpose sometimes). From the CU_DEBUG log, can you tell if there's any concurrency going on?
I'll investigate this more tomorrow. Richard - or maybe you can?
+/* One thread to watch all fds for all events for all libvirt threads. */ +static void *event_thread(void *ptr) +{ + while (1) { + struct watch *cur; + struct pollfd *pollfds; + struct pollfd *pollfd; + int timeout; + int i; + + pollfds = malloc(sizeof(struct pollfd) * watch_count); + + /* fill in pollfds array from our watch list */ + for (pollfd = &pollfds[0], cur = watch_list; + cur != NULL; + pollfd++, cur = cur->next) { + pollfd->fd = cur->fd; + pollfd->events = libvirt_to_poll_events(cur->events); + } + + timeout = event_next_timeout(); + + poll(pollfds, watch_count, timeout); + + /* invoke callbacks */ + for (i = 0; i < watch_count; i++) + for (cur = watch_list; cur != NULL; cur = cur->next) + if (cur->fd == pollfds[i].fd When I generate a event, poll never seems to catch it. I tried forcing this by changing the timeout value to a minute. And then generating the event. I can see from the libvirt debug that the event has been generated:
15:00:34.232: debug : virEventRunOnce:567 : Poll got 1 event 15:00:34.239: debug : virEventDispatchHandles:450 : Dispatch n=2 f=8 w=3 e=1 0x7f2a57d6e6b0
These messages are from qemud? So the event is generated on the other side of the "remote" pipe?
In the eventAddHandle() call, we have:
event.c(75): eventAddHandle event.c(82): ++++++++++++++watch->id is 0 event.c(84): ++++++++++++++watch->fd is 11 event.c(86): ++++++++++++++watch->events is 1 event.c(88): ++++++++++++++watch->cb is 0x6b8f4c0, cb is 0x6b8f4c0 event.c(90): ++++++++++++++watch->opaque is 0x7fffdc014900
These seem reasonable, so we should have a list with one entry.
I haven't tracked down what is happening here.
+ && !cur->deleted) { + invoke_callback(cur, &pollfds[i]); + break; + } + + free(pollfds); + + event_watch_free_deleted(); + event_timer_free_deleted(); + } + + return NULL; +}
Hmm, actually I'd expect the opposite: this code seems to *always* invoke the callback, even when no event is pending (oops!).
Since that's not happening, it may be that the pollfds structure isn't being created properly, in particular the fd member.
I should have said in my previous email - the value of pollfds[i].fd is always 0. In looking at it more, I think the following is happening: 1) The CIMOM loads Richard's provider and init_events() is called. 2) This starts the event_thread. At this point, we don't have a handle, so the following for loop fails: for (pollfd = &pollfds[0], cur = watch_list; cur != NULL; pollfd++, cur = cur->next) {} 3) We call poll() 4) In the meantime, the provider's ActivateFilter() function has been called and the callback is registered 5) eventUpdateHandle() is called, but it doesn't update our pollfds list 5) We generate a domain event and libvirt dispatches it 5) Our fd in pollfds[i] is still 0, so we don't have a match. event.c(388): init_events [New Thread 0x7ffff61c6710 (LWP 22188)] event.c(320): ^^^^^^^^^^in event_thread event.c(330): ^^^^^^^^^^got pollfds event.c(335): ^^^^^^^^^^cur is NULL!! event.c(352): ^^^^^^^^^timeout is 30000 std_indication.c(231): Calling handler->activate_fn Virt_GuestCrashAlertIndication.c(334): ActivateFilter for KVM_GuestCrashAlertIndication misc_util.c(76): Connecting to libvirt with uri `qemu:///system' event.c(75): eventAddHandle event.c(82): ++++++++++++++watch->id is 0 event.c(84): ++++++++++++++watch->fd is 11 event.c(86): ++++++++++++++watch->events is 1 event.c(88): ++++++++++++++watch->cb is 0x6de44c0, cb is 0x6de44c0 event.c(90): ++++++++++++++watch->opaque is 0x7fffdc014900 event.c(174): eventAddTimeout Virt_GuestCrashAlertIndication.c(338): Registering callback function event.c(108): eventUpdateHandle 0 event.c(108): eventUpdateHandle 0 Virt_GuestCrashAlertIndication.c(343): ret val: 0 std_indication.c(270): IndicationVirt_GuestCrashAlertIndicationProvider: indications enabled Virt_GuestCrashAlertIndication.c(365): EnableIndications event.c(358): ^^^^^^^^^^loop through watch_count event.c(360): ^^^^^^^^^^loop through watch_list event.c(361): ==================cur->fd is 11, pollfds[i].fd is 0 event.c(362): ==================cur->deleted is 7
Another possibility is that watch_count is somehow 0, which would also hose pollfds allocation...
I guess strace output (for poll()), CU_DEBUG logs, and some gdb work would help.
-- Kaitlin Rupert IBM Linux Technology Center kaitlin@linux.vnet.ibm.com