Hollis Blanchard wrote:
On Tue, 2009-08-18 at 16:24 -0700, Kaitlin Rupert wrote:
> Hi Hollis,
>
> I applied your patch and the patches Richard submitted to do some
> testing. I noticed a few issues commented below...
>
>
>> +/* Delete all watches marked for deletion. */
>> +static void event_watch_free_deleted(void)
>> +{
>> + struct watch *cur;
>> + struct watch **link;
>> +
>> + CU_DEBUG("%s", __func__);
>> +
>> + pthread_mutex_lock(&watch_list_mutex);
>> +
>> + cur = watch_list;
>> + link = &watch_list;
>> + while (cur != NULL) {
>> + struct watch *next = cur->next;
>> +
>> + if (cur->deleted) {
>> + *link = next;
>> +
>> + cur->ff(cur->opaque);
> I'm seeing a seg fault here because once we reach this point, cur is NULL.
>
>> + free(cur);
>> + watch_count--;
>> + } else
>> + link = &cur->next;
>> +
>> + cur = next;
>> + }
>> +
>> + pthread_mutex_unlock(&watch_list_mutex);
>> +}
>> +
Am I missing something obvious? The loop only runs while cur is not
NULL.
Agreed. I meant to say in my previous email that I am confused as well.
Actually, I put a check for (cur != NULL) right about the call to
cur->ff - and cur is definitely NULL at that point.
I haven't spent enough time debugging it, here's what little I have so far:
event.c(142): event_watch_free_deleted
event.c(146): +++++++++watch_list addr is 0x7fffdc014b30
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff61c6710 (LWP 22188)]
0x0000000000000000 in ?? ()
Missing separate debuginfos, use: debuginfo-install
tog-pegasus-2.9.0-3.fc12.x86_64
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00000000069811f1 in event_watch_free_deleted () at event.c:156
#2 event_thread () at event.c:375
#3 0x0000000003c5e91a in start_thread () from /lib64/libpthread.so.0
#4 0x0000000004936b2d in clone () from /lib64/libc.so.6
#5 0x0000000000000000 in ?? ()
(gdb) up
#1 0x00000000069811f1 in event_watch_free_deleted () at event.c:156
156 cur->ff(cur->opaque);
(gdb) print cur
$1 = (struct watch *) 0x0
(gdb)
Moreover, the list is locked, so concurrent list modifications wouldn't
even explain it... unless the other user didn't take the lock (which we
do on purpose sometimes). From the CU_DEBUG log, can you tell if there's
any concurrency going on?
I'll investigate this more tomorrow. Richard - or maybe you can?
>> +/* One thread to watch all fds for all events for all libvirt threads. */
>> +static void *event_thread(void *ptr)
>> +{
>> + while (1) {
>> + struct watch *cur;
>> + struct pollfd *pollfds;
>> + struct pollfd *pollfd;
>> + int timeout;
>> + int i;
>> +
>> + pollfds = malloc(sizeof(struct pollfd) * watch_count);
>> +
>> + /* fill in pollfds array from our watch list */
>> + for (pollfd = &pollfds[0], cur = watch_list;
>> + cur != NULL;
>> + pollfd++, cur = cur->next) {
>> + pollfd->fd = cur->fd;
>> + pollfd->events =
libvirt_to_poll_events(cur->events);
>> + }
>> +
>> + timeout = event_next_timeout();
>> +
>> + poll(pollfds, watch_count, timeout);
>> +
>> + /* invoke callbacks */
>> + for (i = 0; i < watch_count; i++)
>> + for (cur = watch_list; cur != NULL; cur = cur->next)
>> + if (cur->fd == pollfds[i].fd
> When I generate a event, poll never seems to catch it. I tried forcing
> this by changing the timeout value to a minute. And then generating the
> event. I can see from the libvirt debug that the event has been generated:
>
> 15:00:34.232: debug : virEventRunOnce:567 : Poll got 1 event
> 15:00:34.239: debug : virEventDispatchHandles:450 : Dispatch n=2 f=8 w=3
> e=1 0x7f2a57d6e6b0
These messages are from qemud? So the event is generated on the other
side of the "remote" pipe?
> In the eventAddHandle() call, we have:
>
> event.c(75): eventAddHandle
> event.c(82): ++++++++++++++watch->id is 0
> event.c(84): ++++++++++++++watch->fd is 11
> event.c(86): ++++++++++++++watch->events is 1
> event.c(88): ++++++++++++++watch->cb is 0x6b8f4c0, cb is 0x6b8f4c0
> event.c(90): ++++++++++++++watch->opaque is 0x7fffdc014900
These seem reasonable, so we should have a list with one entry.
> I haven't tracked down what is happening here.
>
>
>> + && !cur->deleted) {
>> + invoke_callback(cur, &pollfds[i]);
>> + break;
>> + }
>> +
>> + free(pollfds);
>> +
>> + event_watch_free_deleted();
>> + event_timer_free_deleted();
>> + }
>> +
>> + return NULL;
>> +}
Hmm, actually I'd expect the opposite: this code seems to *always*
invoke the callback, even when no event is pending (oops!).
Since that's not happening, it may be that the pollfds structure isn't
being created properly, in particular the fd member.
I should have said in my previous email - the value of pollfds[i].fd is
always 0. In looking at it more, I think the following is happening:
1) The CIMOM loads Richard's provider and init_events() is called.
2) This starts the event_thread. At this point, we don't have a handle,
so the following for loop fails:
for (pollfd = &pollfds[0], cur = watch_list;
cur != NULL;
pollfd++, cur = cur->next) {}
3) We call poll()
4) In the meantime, the provider's ActivateFilter() function has been
called and the callback is registered
5) eventUpdateHandle() is called, but it doesn't update our pollfds list
5) We generate a domain event and libvirt dispatches it
5) Our fd in pollfds[i] is still 0, so we don't have a match.
event.c(388): init_events
[New Thread 0x7ffff61c6710 (LWP 22188)]
event.c(320): ^^^^^^^^^^in event_thread
event.c(330): ^^^^^^^^^^got pollfds
event.c(335): ^^^^^^^^^^cur is NULL!!
event.c(352): ^^^^^^^^^timeout is 30000
std_indication.c(231): Calling handler->activate_fn
Virt_GuestCrashAlertIndication.c(334): ActivateFilter for
KVM_GuestCrashAlertIndication
misc_util.c(76): Connecting to libvirt with uri `qemu:///system'
event.c(75): eventAddHandle
event.c(82): ++++++++++++++watch->id is 0
event.c(84): ++++++++++++++watch->fd is 11
event.c(86): ++++++++++++++watch->events is 1
event.c(88): ++++++++++++++watch->cb is 0x6de44c0, cb is 0x6de44c0
event.c(90): ++++++++++++++watch->opaque is 0x7fffdc014900
event.c(174): eventAddTimeout
Virt_GuestCrashAlertIndication.c(338): Registering callback function
event.c(108): eventUpdateHandle 0
event.c(108): eventUpdateHandle 0
Virt_GuestCrashAlertIndication.c(343): ret val: 0
std_indication.c(270): IndicationVirt_GuestCrashAlertIndicationProvider:
indications enabled
Virt_GuestCrashAlertIndication.c(365): EnableIndications
event.c(358): ^^^^^^^^^^loop through watch_count
event.c(360): ^^^^^^^^^^loop through watch_list
event.c(361): ==================cur->fd is 11, pollfds[i].fd is 0
event.c(362): ==================cur->deleted is 7
Another possibility is that watch_count is somehow 0, which would also
hose pollfds allocation...
I guess strace output (for poll()), CU_DEBUG logs, and some gdb work
would help.
--
Kaitlin Rupert
IBM Linux Technology Center
kaitlin(a)linux.vnet.ibm.com