Eric Blake <eblake@redhat.com> wrote on 03/22/2012 06:54:31 PM:


>
> On 03/22/2012 04:49 PM, David Stevens wrote:
> > Stefan Berger/Watson/IBM wrote on 03/22/2012 03:04:53 PM:
> >
> >>
> >> I have some concerns about the cancelation of the thread. It can
> >> hold the snoop lock and get cancelled while holding it. Next time
> >> that lock is grabbed we will get a deadlock...
> >>
> >
> > The snoop lock is acquired in virNWFilterDHCPSnoopEnd(), which
> > is where the pthread_cancel() directly (for valid leases) or the
> > freeReq()/pthread_cancel() is done. So, the canceler has the lock
> > and the canceling thread can't have it, then.
> >
> > The only other case I see is when the config goes invalid and
> > we exit the snooper loop; he frees the snoop_lock() before removing
> > his own hash entry,  which will cancel the same thread doing
> > the remove. Again, it can't have the snoop lock when canceled.
> >
> > Is there some other case you think I'm missing?
>
> pthread_cancel() tends to imply that you are properly managing signal
> blocking across threads; we haven't used it anywhere else in libvirt,
> and I'm extremely wary of pulling it in now, as there's probably a lot
> of subtle bugs that it would expose.  Are you sure you can't do this in
> some other manner without dragging in pthread_cancel()?


From the patch:

+    if (pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, &tmp) < 0)
+        virNWFilterReportError(VIR_ERR_INTERNAL_ERROR, "pthread_setcanceltype "
+                               "failed; ifname \"%s\"", req->ifname ?
+                               req->ifname : "<NULL>");


from the man page of above function:

      PTHREAD_CANCEL_ASYNCHRONOUS
              The  thread can be canceled at any time.  (Typically, it will be
              canceled immediately upon receiving a cancellation request,  but
              the system doesn't guarantee this.)

It seems implementation dependent on when the thread actually die...
So I can construct a scenario where the canceller holds the lock and the thread doesn't, the thread gets cancelled but not immediately and can still grab the lock and then dies.  This is what I concluded to have seen from below stack trace when the snooping thread was running and in rapid succession I was shooting two SIGHUPs at libvirt:

Thread 11 (Thread 0x7f3c29ce8700 (LWP 12800)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
    at util/threadpool.c:103
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
    at util/threads-pthread.c:161
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f3c294e7700 (LWP 12801)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
    at util/threadpool.c:103
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
    at util/threads-pthread.c:161
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f3c28ce6700 (LWP 12802)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
    at util/threadpool.c:103
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
    at util/threads-pthread.c:161
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f3c284e5700 (LWP 12803)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
    at util/threadpool.c:103
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
    at util/threads-pthread.c:161
---Type <return> to continue, or q <return> to quit---
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f3c27ce4700 (LWP 12804)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
    at util/threadpool.c:103
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
    at util/threads-pthread.c:161
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f3c274e3700 (LWP 12805)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
    at util/threadpool.c:103
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
---Type <return> to continue, or q <return> to quit---
    at util/threads-pthread.c:161
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f3c26ce2700 (LWP 12806)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
    at util/threadpool.c:103
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
    at util/threads-pthread.c:161
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f3c264e1700 (LWP 12807)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
    at util/threadpool.c:103
---Type <return> to continue, or q <return> to quit---
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
    at util/threads-pthread.c:161
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f3c25ce0700 (LWP 12808)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
    at util/threadpool.c:103
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
    at util/threads-pthread.c:161
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f3c254df700 (LWP 12809)):
#0  0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007f3c2fe2229a in virCondWait (c=<value optimized out>,
    m=<value optimized out>) at util/threads-pthread.c:117
#2  0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>)
---Type <return> to continue, or q <return> to quit---
    at util/threadpool.c:103
#3  0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>)
    at util/threads-pthread.c:161
#4  0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0
#5  0x00000038ec2e0c2d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f3c2fda2820 (LWP 12799)):
#0  0x00000038ec60dc9c in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000038ec609125 in _L_lock_880 () from /lib64/libpthread.so.0
#2  0x00000038ec609008 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f3c2fe73818 in virNWFilterObjFindByUUID (nwfilters=0x7f3c1c015008,
    uuid=<value optimized out>) at conf/nwfilter_conf.c:2662
#4  0x00007f3c2fe738ad in virNWFilterObjAssignDef (conn=0x2055190,
    nwfilters=0x7f3c1c015008, def=0x201ad60) at conf/nwfilter_conf.c:2947
#5  0x00007f3c2fe73d6f in virNWFilterObjLoad (conn=0x2055190,
    nwfilters=0x7f3c1c015008, configDir=0x7f3c1c089460 "/etc/libvirt/nwfilter")
    at conf/nwfilter_conf.c:3045
#6  virNWFilterLoadAllConfigs (conn=0x2055190, nwfilters=0x7f3c1c015008,
    configDir=0x7f3c1c089460 "/etc/libvirt/nwfilter")
    at conf/nwfilter_conf.c:3092
#7  0x00000000004e9084 in nwfilterDriverReload ()
    at nwfilter/nwfilter_driver.c:163
#8  0x00007f3c2fea2c8f in virStateReload () at libvirt.c:891
---Type <return> to continue, or q <return> to quit---
#9  0x0000000000420572 in daemonReloadHandler (srv=<value optimized out>,
    sig=<value optimized out>, opaque=<value optimized out>) at libvirtd.c:1153
#10 0x00007f3c2fefd4ba in virNetServerSignalEvent (
    watch=<value optimized out>, fd=<value optimized out>,
    events=<value optimized out>, opaque=0x200fe50) at rpc/virnetserver.c:500
#11 0x00007f3c2fe0f651 in virEventPollDispatchHandles ()
    at util/event_poll.c:490
#12 virEventPollRunOnce () at util/event_poll.c:637
#13 0x00007f3c2fe0dd15 in virEventRunDefaultImpl () at util/event.c:247
#14 0x00007f3c2fefdffd in virNetServerRun (srv=0x200fe50)
    at rpc/virnetserver.c:736
#15 0x000000000042235d in main (argc=<value optimized out>,
    argv=<value optimized out>) at libvirtd.c:1609

I bet thread 1 was trying to grab a lock that wasn't properly unlocked due to the async cancel. All the other threads look pretty much the same and I doubt either one holds the lock.

Maybe we should go with the previous code from a while ago which was setting a flag for the thread to die. It caused other work-arounds to become necessary but at least we don't have to deal with possibly async. deaths of threads holding locks.

    Stefan