Eric Blake <eblake@redhat.com> wrote on 03/22/2012
06:54:31 PM:
>
> On 03/22/2012 04:49 PM, David Stevens wrote:
> > Stefan Berger/Watson/IBM wrote on 03/22/2012 03:04:53 PM:
> >
> >>
> >> I have some concerns about the cancelation of the thread.
It can
> >> hold the snoop lock and get cancelled while holding it. Next
time
> >> that lock is grabbed we will get a deadlock...
> >>
> >
> > The snoop lock is acquired in virNWFilterDHCPSnoopEnd(), which
> > is where the pthread_cancel() directly (for valid leases) or
the
> > freeReq()/pthread_cancel() is done. So, the canceler has the
lock
> > and the canceling thread can't have it, then.
> >
> > The only other case I see is when the config goes invalid and
> > we exit the snooper loop; he frees the snoop_lock() before removing
> > his own hash entry, which will cancel the same thread doing
> > the remove. Again, it can't have the snoop lock when canceled.
> >
> > Is there some other case you think I'm missing?
>
> pthread_cancel() tends to imply that you are properly managing signal
> blocking across threads; we haven't used it anywhere else in libvirt,
> and I'm extremely wary of pulling it in now, as there's probably a
lot
> of subtle bugs that it would expose. Are you sure you can't
do this in
> some other manner without dragging in pthread_cancel()?
From the patch:
+ if (pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS,
&tmp) < 0)
+ virNWFilterReportError(VIR_ERR_INTERNAL_ERROR,
"pthread_setcanceltype "
+
"failed; ifname \"%s\"",
req->ifname ?
+
req->ifname : "<NULL>");
from the man page of above function:
PTHREAD_CANCEL_ASYNCHRONOUS
The
thread can be canceled at any time. (Typically, it will be
canceled
immediately upon receiving a cancellation request, but
the
system doesn't guarantee this.)
It seems implementation dependent on when the thread
actually die...
So I can construct a scenario where the canceller
holds the lock and the thread doesn't, the thread gets cancelled but not
immediately and can still grab the lock and then dies. This is what
I concluded to have seen from below stack trace when the snooping thread
was running and in rapid succession I was shooting two SIGHUPs at libvirt:
Thread 11 (Thread 0x7f3c29ce8700 (LWP 12800)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
at util/threadpool.c:103
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
at util/threads-pthread.c:161
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f3c294e7700 (LWP 12801)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
at util/threadpool.c:103
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
at util/threads-pthread.c:161
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
---Type <return> to continue, or q <return>
to quit---
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f3c28ce6700 (LWP 12802)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
at util/threadpool.c:103
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
at util/threads-pthread.c:161
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f3c284e5700 (LWP 12803)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
at util/threadpool.c:103
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
at util/threads-pthread.c:161
---Type <return> to continue, or q <return>
to quit---
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f3c27ce4700 (LWP 12804)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
at util/threadpool.c:103
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
at util/threads-pthread.c:161
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f3c274e3700 (LWP 12805)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
at util/threadpool.c:103
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
---Type <return> to continue, or q <return>
to quit---
at util/threads-pthread.c:161
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f3c26ce2700 (LWP 12806)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
at util/threadpool.c:103
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
at util/threads-pthread.c:161
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f3c264e1700 (LWP 12807)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
at util/threadpool.c:103
---Type <return> to continue, or q <return>
to quit---
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
at util/threads-pthread.c:161
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f3c25ce0700 (LWP 12808)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
at util/threadpool.c:103
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
at util/threads-pthread.c:161
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f3c254df700 (LWP 12809)):
#0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2
()
from /lib64/libpthread.so.0
#1 0x00007f3c2fe2229a in virCondWait (c=<value
optimized out>,
m=<value optimized out>) at util/threads-pthread.c:117
#2 0x00007f3c2fe2295b in virThreadPoolWorker
(opaque=<value optimized out>)
---Type <return> to continue, or q <return>
to quit---
at util/threadpool.c:103
#3 0x00007f3c2fe21f36 in virThreadHelper (data=<value
optimized out>)
at util/threads-pthread.c:161
#4 0x00000038ec606ccb in start_thread () from
/lib64/libpthread.so.0
#5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f3c2fda2820 (LWP 12799)):
#0 0x00000038ec60dc9c in __lll_lock_wait ()
from /lib64/libpthread.so.0
#1 0x00000038ec609125 in _L_lock_880 () from
/lib64/libpthread.so.0
#2 0x00000038ec609008 in pthread_mutex_lock
() from /lib64/libpthread.so.0
#3 0x00007f3c2fe73818 in virNWFilterObjFindByUUID
(nwfilters=0x7f3c1c015008,
uuid=<value optimized out>) at
conf/nwfilter_conf.c:2662
#4 0x00007f3c2fe738ad in virNWFilterObjAssignDef
(conn=0x2055190,
nwfilters=0x7f3c1c015008, def=0x201ad60)
at conf/nwfilter_conf.c:2947
#5 0x00007f3c2fe73d6f in virNWFilterObjLoad
(conn=0x2055190,
nwfilters=0x7f3c1c015008, configDir=0x7f3c1c089460
"/etc/libvirt/nwfilter")
at conf/nwfilter_conf.c:3045
#6 virNWFilterLoadAllConfigs (conn=0x2055190,
nwfilters=0x7f3c1c015008,
configDir=0x7f3c1c089460 "/etc/libvirt/nwfilter")
at conf/nwfilter_conf.c:3092
#7 0x00000000004e9084 in nwfilterDriverReload
()
at nwfilter/nwfilter_driver.c:163
#8 0x00007f3c2fea2c8f in virStateReload () at
libvirt.c:891
---Type <return> to continue, or q <return>
to quit---
#9 0x0000000000420572 in daemonReloadHandler
(srv=<value optimized out>,
sig=<value optimized out>, opaque=<value
optimized out>) at libvirtd.c:1153
#10 0x00007f3c2fefd4ba in virNetServerSignalEvent
(
watch=<value optimized out>, fd=<value
optimized out>,
events=<value optimized out>,
opaque=0x200fe50) at rpc/virnetserver.c:500
#11 0x00007f3c2fe0f651 in virEventPollDispatchHandles
()
at util/event_poll.c:490
#12 virEventPollRunOnce () at util/event_poll.c:637
#13 0x00007f3c2fe0dd15 in virEventRunDefaultImpl ()
at util/event.c:247
#14 0x00007f3c2fefdffd in virNetServerRun (srv=0x200fe50)
at rpc/virnetserver.c:736
#15 0x000000000042235d in main (argc=<value optimized
out>,
argv=<value optimized out>) at
libvirtd.c:1609
I bet thread 1 was trying to grab a lock that wasn't
properly unlocked due to the async cancel. All the other threads look pretty
much the same and I doubt either one holds the lock.
Maybe we should go with the previous code from a while
ago which was setting a flag for the thread to die. It caused other work-arounds
to become necessary but at least we don't have to deal with possibly async.
deaths of threads holding locks.
Stefan