Re: [libvirt] [libvirt PATCHv6 1/1] add DHCP snooping

Stefan Berger/Watson/IBM wrote on 03/22/2012 03:04:53 PM:
I have some concerns about the cancelation of the thread. It can hold the snoop lock and get cancelled while holding it. Next time that lock is grabbed we will get a deadlock...
The snoop lock is acquired in virNWFilterDHCPSnoopEnd(), which is where the pthread_cancel() directly (for valid leases) or the freeReq()/pthread_cancel() is done. So, the canceler has the lock and the canceling thread can't have it, then. The only other case I see is when the config goes invalid and we exit the snooper loop; he frees the snoop_lock() before removing his own hash entry, which will cancel the same thread doing the remove. Again, it can't have the snoop lock when canceled. Is there some other case you think I'm missing? +-DLS

On 03/22/2012 04:49 PM, David Stevens wrote:
Stefan Berger/Watson/IBM wrote on 03/22/2012 03:04:53 PM:
I have some concerns about the cancelation of the thread. It can hold the snoop lock and get cancelled while holding it. Next time that lock is grabbed we will get a deadlock...
The snoop lock is acquired in virNWFilterDHCPSnoopEnd(), which is where the pthread_cancel() directly (for valid leases) or the freeReq()/pthread_cancel() is done. So, the canceler has the lock and the canceling thread can't have it, then.
The only other case I see is when the config goes invalid and we exit the snooper loop; he frees the snoop_lock() before removing his own hash entry, which will cancel the same thread doing the remove. Again, it can't have the snoop lock when canceled.
Is there some other case you think I'm missing?
pthread_cancel() tends to imply that you are properly managing signal blocking across threads; we haven't used it anywhere else in libvirt, and I'm extremely wary of pulling it in now, as there's probably a lot of subtle bugs that it would expose. Are you sure you can't do this in some other manner without dragging in pthread_cancel()? -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Eric Blake <eblake@redhat.com> wrote on 03/22/2012 03:54:31 PM:
pthread_cancel() tends to imply that you are properly managing signal blocking across threads; we haven't used it anywhere else in libvirt, and I'm extremely wary of pulling it in now, as there's probably a lot of subtle bugs that it would expose. Are you sure you can't do this in some other manner without dragging in pthread_cancel()?
Well, I was trying to avoid it in the earlier versions, but we ran into races where a new snooper thread could start up on the same interface before the old one woke up and noticed it was supposed to die; the old thread would then interfere with the new thread in unpleasant ways. I certainly had no problems with it during my testing. In this case, an asynchronous signal to kill an otherwise blocked-on-a-read thread is exactly what we need. That said, I think it's possible to avoid it, but using it greatly simplified the restart and reload cases. If you think it's too much risk, I can take a shot at reworking that to avoid it again. +-DLS

Eric Blake <eblake@redhat.com> wrote on 03/22/2012 06:54:31 PM:
On 03/22/2012 04:49 PM, David Stevens wrote:
Stefan Berger/Watson/IBM wrote on 03/22/2012 03:04:53 PM:
I have some concerns about the cancelation of the thread. It can hold the snoop lock and get cancelled while holding it. Next time that lock is grabbed we will get a deadlock...
The snoop lock is acquired in virNWFilterDHCPSnoopEnd(), which is where the pthread_cancel() directly (for valid leases) or the freeReq()/pthread_cancel() is done. So, the canceler has the lock and the canceling thread can't have it, then.
The only other case I see is when the config goes invalid and we exit the snooper loop; he frees the snoop_lock() before removing his own hash entry, which will cancel the same thread doing the remove. Again, it can't have the snoop lock when canceled.
Is there some other case you think I'm missing?
pthread_cancel() tends to imply that you are properly managing signal blocking across threads; we haven't used it anywhere else in libvirt, and I'm extremely wary of pulling it in now, as there's probably a lot of subtle bugs that it would expose. Are you sure you can't do this in some other manner without dragging in pthread_cancel()?
From the patch:
+ if (pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, &tmp) < 0) + virNWFilterReportError(VIR_ERR_INTERNAL_ERROR, "pthread_setcanceltype " + "failed; ifname \"%s\"", req->ifname ? + req->ifname : "<NULL>"); from the man page of above function: PTHREAD_CANCEL_ASYNCHRONOUS The thread can be canceled at any time. (Typically, it will be canceled immediately upon receiving a cancellation request, but the system doesn't guarantee this.) It seems implementation dependent on when the thread actually die... So I can construct a scenario where the canceller holds the lock and the thread doesn't, the thread gets cancelled but not immediately and can still grab the lock and then dies. This is what I concluded to have seen from below stack trace when the snooping thread was running and in rapid succession I was shooting two SIGHUPs at libvirt: Thread 11 (Thread 0x7f3c29ce8700 (LWP 12800)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:103 #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:161 #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 10 (Thread 0x7f3c294e7700 (LWP 12801)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:103 #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:161 #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7f3c28ce6700 (LWP 12802)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:103 #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:161 #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7f3c284e5700 (LWP 12803)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:103 #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:161 ---Type <return> to continue, or q <return> to quit--- #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7f3c27ce4700 (LWP 12804)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:103 #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:161 #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7f3c274e3700 (LWP 12805)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:103 #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) ---Type <return> to continue, or q <return> to quit--- at util/threads-pthread.c:161 #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f3c26ce2700 (LWP 12806)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:103 #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:161 #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f3c264e1700 (LWP 12807)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:103 ---Type <return> to continue, or q <return> to quit--- #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:161 #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f3c25ce0700 (LWP 12808)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) at util/threadpool.c:103 #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:161 #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f3c254df700 (LWP 12809)): #0 0x00000038ec60b3b4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f3c2fe2229a in virCondWait (c=<value optimized out>, m=<value optimized out>) at util/threads-pthread.c:117 #2 0x00007f3c2fe2295b in virThreadPoolWorker (opaque=<value optimized out>) ---Type <return> to continue, or q <return> to quit--- at util/threadpool.c:103 #3 0x00007f3c2fe21f36 in virThreadHelper (data=<value optimized out>) at util/threads-pthread.c:161 #4 0x00000038ec606ccb in start_thread () from /lib64/libpthread.so.0 #5 0x00000038ec2e0c2d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f3c2fda2820 (LWP 12799)): #0 0x00000038ec60dc9c in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00000038ec609125 in _L_lock_880 () from /lib64/libpthread.so.0 #2 0x00000038ec609008 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f3c2fe73818 in virNWFilterObjFindByUUID (nwfilters=0x7f3c1c015008, uuid=<value optimized out>) at conf/nwfilter_conf.c:2662 #4 0x00007f3c2fe738ad in virNWFilterObjAssignDef (conn=0x2055190, nwfilters=0x7f3c1c015008, def=0x201ad60) at conf/nwfilter_conf.c:2947 #5 0x00007f3c2fe73d6f in virNWFilterObjLoad (conn=0x2055190, nwfilters=0x7f3c1c015008, configDir=0x7f3c1c089460 "/etc/libvirt/nwfilter") at conf/nwfilter_conf.c:3045 #6 virNWFilterLoadAllConfigs (conn=0x2055190, nwfilters=0x7f3c1c015008, configDir=0x7f3c1c089460 "/etc/libvirt/nwfilter") at conf/nwfilter_conf.c:3092 #7 0x00000000004e9084 in nwfilterDriverReload () at nwfilter/nwfilter_driver.c:163 #8 0x00007f3c2fea2c8f in virStateReload () at libvirt.c:891 ---Type <return> to continue, or q <return> to quit--- #9 0x0000000000420572 in daemonReloadHandler (srv=<value optimized out>, sig=<value optimized out>, opaque=<value optimized out>) at libvirtd.c:1153 #10 0x00007f3c2fefd4ba in virNetServerSignalEvent ( watch=<value optimized out>, fd=<value optimized out>, events=<value optimized out>, opaque=0x200fe50) at rpc/virnetserver.c:500 #11 0x00007f3c2fe0f651 in virEventPollDispatchHandles () at util/event_poll.c:490 #12 virEventPollRunOnce () at util/event_poll.c:637 #13 0x00007f3c2fe0dd15 in virEventRunDefaultImpl () at util/event.c:247 #14 0x00007f3c2fefdffd in virNetServerRun (srv=0x200fe50) at rpc/virnetserver.c:736 #15 0x000000000042235d in main (argc=<value optimized out>, argv=<value optimized out>) at libvirtd.c:1609 I bet thread 1 was trying to grab a lock that wasn't properly unlocked due to the async cancel. All the other threads look pretty much the same and I doubt either one holds the lock. Maybe we should go with the previous code from a while ago which was setting a flag for the thread to die. It caused other work-arounds to become necessary but at least we don't have to deal with possibly async. deaths of threads holding locks. Stefan
participants (3)
-
David Stevens
-
Eric Blake
-
Stefan Berger