[libvirt] Two core dumps are generated in multi-thread scenarios

Hi, I found two core dumps generated in multi-thread scenarios in ESX part. Case1: libcurl support multi-thread core dump: #12 0x00002aaabea89712 in addbyter () from /usr/local/lib/libcurl.so.4 #13 0x00002aaabea89b86 in dprintf_formatf () from /usr/local/lib/libcurl.so.4 #14 0x00002aaabea8b055 in curl_mvsnprintf () from /usr/local/lib/libcurl.so.4 #15 0x00002aaabea7678f in Curl_failf () from /usr/local/lib/libcurl.so.4 #16 0x00002aaabea6d871 in Curl_resolv_timeout () from /usr/local/lib/libcurl.so.4 #17 0x00000006e8a8f230 in ?? () Fix code: esxVI_CURL_Connect() in esx_vi.c: I add a new line as following: curl_easy_setopt(curl->handle, CURLOPT_NOSIGNAL, 1); Case2: libssl support multi-thread core dump: #0 0x0000003f9b030265 in raise () from /lib64/libc.so.6 #1 0x0000003f9b031d10 in abort () from /lib64/libc.so.6 #2 0x0000003f9b06a84b in __libc_message () from /lib64/libc.so.6 #3 0x0000003f9b072fae in _int_malloc () from /lib64/libc.so.6 #4 0x0000003f9b074cde in malloc () from /lib64/libc.so.6 #5 0x0000003f9b07963b in strerror () from /lib64/libc.so.6 #6 0x0000003fa188032a in ERR_load_ERR_strings () from /lib64/libcrypto.so.6 #7 0x0000003fa187fde9 in ERR_load_crypto_strings () from /lib64/libcrypto.so.6 #8 0x0000003fa48309d9 in SSL_load_error_strings () from /lib64/libssl.so.6 #9 0x00002aaaba8e612e in Curl_ossl_init () from /opt/CSCOppm-unit/hypervisor/libcurl/lib/libcurl.so.4 #10 0x00002aaaba8ee6c1 in curl_global_init () from /opt/CSCOppm-unit/hypervisor/libcurl/lib/libcurl.so.4 #11 0x00002aaaba8ee6f8 in curl_easy_init () from /opt/CSCOppm-unit/hypervisor/libcurl/lib/libcurl.so.4 #12 0x00002aaaba0d932b in esxVI_RegisterVM_Task (ctx=0x2aaaba0d96d1, _this=0x5cf54b20, path=0x50e921c0 "10.74.125.50", name=0x2aaac0ae6e80 "root", asTemplate=3228119712, pool=0x5cf54b20, host=0x2aaac0693270, output=0x50e921a0) at esx/esx_vi_methods.generated.c:480 Possible Problem: Two callback functions(locking_function and threadid_func) need to be set. http://www.openssl.org/docs/crypto/threads.html#DESCRIPTION Would you help to give some comments about this two core dump? B.R. Benjamin Wang

On Sun, Sep 23, 2012 at 03:32:52AM +0000, Benjamin Wang (gendwang) wrote:
Hi, I found two core dumps generated in multi-thread scenarios in ESX part. Case1: libcurl support multi-thread core dump: #12 0x00002aaabea89712 in addbyter () from /usr/local/lib/libcurl.so.4 #13 0x00002aaabea89b86 in dprintf_formatf () from /usr/local/lib/libcurl.so.4 #14 0x00002aaabea8b055 in curl_mvsnprintf () from /usr/local/lib/libcurl.so.4 #15 0x00002aaabea7678f in Curl_failf () from /usr/local/lib/libcurl.so.4 #16 0x00002aaabea6d871 in Curl_resolv_timeout () from /usr/local/lib/libcurl.so.4 #17 0x00000006e8a8f230 in ?? ()
Fix code: esxVI_CURL_Connect() in esx_vi.c: I add a new line as following: curl_easy_setopt(curl->handle, CURLOPT_NOSIGNAL, 1);
Where exactly in the function ? Can you send a diff of your change ? Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Hi, Old code(in esx_vi.c) is as below: curl_easy_setopt(curl->handle, CURLOPT_USERAGENT, "libvirt-esx"); curl_easy_setopt(curl->handle, CURLOPT_HEADER, 0); New code: curl_easy_setopt(curl->handle, CURLOPT_NOSIGNAL, 1); curl_easy_setopt(curl->handle, CURLOPT_USERAGENT, "libvirt-esx"); curl_easy_setopt(curl->handle, CURLOPT_HEADER, 0); B.R. Benjamin Wang -----Original Message----- From: Daniel Veillard [mailto:veillard@redhat.com] Sent: 2012年9月23日 16:52 To: Benjamin Wang (gendwang) Cc: Matthias Bolte; libvir-list@redhat.com; Yang Zhou (yangzho) Subject: Re: [libvirt] Two core dumps are generated in multi-thread scenarios On Sun, Sep 23, 2012 at 03:32:52AM +0000, Benjamin Wang (gendwang) wrote:
Hi, I found two core dumps generated in multi-thread scenarios in ESX part. Case1: libcurl support multi-thread core dump: #12 0x00002aaabea89712 in addbyter () from /usr/local/lib/libcurl.so.4 #13 0x00002aaabea89b86 in dprintf_formatf () from /usr/local/lib/libcurl.so.4 #14 0x00002aaabea8b055 in curl_mvsnprintf () from /usr/local/lib/libcurl.so.4 #15 0x00002aaabea7678f in Curl_failf () from /usr/local/lib/libcurl.so.4 #16 0x00002aaabea6d871 in Curl_resolv_timeout () from /usr/local/lib/libcurl.so.4 #17 0x00000006e8a8f230 in ?? ()
Fix code: esxVI_CURL_Connect() in esx_vi.c: I add a new line as following: curl_easy_setopt(curl->handle, CURLOPT_NOSIGNAL, 1);
Where exactly in the function ? Can you send a diff of your change ? Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

2012/9/23 Benjamin Wang (gendwang) <gendwang@cisco.com>:
Hi, I found two core dumps generated in multi-thread scenarios in ESX part.
Case1: libcurl support multi-thread core dump: #12 0x00002aaabea89712 in addbyter () from /usr/local/lib/libcurl.so.4 #13 0x00002aaabea89b86 in dprintf_formatf () from /usr/local/lib/libcurl.so.4 #14 0x00002aaabea8b055 in curl_mvsnprintf () from /usr/local/lib/libcurl.so.4 #15 0x00002aaabea7678f in Curl_failf () from /usr/local/lib/libcurl.so.4 #16 0x00002aaabea6d871 in Curl_resolv_timeout () from /usr/local/lib/libcurl.so.4 #17 0x00000006e8a8f230 in ?? ()
Fix code: esxVI_CURL_Connect() in esx_vi.c: I add a new line as following: curl_easy_setopt(curl->handle, CURLOPT_NOSIGNAL, 1);
It took me a moment reading libcurl code until I figured out what might be happening here. The problem is that Curl_resolv_timeout uses SIGALRM + sigsetjmp/siglongjmp to realize the timeout logic. This implementation is not thread-safe as the SIGALRM might be executed on a different thread than the original thread that started the call to Curl_resolv_timeout. This in turn results in the call to Curl_resolv_timeout being continued via siglongjmp (called from the SIGALRM handler) on different thread. Setting CURLOPT_NOSIGNAL to 1 makes libcurl avoid the SIGALRM + sigsetjmp/siglongjmp implementation. This solves the problem but with the cost of losing the timeout capability. In your case a DNS lookup took longer than libcurl was willing to wait and a timeout aborted it. But the call to Curl_failf (as part of the timeout error handling) was made on the wrong thread (I think) making it segfault. IMHO there is no ideal solution here, because with CURLOPT_NOSIGNAL set to 0 (the default) libcurl can realize DNS lookup with timeout, but the error handling might occur on the wrong thread. But with CURLOPT_NOSIGNAL set to 1 the segfault is avoided but libcurl might get stuck in a DNS lookup. Are you able to reproduce this problem and can you confirm that setting CURLOPT_NOSIGNAL to 1 fixes it? -- Matthias Bolte http://photron.blogspot.com

Hi Matthias, This can't be reproduced 100%. I reproduce this case twice. But when I set the CURLOPT_NOSIGNAL to 1. I didn't find the similar core again. And it seems that everything works well. What do you mean " stuck in a DNS lookup"? B.R. Benjamin Wang -----Original Message----- From: Matthias Bolte [mailto:matthias.bolte@googlemail.com] Sent: 2012年9月30日 4:20 To: Benjamin Wang (gendwang) Cc: libvir-list@redhat.com; Yang Zhou (yangzho) Subject: Re: Two core dumps are generated in multi-thread scenarios 2012/9/23 Benjamin Wang (gendwang) <gendwang@cisco.com>:
Hi, I found two core dumps generated in multi-thread scenarios in ESX part.
Case1: libcurl support multi-thread core dump: #12 0x00002aaabea89712 in addbyter () from /usr/local/lib/libcurl.so.4 #13 0x00002aaabea89b86 in dprintf_formatf () from /usr/local/lib/libcurl.so.4 #14 0x00002aaabea8b055 in curl_mvsnprintf () from /usr/local/lib/libcurl.so.4 #15 0x00002aaabea7678f in Curl_failf () from /usr/local/lib/libcurl.so.4 #16 0x00002aaabea6d871 in Curl_resolv_timeout () from /usr/local/lib/libcurl.so.4 #17 0x00000006e8a8f230 in ?? ()
Fix code: esxVI_CURL_Connect() in esx_vi.c: I add a new line as following: curl_easy_setopt(curl->handle, CURLOPT_NOSIGNAL, 1);
It took me a moment reading libcurl code until I figured out what might be happening here. The problem is that Curl_resolv_timeout uses SIGALRM + sigsetjmp/siglongjmp to realize the timeout logic. This implementation is not thread-safe as the SIGALRM might be executed on a different thread than the original thread that started the call to Curl_resolv_timeout. This in turn results in the call to Curl_resolv_timeout being continued via siglongjmp (called from the SIGALRM handler) on different thread. Setting CURLOPT_NOSIGNAL to 1 makes libcurl avoid the SIGALRM + sigsetjmp/siglongjmp implementation. This solves the problem but with the cost of losing the timeout capability. In your case a DNS lookup took longer than libcurl was willing to wait and a timeout aborted it. But the call to Curl_failf (as part of the timeout error handling) was made on the wrong thread (I think) making it segfault. IMHO there is no ideal solution here, because with CURLOPT_NOSIGNAL set to 0 (the default) libcurl can realize DNS lookup with timeout, but the error handling might occur on the wrong thread. But with CURLOPT_NOSIGNAL set to 1 the segfault is avoided but libcurl might get stuck in a DNS lookup. Are you able to reproduce this problem and can you confirm that setting CURLOPT_NOSIGNAL to 1 fixes it? -- Matthias Bolte http://photron.blogspot.com

Hi, I pushed the proposed fix of setting CURLOPT_NOSIGNAL to 1. This effectively makes libcurl lose its timeout ability for synchronous DNS lookups. Asynchronous DNS lookups via the c-ares library are not effected. You backtrace shows a timeout of a synchronous DNS lookup, I think (see the Curl_resolv_timeout Curl_failf call sequence). This is how you found the problem. But setting CURLOPT_NOSIGNAL to 1 makes libcurl lose its timeout ability for synchronous DNS lookups and a call to Curl_resolv_timeout can now take longer than a given timeout or might never return at all. So we're are replacing a possible segfault with with a possibly DNS lookup that takes too long or never returns. Regards, Matthias 2012/10/2 Benjamin Wang (gendwang) <gendwang@cisco.com>:
Hi Matthias, This can't be reproduced 100%. I reproduce this case twice. But when I set the CURLOPT_NOSIGNAL to 1. I didn't find the similar core again. And it seems that everything works well. What do you mean " stuck in a DNS lookup"?
B.R. Benjamin Wang
-----Original Message----- From: Matthias Bolte [mailto:matthias.bolte@googlemail.com] Sent: 2012年9月30日 4:20 To: Benjamin Wang (gendwang) Cc: libvir-list@redhat.com; Yang Zhou (yangzho) Subject: Re: Two core dumps are generated in multi-thread scenarios
2012/9/23 Benjamin Wang (gendwang) <gendwang@cisco.com>:
Hi, I found two core dumps generated in multi-thread scenarios in ESX part.
Case1: libcurl support multi-thread core dump: #12 0x00002aaabea89712 in addbyter () from /usr/local/lib/libcurl.so.4 #13 0x00002aaabea89b86 in dprintf_formatf () from /usr/local/lib/libcurl.so.4 #14 0x00002aaabea8b055 in curl_mvsnprintf () from /usr/local/lib/libcurl.so.4 #15 0x00002aaabea7678f in Curl_failf () from /usr/local/lib/libcurl.so.4 #16 0x00002aaabea6d871 in Curl_resolv_timeout () from /usr/local/lib/libcurl.so.4 #17 0x00000006e8a8f230 in ?? ()
Fix code: esxVI_CURL_Connect() in esx_vi.c: I add a new line as following: curl_easy_setopt(curl->handle, CURLOPT_NOSIGNAL, 1);
It took me a moment reading libcurl code until I figured out what might be happening here. The problem is that Curl_resolv_timeout uses SIGALRM + sigsetjmp/siglongjmp to realize the timeout logic. This implementation is not thread-safe as the SIGALRM might be executed on a different thread than the original thread that started the call to Curl_resolv_timeout. This in turn results in the call to Curl_resolv_timeout being continued via siglongjmp (called from the SIGALRM handler) on different thread. Setting CURLOPT_NOSIGNAL to 1 makes libcurl avoid the SIGALRM + sigsetjmp/siglongjmp implementation. This solves the problem but with the cost of losing the timeout capability.
In your case a DNS lookup took longer than libcurl was willing to wait and a timeout aborted it. But the call to Curl_failf (as part of the timeout error handling) was made on the wrong thread (I think) making it segfault. IMHO there is no ideal solution here, because with CURLOPT_NOSIGNAL set to 0 (the default) libcurl can realize DNS lookup with timeout, but the error handling might occur on the wrong thread. But with CURLOPT_NOSIGNAL set to 1 the segfault is avoided but libcurl might get stuck in a DNS lookup.
Are you able to reproduce this problem and can you confirm that setting CURLOPT_NOSIGNAL to 1 fixes it?
-- Matthias Bolte http://photron.blogspot.com
-- Matthias Bolte http://photron.blogspot.com
participants (3)
-
Benjamin Wang (gendwang)
-
Daniel Veillard
-
Matthias Bolte