Tsan is a heavilly reworked version of helgrind the
data race detector based on Valgrind (at least for linux)
It is developped by Google and described/available there:
http://code.google.com/p/data-race-test/wiki/ThreadSanitizer
as libvirt code is highly threaded I though that we should try
testing it. Helgrind used to not be very practicable but Google
suggest they significantly speed up the process and decreased the
number of false positive.
I tried to build it from source, but it failed in some ways
by not finding the i386/686 libraries build when run on my x86_64,
could be related to differences in distribution, they use only
Ubuntu, and I tested on Fedora 16, I didn't tried to really solve the
issue at that point.
One of the problem of such a configuration is that the program doesn't
have what could be specific Fedora exception lists for valgrind.
So I grabbed the prebuilt version tsan-r4363-amd64-linux-self-contained.sh
(it's a self extracting binary in shell script) from
http://build.chromium.org/p/client.tsan/binaries/
renamid tsan.sh and installed it in my path, works fine...
I did a first test run with libxml2 code I know very well and
the result were good, i.e. it pointed out a data race I know about
and raised another one about some global data initialization,
so fairly positive from my POV.
Then I instrumented libvirt daemon for 0.9.12 release, launched it
as root from command line as a tsan argument:
/tmp/tsan.sh --trace-children=no /usr/sbin/libvirtd
and then played with virt-manager to exercise it a bit.
At first I tried to use --log-file=tsan.log option to collect
the logs but the logfile got corrupted so the following is
just cut and paste from the output
One of the ouput raised frequently is:
----------------------------------------------------------------------
==5838== WARNING: Possible data race during write of size 4 at 0x34723B3BD0: {{{
==5838== T1 (L{L17, L125}):
==5838== #0 ptmalloc_unlock_all2 /lib64/libc-2.14.90.so
==5838== #1 fork /lib64/libc-2.14.90.so
==5838== #2 virFork /usr/src/debug/libvirt-0.9.12/src/util/command.c:268
==5838== #3 virExecWithHook.constprop.5
/usr/src/debug/libvirt-0.9.12/src/util/command.c:489
==5838== #4 virCommandRunAsync /usr/src/debug/libvirt-0.9.12/src/util/command.c:2177
==5838== #5 virCommandRun /usr/src/debug/libvirt-0.9.12/src/util/command.c:1970
==5838== #6 qemuCapsExtractVersionInfo
/usr/src/debug/libvirt-0.9.12/src/qemu/qemu_capabilities.c:1379
==5838== #7 qemuCapsExtractVersion
/usr/src/debug/libvirt-0.9.12/src/qemu/qemu_capabilities.c:1568
==5838== #8 qemudGetVersion
/usr/src/debug/libvirt-0.9.12/src/qemu/qemu_driver.c:1320
==5838== #9 virConnectGetVersion /usr/src/debug/libvirt-0.9.12/src/libvirt.c:1640
==5838== #10 remoteDispatchGetVersionHelper
/usr/src/debug/libvirt-0.9.12/daemon/remote_dispatch.h:6862
==5838== #11 virNetServerProgramDispatch
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416
==5838== Concurrent read(s) happened at (OR AFTER) these points:
==5838== T11 (L{}):
==5838== #0 arena_thread_freeres /lib64/libc-2.14.90.so
==5838== #1 __libc_thread_freeres /lib64/libc-2.14.90.so
==5838== #2 start_thread /lib64/libpthread-2.14.90.so
==5838== Address 0x34723B3BD0 is 0 bytes inside data symbol "list_lock"
==5838== Locks involved in this report (reporting last lock sites): {L17, L125}
==5838== L17 (0x5339B00)
==5838== #0 pthread_mutex_lock
/mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:935
==5838== #1 virFork /usr/src/debug/libvirt-0.9.12/src/util/command.c:266
==5838== #2 virExecWithHook.constprop.5
/usr/src/debug/libvirt-0.9.12/src/util/command.c:489
==5838== #3 virCommandRunAsync /usr/src/debug/libvirt-0.9.12/src/util/command.c:2177
==5838== #4 virCommandRun /usr/src/debug/libvirt-0.9.12/src/util/command.c:1970
==5838== #5 qemuCapsExtractVersionInfo
/usr/src/debug/libvirt-0.9.12/src/qemu/qemu_capabilities.c:1379
==5838== #6 qemuCapsExtractVersion
/usr/src/debug/libvirt-0.9.12/src/qemu/qemu_capabilities.c:1568
==5838== #7 qemudGetVersion
/usr/src/debug/libvirt-0.9.12/src/qemu/qemu_driver.c:1320
==5838== #8 virConnectGetVersion /usr/src/debug/libvirt-0.9.12/src/libvirt.c:1640
==5838== #9 remoteDispatchGetVersionHelper
/usr/src/debug/libvirt-0.9.12/daemon/remote_dispatch.h:6862
==5838== #10 virNetServerProgramDispatch
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416
==5838== #11 virNetServerHandleJob
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161
==5838== L125 (0x4C0A0A00)
==5838== #0 pthread_mutex_lock
/mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:935
==5838== #1 qemudGetVersion
/usr/src/debug/libvirt-0.9.12/src/qemu/qemu_driver.c:1319
==5838== #2 virConnectGetVersion /usr/src/debug/libvirt-0.9.12/src/libvirt.c:1640
==5838== #3 remoteDispatchGetVersionHelper
/usr/src/debug/libvirt-0.9.12/daemon/remote_dispatch.h:6862
==5838== #4 virNetServerProgramDispatch
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416
==5838== #5 virNetServerHandleJob
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161
==5838== #6 virThreadPoolWorker
/usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144
==5838== #7 virThreadHelper
/usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161
==5838== Race verifier data: 0x347207CF98,0x347215D6BB
==5838== }}}
I don't know what "Race verifier data" means but it seems that we do
*pid = fork(); in the main thread and that another thread is accessing
without synchronization. The probram properly report that 2 locks are
held at that point, the virLogLock() and qemuDriverLock() and give full
stack trace on where those locks were taken. That's convenient but
I could not really spot where the second thread did the access, I suspect
that's somehow related to libpthread considering the minimal stack trace
of the second thread T11 ...
Unfortunately, it seems that each time we fork we hit one or two
issue like that (which could be false positive), and tend to fill logs
relatively fast :-\
I however found out a couple of issues not related to libpthread:
==5733== WARNING: Possible data race during write of size 1 at 0x34723B6B14: {{{
==5733== T5 (L{}):
==5733== #0 Replace_memcpy
/mnt/data/build/slave/full_linux_build/build/tsan/ts_replace.h:113
==5733== #1 __GI_memcpy
/mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:2359
==5733== #2 xdrmem_getbytes /lib64/libc-2.14.90.so
==5733== #3 xdr_string /lib64/libc-2.14.90.so
==5733== #4 xdr_remote_nonnull_string
/usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:31
==5733== #5 xdr_remote_nonnull_domain
/usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:58
==5733== #6 xdr_remote_domain_get_info_args
/usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:1390
==5733== #7 virNetMessageDecodePayload
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetmessage.c:382
==5733== #8 virNetServerProgramDispatch
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:403
==5733== #9 virNetServerHandleJob
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161
==5733== #10 virThreadPoolWorker
/usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144
==5733== #11 virThreadHelper
/usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161
==5733== Concurrent write(s) happened at (OR AFTER) these points:
==5733== T2 (L{}):
==5733== #0 xdrmem_getbytes /lib64/libc-2.14.90.so
==5733== #1 xdr_string /lib64/libc-2.14.90.so
==5733== #2 xdr_remote_nonnull_string
/usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:31
==5733== #3 xdr_remote_nonnull_domain
/usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:58
==5733== #4 xdr_remote_domain_get_xml_desc_args
/usr/src/debug/libvirt-0.9.12/daemon/../src/remote/remote_protocol.c:1532
==5733== #5 virNetMessageDecodePayload
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetmessage.c:382
==5733== #6 virNetServerProgramDispatch
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:403
==5733== #7 virNetServerHandleJob
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161
==5733== #8 virThreadPoolWorker
/usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144
==5733== #9 virThreadHelper
/usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161
==5733== Address 0x34723B6B14 is 0 bytes inside data symbol "crud.6424"
==5733== Race verifier data: 0x4A091CE,0x347212579D
==5733== }}}
that one could potentially be nasty, it seems that 2 threads decoding
rpc payload in parallel hit an unsynchronized write access in
the xdr routine for xdr_remote_nonnull_string() . I didn't had the
debuginfo for glibc installed when I ran it so I don't have the
exact informations at the top of the stacks
The other one raised is actually something I had initially found
with the first run on libxml2 so it's relatively consistent:
==5733== WARNING: Possible data race during read of size 4 at 0x34723B10B4: {{{
==5733== T1 (L{L173}):
==5733== #0 random_r /lib64/libc-2.14.90.so
==5733== #1 random /lib64/libc-2.14.90.so
==5733== #2 rand /lib64/libc-2.14.90.so
==5733== #3 xmlDictCreate /usr/src/debug/libxml2-2.8.0/dict.c:491
==5733== #4 xmlInitParserCtxt /usr/src/debug/libxml2-2.8.0/parserInternals.c:1620
==5733== #5 xmlNewParserCtxt /usr/src/debug/libxml2-2.8.0/parserInternals.c:1874
==5733== #6 xmlCreateURLParserCtxt /usr/src/debug/libxml2-2.8.0/parser.c:13546
==5733== #7 xmlReadFile /usr/src/debug/libxml2-2.8.0/parser.c:14713
==5733== #8 xmlRelaxNGParse /usr/src/debug/libxml2-2.8.0/relaxng.c:7480
==5733== #9 (no symbols) /usr/lib64/libnetcf.so.1.4.0
==5733== #10 ncf_init /usr/lib64/libnetcf.so.1.4.0
==5733== #11 interfaceOpenInterface
/usr/src/debug/libvirt-0.9.12/src/interface/netcf_driver.c:144
==5733== Concurrent write(s) happened at (OR AFTER) these points:
==5733== T10 (L{L156}):
==5733== #0 random_r /lib64/libc-2.14.90.so
==5733== #1 random /lib64/libc-2.14.90.so
==5733== #2 rand /lib64/libc-2.14.90.so
==5733== #3 xmlHashCreate /usr/src/debug/libxml2-2.8.0/hash.c:193
==5733== #4 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:4125
==5733== #5 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6380
==5733== #6 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6523
==5733== #7 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6319
==5733== #8 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6520
==5733== #9 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6523
==5733== Address 0x34723B10B4 is 20 bytes inside data symbol "randtbl"
==5733== Locks involved in this report (reporting last lock sites): {L156, L173}
==5733== L156 (0x402E790)
==5733== #0 pthread_mutex_lock
/mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:935
==5733== #1 remoteDispatchOpenHelper
/usr/src/debug/libvirt-0.9.12/daemon/remote.c:683
==5733== #2 virNetServerProgramDispatch
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416
==5733== #3 virNetServerHandleJob
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161
==5733== #4 virThreadPoolWorker
/usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144
==5733== #5 virThreadHelper
/usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161
==5733== L173 (0x402F420)
==5733== #0 pthread_mutex_lock
/mnt/data/build/slave/full_linux_build/build/tsan/ts_valgrind_intercepts.c:935
==5733== #1 remoteDispatchOpenHelper
/usr/src/debug/libvirt-0.9.12/daemon/remote.c:683
==5733== #2 virNetServerProgramDispatch
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserverprogram.c:416
==5733== #3 virNetServerHandleJob
/usr/src/debug/libvirt-0.9.12/src/rpc/virnetserver.c:161
==5733== #4 virThreadPoolWorker
/usr/src/debug/libvirt-0.9.12/src/util/threadpool.c:144
==5733== #5 virThreadHelper
/usr/src/debug/libvirt-0.9.12/src/util/threads-pthread.c:161
==5733== Race verifier data: 0x347203A572,0x347203A565
==5733== }}}
I didn't know that netcf was doing Relax-NG validation on-the-fly :)
and hit another in libxml2 though libvirt/netcf/libxslt/libxml2
call:
==5733== WARNING: Possible data race during read of size 4 at 0x34723B10B4: {{{
==5733== T1 (L{L173}):
==5733== #0 random_r /lib64/libc-2.14.90.so
==5733== #1 random /lib64/libc-2.14.90.so
==5733== #2 rand /lib64/libc-2.14.90.so
==5733== #3 xmlDictCreate /usr/src/debug/libxml2-2.8.0/dict.c:491
==5733== #4 xmlInitParserCtxt /usr/src/debug/libxml2-2.8.0/parserInternals.c:1620
==5733== #5 xmlNewParserCtxt /usr/src/debug/libxml2-2.8.0/parserInternals.c:1874
==5733== #6 xmlCreateURLParserCtxt /usr/src/debug/libxml2-2.8.0/parser.c:13546
==5733== #7 xmlReadFile /usr/src/debug/libxml2-2.8.0/parser.c:14713
==5733== #8 xmlRelaxNGParse /usr/src/debug/libxml2-2.8.0/relaxng.c:7480
==5733== #9 (no symbols) /usr/lib64/libnetcf.so.1.4.0
==5733== #10 ncf_init /usr/lib64/libnetcf.so.1.4.0
==5733== #11 interfaceOpenInterface
/usr/src/debug/libvirt-0.9.12/src/interface/netcf_driver.c:144
==5733== Concurrent write(s) happened at (OR AFTER) these points:
==5733== T10 (L{L156}):
==5733== #0 random_r /lib64/libc-2.14.90.so
==5733== #1 random /lib64/libc-2.14.90.so
==5733== #2 rand /lib64/libc-2.14.90.so
==5733== #3 xmlHashCreate /usr/src/debug/libxml2-2.8.0/hash.c:193
==5733== #4 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:4125
==5733== #5 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6380
==5733== #6 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6523
==5733== #7 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6319
==5733== #8 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6520
==5733== #9 xmlRelaxNGCheckRules /usr/src/debug/libxml2-2.8.0/relaxng.c:6523
==5733== Address 0x34723B10B4 is 20 bytes inside data symbol "randtbl"
==5733== Locks involved in this report (reporting last lock sites): {L156, L173}
Looks like another libxml2 bug calling rand, I should use rand_r() instead
if available.
I also ran virsh itself under tsan for a few functions, but that didn't
raise any issue.
So except for the parallel xdr_remote_nonnull_string() issue
I don't think there is anything I can report at this point for libvirt
code itself.
The slowdown of tsan is acceptable, virt-manager is a clearly a bit
less responsive when connecting of doing libvirt operations but it
is really usable, and that's IMHO the main point of my testing,
it's not hard (though it would be better if packaged for Fedora)
and the results are interesting but a bit heavy since the same issue
can pop-up all the time like for fork().
I guess it would be interesting to try out qemu-kvm with it though
it may turn unbearably slow, but it's worth testing !
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit
http://xmlsoft.org/
daniel(a)veillard.com | Rpmfind RPM search engine
http://rpmfind.net/
http://veillard.com/ | virtualization library
http://libvirt.org/