Re: [libvirt] [PATCH 0/2] [RFC] Use the power of SystemTap to get rid of all* deadlocks forever*

10 Sep 2014


      On Tue, Sep 09, 2014 at 07:51:00PM +0200, Martin Kletzander wrote:
...
On Mon, Sep 08, 2014 at 03:59:04PM +0100, Daniel P. Berrange wrote:
...
On Mon, Sep 08, 2014 at 04:17:44PM +0200, Martin Kletzander wrote:
...
Many moons ago, I wanted to make locking more debuggable from logs.
An idea appeared that this might be better off with SystemTap.  I let
it rot for a while and some time back I had a deadlock that I wanted
to try it on.  So I started scripting it and here's the work (in
progress).
Well, the idea is simple.  For each mutex that is about to get locked,
append it's symname and pid that locked it into the list A, when it
get's locked move it to list B and when it's unlocked, remove it.
Then whenever the user presses ^C, the script prints all the locks
that were locked (with their particular backtraces) and locks that are
waiting to be locked.  This could be enhanced that it would print both
pieces of information only for the locks that are waiting to be
acquired.  But I don't really care what color the shed will have.
Here are some problems I'd love to get any help with:
- I cannot run it against built git version, two problems with that:
- When I'm not root, staprun needs setuid bit set and that is, of
     course, incompatible with the needed LD_LIBRARY_PATH.
The example command line shows you using 'stap' to actually
launch libvirtd. This is a convenient approach since stap
automatically gets the PID of the process to trace. You do
not need todo it this way though.  You can simply run the
stap script, and then start libvirtd manually (at least for
session mode libvirtd).
Yes, but then I might miss some information on the locks.  However
it might still help sometimes.
If you launch the stap script before launching libvirtd then you
should not miss any locks.
...
...
Hmm, I just tried your example script and seemed to get the
full trace of symbols
# ./run stap --ldd -c daemon/libvirtd -d daemon/libvirtd  examples/systemtap/lock-debug.stp src/.libs/libvirt.so
WARNING: Missing unwind data for module, rerun with 'stap -d ...ge/src/virt/libvirt/src/.libs/libvirt_driver_vbox_network.so'
WARNING: Missing unwind data for module, rerun with 'stap -d ...ge/src/virt/libvirt/src/.libs/libvirt_driver_vbox_storage.so'
WARNING: Missing unwind data for module, rerun with 'stap -d ...errange/src/virt/libvirt/src/.libs/libvirt_driver_nodedev.so'
WARNING: Missing unwind data for module, rerun with 'stap -d ...6/berrange/src/virt/libvirt/src/.libs/libvirt_driver_vbox.so'
WARNING: Missing unwind data for module, rerun with 'stap -d ...errange/src/virt/libvirt/src/.libs/libvirt_driver_network.so'
This is another thing that has to be done.  I guess --ldd doesn't load
symbols for these drivers because the daemon is not dynamically linked
with them, it just loads them.
Yes, you would have to manually list these, or just turn off loadable
modules when debugging locks might be easier.
...
How did you manage to run it with the git built daemon?  Or did you
just connected to it?  If you just ran it as written above:
...
# ./run stap --ldd -c daemon/libvirtd -d daemon/libvirtd  examples/systemtap/lock-debug.stp src/.libs/libvirt.so
then I myself am impressed that it works, because for me it doesn't :)
Yes, I ran exactly that command, however, I can't remember if that was
the first time I ran it after compilation. The libtool wrapper scripts
do alot of one-time work the first time you run after build. So it cna
be helpful to launch the daemon once & shut it down, and then launch it
again for debugging.
...
...
virLogMutex+0x0/0x28 [...9576/berrange/src/virt/libvirt/src/.libs/libvirt.so.0.1002.9](23908):
0x7f226f0cd850 : virMutexLock+0x0/0x10 [...9576/berrange/src/virt/libvirt/src/.libs/libvirt.so.0.1002.9]
0x7f226fc1d163 : main+0x1f33/0x269c [...1524c9576/berrange/src/virt/libvirt/daemon/.libs/lt-libvirtd]
0x7f226b94ed65 : __libc_start_main+0xf5/0x1c0 [/usr/lib64/libc-2.18.so]
0x7f226fc1d8f5 : _start+0x29/0x34 [...1524c9576/berrange/src/virt/libvirt/daemon/.libs/lt-libvirtd]
Unfortuntely these addresses appear to be absolute offsets in the process
memory, which is still useless unless you know the load address of each
ELF module.
I don't know what's the difference between the offsets before and
after the slash (e.g. in main+0x1f33/0x269c), maybe it's from-to?
Yeah, I'm unclear on that too - I'd have to look at the source
...
...
I definitely think it ought to be possible to convert an address like
virCommandRunAsync+0x44f into a proper line number though. Perhaps we
would need to directly write code against elfutils, instead of relying
on addr2line.
That's out of my league for now (even though Id love to get some more
insights on ELF structure, etc.).  But do you think it would be worth
adding even without this info (for now)?  If yes, then I'd be happy to
polish it a bit and propose it.
Yes, I think it is worth having the script in git. Even without line
numbers, simply seeing the function name stacks could be enough to
diagnose the problems, since most functions only have one lock call.


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|