Hi,
We've recently run into an issue with libvirt 1.2.17 in the context of an
OpenStack deployment.
Occasionally after doing live migrations from a compute node with libvirt 1.2.17
to a compute node with libvirt 2.0.0 we see libvirtd on the 1.2.17 side stop
responding. When this happens, if you run a command like "sudo virsh list" then
it just hangs waiting for a response from libvirtd.
Running "ps -elfT|grep libvirtd" shows many threads waiting on a futex, but two
threads in poll_schedule_timeout() as part of the poll() syscall. On a non-hung
libvirtd I only see one thread in poll_schedule_timeout().
If I kill and restart libvirtd (this took two tries, it didn't actually die the
first time) then the problem seems to go away.
I just tried attaching gdb to the "hung" libvirtd process and running
"thread
apply all backtrace". This printed backtraces for the threads, including the
one that was apparently stuck in poll():
Thread 17 (Thread 0x7f0573fff700 (LWP 186865)):
#0 0x00007f05b59d769d in poll () from /lib64/libc.so.6
#1 0x00007f05b7f01b9a in virNetClientIOEventLoop () from /lib64/libvirt.so.0
#2 0x00007f05b7f0234b in virNetClientSendInternal () from /lib64/libvirt.so.0
#3 0x00007f05b7f036f3 in virNetClientSendWithReply () from /lib64/libvirt.so.0
#4 0x00007f05b7f04eb3 in virNetClientStreamSendPacket () from /lib64/libvirt.so.0
#5 0x00007f05b7ed8db5 in remoteStreamFinish () from /lib64/libvirt.so.0
#6 0x00007f05b7ec7eaa in virStreamFinish () from /lib64/libvirt.so.0
#7 0x00007f059bd9323d in qemuMigrationIOFunc () from
/usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
#8 0x00007f05b7e09aa2 in virThreadHelper () from /lib64/libvirt.so.0
#9 0x00007f05b5cb4dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f05b59e1ced in clone () from /lib64/libc.so.6
Interestingly, when I hit "c" to continue in the debugger, I got this:
(gdb) c
Continuing.
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f0573fff700 (LWP 186865)]
0x00007f05b5cbb1cd in write () from /lib64/libpthread.so.0
(gdb) c
Continuing.
[Thread 0x7f0573fff700 (LWP 186865) exited]
(gdb) quit
A debugging session is active.
Inferior 1 [process 37471] will be detached.
Quit anyway? (y or n) y
Detaching from program: /usr/sbin/libvirtd, process 37471
Now thread 186865 seems to be gone, and libvirtd is no longer hung.
Has anyone seen anything like this before? Anyone have an idea where to start
looking?
Thanks,
Chris