On 17.03.2017 23:21, Chris Friesen wrote:
Hi,
We've recently run into an issue with libvirt 1.2.17 in the context of
an OpenStack deployment.
Let me just say that 1.2.17 is rather old libvirt. Can you try with one
of the latests one to see whether the bug still reproduces?
Occasionally after doing live migrations from a compute node with
libvirt 1.2.17 to a compute node with libvirt 2.0.0 we see libvirtd on
the 1.2.17 side stop responding. When this happens, if you run a
command like "sudo virsh list" then it just hangs waiting for a response
from libvirtd.
Running "ps -elfT|grep libvirtd" shows many threads waiting on a futex,
but two threads in poll_schedule_timeout() as part of the poll()
syscall. On a non-hung libvirtd I only see one thread in
poll_schedule_timeout().
So looks like libvirt is waiting for something.
If I kill and restart libvirtd (this took two tries, it didn't actually
die the first time) then the problem seems to go away.
I just tried attaching gdb to the "hung" libvirtd process and running
"thread apply all backtrace". This printed backtraces for the threads,
including the one that was apparently stuck in poll():
Thread 17 (Thread 0x7f0573fff700 (LWP 186865)):
#0 0x00007f05b59d769d in poll () from /lib64/libc.so.6
#1 0x00007f05b7f01b9a in virNetClientIOEventLoop () from
/lib64/libvirt.so.0
#2 0x00007f05b7f0234b in virNetClientSendInternal () from
/lib64/libvirt.so.0
#3 0x00007f05b7f036f3 in virNetClientSendWithReply () from
/lib64/libvirt.so.0
#4 0x00007f05b7f04eb3 in virNetClientStreamSendPacket () from
/lib64/libvirt.so.0
#5 0x00007f05b7ed8db5 in remoteStreamFinish () from /lib64/libvirt.so.0
#6 0x00007f05b7ec7eaa in virStreamFinish () from /lib64/libvirt.so.0
#7 0x00007f059bd9323d in qemuMigrationIOFunc () from
/usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
#8 0x00007f05b7e09aa2 in virThreadHelper () from /lib64/libvirt.so.0
#9 0x00007f05b5cb4dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f05b59e1ced in clone () from /lib64/libc.so.6
This means that libvirt is trying to finish the migration stream (which
happens on the end of the migration process), but the client socket is
not writable. So perhaps client is broken? Or the connection got
interrupted?
Interestingly, when I hit "c" to continue in the debugger, I got this:
(gdb) c
Continuing.
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f0573fff700 (LWP 186865)]
0x00007f05b5cbb1cd in write () from /lib64/libpthread.so.0
(gdb) c
Continuing.
[Thread 0x7f0573fff700 (LWP 186865) exited]
(gdb) quit
A debugging session is active.
Inferior 1 [process 37471] will be detached.
Quit anyway? (y or n) y
Detaching from program: /usr/sbin/libvirtd, process 37471
This is becasue there might be some keep alive going on. Introduced in
0.9.8, libvirt has keepalive mechanism in place (repeatedly sending
ping/pong between client & server). Now, should 5 subsequent pings get
lost (this is configurable of course) libvirt thinks the connection is
broken and closes it. If you attach a debugger to libvirt, the whole
daemon is paused, among with the event loop so server cannot reply to
client's pings which in turn makes client think the connection is
broken. Thus it closes the connection which is observed as broken pipe
in the daemon.
Now thread 186865 seems to be gone, and libvirtd is no longer hung.
Has anyone seen anything like this before? Anyone have an idea where to
start looking?
I'd say logs. We need to find out where the problem actually occurs.
Whether it is one side of the migration (I bet on destination), or
perhaps virsh itself (you're doing tunnelled migration?).
Michal