[libvirt] virsh reconnect vs. keepalive issue

I'm trying to debug a new virsh command, and had two terminals up, one running './run gdb tools/virsh' (I originally ran virsh directly, but re-ran under gdb to get a trace) and the other running './run gdb daemon/libvirtd'. Because I forgot to use the -k0 argument to disable keepalives, and took too long while at a breakpoint on the libvirtd terminal, my virsh session tried to report a keepalive failure: [New Thread 0x7fffe7b8c700 (LWP 7143)] 2014-05-20 23:40:22.337+0000: 7138: info : libvirt version: 1.2.5 2014-05-20 23:40:22.337+0000: 7138: warning : virKeepAliveTimerInternal:143 : No response from client 0x55555586a520 after 6 keepalive messages in 35 seconds 2014-05-20 23:40:22.337+0000: 7143: warning : virKeepAliveTimerInternal:143 : No response from client 0x55555586a520 after 6 keepalive messages in 35 seconds error: internal error: received hangup / error event on socket but then it just hangs there. A backtrace shows that it is hanging trying to reconnect: (gdb) bt #0 0x00007ffff39129dd in poll () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007ffff798e3c1 in virNetClientIOEventLoop (client=0x55555586ae10, thiscall=0x55555586b090) at rpc/virnetclient.c:1514 #2 0x00007ffff798ecbf in virNetClientIO (client=0x55555586ae10, thiscall=0x55555586b090) at rpc/virnetclient.c:1786 #3 0x00007ffff798f5ab in virNetClientSendInternal (client=0x55555586ae10, msg=0x55555586ada0, expectReply=true, nonBlock=false) at rpc/virnetclient.c:1958 #4 0x00007ffff798f648 in virNetClientSendWithReply (client=0x55555586ae10, msg=0x55555586ada0) at rpc/virnetclient.c:1986 #5 0x00007ffff7990580 in virNetClientProgramCall (prog=0x55555586afd0, client=0x55555586ae10, serial=0, proc=66, noutfds=0, outfds=0x0, ninfds=0x0, infds=0x0, args_filter=0x7ffff394f270 <__GI_xdr_void>, args=0x0, ret_filter=0x7ffff798152c <xdr_remote_auth_list_ret>, ret=0x7fffffffd6f0) at rpc/virnetclientprogram.c:329 #6 0x00007ffff7975d4c in callFull (conn=0x555555869fb0, priv=0x55555586ab30, flags=0, fdin=0x0, fdinlen=0, fdout=0x0, fdoutlen=0x0, proc_nr=66, args_filter=0x7ffff394f270 <__GI_xdr_void>, args=0x0, ret_filter=0x7ffff798152c <xdr_remote_auth_list_ret>, ret=0x7fffffffd6f0 "") at remote/remote_driver.c:6544 #7 0x00007ffff7975e25 in call (conn=0x555555869fb0, priv=0x55555586ab30, flags=0, proc_nr=66, args_filter=0x7ffff394f270 <__GI_xdr_void>, args=0x0, ret_filter=0x7ffff798152c <xdr_remote_auth_list_ret>, ---Type <return> to continue, or q <return> to quit--- ret=0x7fffffffd6f0 "") at remote/remote_driver.c:6566 #8 0x00007ffff795f7f1 in remoteAuthenticate (conn=0x555555869fb0, priv=0x55555586ab30, auth=0x7ffff7d71520 <virConnectAuthDefault>, authtype=0x0) at remote/remote_driver.c:3939 #9 0x00007ffff7957f0a in doRemoteOpen (conn=0x555555869fb0, priv=0x55555586ab30, auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at remote/remote_driver.c:973 #10 0x00007ffff795890b in remoteConnectOpen (conn=0x555555869fb0, auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at remote/remote_driver.c:1190 #11 0x00007ffff78fc5e5 in do_open (name=0x0, auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at libvirt.c:1147 #12 0x00007ffff78fd093 in virConnectOpenAuth (name=0x0, auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at libvirt.c:1390 #13 0x0000555555575a67 in vshConnect (ctl=0x7fffffffdbc0, uri=0x0, readonly=false) at virsh.c:337 #14 0x0000555555575c23 in vshReconnect (ctl=0x7fffffffdbc0) at virsh.c:383 #15 0x000055555557943a in vshCommandRun (ctl=0x7fffffffdbc0, cmd=0x5555557f4900) at virsh.c:1853 #16 0x000055555557d8de in main (argc=5, argv=0x7fffffffdda8) at virsh.c:3617 It seems odd that virsh is hanging rather than gracefully exit, because it is trying too hard to reconnect in spite of being forcefully disconnected due to a keepalive timeout. But I'm not sure what, if anything, we could tweak to make this failure mode not require a Ctrl-C on my end to get rid of the failed virsh. It also reiterates the idea that we really ought to allow for a user-specified timeout to virConnect, rather than blocking forever. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Tue, May 20, 2014 at 05:49:54PM -0600, Eric Blake wrote:
I'm trying to debug a new virsh command, and had two terminals up, one running './run gdb tools/virsh' (I originally ran virsh directly, but re-ran under gdb to get a trace) and the other running './run gdb daemon/libvirtd'. Because I forgot to use the -k0 argument to disable keepalives, and took too long while at a breakpoint on the libvirtd terminal, my virsh session tried to report a keepalive failure:
[New Thread 0x7fffe7b8c700 (LWP 7143)] 2014-05-20 23:40:22.337+0000: 7138: info : libvirt version: 1.2.5 2014-05-20 23:40:22.337+0000: 7138: warning : virKeepAliveTimerInternal:143 : No response from client 0x55555586a520 after 6 keepalive messages in 35 seconds 2014-05-20 23:40:22.337+0000: 7143: warning : virKeepAliveTimerInternal:143 : No response from client 0x55555586a520 after 6 keepalive messages in 35 seconds error: internal error: received hangup / error event on socket
I tried reproducing that and I get no hang, plus the only difference is that without -k0 I *sometimes* get: warning : virKeepAliveTimer:177 : Failed to send keepalive request to client 0x55555583d630 Other than that, everythign works fine and I only see the reconnection happen after another command is called. Are you using some kind of special authentication?
but then it just hangs there. A backtrace shows that it is hanging trying to reconnect:
(gdb) bt #0 0x00007ffff39129dd in poll () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007ffff798e3c1 in virNetClientIOEventLoop (client=0x55555586ae10, thiscall=0x55555586b090) at rpc/virnetclient.c:1514 #2 0x00007ffff798ecbf in virNetClientIO (client=0x55555586ae10, thiscall=0x55555586b090) at rpc/virnetclient.c:1786 #3 0x00007ffff798f5ab in virNetClientSendInternal (client=0x55555586ae10, msg=0x55555586ada0, expectReply=true, nonBlock=false) at rpc/virnetclient.c:1958 #4 0x00007ffff798f648 in virNetClientSendWithReply (client=0x55555586ae10, msg=0x55555586ada0) at rpc/virnetclient.c:1986 #5 0x00007ffff7990580 in virNetClientProgramCall (prog=0x55555586afd0, client=0x55555586ae10, serial=0, proc=66, noutfds=0, outfds=0x0, ninfds=0x0, infds=0x0, args_filter=0x7ffff394f270 <__GI_xdr_void>, args=0x0, ret_filter=0x7ffff798152c <xdr_remote_auth_list_ret>, ret=0x7fffffffd6f0) at rpc/virnetclientprogram.c:329 #6 0x00007ffff7975d4c in callFull (conn=0x555555869fb0, priv=0x55555586ab30, flags=0, fdin=0x0, fdinlen=0, fdout=0x0, fdoutlen=0x0, proc_nr=66, args_filter=0x7ffff394f270 <__GI_xdr_void>, args=0x0, ret_filter=0x7ffff798152c <xdr_remote_auth_list_ret>, ret=0x7fffffffd6f0 "") at remote/remote_driver.c:6544 #7 0x00007ffff7975e25 in call (conn=0x555555869fb0, priv=0x55555586ab30, flags=0, proc_nr=66, args_filter=0x7ffff394f270 <__GI_xdr_void>, args=0x0, ret_filter=0x7ffff798152c <xdr_remote_auth_list_ret>, ---Type <return> to continue, or q <return> to quit--- ret=0x7fffffffd6f0 "") at remote/remote_driver.c:6566 #8 0x00007ffff795f7f1 in remoteAuthenticate (conn=0x555555869fb0, priv=0x55555586ab30, auth=0x7ffff7d71520 <virConnectAuthDefault>, authtype=0x0) at remote/remote_driver.c:3939 #9 0x00007ffff7957f0a in doRemoteOpen (conn=0x555555869fb0, priv=0x55555586ab30, auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at remote/remote_driver.c:973 #10 0x00007ffff795890b in remoteConnectOpen (conn=0x555555869fb0, auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at remote/remote_driver.c:1190 #11 0x00007ffff78fc5e5 in do_open (name=0x0, auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at libvirt.c:1147 #12 0x00007ffff78fd093 in virConnectOpenAuth (name=0x0, auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at libvirt.c:1390 #13 0x0000555555575a67 in vshConnect (ctl=0x7fffffffdbc0, uri=0x0, readonly=false) at virsh.c:337 #14 0x0000555555575c23 in vshReconnect (ctl=0x7fffffffdbc0) at virsh.c:383 #15 0x000055555557943a in vshCommandRun (ctl=0x7fffffffdbc0, cmd=0x5555557f4900) at virsh.c:1853 #16 0x000055555557d8de in main (argc=5, argv=0x7fffffffdda8) at virsh.c:3617
It seems odd that virsh is hanging rather than gracefully exit, because it is trying too hard to reconnect in spite of being forcefully disconnected due to a keepalive timeout. But I'm not sure what, if anything, we could tweak to make this failure mode not require a Ctrl-C on my end to get rid of the failed virsh. It also reiterates the idea that we really ought to allow for a user-specified timeout to virConnect, rather than blocking forever.
-- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 05/23/2014 05:29 AM, Martin Kletzander wrote:
On Tue, May 20, 2014 at 05:49:54PM -0600, Eric Blake wrote:
I'm trying to debug a new virsh command, and had two terminals up, one running './run gdb tools/virsh' (I originally ran virsh directly, but re-ran under gdb to get a trace) and the other running './run gdb daemon/libvirtd'. Because I forgot to use the -k0 argument to disable keepalives, and took too long while at a breakpoint on the libvirtd terminal, my virsh session tried to report a keepalive failure:
I also have an explicit keepalive_interval=3 in my /etc/libvirt/libvirtd.conf (rather than the default), I don't think I have any other non-default settings in my conf files at the moment, but wonder if the choice of server settings matters for reproducing.
I tried reproducing that and I get no hang, plus the only difference is that without -k0 I *sometimes* get:
warning : virKeepAliveTimer:177 : Failed to send keepalive request to client 0x55555583d630
Other than that, everythign works fine and I only see the reconnection happen after another command is called. Are you using some kind of special authentication?
No, I'm just running 'virsh -c qemu:///system' from a root shell. Maybe it's the particular virsh command I was running (in my case, 'virsh blockcommit --wait' with patches not yet upstream, but upstream would be similar with 'virsh blockcopy --wait') - the single virsh command is actually a while() loop of API calls to poll for status; it could be that the reconnect works if between commands that only do one API call, but because of this particular command calling API in a while loop, it is not breaking out of the loop properly on a reconnect attempt. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
participants (2)
-
Eric Blake
-
Martin Kletzander