[libvirt] virsh reconnect vs. keepalive issue

21 May 2014

      I'm trying to debug a new virsh command, and had two terminals up, one
running './run gdb tools/virsh' (I originally ran virsh directly, but
re-ran under gdb to get a trace) and the other running './run gdb
daemon/libvirtd'.  Because I forgot to use the -k0 argument to disable
keepalives, and took too long while at a breakpoint on the libvirtd
terminal, my virsh session tried to report a keepalive failure:

[New Thread 0x7fffe7b8c700 (LWP 7143)]
2014-05-20 23:40:22.337+0000: 7138: info : libvirt version: 1.2.5
2014-05-20 23:40:22.337+0000: 7138: warning :
virKeepAliveTimerInternal:143 : No response from client 0x55555586a520
after 6 keepalive messages in 35 seconds
2014-05-20 23:40:22.337+0000: 7143: warning :
virKeepAliveTimerInternal:143 : No response from client 0x55555586a520
after 6 keepalive messages in 35 seconds
error: internal error: received hangup / error event on socket

but then it just hangs there.  A backtrace shows that it is hanging
trying to reconnect:

(gdb) bt
#0  0x00007ffff39129dd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007ffff798e3c1 in virNetClientIOEventLoop (client=0x55555586ae10,
    thiscall=0x55555586b090) at rpc/virnetclient.c:1514
#2  0x00007ffff798ecbf in virNetClientIO (client=0x55555586ae10,
    thiscall=0x55555586b090) at rpc/virnetclient.c:1786
#3  0x00007ffff798f5ab in virNetClientSendInternal (client=0x55555586ae10,
    msg=0x55555586ada0, expectReply=true, nonBlock=false)
    at rpc/virnetclient.c:1958
#4  0x00007ffff798f648 in virNetClientSendWithReply (client=0x55555586ae10,
    msg=0x55555586ada0) at rpc/virnetclient.c:1986
#5  0x00007ffff7990580 in virNetClientProgramCall (prog=0x55555586afd0,
    client=0x55555586ae10, serial=0, proc=66, noutfds=0, outfds=0x0,
    ninfds=0x0, infds=0x0, args_filter=0x7ffff394f270 <__GI_xdr_void>,
    args=0x0, ret_filter=0x7ffff798152c <xdr_remote_auth_list_ret>,
    ret=0x7fffffffd6f0) at rpc/virnetclientprogram.c:329
#6  0x00007ffff7975d4c in callFull (conn=0x555555869fb0,
priv=0x55555586ab30,
    flags=0, fdin=0x0, fdinlen=0, fdout=0x0, fdoutlen=0x0, proc_nr=66,
    args_filter=0x7ffff394f270 <__GI_xdr_void>, args=0x0,
    ret_filter=0x7ffff798152c <xdr_remote_auth_list_ret>,
    ret=0x7fffffffd6f0 "") at remote/remote_driver.c:6544
#7  0x00007ffff7975e25 in call (conn=0x555555869fb0, priv=0x55555586ab30,
    flags=0, proc_nr=66, args_filter=0x7ffff394f270 <__GI_xdr_void>,
args=0x0,
    ret_filter=0x7ffff798152c <xdr_remote_auth_list_ret>,
---Type <return> to continue, or q <return> to quit---
    ret=0x7fffffffd6f0 "") at remote/remote_driver.c:6566
#8  0x00007ffff795f7f1 in remoteAuthenticate (conn=0x555555869fb0,
    priv=0x55555586ab30, auth=0x7ffff7d71520 <virConnectAuthDefault>,
    authtype=0x0) at remote/remote_driver.c:3939
#9  0x00007ffff7957f0a in doRemoteOpen (conn=0x555555869fb0,
    priv=0x55555586ab30, auth=0x7ffff7d71520 <virConnectAuthDefault>,
flags=0)
    at remote/remote_driver.c:973
#10 0x00007ffff795890b in remoteConnectOpen (conn=0x555555869fb0,
    auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0)
    at remote/remote_driver.c:1190
#11 0x00007ffff78fc5e5 in do_open (name=0x0,
    auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at libvirt.c:1147
#12 0x00007ffff78fd093 in virConnectOpenAuth (name=0x0,
    auth=0x7ffff7d71520 <virConnectAuthDefault>, flags=0) at libvirt.c:1390
#13 0x0000555555575a67 in vshConnect (ctl=0x7fffffffdbc0, uri=0x0,
    readonly=false) at virsh.c:337
#14 0x0000555555575c23 in vshReconnect (ctl=0x7fffffffdbc0) at virsh.c:383
#15 0x000055555557943a in vshCommandRun (ctl=0x7fffffffdbc0,
    cmd=0x5555557f4900) at virsh.c:1853
#16 0x000055555557d8de in main (argc=5, argv=0x7fffffffdda8) at virsh.c:3617

It seems odd that virsh is hanging rather than gracefully exit, because
it is trying too hard to reconnect in spite of being forcefully
disconnected due to a keepalive timeout.  But I'm not sure what, if
anything, we could tweak to make this failure mode not require a Ctrl-C
on my end to get rid of the failed virsh.  It also reiterates the idea
that we really ought to allow for a user-specified timeout to
virConnect, rather than blocking forever.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Eric Blake

Martin Kletzander

Eric Blake

tags

participants (2)