On 12/6/18 10:12 AM, Lentes, Bernd wrote:
> Hi,
>
> i have a two-node cluster with several domains as resources. During testing i
> tried several times to migrate some domains concurrently.
> Usually it suceeded, but rarely it failed. I found one clue in the log:
>
> Dec 03 16:03:02 ha-idg-1 libvirtd[3252]: 2018-12-03 15:03:02.758+0000: 3252:
> error : virKeepAliveTimerInternal:143 : internal error: connection closed due
> to keepalive timeout
>
> The domains are configured similar:
> primitive vm_geneious VirtualDomain \
> params config="/mnt/san/share/config.xml" \
> params hypervisor="qemu:///system" \
> params migration_transport=ssh \
> op start interval=0 timeout=120 trace_ra=1 \
> op stop interval=0 timeout=130 trace_ra=1 \
> op monitor interval=30 timeout=25 trace_ra=1 \
> op migrate_from interval=0 timeout=300 trace_ra=1 \
> op migrate_to interval=0 timeout=300 trace_ra=1 \
> meta allow-migrate=true target-role=Started is-managed=true \
> utilization cpu=2 hv_memory=8000
>
> What is the algorithm to discover the port used for live migration ?
> I have the impression that "params migration_transport=ssh" is worthless,
port
> 22 isn't involved for live migration.
> My experience is that for the migration tcp ports > 49151 are used. But the
> exact procedure isn't clear for me.
> Does live migration uses first tcp port 49152 and for each following domain one
> port higher ?
> E.g. for the concurrent live migration of three domains 49152, 49153 and 49154.
>
> Why does live migration for three domains usually succeed, although on both
> hosts just 49152 and 49153 is open ?
> Is the migration not really concurrent, but sometimes sequential ?
>
> Bernd
>
Hi,
i tried to narrow down the problem.
My first assumption was that something with the network between the hosts is not ok.
I opened port 49152 - 49172 in the firewall - problem persisted.
So i deactivated the firewall on both nodes - problem persisted.
Then i wanted to exclude the HA-Cluster software (pacemaker).
I unmanaged the VirtualDomains in pacemaker and migrated them with virsh - problem
persists.
I wrote a script to migrate three domains sequentially from host A to host B and vice
versa via virsh.
I raised up the loglevel from libvirtd and found s.th. in the log which may be the
culprit:
This is the output of my script:
Thu Dec 6 17:02:53 CET 2018
migrate sim
Migration: [100 %]
Thu Dec 6 17:03:07 CET 2018
migrate geneious
Migration: [100 %]
Thu Dec 6 17:03:16 CET 2018
migrate mausdb
Migration: [ 99 %]error: operation failed: migration job: unexpectedly failed
<===== error !
Thu Dec 6 17:05:32 CET 2018 <======== time of error
Guests on ha-idg-1: \n
Id Name State
----------------------------------------------------
1 sim running
2 geneious running
- mausdb shut off
migrate to ha-idg-2\n
Thu Dec 6 17:05:32 CET 2018
This is what journalctl told:
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info :
virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=0 idle=30
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: error :
virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive
timeout
Dec 06 17:05:32 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:32.481+0000: 12553: info :
virObjectUnref:259 : OBJECT_UNREF: obj=0x55b2bb937740
Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info :
virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=1 idle=25
Dec 06 17:05:27 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:27.476+0000: 12553: info :
virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50
prog=1801807216 vers=1 proc=1
Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info :
virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=2 idle=20
Dec 06 17:05:22 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:22.471+0000: 12553: info :
virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50
prog=1801807216 vers=1 proc=1
Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info :
virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=3 idle=15
Dec 06 17:05:17 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:17.466+0000: 12553: info :
virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50
prog=1801807216 vers=1 proc=1
Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info :
virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=4 idle=10
Dec 06 17:05:12 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:12.460+0000: 12553: info :
virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50
prog=1801807216 vers=1 proc=1
Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info :
virKeepAliveTimerInternal:136 : RPC_KEEPALIVE_TIMEOUT: ka=0x55b2bb937740
client=0x55b2bb930d50 countToDeath=5 idle=5
Dec 06 17:05:07 ha-idg-1 libvirtd[12553]: 2018-12-06 16:05:07.455+0000: 12553: info :
virKeepAliveMessage:107 : RPC_KEEPALIVE_SEND: ka=0x55b2bb937740 client=0x55b2bb930d50
prog=1801807216 vers=1 proc=1
There seems to be a kind of a countdown. From googleing i found that this may be related
to libvirtd.conf:
# Keepalive settings for the admin interface
#admin_keepalive_interval = 5
#admin_keepalive_count = 5
What is meant by the "admin interface" ? virsh ?
virsh-admin, which you can use to change some admin settings of libvirtd, e.g.
log_level. You are interested in the keepalive settings above those ones in
libvirtd.conf, specifically
#keepalive_interval = 5
#keepalive_count = 5
What is meant by "client" in libvirtd.conf ? virsh ?
Yes, virsh is a client, as is virt-manager or any application connecting to
libvirtd.
Why do i have regular timeouts although my two hosts are very
performant ? 128GB RAM, 16 cores, 2 1GBit/s network adapter on each host in bonding.
During migration i don't see much load, although nearly no waiting for IO.
I'd think concurrently migrating 3 VMs on a 1G network might cause some
congestion :-).
Should i set admin_keepalive_interval to -1 ?
You should try 'keepalive_interval = -1'. You can also avoid sending keepalive
messages from virsh with the '-k' option, e.g. 'virsh -k 0 migrate ...'.
If this doesn't help, are you in a position to test a newer libvirt, preferably
master or the recent 4.10.0 release?
Regards,
Jim