Background:
----------
I'm trying to debug a two-node pacemaker/corosync cluster where I
want to be able to do live migration of KVM/qemu VMs. Storage is
backed via dual-primary DRBD (yes, fencing is in place).
When moving the VM between nodes via 'pcs resource move RES NODENAME',
the live migration fails although pacemaker will shut down the VM
and restart it on the other node.
For the purpose of diagnosing things, on both nodes I've put SELinux
into permissive mode and disabled firewalld.
Interesting Bit:
---------------
Debugging a bit further, I put the VM into an unmanaged state and
then try with virsh, from the node currently running the VM:
[root@node1 ~]# virsh migrate --live --verbose testvm
qemu+ssh://node2/system
error: internal error: Attempt to migrate guest to the same host
node1.example.tld
A quick google points toward uuid problems, however the two nodes
are, afaict, working with different UUIDs. (Substantiating info
shown toward the end.)
I thought that since `hostname` only returns the node name and not
the FQDN that perhaps there was internal qemu confusion about using
the short node name vs FQDN. However fully qualifying it made
no difference:
[root@node1 ~]# virsh migrate --live --verbose testvm
qemu+ssh://node2.example.tld/system
error: internal error: Attempt to migrate guest to the same host
node1.example.tld
Running virsh with a debug level of 1 doesn't reveal anything interesting
that I can see. Running libvirtd at that level shows that node2 is seeing
node1.example.tld in the emitted XML in qemuMigrationPrepareDirect. I'm
assuming that means the wrong node has been calculated somewhere prior
to that.
At this point I'm grasping at straws and looking for ideas. Does anyone
have a clue-bat?
Devin
Config Info Follows:
-------------------
CentOS Linux release 7.2.1511 (Core)
libvirt on both nodes is 1.2.17-13
[root@node1 ~]# virsh sysinfo | grep uuid
<entry name='uuid'>03DE0294-0480-05A4-B906-8E0700080009</entry>
[root@node2 ~]# virsh sysinfo | grep uuid
<entry name='uuid'>03DE0294-0480-05A4-B206-320700080009</entry>
[root@node1 ~]# dmidecode -s system-uuid
03DE0294-0480-05A4-B906-8E0700080009
[root@node2 ~]# dmidecode -s system-uuid
03DE0294-0480-05A4-B206-320700080009
[root@node1 ~]# fgrep uuid /etc/libvirt/libvirtd.conf | grep -v '#'
host_uuid = "875cb1a3-437c-4cb5-a3de-9789d0233e4b"
[root@node2 ~]# fgrep uuid /etc/libvirt/libvirtd.conf | grep -v '#'
host_uuid = "643c0ef4-bb46-4dc9-9f91-13dda8d9aa33"
[root@node2 ~]# pcs config show
...
Resource: testvm (class=ocf provider=heartbeat type=VirtualDomain)
Attributes: hypervisor=qemu:///system
config=/cluster/config/libvirt/qemu/testvm.xml migration_transport=ssh
Meta Attrs: allow-migrate=true is-managed=false
Operations: start interval=0s timeout=120 (testvm-start-interval-0s)
stop interval=0s timeout=240 (testvm-stop-interval-0s)
monitor interval=10 timeout=30 (testvm-monitor-interval-10)
migrate_from interval=0 timeout=60s
(testvm-migrate_from-interval-0)
migrate_to interval=0 timeout=120s
(testvm-migrate_to-interval-0)
...
(The /cluster/config directory is a shared GlusterFS filesystem.)
[root@node1 ~]# cat /etc/hosts | grep -v localhost
192.168.10.8 node1.example.tld node2
192.168.10.9 node2.example.tld node2
192.168.11.8 node1hb.example.tld node1hb
192.168.11.9 node2hb.example.tld node2hb
(node1 and node2 are the "reachable" IPs and totem ring1. node1hb and
node2hb form a direct connection via crossover cable for DRBD and
totem ring0.)