[libvirt] Clock problems on live migration

2 Apr 2014

      Hi everyone,

When doing a live migration, Linux guests will frequently get stuck and 
become unresponsive, while the CPU utilization on the host for that 
guest goes to 100%. If you wait long enough, the guest seems to always 
recover, with a dmesg entry such as:
Clocksource tsc unstable (delta = 35882846234 ns)

So the TSC did a jump of nearly 36 seconds. I've actually had a 
(test-vm) stay stuck for over 11 minutes before becoming responsive 
again (delta = 662463064082 ns).

Migrations often fail when going from server A to B, but will then work 
fine in the other direction.

Both servers and guests are locked to the same NTP servers, and are well 
within 1ms from one another.

Both hosts are running Ubuntu 13.04 with these versions (from Ubuntu 
packages):

Kernel: 3.8.0-35-generic x86_64
Libvirt: 1.0.2
Qemu: 1.4.0
Gluster-fs: 3.4.2 (libvirt access the images via the filesystem, not 
using libgfapi yet).
The interconnect between both machines (both for migration and gluster) 
is 10GbE.

We have different guests (all Ubuntu releases, 13.04 and 13.10), and 
they all seem to be affected.

Clocksource: kvm-clock on all guests.
Clock entry from the guest XML: <clock offset='utc'/>

Now as far as I've udnerstood the documentation of kvm-clock, it 
specifically supports live migrations, so I'm a bit surprised by these 
problems. There isn't all that much information to find on this issue, 
although I have found postings by others that seem to have run into the 
same issues, but without a solution.

Any help would be much appreciated.

Regards, Paul Boven.
-- 
Paul Boven <boven@jive.nl> +31 (0)521-596547
Unix/Linux/Networking specialist
Joint Institute for VLBI in Europe - www.jive.nl
VLBI - It's a fringe science