Hi everyone,
When doing a live migration, Linux guests will frequently get stuck and
become unresponsive, while the CPU utilization on the host for that
guest goes to 100%. If you wait long enough, the guest seems to always
recover, with a dmesg entry such as:
Clocksource tsc unstable (delta = 35882846234 ns)
So the TSC did a jump of nearly 36 seconds. I've actually had a
(test-vm) stay stuck for over 11 minutes before becoming responsive
again (delta = 662463064082 ns).
Migrations often fail when going from server A to B, but will then work
fine in the other direction.
Both servers and guests are locked to the same NTP servers, and are well
within 1ms from one another.
Both hosts are running Ubuntu 13.04 with these versions (from Ubuntu
packages):
Kernel: 3.8.0-35-generic x86_64
Libvirt: 1.0.2
Qemu: 1.4.0
Gluster-fs: 3.4.2 (libvirt access the images via the filesystem, not
using libgfapi yet).
The interconnect between both machines (both for migration and gluster)
is 10GbE.
We have different guests (all Ubuntu releases, 13.04 and 13.10), and
they all seem to be affected.
Clocksource: kvm-clock on all guests.
Clock entry from the guest XML: <clock offset='utc'/>
Now as far as I've udnerstood the documentation of kvm-clock, it
specifically supports live migrations, so I'm a bit surprised by these
problems. There isn't all that much information to find on this issue,
although I have found postings by others that seem to have run into the
same issues, but without a solution.
Any help would be much appreciated.
Regards, Paul Boven.
--
Paul Boven <boven(a)jive.nl> +31 (0)521-596547
Unix/Linux/Networking specialist
Joint Institute for VLBI in Europe -
www.jive.nl
VLBI - It's a fringe science