I have a dell r910 with rhel6.2 on it and libvirt 0.8. This machine is hosting 12 virtual guests. Every 3 – 5 months the server crashes for no apparent reason. The logs show no kernel panics or other issues causing the crash. The sar logs show a very high context switch count ( approx. 170000). and also high runq-sz (approx. 10 – 18). The cpu’s were mostly idle, memory usage low, no swapping, disk io very low as well. This also occurs on a number of other rhel6.2 servers I have using KVM/libvirt/qemu for virtualization. I am curious if anyone else has reported incidents like this. After the crash the servers all come backup, but as you can imagine, it is troubling to see this kind of behavior, especially with these machines hosting production guests.

 

Any help or suggestions on what to look for would be helpful.

 

Regards