
On Wed, Jun 22, 2011 at 16:47:18 +0100, Daniel P. Berrange wrote:
If the QEMU process has been stopped (kill -STOP/gdb), or the QEMU process has live-locked itself, then we will never get a reply from the monitor. We should not wait forever in this case, but instead timeout after a reasonable amount of time.
NB if the host has high CPU load, or a single monitor command intentionally takes a long time, then this will cause bogus failures. In the case of high CPU load, arguably the guest should have been migrated elsewhere, since you can't effectively manage guests on a host if QEMU is taking > 30 seconds to reply to simply commands. Since we use background migration, there should not be any commands which take significant time to execute any more
The thing I'm most concerned about is that is far too easy to get into such situations especially since disk cache subsystem in Linux kernel is not the best thing in the world. While I agree that running guests on a loaded host is not very clever and guests should rather be migrated elsewhere, such situation doesn't have to be intentional. In other words, in case of a malfunction of some kind (some processes go crazy, network disruptions, ...) QEMU may require more than a timeout seconds to respond and we will penalize an innocent QEMU process because we won't be able to control it anymore even though the issues get fixed. Jirka