Re: [libvirt] [PATCH 1/2] Timeout QEMU monitor replies after 30 seconds

23 Jun 2011


      On 06/22/2011 11:05 AM, Jiri Denemark wrote:
...
On Wed, Jun 22, 2011 at 16:47:18 +0100, Daniel P. Berrange wrote:
...
If the QEMU process has been stopped (kill -STOP/gdb), or the
QEMU process has live-locked itself, then we will never get a
reply from the monitor. We should not wait forever in this
case, but instead timeout after a reasonable amount of time.
NB if the host has high CPU load, or a single monitor command
intentionally takes a long time, then this will cause bogus
failures. In the case of high CPU load, arguably the guest
should have been migrated elsewhere, since you can't effectively
manage guests on a host if QEMU is taking > 30 seconds to reply
to simply commands. Since we use background migration, there
should not be any commands which take significant time to
execute any more
The thing I'm most concerned about is that is far too easy to get into such
situations especially since disk cache subsystem in Linux kernel is not the
best thing in the world. While I agree that running guests on a loaded host is
not very clever and guests should rather be migrated elsewhere, such situation
doesn't have to be intentional. In other words, in case of a malfunction of
some kind (some processes go crazy, network disruptions, ...) QEMU may require
more than a timeout seconds to respond and we will penalize an innocent QEMU
process because we won't be able to control it anymore even though the issues
get fixed.
Is there any way to measure time spent by the child process, rather than
just relying on wall-time elapsed?  That is, when libvirt hits 30
seconds of wall time in waiting for a monitor, can it then check whether
the child process has accumulated any execution time (likely hung) vs.
no execution time (likely a starved system situation), and only give up
in the former case?

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org