[libvirt] [PATCH 0/3] Libvirt RPC dispatching and unresponsive QEMU

16 Aug 2011

      If there is an unresponsive qemu process and libvirt access
it's monitor, it will not get any response and this thread will
block indefinitely, until the qemu process resumes or it's destroyed.
If users continues executing APIs against that domain, libvirt will
run out of worker threads and hangs (if those APIs will access
monitor as well). Although, they will timeout in approx 30 seconds,
which will free some workers, during that time is libvirt unable to
process any request. Even worse - if the number of unresponsive qemu
exceeds the size of worker thread pool, libvirt will hangs forever,
and even restarting the daemon will not make it any better.

This patch set heals the daemon on several levels, so nothing from
written above will cause it to hangs:

1. RPC dispatching - all APIs are now annotated as 'high' or 'low'
   priority. Then a special thread pool is created. Low priority
   APIs will be still placed into usual pool, but high priority
   can be placed into this new pool if the former has no free worker.
   Which APIs should be marked high and which low? The splitting
   presented here is my guessing. It is not something written in stone,
   but from the logic of things it is not safe to annotate any
   API which is NOT guaranteed to end in reasonable small time as
   high priority call.

2. Job queue size limit - this sets bound on the number of threads
   blocked by a stuck qemu. Okay, there exists timeout on this,
   but if user application continue dispatching low priority calls
   it can still consume all (low priority) worker threads and therefore
   affect other users/VMs. Even if they timeout in approx 30 secs.

3. Run monitor re-connect in a separate thread per VM.
   If libvirtd is restarted, it tries to reconnect to all running
   qemu processes. This is potentially risky - one stuck qemu
   block daemon startup. However, putting the monitor startup code
   into one thread per VM allows libvirtd to startup, accept client
   connections and work with all VMs which monitor was successfully
   re-opened. Unresponsive qemu will hold job until we open the monitor.
   So clever user application can destroy such domain. All APIs
   requiring job will just fail in acquiring lock.

Michal Privoznik (3):
  daemon: Create priority workers pool
  qemu: Introduce job queue size limit
  qemu: Deal with stucked qemu on daemon startup

 daemon/libvirtd.aug          |    1 +
 daemon/libvirtd.c            |   10 +-
 daemon/libvirtd.conf         |    6 +
 daemon/remote.c              |   26 ++
 daemon/remote.h              |    2 +
 src/qemu/libvirtd_qemu.aug   |    1 +
 src/qemu/qemu.conf           |    7 +
 src/qemu/qemu_conf.c         |    4 +
 src/qemu/qemu_conf.h         |    2 +
 src/qemu/qemu_domain.c       |   17 ++
 src/qemu/qemu_domain.h       |    2 +
 src/qemu/qemu_driver.c       |   23 +--
 src/qemu/qemu_process.c      |   89 ++++++-
 src/remote/qemu_protocol.x   |   13 +-
 src/remote/remote_protocol.x |  544 +++++++++++++++++++++---------------------
 src/rpc/gendispatch.pl       |   48 ++++-
 src/rpc/virnetserver.c       |   32 +++-
 src/rpc/virnetserver.h       |    6 +-
 src/util/threadpool.c        |   38 ++-
 src/util/threadpool.h        |    1 +
 20 files changed, 554 insertions(+), 318 deletions(-)

-- 
1.7.3.4