If an async job run on a domain will stop the domain at the end of the
job, a concurrently run query job can hang in qemu monitor nothing can
be done with that domain from this point on. An attempt to start such
domain results in "Timed out during operation: cannot acquire state
change lock" error.
However, quite a few things have to happen at the right time... There
must be an async job running, which stops a domain at the end. This race
was with dump --crash but other similar jobs, such as (managed)save and
migration, should be able to trigger this bug as well. While this async
job is processing its last monitor command, that is a query-migrate to
which qemu replies with status "completed", a new libvirt API that
results in a query job must arrive and stay waiting until the
query-migrate command finishes. Once query-migrate is done but before
the async job closes qemu monitor while stopping the domain, the other
thread needs to wake up and call qemuMonitorSend to send its command to
qemu. Before qemu gets a chance to respond to this command, the async
job needs to close the monitor. At this point, the query job thread is
waiting for a condition that no-one will ever signal so it never
finishes the job.
---
src/qemu/qemu_monitor.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c
index 4141fb7..35648ae 100644
--- a/src/qemu/qemu_monitor.c
+++ b/src/qemu/qemu_monitor.c
@@ -750,6 +750,20 @@ void qemuMonitorClose(qemuMonitorPtr mon)
VIR_FORCE_CLOSE(mon->fd);
}
+ /* In case another thread is waiting for its monitor command to be
+ * processed, we need to wake it up with appropriate error set.
+ */
+ if (mon->msg) {
+ if (mon->lastError.code == VIR_ERR_OK) {
+ qemuReportError(VIR_ERR_OPERATION_FAILED,
+ _("Qemu monitor was closed"));
+ virCopyLastError(&mon->lastError);
+ virResetLastError();
+ }
+ mon->msg->finished = 1;
+ virCondSignal(&mon->notify);
+ }
+
if (qemuMonitorUnref(mon) > 0)
qemuMonitorUnlock(mon);
}
--
1.7.8