Re: [PATCH v2] qemu: fix potential hang in qemuMigrationSrcCancelUnattended during reconnect

22 Apr 2026

On 4/21/26 15:37, Jiří Denemark wrote:
...
...
When libvirtd reconnects to a running QEMU process that had an
in-progress migration, qemuProcessReconnect first connects the
monitor and only later recovers the migration job. During this window
the async job is VIR_ASYNC_JOB_NONE, so any MIGRATION status events
from QEMU are silently dropped by qemuProcessHandleMigrationStatus.
If the migration was already cancelled or completed by QEMU during
this window, no further events will be emitted. When
qemuMigrationSrcCancelUnattended later restores the async job and
calls qemuMigrationSrcCancel with wait=true, the wait loop calls
qemuDomainObjWait (virCondWait with no timeout) and blocks forever
waiting for an event that will never arrive.
Fix this by re-querying QEMU migration state with
qemuMigrationAnyRefreshStatus after restoring the async job but before
calling qemuMigrationSrcCancel. If QEMU has already reached a terminal
state, the cancel is skipped.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Jiri Denemark <jdenemar@redhat.com>
CC: Peter Krempa <pkrempa@redhat.com>
CC: Michal Privoznik <mprivozn@redhat.com>
CC: Efim Shevrin <efim.shevrin@virtuozzo.com>
---
v1 -> v2: Instead of querying QEMU with query-migrate inside
qemuMigrationSrcCancel, use qemuMigrationAnyRefreshStatus in
qemuMigrationSrcCancelUnattended after restoring the async job
to re-check migration state before the actual cancel.
src/qemu/qemu_migration.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
index fec808ccfb..a4bd7efa09 100644
--- a/src/qemu/qemu_migration.c
+++ b/src/qemu/qemu_migration.c
@@ -7330,6 +7330,7 @@ int
 qemuMigrationSrcCancelUnattended(virDomainObj *vm,
                                  virDomainJobObj *oldJob)
 {
+    virDomainJobStatus migStatus = VIR_DOMAIN_JOB_STATUS_NONE;
     bool storage = false;
     size_t i;
@@ -7348,11 +7349,20 @@ qemuMigrationSrcCancelUnattended(virDomainObj *vm,
                                      VIR_JOB_NONE);
     }
-    /* We're inside a MODIFY job and the restored MIGRATION_OUT async job is
-     * used only for processing migration events from QEMU. Thus we don't want
-     * to start a nested job for talking to QEMU.
+    /* Query the actual migration state from QEMU. The state passed to
+     * qemuProcessRecoverMigrationOut may be stale: QEMU could have
+     * reached a terminal state between that initial query and the async
+     * job restore above, with the corresponding event silently dropped.
      */
-    qemuMigrationSrcCancel(vm, VIR_ASYNC_JOB_NONE, true);
+    qemuMigrationAnyRefreshStatus(vm, VIR_ASYNC_JOB_NONE, &migStatus);
+
+    if (migStatus != VIR_DOMAIN_JOB_STATUS_CANCELED) {
+        /* We're inside a MODIFY job and the restored MIGRATION_OUT async
+         * job is used only for processing migration events from QEMU.
+         * Thus we don't want to start a nested job for talking to QEMU.
+         */
+        qemuMigrationSrcCancel(vm, VIR_ASYNC_JOB_NONE, true);
+    }
IMHO my original idea (described in v1 review) would fix the issue in a
similar way, but without this extra monitor call.
qemuProcessRecoverMigration checks the current state of the migration
and passes it to qemuProcessRecoverMigrationOut. We can just pass it one
level down to qemuMigrationSrcCancelUnattended and act accordingly. It
doesn't matter if the state is stale or not, in fact even in your fix
On Fri, Mar 20, 2026 at 18:34:02 +0100, Denis V. Lunev wrote:
the migration may switch to canceled just after you checked for its
current state and the code would then call qemuMigrationSrcCancel. So if
this extra call to qemuMigrationAnyRefreshStatus was necessary it would
mean we still have a race somewhere and the fix would just making the
affected window a tiny bit smaller (there's just a few lines of code
from the first qemuMigrationAnyRefreshStatus call to here).
The important part is that the domain object is locked all the time.
Even if QEMU gets into QEMU_MONITOR_MIGRATION_STATUS_CANCELLED after we
checked migration status, qemuProcessHandleMigrationStatus will not
ignore it. It will just sit on virObjectLock(vm) and wait. The domain is
unlocked inside qemuMigrationSrcCancel at which point the async job is
already set to VIR_ASYNC_JOB_MIGRATION_OUT and
vm->job->current->privateData->stats.mig.status will be properly
updated.
We only need to make sure the migration is not in
QEMU_MONITOR_MIGRATION_STATUS_CANCELLED state before the async job is
restored and doing so once is enough.
That said, technically this patch would work just as well, but the extra
call to qemuMigrationAnyRefreshStatus is not necessary as we can use the
state we got from the same call made a few lines of code before.
Jirka
read the code again. You approach should work, sent v3.

Thank you for the review,
    Den