On 4/21/26 15:37, Jiří Denemark wrote:
When libvirtd reconnects to a running QEMU process that had an in-progress migration, qemuProcessReconnect first connects the monitor and only later recovers the migration job. During this window the async job is VIR_ASYNC_JOB_NONE, so any MIGRATION status events from QEMU are silently dropped by qemuProcessHandleMigrationStatus.
If the migration was already cancelled or completed by QEMU during this window, no further events will be emitted. When qemuMigrationSrcCancelUnattended later restores the async job and calls qemuMigrationSrcCancel with wait=true, the wait loop calls qemuDomainObjWait (virCondWait with no timeout) and blocks forever waiting for an event that will never arrive.
Fix this by re-querying QEMU migration state with qemuMigrationAnyRefreshStatus after restoring the async job but before calling qemuMigrationSrcCancel. If QEMU has already reached a terminal state, the cancel is skipped.
Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Jiri Denemark <jdenemar@redhat.com> CC: Peter Krempa <pkrempa@redhat.com> CC: Michal Privoznik <mprivozn@redhat.com> CC: Efim Shevrin <efim.shevrin@virtuozzo.com> --- v1 -> v2: Instead of querying QEMU with query-migrate inside qemuMigrationSrcCancel, use qemuMigrationAnyRefreshStatus in qemuMigrationSrcCancelUnattended after restoring the async job to re-check migration state before the actual cancel.
src/qemu/qemu_migration.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index fec808ccfb..a4bd7efa09 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -7330,6 +7330,7 @@ int qemuMigrationSrcCancelUnattended(virDomainObj *vm, virDomainJobObj *oldJob) { + virDomainJobStatus migStatus = VIR_DOMAIN_JOB_STATUS_NONE; bool storage = false; size_t i;
@@ -7348,11 +7349,20 @@ qemuMigrationSrcCancelUnattended(virDomainObj *vm, VIR_JOB_NONE); }
- /* We're inside a MODIFY job and the restored MIGRATION_OUT async job is - * used only for processing migration events from QEMU. Thus we don't want - * to start a nested job for talking to QEMU. + /* Query the actual migration state from QEMU. The state passed to + * qemuProcessRecoverMigrationOut may be stale: QEMU could have + * reached a terminal state between that initial query and the async + * job restore above, with the corresponding event silently dropped. */ - qemuMigrationSrcCancel(vm, VIR_ASYNC_JOB_NONE, true); + qemuMigrationAnyRefreshStatus(vm, VIR_ASYNC_JOB_NONE, &migStatus); + + if (migStatus != VIR_DOMAIN_JOB_STATUS_CANCELED) { + /* We're inside a MODIFY job and the restored MIGRATION_OUT async + * job is used only for processing migration events from QEMU. + * Thus we don't want to start a nested job for talking to QEMU. + */ + qemuMigrationSrcCancel(vm, VIR_ASYNC_JOB_NONE, true); + } IMHO my original idea (described in v1 review) would fix the issue in a similar way, but without this extra monitor call. qemuProcessRecoverMigration checks the current state of the migration and passes it to qemuProcessRecoverMigrationOut. We can just pass it one level down to qemuMigrationSrcCancelUnattended and act accordingly. It doesn't matter if the state is stale or not, in fact even in your fix
On Fri, Mar 20, 2026 at 18:34:02 +0100, Denis V. Lunev wrote: the migration may switch to canceled just after you checked for its current state and the code would then call qemuMigrationSrcCancel. So if this extra call to qemuMigrationAnyRefreshStatus was necessary it would mean we still have a race somewhere and the fix would just making the affected window a tiny bit smaller (there's just a few lines of code from the first qemuMigrationAnyRefreshStatus call to here).
The important part is that the domain object is locked all the time. Even if QEMU gets into QEMU_MONITOR_MIGRATION_STATUS_CANCELLED after we checked migration status, qemuProcessHandleMigrationStatus will not ignore it. It will just sit on virObjectLock(vm) and wait. The domain is unlocked inside qemuMigrationSrcCancel at which point the async job is already set to VIR_ASYNC_JOB_MIGRATION_OUT and vm->job->current->privateData->stats.mig.status will be properly updated.
We only need to make sure the migration is not in QEMU_MONITOR_MIGRATION_STATUS_CANCELLED state before the async job is restored and doing so once is enough.
That said, technically this patch would work just as well, but the extra call to qemuMigrationAnyRefreshStatus is not necessary as we can use the state we got from the same call made a few lines of code before.
Jirka
read the code again. You approach should work, sent v3. Thank you for the review, Den