[libvirt PATCH 0/3] qemu: Fix canceling migration

This series fixes commit v8.7.0-57-g2d7b22b561 "qemu: Make qemuMigrationSrcCancel optionally synchronous" which was broken in several ways (although the overall idea was correct). Jiri Denemark (3): qemu_migration: Properly wait for migration to be canceled qemu: Do not crash when canceling migration on reconnect NEWS: Document daemon crash on reconnect NEWS.rst | 5 ++++ src/qemu/qemu_migration.c | 61 ++++++++++++++++++++++++++++----------- src/qemu/qemu_migration.h | 3 +- src/qemu/qemu_process.c | 4 +-- 4 files changed, 53 insertions(+), 20 deletions(-) -- 2.38.0

In my commit v8.7.0-57-g2d7b22b561 I attempted to make qemuMigrationSrcCancel synchronous, but failed. When we are canceling migration after some kind of error which is detected in in qemuMigrationSrcWaitForCompletion, jobData->status will be set to VIR_DOMAIN_JOB_STATUS_FAILED regardless on QEMU state. So instead of relying on the translated jobData->status in qemuMigrationSrcIsCanceled we need to check the migration status we get from QEMU MIGRATION event. Signed-off-by: Jiri Denemark <jdenemar@redhat.com> --- src/qemu/qemu_migration.c | 33 +++++++++++++++++++++------------ 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 33105cf07b..21c870334d 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -4597,21 +4597,30 @@ static bool qemuMigrationSrcIsCanceled(virDomainObj *vm) { virDomainJobData *jobData = vm->job->current; + qemuDomainJobDataPrivate *priv = jobData->privateData; + qemuMonitorMigrationStatus status = priv->stats.mig.status; - qemuMigrationUpdateJobType(jobData); - switch (jobData->status) { - case VIR_DOMAIN_JOB_STATUS_FAILED: - case VIR_DOMAIN_JOB_STATUS_CANCELED: - case VIR_DOMAIN_JOB_STATUS_COMPLETED: - case VIR_DOMAIN_JOB_STATUS_NONE: + switch (status) { + case QEMU_MONITOR_MIGRATION_STATUS_COMPLETED: + case QEMU_MONITOR_MIGRATION_STATUS_ERROR: + case QEMU_MONITOR_MIGRATION_STATUS_CANCELLED: + case QEMU_MONITOR_MIGRATION_STATUS_INACTIVE: + VIR_DEBUG("QEMU migration status: %s; waiting finished", + qemuMonitorMigrationStatusTypeToString(status)); return true; - case VIR_DOMAIN_JOB_STATUS_MIGRATING: - case VIR_DOMAIN_JOB_STATUS_POSTCOPY: - case VIR_DOMAIN_JOB_STATUS_PAUSED: - case VIR_DOMAIN_JOB_STATUS_HYPERVISOR_COMPLETED: - case VIR_DOMAIN_JOB_STATUS_POSTCOPY_PAUSED: - case VIR_DOMAIN_JOB_STATUS_ACTIVE: + case QEMU_MONITOR_MIGRATION_STATUS_POSTCOPY: + case QEMU_MONITOR_MIGRATION_STATUS_POSTCOPY_RECOVER: + case QEMU_MONITOR_MIGRATION_STATUS_POSTCOPY_PAUSED: + case QEMU_MONITOR_MIGRATION_STATUS_PRE_SWITCHOVER: + case QEMU_MONITOR_MIGRATION_STATUS_DEVICE: + case QEMU_MONITOR_MIGRATION_STATUS_SETUP: + case QEMU_MONITOR_MIGRATION_STATUS_ACTIVE: + case QEMU_MONITOR_MIGRATION_STATUS_CANCELLING: + case QEMU_MONITOR_MIGRATION_STATUS_WAIT_UNPLUG: + case QEMU_MONITOR_MIGRATION_STATUS_LAST: + VIR_DEBUG("QEMU migration status: %s; still waiting", + qemuMonitorMigrationStatusTypeToString(status)); break; } -- 2.38.0

When libvirtd is restarted during an active outgoing migration (or snapshot, save, or dump which are internally implemented as migration) it wants to cancel the migration. But by a mistake in commit v8.7.0-57-g2d7b22b561 the qemuMigrationSrcCancel function is called with wait == true, which leads to an instant crash by dereferencing NULL pointer stored in priv->job.current. When canceling migration to file (snapshot, save, dump), we don't need to wait until it is really canceled as no migration capabilities or parameters need to be restored. On the other hand we need to wait when canceling outgoing migration and since we don't have virDomainJobData at this point, we have to temporarily restore the migration job to make sure we can process MIGRATION events from QEMU. Signed-off-by: Jiri Denemark <jdenemar@redhat.com> --- src/qemu/qemu_migration.c | 28 +++++++++++++++++++++++----- src/qemu/qemu_migration.h | 3 ++- src/qemu/qemu_process.c | 4 ++-- 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 21c870334d..76e486fbc7 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -4633,8 +4633,7 @@ qemuMigrationSrcIsCanceled(virDomainObj *vm) * cancellation to complete. * * The thread (the caller itself in most cases) which is watching the migration - * will do all the cleanup once migration is canceled. If no thread is watching - * the migration, use qemuMigrationSrcCancelUnattended instead. + * will do all the cleanup once migration is canceled. */ int qemuMigrationSrcCancel(virDomainObj *vm, @@ -6979,11 +6978,12 @@ qemuMigrationSrcToFile(virQEMUDriver *driver, virDomainObj *vm, /** - * This function is supposed to be used only when no other thread is watching - * the migration. + * This function is supposed to be used only to while reconnecting to a domain + * with an active migration job. */ int -qemuMigrationSrcCancelUnattended(virDomainObj *vm) +qemuMigrationSrcCancelUnattended(virDomainObj *vm, + virDomainJobObj *oldJob) { bool storage = false; size_t i; @@ -6991,8 +6991,26 @@ qemuMigrationSrcCancelUnattended(virDomainObj *vm) VIR_DEBUG("Canceling unfinished outgoing migration of domain %s", vm->def->name); + /* Make sure MIGRATION event handler can store the current migration state + * in the job. + */ + if (!vm->job->current) { + qemuDomainObjRestoreAsyncJob(vm, VIR_ASYNC_JOB_MIGRATION_OUT, + oldJob->phase, oldJob->asyncStarted, + VIR_DOMAIN_JOB_OPERATION_MIGRATION_OUT, + QEMU_DOMAIN_JOB_STATS_TYPE_MIGRATION, + VIR_DOMAIN_JOB_STATUS_FAILED, + VIR_JOB_NONE); + } + + /* We're inside a MODIFY job and the restored MIGRATION_OUT async job is + * used only for processing migration events from QEMU. Thus we don't want + * to start a nested job for talking to QEMU. + */ qemuMigrationSrcCancel(vm, VIR_ASYNC_JOB_NONE, true); + virDomainObjEndAsyncJob(vm); + for (i = 0; i < vm->def->ndisks; i++) { virDomainDiskDef *disk = vm->def->disks[i]; qemuDomainDiskPrivate *diskPriv = QEMU_DOMAIN_DISK_PRIVATE(disk); diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h index fbea45ad4e..3d7c2702aa 100644 --- a/src/qemu/qemu_migration.h +++ b/src/qemu/qemu_migration.h @@ -241,7 +241,8 @@ qemuMigrationSrcToFile(virQEMUDriver *driver, ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; int -qemuMigrationSrcCancelUnattended(virDomainObj *vm); +qemuMigrationSrcCancelUnattended(virDomainObj *vm, + virDomainJobObj *oldJob); int qemuMigrationSrcCancel(virDomainObj *vm, diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index e6fd9395de..a8101e1233 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3532,7 +3532,7 @@ qemuProcessRecoverMigrationOut(virQEMUDriver *driver, */ VIR_DEBUG("Cancelling unfinished migration of domain %s", vm->def->name); - if (qemuMigrationSrcCancelUnattended(vm) < 0) { + if (qemuMigrationSrcCancelUnattended(vm, job) < 0) { VIR_WARN("Could not cancel ongoing migration of domain %s", vm->def->name); } @@ -3691,7 +3691,7 @@ qemuProcessRecoverJob(virQEMUDriver *driver, case VIR_ASYNC_JOB_SAVE: case VIR_ASYNC_JOB_DUMP: case VIR_ASYNC_JOB_SNAPSHOT: - qemuMigrationSrcCancel(vm, VIR_ASYNC_JOB_NONE, true); + qemuMigrationSrcCancel(vm, VIR_ASYNC_JOB_NONE, false); /* resume the domain but only if it was paused as a result of * running a migration-to-file operation. Although we are * recovering an async job, this function is run at startup -- 2.38.0

Signed-off-by: Jiri Denemark <jdenemar@redhat.com> --- NEWS.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/NEWS.rst b/NEWS.rst index b6bcb5524d..9fcca3e33f 100644 --- a/NEWS.rst +++ b/NEWS.rst @@ -44,6 +44,11 @@ v8.9.0 (unreleased) used for guest CPU definition in domain XML, but it could have caused failures to start a domain with ``check='full'`` in some cases. + * qemu: Do not crash after restart with active migration + + In 8.8.0 release libvirt daemon would crash after it was restarted during + an active outgoing migration. + v8.8.0 (2022-10-03) =================== -- 2.38.0

On a Friday in 2022, Jiri Denemark wrote:
This series fixes commit v8.7.0-57-g2d7b22b561 "qemu: Make qemuMigrationSrcCancel optionally synchronous" which was broken in several ways (although the overall idea was correct).
Jiri Denemark (3): qemu_migration: Properly wait for migration to be canceled qemu: Do not crash when canceling migration on reconnect NEWS: Document daemon crash on reconnect
NEWS.rst | 5 ++++ src/qemu/qemu_migration.c | 61 ++++++++++++++++++++++++++++----------- src/qemu/qemu_migration.h | 3 +- src/qemu/qemu_process.c | 4 +-- 4 files changed, 53 insertions(+), 20 deletions(-)
Reviewed-by: Ján Tomko <jtomko@redhat.com> Jano
participants (2)
-
Jiri Denemark
-
Ján Tomko