On Tue, Feb 11, 2025 at 14:20:27 +0100, Peter Krempa wrote:
On incoming migration qemu doesn't activate the block graph nodes
right
away. This is to properly facilitate locking of the images.
The block nodes are normally re-activated when starting the CPUs after
migration, but in cases (e.g. when a paused VM was migrated) when the VM
is left paused the block nodes are not re-activated by qemu.
This means that blockjobs which would want to write to an existing
backing chain member would fail. Generally read-only jobs would succeed
with older qemu's but this was not intended.
Instead with new qemu you'll always get an error if attempting to access
a inactive node:
error: internal error: unable to execute QEMU command 'blockdev-mirror':
Inactive 'libvirt-1-storage' can't be a backing child of active
'#block052'
This is the case for explicit blockjobs (virsh blockcopy) but also for
non shared-storage migration (virsh migrate --copy-storage-all).
Since qemu now provides 'blockdev-set-active' QMP command which can
on-demand re-activate the nodes we can re-activate them in similar cases
as when we'd be starting vCPUs if the VM weren't left paused.
The only exception is on the source in case of a failed post-copy
migration as the VM already ran on destination so it won't ever run on
the source even when recovered.
Resolves:
https://issues.redhat.com/browse/RHEL-78398
Signed-off-by: Peter Krempa <pkrempa(a)redhat.com>
---
src/qemu/qemu_migration.c | 46 +++++++++++++++++++++++++++++++++++----
1 file changed, 42 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
index 53bbbee629..159521fbf5 100644
--- a/src/qemu/qemu_migration.c
+++ b/src/qemu/qemu_migration.c
...
@@ -238,11 +271,14 @@
qemuMigrationSrcRestoreDomainState(virQEMUDriver *driver, virDomainObj *vm)
if (preMigrationState != VIR_DOMAIN_RUNNING ||
state != VIR_DOMAIN_PAUSED ||
- reason == VIR_DOMAIN_PAUSED_POSTCOPY_FAILED)
- return;
+ reason == VIR_DOMAIN_PAUSED_POSTCOPY_FAILED) {
+ if (reason == VIR_DOMAIN_PAUSED_IOERROR)
+ VIR_DEBUG("Domain is paused due to I/O error, skipping resume");
+
+ /* Don't reactivate disks on post-copy failure */
+ if (reason != VIR_DOMAIN_PAUSED_POSTCOPY_FAILED)
+ qemuMigrationBlockNodesReactivate(vm, VIR_ASYNC_JOB_MIGRATION_OUT);
- if (reason == VIR_DOMAIN_PAUSED_IOERROR) {
- VIR_DEBUG("Domain is paused due to I/O error, skipping resume");
return;
}
The new condition is incorrect. Previously the reason ==
VIR_DOMAIN_PAUSED_IOERROR part was effectively in the "else" branch of
the first condition while you moved it to the "then" part. So when
expanded it was similar to
if (preMigrationState == VIR_DOMAIN_RUNNING &&
state == VIR_DOMAIN_PAUSED &&
reason == VIR_DOMAIN_PAUSED_IOERROR)
...
This combination will not be properly handled by the new condition and
the domain would be resumed.
@@ -6795,6 +6831,8 @@ qemuMigrationDstFinishFresh(virQEMUDriver
*driver,
The attempt to resume the domain may fail in the missing context above,
shouldn't we call Reactivate in such case as well? And in fact the same
applies to qemuMigrationSrcRestoreDomainState where resuming might fail
as well.
if (*inPostCopy)
*doKill = false;
+ } else {
+ qemuMigrationBlockNodesReactivate(vm, VIR_ASYNC_JOB_MIGRATION_IN);
}
if (mig->jobData) {
Jirka