[libvirt] [PATCH 0/2] qemu: Cancel/forbid migration on I/O error

When qemu has paused itself due to a I/O error it's not safe to migrate it as it still might contain state in kernel buffers. These patches forbid and cancel ongoing migration if a I/O error happens. Peter Krempa (2): qemu: Cancel migration if guest encoutners I/O error while migrating qemu: Forbid migration of machines with I/O errors src/qemu/qemu_migration.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) -- 1.8.2.1

During a live migration the guest may receive a disk access I/O error. In this state the guest is unable to continue running on a remote host after migration as some state may be present in the kernel and not migrated. With this patch, the migration is canceled in such case so it can either continue on the source if the I/O issues are recovered or has to be destroyed anyways. --- src/qemu/qemu_migration.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index ca79bc2..8e57521 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1686,6 +1686,7 @@ qemuMigrationWaitForCompletion(virQEMUDriverPtr driver, virDomainObjPtr vm, { qemuDomainObjPrivatePtr priv = vm->privateData; const char *job; + int pauseReason; switch (priv->job.asyncJob) { case QEMU_ASYNC_JOB_MIGRATION_OUT: @@ -1707,6 +1708,12 @@ qemuMigrationWaitForCompletion(virQEMUDriverPtr driver, virDomainObjPtr vm, /* Poll every 50ms for progress & to allow cancellation */ struct timespec ts = { .tv_sec = 0, .tv_nsec = 50 * 1000 * 1000ull }; + /* cancel migration if disk I/O error is emitted while migrating */ + if (priv->job.asyncJob == QEMU_ASYNC_JOB_MIGRATION_OUT && + virDomainObjGetState(vm, &pauseReason) == VIR_DOMAIN_PAUSED && + pauseReason == VIR_DOMAIN_PAUSED_IOERROR) + goto cancel; + if (qemuMigrationUpdateJobStatus(driver, vm, job, asyncJob) < 0) goto cleanup; @@ -1728,6 +1735,20 @@ cleanup: return 0; else return -1; + +cancel: + if (virDomainObjIsActive(vm)) { + if (qemuDomainObjEnterMonitorAsync(driver, vm, + priv->job.asyncJob) == 0) { + qemuMonitorMigrateCancel(priv->mon); + qemuDomainObjExitMonitor(driver, vm); + } + } + + priv->job.info.type = VIR_DOMAIN_JOB_FAILED; + virReportError(VIR_ERR_OPERATION_FAILED, + _("%s: %s"), job, _("failed due to I/O error")); + return -1; } -- 1.8.2.1

On 11.06.2013 11:49, Peter Krempa wrote:
During a live migration the guest may receive a disk access I/O error. I'd reword this: ... may receive an I/O error while accessing a disk.
In this state the guest is unable to continue running on a remote host after migration as some state may be present in the kernel and not migrated.
With this patch, the migration is canceled in such case so it can either continue on the source if the I/O issues are recovered or has to be destroyed anyways. --- src/qemu/qemu_migration.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index ca79bc2..8e57521 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1686,6 +1686,7 @@ qemuMigrationWaitForCompletion(virQEMUDriverPtr driver, virDomainObjPtr vm, { qemuDomainObjPrivatePtr priv = vm->privateData; const char *job; + int pauseReason;
switch (priv->job.asyncJob) { case QEMU_ASYNC_JOB_MIGRATION_OUT: @@ -1707,6 +1708,12 @@ qemuMigrationWaitForCompletion(virQEMUDriverPtr driver, virDomainObjPtr vm, /* Poll every 50ms for progress & to allow cancellation */ struct timespec ts = { .tv_sec = 0, .tv_nsec = 50 * 1000 * 1000ull };
+ /* cancel migration if disk I/O error is emitted while migrating */ + if (priv->job.asyncJob == QEMU_ASYNC_JOB_MIGRATION_OUT && + virDomainObjGetState(vm, &pauseReason) == VIR_DOMAIN_PAUSED && + pauseReason == VIR_DOMAIN_PAUSED_IOERROR) + goto cancel; + if (qemuMigrationUpdateJobStatus(driver, vm, job, asyncJob) < 0) goto cleanup;
@@ -1728,6 +1735,20 @@ cleanup: return 0; else return -1; + +cancel: + if (virDomainObjIsActive(vm)) { + if (qemuDomainObjEnterMonitorAsync(driver, vm, + priv->job.asyncJob) == 0) { + qemuMonitorMigrateCancel(priv->mon); + qemuDomainObjExitMonitor(driver, vm); + } + } + + priv->job.info.type = VIR_DOMAIN_JOB_FAILED; + virReportError(VIR_ERR_OPERATION_FAILED, + _("%s: %s"), job, _("failed due to I/O error")); + return -1; }
ACK Michal

Such machine can't be successuflly migrated unles the I/O error has recovered and might lead to data corruption. Forbid this kind of migration. --- src/qemu/qemu_migration.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 8e57521..97daaa0 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1423,6 +1423,7 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, virDomainDefPtr def, bool remote) { int nsnapshots; + int pauseReason; bool forbid; int i; @@ -1445,6 +1446,15 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, nsnapshots); return false; } + + /* cancel migration if disk I/O error is emitted while migrating */ + if (virDomainObjGetState(vm, &pauseReason) == VIR_DOMAIN_PAUSED && + pauseReason == VIR_DOMAIN_PAUSED_IOERROR) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("cannot migrate domain with I/O error")); + return false; + } + } if (virDomainHasDiskMirror(vm)) { -- 1.8.2.1

On 11.06.2013 11:49, Peter Krempa wrote:
Such machine can't be successuflly migrated unles the I/O error has recovered and might lead to data corruption. Forbid this kind of migration. --- src/qemu/qemu_migration.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 8e57521..97daaa0 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1423,6 +1423,7 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, virDomainDefPtr def, bool remote) { int nsnapshots; + int pauseReason; bool forbid; int i;
@@ -1445,6 +1446,15 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, nsnapshots); return false; } + + /* cancel migration if disk I/O error is emitted while migrating */ + if (virDomainObjGetState(vm, &pauseReason) == VIR_DOMAIN_PAUSED && + pauseReason == VIR_DOMAIN_PAUSED_IOERROR) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("cannot migrate domain with I/O error")); + return false; + } + }
if (virDomainHasDiskMirror(vm)) {
Do we want to document this behaviour change? Michal

On 06/11/13 14:05, Michal Privoznik wrote:
On 11.06.2013 11:49, Peter Krempa wrote:
Such machine can't be successuflly migrated unles the I/O error has recovered and might lead to data corruption. Forbid this kind of migration. --- src/qemu/qemu_migration.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 8e57521..97daaa0 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1423,6 +1423,7 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, virDomainDefPtr def, bool remote) { int nsnapshots; + int pauseReason; bool forbid; int i;
@@ -1445,6 +1446,15 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, nsnapshots); return false; } + + /* cancel migration if disk I/O error is emitted while migrating */ + if (virDomainObjGetState(vm, &pauseReason) == VIR_DOMAIN_PAUSED && + pauseReason == VIR_DOMAIN_PAUSED_IOERROR) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("cannot migrate domain with I/O error")); + return false; + } + }
if (virDomainHasDiskMirror(vm)) {
Do we want to document this behaviour change?
How about: diff --git a/src/libvirt.c b/src/libvirt.c index 620dbdd..6413a1e 100644 --- a/src/libvirt.c +++ b/src/libvirt.c @@ -5809,41 +5809,42 @@ error: * guest ABI, * * If a hypervisor supports renaming domains during migration, * the dname parameter specifies the new name for the domain. * Setting dname to NULL keeps the domain name the same. If domain * renaming is not supported by the hypervisor, dname must be NULL or * else an error will be returned. * * The maximum bandwidth (in MiB/s) that will be used to do migration * can be specified with the bandwidth parameter. If set to 0, * libvirt will choose a suitable default. Some hypervisors do * not support this feature and will return an error if bandwidth * is not 0. * * To see which features are supported by the current hypervisor, * see virConnectGetCapabilities, /capabilities/host/migration_features. * * There are many limitations on migration imposed by the underlying * technology - for example it may not be possible to migrate between * different processors even with the same architecture, or between - * different types of hypervisor. + * different types of hypervisor. Migration of domains that encountered + * errors may not be possible. * * Returns 0 if the migration succeeded, -1 upon error. */ int virDomainMigrateToURI2(virDomainPtr domain, const char *dconnuri, const char *miguri, const char *dxml, unsigned long flags, const char *dname, unsigned long bandwidth) { VIR_DOMAIN_DEBUG(domain, "dconnuri=%s, miguri=%s, dxml=%s, " "flags=%lx, dname=%s, bandwidth=%lu", NULLSTR(dconnuri), NULLSTR(miguri), NULLSTR(dxml), flags, NULLSTR(dname), bandwidth); virResetLastError(); /* First checkout the source */

On 11.06.2013 14:39, Peter Krempa wrote:
On 06/11/13 14:05, Michal Privoznik wrote:
On 11.06.2013 11:49, Peter Krempa wrote:
Such machine can't be successuflly migrated unles the I/O error has recovered and might lead to data corruption. Forbid this kind of migration. --- src/qemu/qemu_migration.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 8e57521..97daaa0 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1423,6 +1423,7 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, virDomainDefPtr def, bool remote) { int nsnapshots; + int pauseReason; bool forbid; int i;
@@ -1445,6 +1446,15 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, nsnapshots); return false; } + + /* cancel migration if disk I/O error is emitted while migrating */ + if (virDomainObjGetState(vm, &pauseReason) == VIR_DOMAIN_PAUSED && + pauseReason == VIR_DOMAIN_PAUSED_IOERROR) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("cannot migrate domain with I/O error")); + return false; + } + }
if (virDomainHasDiskMirror(vm)) {
Do we want to document this behaviour change?
How about:
diff --git a/src/libvirt.c b/src/libvirt.c index 620dbdd..6413a1e 100644 --- a/src/libvirt.c +++ b/src/libvirt.c @@ -5809,41 +5809,42 @@ error: * guest ABI, * * If a hypervisor supports renaming domains during migration, * the dname parameter specifies the new name for the domain. * Setting dname to NULL keeps the domain name the same. If domain * renaming is not supported by the hypervisor, dname must be NULL or * else an error will be returned. * * The maximum bandwidth (in MiB/s) that will be used to do migration * can be specified with the bandwidth parameter. If set to 0, * libvirt will choose a suitable default. Some hypervisors do * not support this feature and will return an error if bandwidth * is not 0. * * To see which features are supported by the current hypervisor, * see virConnectGetCapabilities, /capabilities/host/migration_features. * * There are many limitations on migration imposed by the underlying * technology - for example it may not be possible to migrate between * different processors even with the same architecture, or between - * different types of hypervisor. + * different types of hypervisor. Migration of domains that encountered + * errors may not be possible.
And continue with: "Moreover, in case of I/O error, depending on hypervisor the migration may be canceled." ACK Michal

Il 11/06/2013 05:49, Peter Krempa ha scritto:
When qemu has paused itself due to a I/O error it's not safe to migrate it as it still might contain state in kernel buffers. These patches forbid and cancel ongoing migration if a I/O error happens.
Peter Krempa (2): qemu: Cancel migration if guest encoutners I/O error while migrating qemu: Forbid migration of machines with I/O errors
src/qemu/qemu_migration.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+)
This actually should be supported, it certainly is in QEMU. What kind of state is in the kernel and what is the BZ for this patch? When a system call returns, it should be all done. Paolo
participants (3)
-
Michal Privoznik
-
Paolo Bonzini
-
Peter Krempa