[libvirt] [PATCH v2 0/3] Propagate storage errors on migration

This is practically v2 to: https://www.redhat.com/archives/libvir-list/2015-January/msg00230.html The approach is merely the same - listening to BLOCK_JOB_* events instead of asking on the monitor repeatedly. But the code looks more polished. Michal Privoznik (3): qemuProcessHandleBlockJob: Set disk->mirrorState more often qemuProcessHandleBlockJob: Take status into account qemuMigrationDriveMirror: Listen to events src/qemu/qemu_migration.c | 37 +++++++++++++++++-------------------- src/qemu/qemu_process.c | 36 +++++++++++++++++++++--------------- 2 files changed, 38 insertions(+), 35 deletions(-) -- 2.0.5

Currently, upon BLOCK_JOB_* event, disk->mirrorState is not updated each time. The callback code handling the events checks if a blockjob was started via our public APIs prior to setting the mirrorState. However, some block jobs may be started internally (e.g. during storage migration), in which case we don't bother with setting disk->mirror (there's nothing we can set it to anyway), or other fields. But it will come handy if we update the mirrorState in these cases too. The event wasn't delivered just for fun - we've started the job after all. So, in this commit, the mirrorState is set to whatever job status we've obtained. Of course, there are some actions on some statuses that we want to perform. But instead of if {} else if {} else {} ... enumeration, let's move to switch(). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 35 ++++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 15 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index d5df60d..7dc7d2b 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -1041,7 +1041,8 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon ATTRIBUTE_UNUSED, /* If we completed a block pull or commit, then update the XML * to match. */ - if (status == VIR_DOMAIN_BLOCK_JOB_COMPLETED) { + switch ((virConnectDomainEventBlockJobStatus) status) { + case VIR_DOMAIN_BLOCK_JOB_COMPLETED: if (disk->mirrorState == VIR_DOMAIN_DISK_MIRROR_STATE_PIVOT) { if (vm->newDef) { int indx = virDomainDiskIndexByName(vm->newDef, disk->dst, @@ -1091,20 +1092,24 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon ATTRIBUTE_UNUSED, disk->mirrorJob = VIR_DOMAIN_BLOCK_JOB_TYPE_UNKNOWN; ignore_value(qemuDomainDetermineDiskChain(driver, vm, disk, true, true)); - } else if (disk->mirror && - (type == VIR_DOMAIN_BLOCK_JOB_TYPE_COPY || - type == VIR_DOMAIN_BLOCK_JOB_TYPE_ACTIVE_COMMIT)) { - if (status == VIR_DOMAIN_BLOCK_JOB_READY) { - disk->mirrorState = VIR_DOMAIN_DISK_MIRROR_STATE_READY; - save = true; - } else if (status == VIR_DOMAIN_BLOCK_JOB_FAILED || - status == VIR_DOMAIN_BLOCK_JOB_CANCELED) { - virStorageSourceFree(disk->mirror); - disk->mirror = NULL; - disk->mirrorState = VIR_DOMAIN_DISK_MIRROR_STATE_NONE; - disk->mirrorJob = VIR_DOMAIN_BLOCK_JOB_TYPE_UNKNOWN; - save = true; - } + break; + + case VIR_DOMAIN_BLOCK_JOB_READY: + disk->mirrorState = VIR_DOMAIN_DISK_MIRROR_STATE_READY; + save = true; + break; + + case VIR_DOMAIN_BLOCK_JOB_FAILED: + case VIR_DOMAIN_BLOCK_JOB_CANCELED: + virStorageSourceFree(disk->mirror); + disk->mirror = NULL; + disk->mirrorState = VIR_DOMAIN_DISK_MIRROR_STATE_NONE; + disk->mirrorJob = VIR_DOMAIN_BLOCK_JOB_TYPE_UNKNOWN; + save = true; + break; + + case VIR_DOMAIN_BLOCK_JOB_LAST: + break; } } -- 2.0.5

On Wed, Feb 11, 2015 at 13:51:09 +0100, Michal Privoznik wrote:
Currently, upon BLOCK_JOB_* event, disk->mirrorState is not updated each time. The callback code handling the events checks if a blockjob was started via our public APIs prior to setting the mirrorState. However, some block jobs may be started internally (e.g. during storage migration), in which case we don't bother with setting disk->mirror (there's nothing we can set it to anyway), or other fields. But it will come handy if we update the mirrorState in these cases too. The event wasn't delivered just for fun - we've started the job after all.
So, in this commit, the mirrorState is set to whatever job status we've obtained. Of course, there are some actions on some statuses that we want to perform. But instead of if {} else if {} else {} ... enumeration, let's move to switch().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 35 ++++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 15 deletions(-) ...
Looks good to me. ACK Jirka

Upon BLOCK_JOB_COMPLETED event delivery, we check if the job has completed (in qemuMonitorJSONHandleBlockJobImpl()). For better image, the event looks something like this: "timestamp": {"seconds": 1423582694, "microseconds": 372666}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive-virtio-disk0", "len": 8412790784, "offset": 409993216, "speed": 8796093022207, "type": "mirror", "error": "No space left on device"}} If "len" does not equal "offset" it's considered an error, and we can clearly see "error" field filled in. However, later in the event processing this case was handled no differently to case of job being aborted via separate API. It's time that we start differentiate these two because of the future work. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7dc7d2b..c739775 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -1103,7 +1103,8 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon ATTRIBUTE_UNUSED, case VIR_DOMAIN_BLOCK_JOB_CANCELED: virStorageSourceFree(disk->mirror); disk->mirror = NULL; - disk->mirrorState = VIR_DOMAIN_DISK_MIRROR_STATE_NONE; + disk->mirrorState = status == VIR_DOMAIN_BLOCK_JOB_FAILED ? + VIR_DOMAIN_DISK_MIRROR_STATE_ABORT : VIR_DOMAIN_DISK_MIRROR_STATE_NONE; disk->mirrorJob = VIR_DOMAIN_BLOCK_JOB_TYPE_UNKNOWN; save = true; break; -- 2.0.5

On Wed, Feb 11, 2015 at 13:51:10 +0100, Michal Privoznik wrote:
Upon BLOCK_JOB_COMPLETED event delivery, we check if the job has completed (in qemuMonitorJSONHandleBlockJobImpl()). For better image, the event looks something like this:
"timestamp": {"seconds": 1423582694, "microseconds": 372666}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive-virtio-disk0", "len": 8412790784, "offset": 409993216, "speed": 8796093022207, "type": "mirror", "error": "No space left on device"}}
If "len" does not equal "offset" it's considered an error, and we can clearly see "error" field filled in. However, later in the event processing this case was handled no differently to case of job being aborted via separate API. It's time that we start differentiate these two because of the future work.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7dc7d2b..c739775 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -1103,7 +1103,8 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon ATTRIBUTE_UNUSED, case VIR_DOMAIN_BLOCK_JOB_CANCELED: virStorageSourceFree(disk->mirror); disk->mirror = NULL; - disk->mirrorState = VIR_DOMAIN_DISK_MIRROR_STATE_NONE; + disk->mirrorState = status == VIR_DOMAIN_BLOCK_JOB_FAILED ? + VIR_DOMAIN_DISK_MIRROR_STATE_ABORT : VIR_DOMAIN_DISK_MIRROR_STATE_NONE; disk->mirrorJob = VIR_DOMAIN_BLOCK_JOB_TYPE_UNKNOWN; save = true; break;
Heh, thanks to a limited context, it took me a bit to realize how status can be *_FAILED when it is *_CANCELED :-) ACK Jirka

When migrating with storage, libvirt iterates over domain disks and instruct qemu to migrate the ones we are interested in (shared, RO and source-less disks are skipped). The disks are migrated in series. No new disk is transferred until the previous one hasn't been quiesced. This is checked on the qemu monitor via 'query-jobs' command. If the disk has been quiesced, it practically went from copying its content to mirroring state, where all disk writes are mirrored to the other side of migration too. Having said that, there's one inherent error in the design. The monitor command we use reports only active jobs. So if the job fails for whatever reason, we will not see it anymore in the command output. And this can happen fairly simply: just try to migrate a domain with storage. If the storage migration fails (e.g. due to ENOSPC on the destination) we resume the host on the destination and let it run on partly copied disk. The proper fix is what even the comment in the code says: listen for qemu events instead of polling. If storage migration changes state an event is emitted and we can act accordingly: either consider disk copied and continue the process, or consider disk mangled and abort the migration. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_migration.c | 37 +++++++++++++++++-------------------- 1 file changed, 17 insertions(+), 20 deletions(-) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 4506d87..14a4ec6 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1739,7 +1739,6 @@ qemuMigrationDriveMirror(virQEMUDriverPtr driver, for (i = 0; i < vm->def->ndisks; i++) { virDomainDiskDefPtr disk = vm->def->disks[i]; - virDomainBlockJobInfo info; /* skip shared, RO and source-less disks */ if (disk->src->shared || disk->src->readonly || @@ -1772,38 +1771,36 @@ qemuMigrationDriveMirror(virQEMUDriverPtr driver, /* Poll every 500ms for progress & to allow cancellation */ struct timespec ts = { .tv_sec = 0, .tv_nsec = 500 * 1000 * 1000ull }; - memset(&info, 0, sizeof(info)); - - if (qemuDomainObjEnterMonitorAsync(driver, vm, - QEMU_ASYNC_JOB_MIGRATION_OUT) < 0) + /* Explicitly check if domain is still alive. Maybe qemu + * died meanwhile so we won't see any event at all. */ + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("guest unexpectedly quit")); goto error; + } + + /* The following check should be race free as long as the variable + * is set only with domain object locked. And here we have the + * domain object locked too. */ if (priv->job.asyncAbort) { - /* explicitly do this *after* we entered the monitor, - * as this is a critical section so we are guaranteed - * priv->job.asyncAbort will not change */ - ignore_value(qemuDomainObjExitMonitor(driver, vm)); priv->job.current->type = VIR_DOMAIN_JOB_CANCELLED; virReportError(VIR_ERR_OPERATION_ABORTED, _("%s: %s"), qemuDomainAsyncJobTypeToString(priv->job.asyncJob), _("canceled by client")); goto error; } - mon_ret = qemuMonitorBlockJobInfo(priv->mon, diskAlias, &info, - NULL); - if (qemuDomainObjExitMonitor(driver, vm) < 0) - goto error; - if (mon_ret < 0) - goto error; - - if (info.cur == info.end) { + if (disk->mirrorState == VIR_DOMAIN_DISK_MIRROR_STATE_READY) { VIR_DEBUG("Drive mirroring of '%s' completed", diskAlias); break; + } else if (disk->mirrorState == VIR_DOMAIN_DISK_MIRROR_STATE_ABORT) { + virReportError(VIR_ERR_OPERATION_FAILED, + _("migration of disk %s failed"), + disk->dst); + goto error; } - /* XXX Frankly speaking, we should listen to the events, - * instead of doing this. But this works for now and we - * are doing something similar in migration itself anyway */ + /* XXX Turn this into virCond someday. */ virObjectUnlock(vm); -- 2.0.5

On Wed, Feb 11, 2015 at 13:51:11 +0100, Michal Privoznik wrote:
When migrating with storage, libvirt iterates over domain disks and instruct qemu to migrate the ones we are interested in (shared, RO and source-less disks are skipped). The disks are migrated in series. No new disk is transferred until the previous one hasn't been quiesced. This is checked on the qemu monitor via 'query-jobs' command. If the disk has been quiesced, it practically went from copying its content to mirroring state, where all disk writes are mirrored to the other side of migration too. Having said that, there's one inherent error in the design. The monitor command we use reports only active jobs. So if the job fails for whatever reason, we will not see it anymore in the command output. And this can happen fairly simply: just try to migrate a domain with storage. If the storage migration fails (e.g. due to ENOSPC on the destination) we resume the host on the destination and let it run on partly copied disk.
The proper fix is what even the comment in the code says: listen for qemu events instead of polling. If storage migration changes state an event is emitted and we can act accordingly: either consider disk copied and continue the process, or consider disk mangled and abort the migration.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_migration.c | 37 +++++++++++++++++-------------------- 1 file changed, 17 insertions(+), 20 deletions(-)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 4506d87..14a4ec6 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1739,7 +1739,6 @@ qemuMigrationDriveMirror(virQEMUDriverPtr driver, ... - /* XXX Frankly speaking, we should listen to the events, - * instead of doing this. But this works for now and we - * are doing something similar in migration itself anyway */ + /* XXX Turn this into virCond someday. */
Exactly, this is what I wanted to suggest when seeing this code. The code should be even simpler with virCond so why not doing the change right away? I think this could be done as a separate patch (in this series) to make reviewing the code easier. Hopefully, migration will follow the same path soon as well (once we the appropriate event is implemented in qemu). Jirka

On 11.02.2015 17:42, Jiri Denemark wrote:
On Wed, Feb 11, 2015 at 13:51:11 +0100, Michal Privoznik wrote:
When migrating with storage, libvirt iterates over domain disks and instruct qemu to migrate the ones we are interested in (shared, RO and source-less disks are skipped). The disks are migrated in series. No new disk is transferred until the previous one hasn't been quiesced. This is checked on the qemu monitor via 'query-jobs' command. If the disk has been quiesced, it practically went from copying its content to mirroring state, where all disk writes are mirrored to the other side of migration too. Having said that, there's one inherent error in the design. The monitor command we use reports only active jobs. So if the job fails for whatever reason, we will not see it anymore in the command output. And this can happen fairly simply: just try to migrate a domain with storage. If the storage migration fails (e.g. due to ENOSPC on the destination) we resume the host on the destination and let it run on partly copied disk.
The proper fix is what even the comment in the code says: listen for qemu events instead of polling. If storage migration changes state an event is emitted and we can act accordingly: either consider disk copied and continue the process, or consider disk mangled and abort the migration.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_migration.c | 37 +++++++++++++++++-------------------- 1 file changed, 17 insertions(+), 20 deletions(-)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 4506d87..14a4ec6 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1739,7 +1739,6 @@ qemuMigrationDriveMirror(virQEMUDriverPtr driver, ... - /* XXX Frankly speaking, we should listen to the events, - * instead of doing this. But this works for now and we - * are doing something similar in migration itself anyway */ + /* XXX Turn this into virCond someday. */
Exactly, this is what I wanted to suggest when seeing this code. The code should be even simpler with virCond so why not doing the change right away? I think this could be done as a separate patch (in this series) to make reviewing the code easier.
Exactly. It has not direct influence on this patcheset. The virCond patch should be easy and small though, so after this is merged, I can propose one. But I'd really love to fix the bug asap. Michal

On Wed, Feb 11, 2015 at 13:51:11 +0100, Michal Privoznik wrote:
When migrating with storage, libvirt iterates over domain disks and instruct qemu to migrate the ones we are interested in (shared, RO and source-less disks are skipped). The disks are migrated in series. No new disk is transferred until the previous one hasn't been quiesced.
Quiesce in the libvirt terminology is used in the meaning that the filesystem writes were frozen (see snapshot creation). In this context it's used in the meaning that the qemu block-copy job entered the synchronized phase, but writes are still possible. Please tweak the wording to avoid confusion.
This is checked on the qemu monitor via 'query-jobs' command. If the disk has been quiesced, it practically went from copying its content to mirroring state, where all disk writes are mirrored to the other side of migration too. Having said that, there's one inherent error in the design. The monitor command we use reports only active jobs. So if the job fails for whatever reason, we will not see it anymore in the command output. And this can happen fairly simply: just try to migrate a domain with storage. If the storage migration fails (e.g. due to ENOSPC on the destination) we resume the host on the destination and let it run on partly copied disk.
The proper fix is what even the comment in the code says: listen for qemu events instead of polling. If storage migration changes state an event is emitted and we can act accordingly: either consider disk copied and continue the process, or consider disk mangled and abort the migration.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_migration.c | 37 +++++++++++++++++-------------------- 1 file changed, 17 insertions(+), 20 deletions(-)
Peter
participants (3)
-
Jiri Denemark
-
Michal Privoznik
-
Peter Krempa