On Wed, May 18, 2022 at 15:40:18 +0200, Jiri Denemark wrote:
On Thu, May 12, 2022 at 14:14:15 +0200, Peter Krempa wrote:
> On Tue, May 10, 2022 at 17:21:15 +0200, Jiri Denemark wrote:
> > Mostly we just need to check whether the domain is in a failed post-copy
> > migration that can be resumed.
> >
> > Signed-off-by: Jiri Denemark <jdenemar(a)redhat.com>
> > ---
> > src/qemu/qemu_migration.c | 143 ++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 143 insertions(+)
> >
> > diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
> > index c111dd8686..99b1d4b88b 100644
> > --- a/src/qemu/qemu_migration.c
> > +++ b/src/qemu/qemu_migration.c
> > @@ -2717,6 +2717,143 @@ qemuMigrationSrcBeginPhase(virQEMUDriver *driver,
> > flags);
> > }
> >
> > +
> > +static bool
> > +qemuMigrationAnyCanResume(virDomainObj *vm,
> > + virDomainAsyncJob job,
> > + unsigned long flags,
> > + qemuMigrationJobPhase expectedPhase)
> > +{
> > + qemuDomainObjPrivate *priv = vm->privateData;
> > +
> > + VIR_DEBUG("vm=%p, job=%s, flags=0x%lx, expectedPhase=%s",
> > + vm, virDomainAsyncJobTypeToString(job), flags,
> > + qemuDomainAsyncJobPhaseToString(VIR_ASYNC_JOB_MIGRATION_OUT,
> > + expectedPhase));
> > +
> > + if (!(flags & VIR_MIGRATE_POSTCOPY)) {
> > + virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s",
> > + _("resuming failed post-copy migration requires
"
> > + "post-copy to be enabled"));
>
> Error messages shall not have line breaks.
>
> Also ... requiring this flag seems to be a bit pointless and ...
> cosmetic at best. Since the previous migration would need to be started
> with the post-copy flag already which is checked below.
Right, the previous migration must have been started as post-copy, but
resuming it is post-copy too. We could have implied the flag based on
POSTCOPY_RESUME (which is ugly) or all relevant places checking POSTCOPY
would need to check POSTCOPY_RESUME too (which is ugly as well). And
with the expected use case (repeating the failed migration command with
all the flags and arguments and adding an extra POSTCOPY_RESUME flag) in
mind requiring POSTCOPY to be used with POSTCOPY_RESUME looks like the
the best option to me.
I was kind of referring to the fact that it's implied that postcopy is
being resumed with a POSTCOPY_RESUME flag and you need to check anyways
that the migration is indeed postcopy, so this feels redundant, but I
think it's okay.
>
> > + return false;
> > + }
> > +
> > + /* This should never happen since POSTCOPY_RESUME is newer than
> > + * CHANGE_PROTECTION, but let's check it anyway in case we're
talking to
> > + * a weired client.
> > + */
> > + if (job == VIR_ASYNC_JOB_MIGRATION_OUT &&
> > + expectedPhase < QEMU_MIGRATION_PHASE_PERFORM_RESUME &&
> > + !(flags & VIR_MIGRATE_CHANGE_PROTECTION)) {
> > + virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s",
> > + _("resuming failed post-copy migration requires
"
> > + "change protection"));
>
> Same here.
>
> > + return NULL;
> > + }
>
>
> [...]
>
> > +static char *
> > +qemuMigrationSrcBeginResume(virQEMUDriver *driver,
> > + virDomainObj *vm,
> > + const char *xmlin,
> > + char **cookieout,
> > + int *cookieoutlen,
> > + unsigned long flags)
> > +{
> > + virDomainJobStatus status;
> > +
> > + if (qemuMigrationAnyRefreshStatus(driver, vm,
VIR_ASYNC_JOB_MIGRATION_OUT,
> > + &status) < 0)
> > + return NULL;
> > +
> > + if (status != VIR_DOMAIN_JOB_STATUS_POSTCOPY_PAUSED) {
> > + virReportError(VIR_ERR_OPERATION_INVALID, "%s",
> > + _("QEMU reports migration is still
running"));
> > + return NULL;
> > + }
> > +
> > + return qemuMigrationSrcBeginXML(driver, vm, xmlin,
> > + cookieout, cookieoutlen, 0, NULL, 0,
flags);
>
> Certain steps here around the non-shared-storage migration are redundant
> at this point and IMO should be skipped.
Which is what happens in qemuMigrationSrcBegin:
if (flags & VIR_MIGRATE_POSTCOPY_RESUME) {
xml = qemuMigrationSrcBeginResumePhase(...);
goto cleanup;
}
...
Or did I miss anything?
In 'qemuMigrationSrcBeginXML'
qemuMigrationSrcBeginPhaseBlockDirtyBitmaps is called which does a bunch
of setup for bitmap migration which is pointless when recovering as
storage is copied already if the CPUs are running on the destination.