Hi Jirka,
On Thu, Jan 25, 2018 at 8:43 PM, Jiri Denemark <jdenemar(a)redhat.com> wrote:
On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote:
> In case of non-p2p migration, in case libvirt client gets disconnected
from source libvirt
> after PERFORM phase is over, the daemon just resets the current
migration job.
> However, the VM could be left paused on both source and destination in
such case. In case
> the client reconnects and queries migration status, the job has been
blanked out from source libvirt,
> and this reconnected client has no clear way of figuring out if an
unclean migration had previously
> been aborted.
The virDomainGetState API should return VIR_DOMAIN_PAUSED with
VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough?
I understand that a client application should poll source libvirtd for
status of migration job completion using virDomainGetJobStats().
However, as you explained above, cleanup callbacks clear the job info so a
client should additionally be polling for virDomainGetState() too.
Would it not be cleaner to have a single API reflect the source of truth?
> This patch calls out a "potentially" incomplete
migration as a failed
> job, so that a client may
As you say it's potentially incomplete, so marking it as failed is not
completely correct. It's a split brain when the source cannot
distinguish whether the migration was successful or not.
Agree, it might have run to completion too, as we observed in some cases.
Do you think marking the job status as "UNKNOWN" is better articulation of
the current state?
> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
> index e8e0313..7c60d17 100644
> --- a/src/qemu/qemu_domain.c
> +++ b/src/qemu/qemu_domain.c
> @@ -4564,6 +4564,22 @@ qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr
driver, virDomainObjPtr obj)
> qemuDomainObjSaveJob(driver, obj);
> }
>
> +
> +void
> +qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
> +{
> + qemuDomainObjPrivatePtr priv = obj->privateData;
> + VIR_FREE(priv->job.completed);
> + if (VIR_ALLOC(priv->job.completed) == 0) {
> + priv->job.current->type = VIR_DOMAIN_JOB_FAILED;
> + priv->job.completed = priv->job.current;
This will just leak the memory allocated for priv->job.completed by
overwriting the pointer to the one from priv->job.current, ...
> + } else {
> + VIR_WARN("Unable to allocate job.completed for VM %s",
obj->def->name);
> + }
> + qemuDomainObjResetAsyncJob(priv);
which will point to a freed memory after this call.
Agree, I will fix this.
> + qemuDomainObjEndJob(driver, obj);
And while this is almost certainly (I didn't really check though) not
something you should call from a close callback, you don't save the
changes to the status XML so once libvirtd restarts, it will think the
domain is still being migrated.
I will add the same to status XML.
I am suggesting that strengthening the job data would be additionally
useful. If the daemon has not restarted, job information can still get us
fairly accurate status of migration. Pls let me know if you think this is
not useful, I will be happy to learn the rationale.
Regards,
Prerna