Re: [libvirt] [PATCH] Migration: Preserve the failed job in case migration job is terminated beyond the perform phase.

8 Feb 2018

      On Mon, Jan 29, 2018 at 15:56:29 +0530, Prerna wrote:
...
Hi Jirka,
On Thu, Jan 25, 2018 at 8:43 PM, Jiri Denemark <jdenemar@redhat.com> wrote:
...
On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote:
...
In case of non-p2p migration, in case libvirt client gets disconnected
from source libvirt
after PERFORM phase is over, the daemon just resets the current
migration job.
However, the VM could be left paused on both source and destination in
such case. In case
the client reconnects and queries migration status, the job has been
blanked out from source libvirt,
and this reconnected client has no clear way of figuring out if an
unclean migration had previously
been aborted.
The virDomainGetState API should return VIR_DOMAIN_PAUSED with
VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough?
I understand that a client application should poll source libvirtd for
status of migration job completion using virDomainGetJobStats().
Not really, it may poll if it wants to monitor migration progress, but
normally the client would just wait for the migration API to return
either success or failure.
...
However, as you explained above, cleanup callbacks clear the job info so a
client should additionally be polling for virDomainGetState() too.
Well, even if virDomainGetJobStats with VIR_DOMAIN_JOB_STATS_COMPLETED
flag was modified to report the job as VIR_DOMAIN_JOB_FAILED, the client
would still need to call virDomainGetState (on both sides in some cases)
to check whether the domain is running or it was left in a paused state.
So the reporting of a failed job by virDomainGetJobStats does not seem
to be really necessary. And it would be a bit confusing too since the
flag is called *_COMPLETED, while the migration in fact did not
complete. This confusion could be fixed by introducing a new flag,
but...
...
Would it not be cleaner to have a single API reflect the source of truth?
Perhaps, but since there already is a way of getting the info, any
client which wants to work with more than just a bleeding edge libvirt
would still need to implement the existing way. And why would the client
bother using the new API when it can be sure the old way will still be
available? Doing so would make the client even more complicated for no
benefit.

But as I said, just seeing that a previous migration job failed is not
enough to recover from a disconnected client which was controlling a
non-p2p migration.

BTW, p2p migration is far less fragile in this respect. If the
connection to a client breaks, migration normally continues without any
disruption. And if the connection between libvirt daemons fails, both
sides will detect it and abort the migration. Of course, a split brain
can still happen even with p2p migration, but it's not so easy to
trigger it since the time frame in which the connection has to break to
cause a split brain is much shorter.

Jirka