On 10/14/19 2:18 AM, Jiri Denemark wrote:
On Fri, Oct 11, 2019 at 23:18:29 +0000, Jim Fehlig wrote:
> I've been investigating a lockd lock ordering bug in a migration error handling
> path in the libxl driver. In the perform phase, the src calls
> virDomainLockProcessPause to release the lock before sending the VM to dst. In
> this case the send fails for other reasons and an attempt is made to reacquire
> the lock with virDomainLockProcessResume. But that fails since the dst has not
> finished cleaning up the failed VM and releasing the lock it acquired when
> starting to receive the VM. My immediate reaction was "why not reacquire the
> lock in the confirm phase", but then I saw my older comment a few lines later
in
> the perform phase code
>
> /*
> * Confirm phase will not be executed if perform fails. End the
> * job started in begin phase.
> */
>
> Is that just a bug in the implementation, or is it intended to skip the confirm
> phase if perform fails?
It's intended. The Perform phase runs on the source hosts so why should
we call Confirm to let the source know about the failure?
To do any cleanup of the failed migration after the dst has done it's cleanup in
the finish phase?
But of course,
the source has to cleanup after the failed migration similarly to what
Confirm would do.
I've made slight changes to the lock ordering and it looks promising after
initial tests. I'll post a patch after further testing. Thanks!
Regards,
Jim