On Fri, Oct 11, 2019 at 23:18:29 +0000, Jim Fehlig wrote:
I've been investigating a lockd lock ordering bug in a migration
error handling
path in the libxl driver. In the perform phase, the src calls
virDomainLockProcessPause to release the lock before sending the VM to dst. In
this case the send fails for other reasons and an attempt is made to reacquire
the lock with virDomainLockProcessResume. But that fails since the dst has not
finished cleaning up the failed VM and releasing the lock it acquired when
starting to receive the VM. My immediate reaction was "why not reacquire the
lock in the confirm phase", but then I saw my older comment a few lines later in
the perform phase code
/*
* Confirm phase will not be executed if perform fails. End the
* job started in begin phase.
*/
Is that just a bug in the implementation, or is it intended to skip the confirm
phase if perform fails?
It's intended. The Perform phase runs on the source hosts so why should
we call Confirm to let the source know about the failure? But of course,
the source has to cleanup after the failed migration similarly to what
Confirm would do.
Jirka