Hi folks,
I'm looking into a problem discussed back in January 2013
wherein lock/lease state isn't properly preserved across suspend/resume.
(This situation can lead to corruption if the guest's block storage is
modified elsewhere while the original guest is paused.)
For details see:
https://www.redhat.com/archives/libvirt-users/2013-January/msg00109.html
https://bugzilla.redhat.com/show_bug.cgi?id=906590
I'm using libvirt-1.2.0 with explicit Sanlock leases defined in the domain XML.
It appears the problematic behavior is due to virDomainLockProcessPause()
and virDomainLockProcessResume() being called twice during each
suspend/resume: once by the RPC worker thread running the suspend/resume
command, and once by the main thread in response to the QEMU events
triggered by the RPC worker's actions.
In libvirt-1.2.0, call paths for suspend are as follows:
qemuDomainObjBeginJob(suspend) ->
qemuDomainSuspend() ->
qemuProcessStopCPUs() ->
virDomainLockProcessPause()
qemuMonitorJSONIOProcessEvent:143 : handle STOP ->
qemuProcessHandleStop ->
virDomainLockProcessPause()
The first call -- usually out of qemuProcessHandleStop but perhaps
there's a race -- properly saves state and releases locks.
However the second call queries lock status after locks have been
released, so it finds no locks are held. This results in a null/blank
lockState saved in the domain object.
Before I start working on a solution, are these multiple invocations
of virDomainLockProcessPause()/virDomainLockProcessResume() intentional?
Thanks,
Adam Tilghman
UC San Diego