On Tue, Jun 9, 2020 at 3:46 PM Peter Krempa <pkrempa@redhat.com> wrote:
On Tue, Jun 09, 2020 at 15:38:53 +0300, Liran Rotenberg wrote:
> Hi all,
> Passing on Bug 1840609 <https://bugzilla.redhat.com/show_bug.cgi?id=1840609>
> - Wake up from hibernation failed:internal error: unable to execute QEMU
> command 'cont': Failed to get "write" lock.
>
> In Ovirt/RHV there is a specific flow that prevents the VM from starting on
> the first host.
> The result is:
> 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock
> Is another process using the image
> [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]?
> 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from pid
> 177876 (<unknown process>)

This error comes from qemu's internal file locking. It usually means
that there is another qemu or qemu-img which has the given image open.

Is there anything which would access the image at that specific time or
slightly around?
I don't think so. The volumes are created and added to the volume chain on the VM metadata domxml(two snapshots created). Then the user restores the latest snapshot and deletes them (while the VM is down) - they are removed.
The VM is set with the volume and going up, restoring the memory. The mounting place (in /rhev/data-center) points to the same disk and volume. On the first run I see the new place /rhev/data-center/<uuid>... that I can't tell why or where it comes from. It is set with 'rw', while the normal destination to the shared NFS is only with 'r'.

It might be a race condition from something trying to modify the image
perhaps combined with propagation of locks via NFS.
I can't see any race from the RHV point of view. The strangest thing is why it works on the second host?
In RHV after the VM is up we remove the memory disk, but that doesn't happen in this case, since the VM wasn't up.