On Tue, Jun 9, 2020 at 4:14 PM Peter Krempa <pkrempa@redhat.com> wrote:
On Tue, Jun 09, 2020 at 16:00:02 +0300, Liran Rotenberg wrote:
> On Tue, Jun 9, 2020 at 3:46 PM Peter Krempa <pkrempa@redhat.com> wrote:
>
> > On Tue, Jun 09, 2020 at 15:38:53 +0300, Liran Rotenberg wrote:
> > > Hi all,
> > > Passing on Bug 1840609 <
> > https://bugzilla.redhat.com/show_bug.cgi?id=1840609>
> > > - Wake up from hibernation failed:internal error: unable to execute QEMU
> > > command 'cont': Failed to get "write" lock.
> > >
> > > In Ovirt/RHV there is a specific flow that prevents the VM from starting
> > on
> > > the first host.
> > > The result is:
> > > 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock
> > > Is another process using the image
> > >
> > [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]?
> > > 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from pid
> > > 177876 (<unknown process>)
> >
> > This error comes from qemu's internal file locking. It usually means
> > that there is another qemu or qemu-img which has the given image open.
> >
> > Is there anything which would access the image at that specific time or
> > slightly around?
> >
> I don't think so. The volumes are created and added to the volume chain on
> the VM metadata domxml(two snapshots created). Then the user restores the
> latest snapshot and deletes them (while the VM is down) - they are removed.
> The VM is set with the volume and going up, restoring the memory. The
> mounting place (in /rhev/data-center) points to the same disk and volume.
> On the first run I see the new place /rhev/data-center/<uuid>... that I
> can't tell why or where it comes from. It is set with 'rw', while the
> normal destination to the shared NFS is only with 'r'.

That's definitely something not related to the locking itself.

Please attach the XML document used to start the VM in both places
(working and non-working). I can't tell if there's a difference without
seeing those.

The difference might very well be in the XMLs.
I added a snippet from the engine log. The two first domxmls are from the first attempt on 'host_mixed_1' resulting in:
"unable to execute QEMU command 'cont': Failed to get "write" lock."
From this point, it's the second attempt on 'host_mixed_2' (2020-06-09 15:13:00,308+03 in log time). 

> > It might be a race condition from something trying to modify the image
> > perhaps combined with propagation of locks via NFS.
> >
> I can't see any race from the RHV point of view. The strangest thing is why
> it works on the second host?
> In RHV after the VM is up we remove the memory disk, but that doesn't
> happen in this case, since the VM wasn't up.

I'd speculate that it's timing. There's more time for the locks to
propagate or an offending process to terminate. The error message
definitely does not look like a permission issue, and
qemuSetupImagePathCgroup:75 is in the end a no-op on on non-device
paths (makes sense only for LVM logical volumes, partitions etc).