Dear colleagues,

I am facing a problem that has been troubling me for last week and a half. Please if you are able to help or offer some guidance.

I have a non-prod POC environment with 2 CentOS7 fully updated hypervisors and an NFS filer that serves as a VM image storage. The overall environment works exceptionally well. However, starting a few weeks ago I have been trying to implement virtlock in order to prevent a VM running on 2 hypervisors at the same time.

Here is the description how the environment looks like in terms of virtlock configuration on both hypervisors:

-- Content of /etc/libvirt/qemu.conf --
lock_manager = "lockd"

Only the above line is uncommented for direct locking.

# libvirtd --version; python -c "import platform; print(platform.platform())"; virtlockd -V
libvirtd (libvirt) 3.2.0
Linux-3.10.0-693.2.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core
virtlockd (libvirt) 3.2.0

# getenforce
Permissive

Here is the issue:

h1 # virsh list
Id    Name                           State
----------------------------------------------------
1     test09                         running

h1 # virsh domblklist test09
Target     Source
------------------------------------------------
vda        /storage_nfs/images_001/test09.qcow2

h1 #

h2 # virsh list
Id    Name                           State
----------------------------------------------------

h2 # virsh list --all | grep test09
-     test09                         shut off

h2 # virsh start test09
error: Failed to start domain test09
error: resource busy: Lockspace resource '/storage_nfs/images_001/test09.qcow2' is locked

h2 # virsh list
Id    Name                           State
----------------------------------------------------

h2 #

Before I start test09 I open a console to the guest and observe what is going on in it. Once I try to start test09 (and get a message about locked resource) on h2 hypervisor, I can see the following messages in the console and the vm goes to ro mode:

on test09's console:

[ 567.394148] blk_update_request: I/O error, dev vda, sector 13296056
[ 567.395883] blk_update_request: I/O error, dev vda, sector 13296056

[ 572.871905] blk_update_request: I/O error, dev vda, sector 8654040
[ 572.872627] Aborting journal on device vda1-8.
[ 572.873978] blk_update_request: I/O error, dev vda, sector 8652800
[ 572.874707] Buffer I/O error on dev vda1, logical block 1081344, lost sync page write
[ 572.875472] blk_update_request: I/O error, dev vda, sector 2048
[ 572.876009] Buffer I/O error on dev vda1, logical block 0, lost sync page write
[ 572.876727] EXT4-fs error (device vda1): ext4_journal_check_start:56: Detected aborted journal[ 572.878061] JBD2: Error -5 detected when updating journal superblock for vda1-8.

[ 572.878807] EXT4-fs (vda1): Remounting filesystem read-only
[ 572.879311] EXT4-fs (vda1): previous I/O error to superblock detected
[ 572.880937] blk_update_request: I/O error, dev vda, sector 2048
[ 572.881538] Buffer I/O error on dev vda1, logical block 0, lost sync page write

I also observe the guests'log:

-- /var/log/libvirt/qemu/test09.log --

block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)

If it helps, here is the disk portion of an XML file:

    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/storage_nfs/images_001/test09.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>

I usually do implement SELinux on a hypervisor to isolate guests even further but this time I set it to permissive mode just to rule out SELinux factor. The same thing happens when SELinux is in enforcing mode (virt_use_nfs is set to on in that case) and audit2why doesn't report any anomalies when parsing audit logs.

I have tried to use indirect locking via the same filer and with a separated export for the hashes by removing the comment in /etc/libvirt/qemu-lockd.conf for the following line:

file_lockspace_dir = "/var/lib/libvirt/lockd/files"

In this case the hashes are normally created on the NFS export mounted under /var/lib/libvirt/lockd/files. I have also tried playing with both QCOW2 and raw disk images for VMs (and even with XFS/ext4 based guests) but the outcome is always the same. I have a couple of KVM books - consulted them on this topic, consulted Red Hat and SUSE docs but pretty much the configuration instructions are, naturally, the same. I saw that some colleagues posted a few emails (ie https://www.redhat.com/archives/libvirt-users/2015-September/msg00004.html) to the list related to virtlock but it seems that it is not the same issue. I have also, as a last resort, completely disabled SELinux, rebooted both hypervisors, created a new vm, repeated all the steps listed above but with the same results.

Now, I am pretty sure that I am missing something simple here since this is a standard feature and should work out of the box if set correctly but so far I cannot see what I am missing.

I would really appreciate any tip/help.

Thank you very much!!

Regards,

Branimir