Dear colleagues,
I am facing a problem that has been troubling me for last week and a half. Please if you are able to help or offer some guidance.
I have a non-prod POC environment with 2 CentOS7 fully updated hypervisors and an NFS filer that serves as a VM image storage. The overall environment works exceptionally well. However, starting a few weeks ago I have been trying to implement virtlock in order to prevent a VM running on 2 hypervisors at the same time.
Here is the description how the environment looks like in terms of virtlock configuration on both hypervisors:
-- Content of /etc/libvirt/qemu.conf --
lock_manager = "lockd"
Only the above line is uncommented for direct locking.
# libvirtd --version; python -c "import platform; print(platform.platform())"; virtlockd -V
libvirtd (libvirt) 3.2.0
Linux-3.10.0-693.2.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core
virtlockd (libvirt) 3.2.0
# getenforce
Permissive
Here is the issue:
h1 # virsh list
Id Name State
----------------------------------------------------
1 test09 running
h1 # virsh domblklist test09
Target Source
------------------------------------------------
vda /storage_nfs/images_001/test09.qcow2
h1 #
h2 # virsh list
Id Name State
----------------------------------------------------
h2 # virsh list --all | grep test09
- test09 shut off
h2 # virsh start test09
error: Failed to start domain test09
error: resource busy: Lockspace resource '/storage_nfs/images_001/test09.qcow2' is locked
h2 # virsh list
Id Name State
----------------------------------------------------
h2 #
Before I start test09 I open a console to the guest and observe what is going on in it. Once I try to start test09 (and get a message about locked resource) on h2 hypervisor, I can see the following messages in the console and the vm goes to ro mode:
on test09's console:
[ 567.394148] blk_update_request: I/O error, dev vda, sector 13296056
[ 567.395883] blk_update_request: I/O error, dev vda, sector 13296056
[ 572.871905] blk_update_request: I/O error, dev vda, sector 8654040
[ 572.872627] Aborting journal on device vda1-8.
[ 572.873978] blk_update_request: I/O error, dev vda, sector 8652800
[ 572.874707] Buffer I/O error on dev vda1, logical block 1081344, lost sync page write
[ 572.875472] blk_update_request: I/O error, dev vda, sector 2048
[ 572.876009] Buffer I/O error on dev vda1, logical block 0, lost sync page write
[ 572.876727] EXT4-fs error (device vda1): ext4_journal_check_start:56: Detected aborted journal[ 572.878061] JBD2: Error -5 detected when updating journal superblock for vda1-8.
[ 572.878807] EXT4-fs (vda1): Remounting filesystem read-only
[ 572.879311] EXT4-fs (vda1): previous I/O error to superblock detected
[ 572.880937] blk_update_request: I/O error, dev vda, sector 2048
[ 572.881538] Buffer I/O error on dev vda1, logical block 0, lost sync page write
I also observe the guests'log:
-- /var/log/libvirt/qemu/test09.log --
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
block I/O error in device 'drive-virtio-disk0': Permission denied (13)
If it helps, here is the disk portion of an XML file:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'/>
<source file='/storage_nfs/images_001/test09.qcow2'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
</disk>
I usually do implement SELinux on a hypervisor to isolate guests even further but this time I set it to permissive mode just to rule out SELinux factor. The same thing happens when SELinux is in enforcing mode (virt_use_nfs is set to on in that case) and audit2why doesn't report any anomalies when parsing audit logs.
I have tried to use indirect locking via the same filer and with a separated export for the hashes by removing the comment in /etc/libvirt/qemu-lockd.conf for the following line:
file_lockspace_dir = "/var/lib/libvirt/lockd/files"
In this case the hashes are normally created on the NFS export mounted under /var/lib/libvirt/lockd/files. I have also tried playing with both QCOW2 and raw disk images for VMs (and even with XFS/ext4 based guests) but the outcome is always the same. I have a couple of KVM books - consulted them on this topic, consulted Red Hat and SUSE docs but pretty much the configuration instructions are, naturally, the same. I saw that some colleagues posted a few emails (ie https://www.redhat.com/archives/libvirt-users/2015-September/msg00004.html) to the list related to virtlock but it seems that it is not the same issue. I have also, as a last resort, completely disabled SELinux, rebooted both hypervisors, created a new vm, repeated all the steps listed above but with the same results.
Now, I am pretty sure that I am missing something simple here since this is a standard feature and should work out of the box if set correctly but so far I cannot see what I am missing.
I would really appreciate any tip/help.
Thank you very much!!
Regards,
Branimir