[libvirt-users] Source of Qcow2 Image Corruption

14 Aug 2012

      Hello, 

I have two KVM virtual machine nodes in a high-availability cluster using Pacemaker + Heartbeat on Ubuntu 10.04 Server amd64. This cluster hosts a single Ubuntu 10.04 VM which uses a qcow2 image file, myvm.qcow2, with a backing file, backingfile.qcow2. This morning, the VM suddenly powered off. I attempted to start it again with virsh start domain, but it would only start briefly and then power off again. I checked the qcow2 disk image and found countless corruption errors: 

root@vmhost:/mnt/storage/vmstore/disks# qemu-img info myvm.qcow2 
image: myvm.qcow2 
file format: qcow2 
virtual size: 9.8G (10485760000 bytes) 
disk size: 13G 
cluster_size: 65536 
backing file: backingfile.qcow2 (actual path: backingfile.qcow2) 
Snapshot list: 
ID TAG VM SIZE DATE VM CLOCK 
1.5G 2056-05-05 21:01:212795663:45:42.642 
/archive/1006/20100627000/2il_root/save/archive/1002/20100204005/1 743M 1995-08-16 12:47:352289751:06:20.183 
root@vmhost:/mnt/storage/vmstore/disks# qemu-img check myvm.qcow2 2>&1 | head 
ERROR OFLAG_COPIED: offset=80000002047d0000 refcount=0 
ERROR OFLAG_COPIED: offset=8000000212e50000 refcount=0 
ERROR OFLAG_COPIED: offset=80000001ffde0000 refcount=0 
ERROR OFLAG_COPIED: offset=80000001ff710000 refcount=0 
ERROR OFLAG_COPIED: offset=8000000216ec0000 refcount=0 
ERROR OFLAG_COPIED: offset=8000000206db0000 refcount=0 
ERROR OFLAG_COPIED: offset=80000001ff720000 refcount=0 
ERROR OFLAG_COPIED: offset=80000001ffdf0000 refcount=0 
ERROR OFLAG_COPIED: offset=8000000212e60000 refcount=0 
ERROR OFLAG_COPIED: offset=8000000212e70000 refcount=0 
root@vmhost:/mnt/storage/vmstore/disks# qemu-img info backingfile.qcow2 
image: backingfile.qcow2 
file format: qcow2 
virtual size: 9.8G (10485760000 bytes) 
disk size: 4.8G 
cluster_size: 65536 
root@vmhost:/mnt/storage/vmstore/disks# qemu-img check backingfile.qcow2 
No errors were found on the image. 

If I use qemu-img to convert the image, the resulting image is "clean": 
# convert myvm.qcow2 -O qcow2 /tmp/test.qcow2 
# qemu-img check /tmp/test.qcow2 
No errors were found on the image. 

I had this corruption happen a month ago to a different VM on the same machine but a different physical drive, so I do not believe it to be a physical disk failure. I can find nothing in /var/log that gives any more information related to this corruption. What other debug information can I provide to diagnose why these images are getting corrupted and taking these running VMs offline? 

Thanks, 

Andrew Martin