Hello,
I have two KVM virtual machine nodes in a high-availability cluster using Pacemaker +
Heartbeat on Ubuntu 10.04 Server amd64. This cluster hosts a single Ubuntu 10.04 VM which
uses a qcow2 image file, myvm.qcow2, with a backing file, backingfile.qcow2. This morning,
the VM suddenly powered off. I attempted to start it again with virsh start domain, but it
would only start briefly and then power off again. I checked the qcow2 disk image and
found countless corruption errors:
root@vmhost:/mnt/storage/vmstore/disks# qemu-img info myvm.qcow2
image: myvm.qcow2
file format: qcow2
virtual size: 9.8G (10485760000 bytes)
disk size: 13G
cluster_size: 65536
backing file: backingfile.qcow2 (actual path: backingfile.qcow2)
Snapshot list:
ID TAG VM SIZE DATE VM CLOCK
1.5G 2056-05-05 21:01:212795663:45:42.642
/archive/1006/20100627000/2il_root/save/archive/1002/20100204005/1 743M 1995-08-16
12:47:352289751:06:20.183
root@vmhost:/mnt/storage/vmstore/disks# qemu-img check myvm.qcow2 2>&1 | head
ERROR OFLAG_COPIED: offset=80000002047d0000 refcount=0
ERROR OFLAG_COPIED: offset=8000000212e50000 refcount=0
ERROR OFLAG_COPIED: offset=80000001ffde0000 refcount=0
ERROR OFLAG_COPIED: offset=80000001ff710000 refcount=0
ERROR OFLAG_COPIED: offset=8000000216ec0000 refcount=0
ERROR OFLAG_COPIED: offset=8000000206db0000 refcount=0
ERROR OFLAG_COPIED: offset=80000001ff720000 refcount=0
ERROR OFLAG_COPIED: offset=80000001ffdf0000 refcount=0
ERROR OFLAG_COPIED: offset=8000000212e60000 refcount=0
ERROR OFLAG_COPIED: offset=8000000212e70000 refcount=0
root@vmhost:/mnt/storage/vmstore/disks# qemu-img info backingfile.qcow2
image: backingfile.qcow2
file format: qcow2
virtual size: 9.8G (10485760000 bytes)
disk size: 4.8G
cluster_size: 65536
root@vmhost:/mnt/storage/vmstore/disks# qemu-img check backingfile.qcow2
No errors were found on the image.
If I use qemu-img to convert the image, the resulting image is "clean":
# convert myvm.qcow2 -O qcow2 /tmp/test.qcow2
# qemu-img check /tmp/test.qcow2
No errors were found on the image.
I had this corruption happen a month ago to a different VM on the same machine but a
different physical drive, so I do not believe it to be a physical disk failure. I can find
nothing in /var/log that gives any more information related to this corruption. What other
debug information can I provide to diagnose why these images are getting corrupted and
taking these running VMs offline?
Thanks,
Andrew Martin