On 02/24/2015 08:05 PM, Christopher Pereira wrote:
Hi,
I'm chasing a bug related with libvirt not being able to resume a VM
paused because of an IO Error for images stored on a gluster volume.
The bug was reported in oVirt here:
https://bugzilla.redhat.com/show_bug.cgi?id=1058300#c39
To reproduce:
1) Run a VM on a gluster volume
2) Stop the gluser volume (VM will be paused)
3) Start the gluster volume
4) virsh resume VM (will fail)
That's not a libvirt bug, but a management bug. You can't just stop a
filesystem while some process has an open file handle (in this case, the
process is qemu, not libvirt), and still expect the handle to come back
to life when the filesystem is present again, unless the filesystem is
something like NFS that has super-long timeouts built into it to survive
temporary outages while still keeping the same handle alive.
QEMU logs will say:
block I/O error in device 'drive-virtio-disk0': Transport endpoint is
not connected (107)
Yeah, because you yanked the file out of under qemu's feet, and gluster
apparently doesn't have the ability to revive the handle when the
connection comes back online.
Libvirt can't know that you need to reopen the file. If there are any
patches to be made, they need to be in gluster (to revive an existing
handle rather than requiring clients to reopen) or in qemu (to attempt
reopening a handle when gluster reports no connection), not in libvirt.
At least, that's true as long as libvirt still hasn't implemented
fd-passing as the way to hand files to qemu (right now, it is qemu that
opens the gluster handle, not libvirt).
And VM will not be resumed.
My guess is that libvirt is not telling QEMU to reopen its file
descriptors.
Can someone please confirm or fix?
It's not libvirt's fault. Don't yank a gluster volume when clients are
still using it.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library
http://libvirt.org