Re: [libvirt] [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

14 Nov 2011


      On Mon, Nov 14, 2011 at 11:08:02AM +0000, Daniel P. Berrange wrote:
...
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
...
On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote:
...
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
...
On 11/11/2011 12:15 PM, Kevin Wolf wrote:
...
Am 10.11.2011 22:30, schrieb Anthony Liguori:
...
Live migration with qcow2 or any other image format is just not going to work 
right now even with proper clustered storage.  I think doing a block level flush 
cache interface and letting block devices decide how to do it is the best approach.
I would really prefer reusing the existing open/close code. It means
less (duplicated) code, is existing code that is well tested and doesn't
make migration much of a special case.
If you want to avoid reopening the file on the OS level, we can reopen
only the topmost layer (i.e. the format, but not the protocol) for now
and in 1.1 we can use bdrv_reopen().
Intuitively I dislike _reopen style interfaces.  If the second open
yields different results from the first, does it invalidate any
computations in between?
What's wrong with just delaying the open?
If you delay the 'open' until the mgmt app issues 'cont', then you loose
the ability to rollback to the source host upon open failure for most
deployed versions of libvirt. We only fairly recently switched to a five
stage migration handshake to cope with rollback when 'cont' fails.
Daniel
I guess reopen can fail as well, so this seems to me to be an important
fix but not a blocker.
If if the initial open succeeds, then it is far more likely that a later
re-open will succeed too, because you have already elminated the possibility
of configuration mistakes, and will have caught most storage runtime errors
too. So there is a very significant difference in reliability between doing
an 'open at startup + reopen at cont' vs just 'open at cont'
Based on the bug reports I see, we want to be very good at detecting and
gracefully handling open errors because they are pretty frequent.
Regards,
Daniel
IIUC, the 'cont' that we were discussing is the startup of the VM
at destination after migration completes. A failure results in
migration failure, which libvirt has been able to handle since forever.
In case of the 'cont' command on source upon migration failure,
qemu was running there previously so it's likely configuration is OK.

Am I confused? If no, libvirt seems unaffected.
...
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|