Re: [libvirt] [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

Monday, 14 November 2011

On Mon, Nov 14, 2011 at 11:37:27AM +0000, Daniel P. Berrange wrote:
...
 On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
 > On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
 > > On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
 > > > Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
 > > > > On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
 > > > >> On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange
wrote:
 > > > >>> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
 > > > >>>> On 11/11/2011 12:15 PM, Kevin Wolf wrote:
 > > > >>>>> Am 10.11.2011 22:30, schrieb Anthony Liguori:
 > > > >>>>>> Live migration with qcow2 or any other image
format is just not going to work 
 > > > >>>>>> right now even with proper clustered storage.  I
think doing a block level flush 
 > > > >>>>>> cache interface and letting block devices decide
how to do it is the best approach.
 > > > >>>>>
 > > > >>>>> I would really prefer reusing the existing open/close
code. It means
 > > > >>>>> less (duplicated) code, is existing code that is well
tested and doesn't
 > > > >>>>> make migration much of a special case.
 > > > >>>>>
 > > > >>>>> If you want to avoid reopening the file on the OS
level, we can reopen
 > > > >>>>> only the topmost layer (i.e. the format, but not the
protocol) for now
 > > > >>>>> and in 1.1 we can use bdrv_reopen().
 > > > >>>>>
 > > > >>>>
 > > > >>>> Intuitively I dislike _reopen style interfaces.  If the
second open
 > > > >>>> yields different results from the first, does it
invalidate any
 > > > >>>> computations in between?
 > > > >>>>
 > > > >>>> What's wrong with just delaying the open?
 > > > >>>
 > > > >>> If you delay the 'open' until the mgmt app issues
'cont', then you loose
 > > > >>> the ability to rollback to the source host upon open failure
for most
 > > > >>> deployed versions of libvirt. We only fairly recently
switched to a five
 > > > >>> stage migration handshake to cope with rollback when
'cont' fails.
 > > > >>>
 > > > >>> Daniel
 > > > >>
 > > > >> I guess reopen can fail as well, so this seems to me to be an
important
 > > > >> fix but not a blocker.
 > > > > 
 > > > > If if the initial open succeeds, then it is far more likely that a
later
 > > > > re-open will succeed too, because you have already elminated the
possibility
 > > > > of configuration mistakes, and will have caught most storage runtime
errors
 > > > > too. So there is a very significant difference in reliability between
doing
 > > > > an 'open at startup + reopen at cont' vs just 'open at
cont'
 > > > > 
 > > > > Based on the bug reports I see, we want to be very good at detecting
and
 > > > > gracefully handling open errors because they are pretty frequent.
 > > > 
 > > > Do you have some more details on the kind of errors? Missing files,
 > > > permissions, something like this? Or rather something related to the
 > > > actual content of an image file?
 > > 
 > > Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
 > > setup. Access permissions due to incorrect user / group setup, or read
 > > only mounts, or SELinux denials. Actual I/O errors are less common and
 > > are not so likely to cause QEMU to fail to start any, since QEMU is
 > > likely to just report them to the guest OS instead.
 > 
 > Do you run qemu with -S, then give a 'cont' command to start it?

 Yes

 Daniel 
Probably in an attempt to improve reliability :)

So this is in fact unrelated to migration.  So we can either ignore this
bug (assuming no distros ship cutting edge qemu with an old libvirt), or
special-case -S and do an open/close cycle on startup.

...
 -- 
 |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
 |: http://libvirt.org              -o-             http://virt-manager.org :|
 |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
 |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :| 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions