On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
> On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
> > Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
> > > On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote:
> > >> On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange wrote:
> > >>> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
> > >>>> On 11/11/2011 12:15 PM, Kevin Wolf wrote:
> > >>>>> Am 10.11.2011 22:30, schrieb Anthony Liguori:
> > >>>>>> Live migration with qcow2 or any other image format is
just not going to work
> > >>>>>> right now even with proper clustered storage. I think
doing a block level flush
> > >>>>>> cache interface and letting block devices decide how
to do it is the best approach.
> > >>>>>
> > >>>>> I would really prefer reusing the existing open/close
code. It means
> > >>>>> less (duplicated) code, is existing code that is well
tested and doesn't
> > >>>>> make migration much of a special case.
> > >>>>>
> > >>>>> If you want to avoid reopening the file on the OS level,
we can reopen
> > >>>>> only the topmost layer (i.e. the format, but not the
protocol) for now
> > >>>>> and in 1.1 we can use bdrv_reopen().
> > >>>>>
> > >>>>
> > >>>> Intuitively I dislike _reopen style interfaces. If the second
open
> > >>>> yields different results from the first, does it invalidate
any
> > >>>> computations in between?
> > >>>>
> > >>>> What's wrong with just delaying the open?
> > >>>
> > >>> If you delay the 'open' until the mgmt app issues
'cont', then you loose
> > >>> the ability to rollback to the source host upon open failure for
most
> > >>> deployed versions of libvirt. We only fairly recently switched to
a five
> > >>> stage migration handshake to cope with rollback when
'cont' fails.
> > >>>
> > >>> Daniel
> > >>
> > >> I guess reopen can fail as well, so this seems to me to be an
important
> > >> fix but not a blocker.
> > >
> > > If if the initial open succeeds, then it is far more likely that a later
> > > re-open will succeed too, because you have already elminated the
possibility
> > > of configuration mistakes, and will have caught most storage runtime
errors
> > > too. So there is a very significant difference in reliability between
doing
> > > an 'open at startup + reopen at cont' vs just 'open at
cont'
> > >
> > > Based on the bug reports I see, we want to be very good at detecting and
> > > gracefully handling open errors because they are pretty frequent.
> >
> > Do you have some more details on the kind of errors? Missing files,
> > permissions, something like this? Or rather something related to the
> > actual content of an image file?
>
> Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI
> setup. Access permissions due to incorrect user / group setup, or read
> only mounts, or SELinux denials. Actual I/O errors are less common and
> are not so likely to cause QEMU to fail to start any, since QEMU is
> likely to just report them to the guest OS instead.
Do you run qemu with -S, then give a 'cont' command to start it?
Yes
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|