Re: [libvirt] [RFC] Proposed API to support block device streaming

Monday, 15 November 2010

On Mon, Nov 15, 2010 at 1:05 PM, Daniel P. Berrange <berrange(a)redhat.com&gt; wrote:
...
 On Wed, Nov 10, 2010 at 08:45:20AM -0600, Adam Litke wrote:
> On Wed, 2010-11-10 at 11:33 +0000, Daniel P. Berrange wrote:
> > On Tue, Nov 09, 2010 at 03:17:23PM -0600, Adam Litke wrote:
> > > I've been working with Anthony Liguori and Stefan Hajnoczi to enable
data
> > > streaming to copy-on-read disk images in qemu.  This work is working its
way
> > > through peer review and I expect it to be upstream soon as part of the
support
> > > for the new QED disk image format.
> > >
> > > I would like to enable these commands in libvirt in order to support at
least
> > > two compelling use cases:
> > >
> > > 1) Rapid deployment of domains:
> > > Creating a new domain from a central repository of images can be time
consuming
> > > since a local copy of the image must be made before the domain can be
started.
> > > With copy-on-read and streaming, up-front copy time is eliminated and the
> > > domain can be started immediately.  Streaming can run while the domain
runs
> > > to fully populate the disk image.
> > >
> > > 2) Post-copy live block migration:
> > > A qemu-nbd server is started on the source host and serves the domain's
block
> > > device to the destination host.  A QED image is created on the destination
host
> > > with backing to the nbd server.  The domain is migrated as normal.  When
> > > migration completes, a stream command is executed to fully populate the
> > > destination QED image.  After streaming completes, the qemu-nbd server can
> > > be shut down and the domain (including local storage) is fully independent
of
> > > the source host.
> > >
> > > Qemu will support two streaming modes: full device and single sector.
 Full
> > > device streaming is the easiest to use because one command will cause the
whole
> > > device to be streamed as fast as possible.  Single sector mode can be used
if
> > > one wants to throttle streaming to reduce I/O pressure.  In this mode, the
user
> > > issues individual commands to stream single sectors.
> > >
> > > To enable this support in libvirt, I propose the following API...
> > >
> > > virDomainStreamDisk() initiates either a full device stream or a single
sector
> > > stream (depending on virDomainStreamDiskFlags).  For a full device stream,
it
> > > returns either 0 or -1.  For a single sector stream, it returns an offset
that
> > > can be used to continue streaming with a subsequent call to
virDomainStreamDisk().
> > >
> > > virDomainStreamDiskInfo() returns the status of a currently-running full
device
> > > stream (the device name, current streaming position, and total size).
> > >
> > > Comments on this design would be greatly appreciated.  Thanks!
> >
> > I'm finding it hard to say whether these APIs are suitable or not
> > because I can't see what this actually maps to in terms of
> > implementation.
>
> Please see the qemu driver piece that I will post as a reply to this
> email.  Since I am not looking for any particular code review at this
> point I decided not to post the whole series.  But I would be happy to
> do so.

 I'm not too worried about the code, I just wanted to understand  what
 logical set of QEMU operations it maps to.

> > Do these calls need to be run before the QEMU process is started,
> > or after QEMU is already running ?
>
> Streaming requires a running domain and runs concurrently.

 What if you have a disk image and want to activate streaming
 without running a VM ? eg, so you can ensure the image is
 fully downloaded to the host and thus avoid a runtime problem
 which would result in IO error for the guest 
The following would solve that use case:
qemu-img stream <filename>

...

> > Does the path in the arg actually need to exist on disk before
> > streaming begins, or do these APIs create the image too ?
>
> The path actually refers to the alias of the currently attached disk
> (which must be a copy-on-read disk).  For example: 'drive-virtio-disk0'.
> When started, the stream command will populate the local image file with
> blocks from the backing file until the local file is complete and the
> backing_file link can be broken.

 NB, libvirt intentionally doesn't expose the device backend
 aliases in the API. So this should refer to the device
 alias which is included in the XML.

> > If we're streaming the whole disk, is there a way to cancel/abort
> > it early ?
>
> I was thinking of adding another mode flag for this:
> VIR_STREAM_DISK_CANCEL
>
> > What happens if qemu-nbd dies before streaming is complete ?
>
> Bad things.  Same as if you deleted a qcow2 backing file.

 So a migration lifecycle based on this design has a pretty
 dangerous failure mode. The guest can loose access to the
 NBD server before the disk copy is complete, and we'd be
 unable to switch back to the original QEMU instance since
 the target has already started dirtying memory which has
 invalidated the source. 
This is similar to the scenario where you base images off a master
image on NFS and lose connectivity to the NFS server.

There may be no issue if a backing read error is hit during streaming
but the guest doesn't access that region of the disk.  Streaming could
be unable to make progress while the guest continues to run
successfully within its disk working set.

The uglier case is when the guest reads the backing file and we are
unable to access it.  We can pause the guest (like for ENOSPC) and
wait for manual intervention but this is a big hammer.  We can return
I/O errors to the guest, allowing it to make progress but possibly
causing its workload to fail.

It is safe to restart streaming on the destination host after a
failure (e.g. power outage).  The image will continue streaming where
it left off.

There needs to be a way to bring up qemu-nbd easily again if the
source host fails.

Stefan

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [RFC] Proposed API to support block device streaming