On Thu, Apr 07, 2011 at 04:31:58PM -0500, Adam Litke wrote:
I've been working with Anthony Liguori and Stefan Hajnoczi to
enable data
streaming to copy-on-read disk images in qemu. This work is working its way
through review and I expect it to be upstream soon as part of the support for
the new QED disk image format.
Disk streaming is extremely useful when provisioning domains from a central
repository of template images. Currently the domain must be provisioned by
either: 1) copying the template image to local storage before the VM can be
started or, 2) creating a qcow2 image that backs to a base image in the remote
repository. Option 1 can introduce a significant delay when provisioning large
disks. Option 2 introduces a permanent dependency on a remote service and
increased network load to satisfy disk reads.
So the scenario we have is a thin-provisioned disk image, with a backstore
of some kind (whether local image, or a NBD server doesn't matter). The
goal is to allocate blocks in the disk image, to change it from being
thin-provisioned, to less-thin, or even fully-allocated. QEMU may be running
while this is done (requiring online copy by QEMU process via the monitor)
or shutoff (requiring offline copy with qemu-img commands).
What strikes me, is that from an API design POV, there is really no compelling
reason to restrict this to disk images with backing stores. Any disk volume
which is thin-provisioned can benefit from this. ie, instead of copying blocks
of data from the backing store, just write blocks of zeros into unallocated
regions of the disk.
So a mgmt app can start a VM with a sparse raw file, with host storage
overcommit across all VMs, and if they later need to provide a strong
guarantee for storage allocatio to a particular VM, this API can used,
regardless of whether a backingstore is present.
Qemu will support two streaming modes: full device and single sector.
Full
device streaming is the easiest to use because one command will cause the whole
device to be streamed as fast as possible. Single sector mode can be used if
one wants to throttle streaming to reduce I/O pressure. In this mode, a
management tool issues individual commands to stream single sectors.
This design is needlessly restrictive IMHO - special casing the two
extremes, and not providing any intermediate capabilities. The API
should just take an offset and a length. This trivially allows for
a single sector, multiple sectors, or all sectors.
The API should also be using bytes, not sectors. Sectors are a very
ill-defined unit of measurement, with lots of potential meanings.
It could be the sector size of the underlying block device, filesystem
block size, the cluster size of the virtual disk file format, or
sector size of the virtual block device. Using bytes, specifying
the logical offset + length of the virtual disk image is clear.
In addition, all the other libvirt storage APIs use bytes, and we
want this to be consistent with them. If the internal implementation
wants to convert from bytes to sectors & round up/down to nearest
sector boundary, then that is fine - just don't expose it in the API.
Finally, while requesting allocation of the entire disk is pretty
trivial, to be able to sensibly do allocation of partial regions
or individual sectors, applications need to be able to find out
just what regions are currently allocated/missing. This will
require some kind of API to query disk allocation regions (cf
the FIEMAP/FIBMAP ioctls).
To enable this support in libvirt, I propose the following API...
virDomainStreamDisk() will start or stop a full device stream or stream a
single sector of a device. The behavior is controlled by setting
virDomainStreamDiskFlags. When either starting or stopping a full device
stream, the return value is either 0 or -1 to indicate whether the operation
succeeded. For a single sector stream, a device offset is returned (or -1 on
failure). This value can be used to continue streaming with a subsequent call
to virDomainStreamDisk().
virDomainStreamDiskInfo() returns information about active full device streams
(the device alias, current streaming position, and total size).
I'm finding the term 'Streaming' to be quite mis-leading. This is really
about allocating blocks in the disk image. Thus I would use the word
'Allocate' in the API naming. I'll followup about API design in the
next patch.
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|