On Fri, Apr 08, 2011 at 02:26:48PM -0500, Anthony Liguori wrote:
On 04/08/2011 11:02 AM, Stefan Hajnoczi wrote:
>On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange<berrange(a)redhat.com>
wrote:
>
>I have CCed Anthony and Kevin. Anthony drove the QED image streaming
>and Kevin will probably be interested in the idea of allocating raw
>images as a background activity while QEMU runs.
>
>> /*
>> * @path: fully qualified filename of the virtual disk
>> * @nregions: filled in the number of @region structs
>> * @regions: filled with a list of allocated regions
>> *
>> * Query the extents of allocated regions within the
>> * virtual disk file. The offsets in the list of regions
>> * are not guarenteed to be sorted in any explicit order.
>> */
>> int virDomainBlockGetAllocationMap(virDomainPtr dom,
>> const char *path,
>> unsigned int *nregions,
>> virDomainBlockRegionPtr *regions);
>QEMU can provide this with its existing .bdrv_is_allocated() function.
> Kevin, do you have any thoughts on whether this API will work well?
I think the trouble with this API proposal is that it's overloading
concepts.
Sparse is not the same thing as CoW to a backing file.
I don't like to use the term "sparse", since that implies a specific disk
format (raw file with holes). Rather I use the term 'thin provisioned'
to refer to any disk format, where the not all physical sectors have
yet been allocated. A thin-provisioned disk, can trivially be thought
of as a disk, with a backing file whose sectors are all filled with
zeros.
For instance, when you expose streaming, the result is still a
sparse file. So you'd have a rather curious API where you called to
"allocate" a region in the file which resulted in having a sparse
file which you then called again to make it non sparse. But AFAICT,
the API doesn't really tell you these details.
Copy-on-read streaming does not imply that the result is still
thin-provisioned. That is a policy decision by the management
application. Given information about the allocation pattern of
the original master image, the mgmt app can decide whether to
make a series of Allocate() calls to preserve sparseness, or
make a series of Allocate() calls which result in a fully
allocated image.
Having to related APIs to expand a copy-on-read image and then to
fill in a sparse file is certainly a reasonable thing to do. I
think trying to make a single API that does both without having a
flag that basically makes it two APIs is going to be cumbersome.
On the contrary, having a single API makes life *simpler*. It doesn't
require any special flag to distinguish the two use cases, since they
are fundamentally the same thing. Some examples, which include the
implicit "all zeros" backing file that every disk has, should illustrate
this
- Make a brand new thin-provisioned disk, no backing store,
fully allocated
|0|0|0|0|0|0|0|0|0|
| | | | | | | | | | -> |0|0|0|0|0|0|0|0|0|
- Make a brand new thin-provisioned disk, no backing store,
1/2 allocated
|0|0|0|0|0|0|0|0|0| |0|0|0|0|0|0|0|0|0|
| | | | | | | | | | -> |0|0|0|0|0| | | | |
- Make a existing, thin-provisioned disk, no backing store,
fully allocated
|0|0|0|0|0|0|0|0|0|
|X| |X|X| | |X| |X| -> |X|0|X|X|0|0|X|0|X|
- Make a existing, thin-provisioned disk, no backing store,
1/2 allocated
|0|0|0|0|0|0|0|0|0| |0|0|0|0|0|0|0|0|0|
|X| |X|X| | |X| |X| -> |X|0|X|X|0| |X| |X|
- Make a brand new thin-provisioned disk, with backing store,
independant of backing store, but still thin:
|0|0|0|0|0|0|0|0|0|
|X| |X|X| | |X| |X| |0|0|0|0|0|0|0|0|0|
| | | | | | | | | | -> |X| |X|X| | |X| |X|
- Make a existing thin-provisioned disk, with backing store,
independant of backing store, but still thin
|0|0|0|0|0|0|0|0|0|
|X| |X|X| | |X| |X| |0|0|0|0|0|0|0|0|0|
|Y|Y|Y| | | | | | | -> |X| |X|X| | |X| |X|
- Make a existing thin-provisioned disk, with backing store,
independant of backing store, fully allocated
|0|0|0|0|0|0|0|0|0|
|X| |X|X| | |X| |X|
|Y|Y|Y| | | | | | | -> |X|0|X|X|0|0|X|0|X|
- Make a brand new thin-provisioned disk, with 2 backing stores,
independant of backing stores & fully allocated:
|0|0|0|0|0|0|0|0|0|
| | |Z|Z| | | |Z| |
|X| |X| | | |X| |X|
|Y|Y| |Y| | | | | | -> |Y|Y|X|Y|0|0|X|Z|X|
etc, etc for many more example scenarios. Cow-on-read streaming is really
not a special case - it is just one of many example scenarios, all of
which can be managed via the pair of APIs mentioned earlier.
Regards,
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|