[libvirt] [RFC] Storage volume clone API

Hi all, Attached is a stab at an API for cloning a storage volume. This patch implements the command for FS volumes, along with a virsh command 'vol-clone'. This builds on the previously posted 'drop pool lock during volume creation' work. The new API call looks like: /** * virStorageVolClone: * @vol: pointer to storage vol to clone * @xmldesc: description of volume to create * @flags: flags for creation (unused, pass 0) * * Clone a storage volume within the parent pool. * Information for the new volume (name, perms) * are passed via a typical volume XML description * Not all pools support cloning of volumes * * return the storage volume, or NULL on error */ virStorageVolPtr virStorageVolClone(virStorageVolPtr vol, const char *xmldesc, unsigned int flags) The user passes information about the new volume via <volume> XML. Currently we only use the passed name and permissions values, but backingStore info could also be relevant, and 'format' if we allow cloning between image formats. Current limitations: - Parsing the 'clone' xml will complain if the 'capacity' element hasn't been specified, even though it is meaningless for the new volume. Not too sure what the right thing to do here is. - A clone of a raw sparse file will turn out fully allocated. Could probably borrow code from 'cp' to try and get this right. - Doesn't handle backing stores. I think we would need to call out to 'qemu-img convert' to have any chance of getting this right. And if we do that, we could also allow cloning from one image format to another. A lot of the patch is just plumbing, which I will break out into separate patches on the next post. For now, most of the interesting bits are at near the bottom. Feedback appreciated. Thanks, Cole

On Wed, Apr 15, 2009 at 03:28:54PM -0400, Cole Robinson wrote:
Hi all,
Attached is a stab at an API for cloning a storage volume. This patch implements the command for FS volumes, along with a virsh command 'vol-clone'. This builds on the previously posted 'drop pool lock during volume creation' work.
The new API call looks like:
/** * virStorageVolClone: * @vol: pointer to storage vol to clone * @xmldesc: description of volume to create * @flags: flags for creation (unused, pass 0) * * Clone a storage volume within the parent pool. * Information for the new volume (name, perms) * are passed via a typical volume XML description * Not all pools support cloning of volumes * * return the storage volume, or NULL on error */ virStorageVolPtr virStorageVolClone(virStorageVolPtr vol, const char *xmldesc, unsigned int flags)
I was initially surprized by the xmldesc, but I think it makes perfect sense, based on existing APIs and the fact we may augment the amount of metadata and it's simpler to build the API directly with room for the extensions. I wonder if the entry point could be made to accept NULL as the input, then libvirt generate a name/uuid, keep other attributes the same and the returned object can be inspected to discover name/uuid.
The user passes information about the new volume via <volume> XML. Currently we only use the passed name and permissions values, but backingStore info could also be relevant, and 'format' if we allow cloning between image formats.
We don't have an UUID for a volume ?
Current limitations:
- Parsing the 'clone' xml will complain if the 'capacity' element hasn't been specified, even though it is meaningless for the new volume. Not too sure what the right thing to do here is.
add an argument to the parsing routine asking to disable that check ?
- A clone of a raw sparse file will turn out fully allocated. Could probably borrow code from 'cp' to try and get this right.
I would not consider this a blocker though a change in behaviour there later on might be considered disruptive (or a great improvement depending who you're asking !)
- Doesn't handle backing stores. I think we would need to call out to 'qemu-img convert' to have any chance of getting this right. And if we do that, we could also allow cloning from one image format to another.
A lot of the patch is just plumbing, which I will break out into separate patches on the next post. For now, most of the interesting bits are at near the bottom.
No need to keep the generated doc files in the patch IMHO [...]
diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index a028b21..0f4dd27 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -1047,6 +1047,9 @@ const char* virStorageVolGetKey (virStorageVolPtr vol); virStorageVolPtr virStorageVolCreateXML (virStoragePoolPtr pool, const char *xmldesc, unsigned int flags); +virStorageVolPtr virStorageVolClone (virStorageVolPtr vol, + const char *xmldesc, + unsigned int flags); int virStorageVolDelete (virStorageVolPtr vol, unsigned int flags); int virStorageVolRef (virStorageVolPtr vol);
Fine [...]
+/** + * virStorageVolClone: + * @vol: pointer to storage vol to clone + * @xmldesc: description of volume to create + * @flags: flags for creation (unused, pass 0) + * + * Clone a storage volume within the parent pool. + * Information for the new volume (name, perms) + * are passed via a typical volume XML description + * Not all pools support cloning of volumes + * + * return the storage volume, or NULL on error + */ +virStorageVolPtr +virStorageVolClone(virStorageVolPtr vol, + const char *xmldesc, + unsigned int flags) +{ + DEBUG("vol=%p, flags=%u", vol, flags); + + virResetLastError(); + + if (!VIR_IS_STORAGE_VOL(vol)) { + virLibConnError(NULL, VIR_ERR_INVALID_STORAGE_VOL, __FUNCTION__); + return (NULL); + } + + if (vol->conn->flags & VIR_CONNECT_RO) { + virLibConnError(vol->conn, VIR_ERR_OPERATION_DENIED, __FUNCTION__); + goto error; + } + + if (vol->conn->storageDriver && vol->conn->storageDriver->volClone) { + virStorageVolPtr ret; + ret = vol->conn->storageDriver->volClone (vol, xmldesc, flags); + if (!ret) + goto error; + return ret; + } + + virLibConnError (vol->conn, VIR_ERR_NO_SUPPORT, __FUNCTION__); + +error: + /* Copy to connection error object for back compatability */ + virSetConnError(vol->conn); + return NULL; +}
okay too [...]
+static int +virStorageBackendFileSystemVolClone(virConnectPtr conn, + virStorageVolDefPtr origvol, + virStorageVolDefPtr newvol, + unsigned int flags ATTRIBUTE_UNUSED) +{ + + int origfd = -1; + int newfd = -1; + int ret = -1; + size_t bytes = 1024 * 1024;
so the copy is done megabytes by megabytes That sounds a reasonable intermediate compromize between the daemon memory usage and the I/O efficiency, feedback from an I/O expert or based on cp or tar code could be interesting. In general patch looks good to me, I got a bit lost in the pool locking, but finally it sounds fine. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Daniel Veillard wrote:
On Wed, Apr 15, 2009 at 03:28:54PM -0400, Cole Robinson wrote:
Hi all,
Attached is a stab at an API for cloning a storage volume. This patch implements the command for FS volumes, along with a virsh command 'vol-clone'. This builds on the previously posted 'drop pool lock during volume creation' work.
The new API call looks like:
/** * virStorageVolClone: * @vol: pointer to storage vol to clone * @xmldesc: description of volume to create * @flags: flags for creation (unused, pass 0) * * Clone a storage volume within the parent pool. * Information for the new volume (name, perms) * are passed via a typical volume XML description * Not all pools support cloning of volumes * * return the storage volume, or NULL on error */ virStorageVolPtr virStorageVolClone(virStorageVolPtr vol, const char *xmldesc, unsigned int flags)
I was initially surprized by the xmldesc, but I think it makes perfect sense, based on existing APIs and the fact we may augment the amount of metadata and it's simpler to build the API directly with room for the extensions. I wonder if the entry point could be made to accept NULL as the input, then libvirt generate a name/uuid, keep other attributes the same and the returned object can be inspected to discover name/uuid.
That seems reasonable. If people think it's useful behavior, I can add it.
The user passes information about the new volume via <volume> XML. Currently we only use the passed name and permissions values, but backingStore info could also be relevant, and 'format' if we allow cloning between image formats.
We don't have an UUID for a volume ?
No, the unique identifier is the <key> field which is the absolute path to the volume on the host. I don't think any of the storage drivers actually allow manually setting this field, it is either generated using the parent pool path and the new volume name, or dictated by storage (/dev/sda3 or similar).
Current limitations:
- Parsing the 'clone' xml will complain if the 'capacity' element hasn't been specified, even though it is meaningless for the new volume. Not too sure what the right thing to do here is.
add an argument to the parsing routine asking to disable that check ?
Yeah I figured i'd end up down that route. I'll check it out.
- A clone of a raw sparse file will turn out fully allocated. Could probably borrow code from 'cp' to try and get this right.
I would not consider this a blocker though a change in behaviour there later on might be considered disruptive (or a great improvement depending who you're asking !)
Agreed, not critical.
- Doesn't handle backing stores. I think we would need to call out to 'qemu-img convert' to have any chance of getting this right. And if we do that, we could also allow cloning from one image format to another.
A lot of the patch is just plumbing, which I will break out into separate patches on the next post. For now, most of the interesting bits are at near the bottom.
No need to keep the generated doc files in the patch IMHO
Okay, I'll drop them. Thanks, Cole
[...]
diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index a028b21..0f4dd27 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -1047,6 +1047,9 @@ const char* virStorageVolGetKey (virStorageVolPtr vol); virStorageVolPtr virStorageVolCreateXML (virStoragePoolPtr pool, const char *xmldesc, unsigned int flags); +virStorageVolPtr virStorageVolClone (virStorageVolPtr vol, + const char *xmldesc, + unsigned int flags); int virStorageVolDelete (virStorageVolPtr vol, unsigned int flags); int virStorageVolRef (virStorageVolPtr vol);
Fine
[...]
+/** + * virStorageVolClone: + * @vol: pointer to storage vol to clone + * @xmldesc: description of volume to create + * @flags: flags for creation (unused, pass 0) + * + * Clone a storage volume within the parent pool. + * Information for the new volume (name, perms) + * are passed via a typical volume XML description + * Not all pools support cloning of volumes + * + * return the storage volume, or NULL on error + */ +virStorageVolPtr +virStorageVolClone(virStorageVolPtr vol, + const char *xmldesc, + unsigned int flags) +{ + DEBUG("vol=%p, flags=%u", vol, flags); + + virResetLastError(); + + if (!VIR_IS_STORAGE_VOL(vol)) { + virLibConnError(NULL, VIR_ERR_INVALID_STORAGE_VOL, __FUNCTION__); + return (NULL); + } + + if (vol->conn->flags & VIR_CONNECT_RO) { + virLibConnError(vol->conn, VIR_ERR_OPERATION_DENIED, __FUNCTION__); + goto error; + } + + if (vol->conn->storageDriver && vol->conn->storageDriver->volClone) { + virStorageVolPtr ret; + ret = vol->conn->storageDriver->volClone (vol, xmldesc, flags); + if (!ret) + goto error; + return ret; + } + + virLibConnError (vol->conn, VIR_ERR_NO_SUPPORT, __FUNCTION__); + +error: + /* Copy to connection error object for back compatability */ + virSetConnError(vol->conn); + return NULL; +}
okay too
[...]
+static int +virStorageBackendFileSystemVolClone(virConnectPtr conn, + virStorageVolDefPtr origvol, + virStorageVolDefPtr newvol, + unsigned int flags ATTRIBUTE_UNUSED) +{ + + int origfd = -1; + int newfd = -1; + int ret = -1; + size_t bytes = 1024 * 1024;
so the copy is done megabytes by megabytes That sounds a reasonable intermediate compromize between the daemon memory usage and the I/O efficiency, feedback from an I/O expert or based on cp or tar code could be interesting.
In general patch looks good to me, I got a bit lost in the pool locking, but finally it sounds fine.
Daniel

On Wed, Apr 15, 2009 at 03:28:54PM -0400, Cole Robinson wrote:
Hi all,
Attached is a stab at an API for cloning a storage volume. This patch implements the command for FS volumes, along with a virsh command 'vol-clone'. This builds on the previously posted 'drop pool lock during volume creation' work.
The new API call looks like:
/** * virStorageVolClone: * @vol: pointer to storage vol to clone * @xmldesc: description of volume to create * @flags: flags for creation (unused, pass 0) * * Clone a storage volume within the parent pool. * Information for the new volume (name, perms) * are passed via a typical volume XML description * Not all pools support cloning of volumes * * return the storage volume, or NULL on error */ virStorageVolPtr virStorageVolClone(virStorageVolPtr vol, const char *xmldesc, unsigned int flags)
One potentially annoying restriction of this API is that it would not allow for cloning between storage pools. eg clone an LVM volume to a qcow2 file. While I don't think we need to actually implement that kind of cross pool cloning right now, I think it is worth allowing for it in the API contract. Thus I'd suggest taking the virStorageVolCreateXML() API as is and then adding an extra 'clone' parameter virStorageVolPtr virStorageVolCreateXMLFrom(virStoragePoolPtr pool, const char *xmldesc, unsigned int flags, virStorageVolPtr clone); One could argue we don't need this at all, and we should just add a 'copy' API int virStorageVolCopyData(virStorageVolPtr vol, virStorageVolPtr from); but being able to create + initialize data in one go might well allow for some optimizations with some types of storage.
The user passes information about the new volume via <volume> XML. Currently we only use the passed name and permissions values, but backingStore info could also be relevant, and 'format' if we allow cloning between image formats.
Current limitations:
- Parsing the 'clone' xml will complain if the 'capacity' element hasn't been specified, even though it is meaningless for the new volume. Not too sure what the right thing to do here is.
I think it depends how you define the semantics of the clone operation. My feeling is that this should follow the same rules as the normal create call in all aspect, and that the 'clone' volume is just there as a data source. There's no reason the new volume has to be forced to be the same capacity as the one being cloned. If its bigger that just means there's some uninitialized data at the end which is perfectly OK. The guest admin can extend the partition table / filesystem as desired. If smaller you could just truncate copied data at that point, though its more quesitonable whether that's useful.
- A clone of a raw sparse file will turn out fully allocated. Could probably borrow code from 'cp' to try and get this right.
It is pretty easy to do a good approximation of sparseness. You're copying in 1 MB chunks. So before write()ing the 1 MB out to the target, just do a scan to see if its all zeros. If it is, then do a lseek() instead. This is what virt-clone currently does.
- Doesn't handle backing stores. I think we would need to call out to 'qemu-img convert' to have any chance of getting this right. And if we do that, we could also allow cloning from one image format to another.
For non-raw formats you want to call out to qemu-img anyway because you need to be able to have a larger target image. This will require differnet metadata being written in the qcow/vmdk/etc headers, so a plain byte-wise clone won't be sufficient. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 04/16/2009 11:10 AM, Daniel P. Berrange wrote:
On Wed, Apr 15, 2009 at 03:28:54PM -0400, Cole Robinson wrote:
Hi all,
Attached is a stab at an API for cloning a storage volume. This patch implements the command for FS volumes, along with a virsh command 'vol-clone'. This builds on the previously posted 'drop pool lock during volume creation' work.
The new API call looks like:
/** * virStorageVolClone: * @vol: pointer to storage vol to clone * @xmldesc: description of volume to create * @flags: flags for creation (unused, pass 0) * * Clone a storage volume within the parent pool. * Information for the new volume (name, perms) * are passed via a typical volume XML description * Not all pools support cloning of volumes * * return the storage volume, or NULL on error */ virStorageVolPtr virStorageVolClone(virStorageVolPtr vol, const char *xmldesc, unsigned int flags)
One potentially annoying restriction of this API is that it would not allow for cloning between storage pools. eg clone an LVM volume to a qcow2 file.
While I don't think we need to actually implement that kind of cross pool cloning right now, I think it is worth allowing for it in the API contract.
Thus I'd suggest taking the virStorageVolCreateXML() API as is and then adding an extra 'clone' parameter
virStorageVolPtr virStorageVolCreateXMLFrom(virStoragePoolPtr pool, const char *xmldesc, unsigned int flags, virStorageVolPtr clone);
Ah sorry, I guess I didn't really consider the case of cloning between storage pools. I think the above prototype looks good.
One could argue we don't need this at all, and we should just add a 'copy' API
int virStorageVolCopyData(virStorageVolPtr vol, virStorageVolPtr from);
but being able to create + initialize data in one go might well allow for some optimizations with some types of storage.
Hmm, this is interesting. The above ties into something I've been wondering about: an API to import to a volume from an arbitrary path. This would allow virt-manager users an option to, say, move install media from their homedir to a storage pool. But that's a separate discussion.
The user passes information about the new volume via <volume> XML. Currently we only use the passed name and permissions values, but backingStore info could also be relevant, and 'format' if we allow cloning between image formats.
Current limitations:
- Parsing the 'clone' xml will complain if the 'capacity' element hasn't been specified, even though it is meaningless for the new volume. Not too sure what the right thing to do here is.
I think it depends how you define the semantics of the clone operation.
My feeling is that this should follow the same rules as the normal create call in all aspect, and that the 'clone' volume is just there as a data source.
There's no reason the new volume has to be forced to be the same capacity as the one being cloned. If its bigger that just means there's some uninitialized data at the end which is perfectly OK. The guest admin can extend the partition table / filesystem as desired. If smaller you could just truncate copied data at that point, though its more quesitonable whether that's useful.
Sounds good.
- A clone of a raw sparse file will turn out fully allocated. Could probably borrow code from 'cp' to try and get this right.
It is pretty easy to do a good approximation of sparseness. You're copying in 1 MB chunks. So before write()ing the 1 MB out to the target, just do a scan to see if its all zeros. If it is, then do a lseek() instead. This is what virt-clone currently does.
I'll give that a shot.
- Doesn't handle backing stores. I think we would need to call out to 'qemu-img convert' to have any chance of getting this right. And if we do that, we could also allow cloning from one image format to another.
For non-raw formats you want to call out to qemu-img anyway because you need to be able to have a larger target image. This will require differnet metadata being written in the qcow/vmdk/etc headers, so a plain byte-wise clone won't be sufficient.
Right, with the ability to specify a different capacity for the new volume, using qemu-img is a must. I'll incorporate that into my next patch. Thanks, Cole
participants (3)
-
Cole Robinson
-
Daniel P. Berrange
-
Daniel Veillard