[libvirt] RFC API proposal: virDomainBlockRebase

Right now, the existing virDomainBlockPull API has a tough limitation - it is an all-or-none approach. In all my examples below, I'm starting from the following relationship, where '<-' means 'is a backing file of': template <- intermediate <- current virDomainBlockPull can only convert things in a forward direction, with the merge destination being the current image, resulting in: merge template and intermediate into current, creating: current Meanwhile, qemu is adding support for a partial block pull operation, still on the current image as the merge destination, but where you can now specify an optional argument to limit the pull to just the intermediate files and altering the current image to be backed by an ancestor file, as in: merge intermediate into current, creating: template <- current For 0.9.10, I'd like to add the following API: /** * virDomainBlockRebase: * @dom: pointer to domain object * @disk: path to the block device, or device shorthand * @base: new base image, or NULL for entire block pull * @bandwidth: (optional) specify copy bandwidth limit in Mbps * @flags: extra flags; not used yet, so callers should always pass 0 * * Populate a disk image with data from its backing image chain, and * setting the new backing image to @base, where base is the absolute * path of one of the backing images in the chain. If @base is NULL, * then this operation is identical to virDomainBlockPull(). Once all * data from its backing image chain has been pulled, the disk no * longer depends on those intermediate backing images. This function * pulls data for the entire device in the background. Progress of the * operation can be checked with virDomainGetBlockJobInfo() and * the operation can be aborted with virDomainBlockJobAbort(). When * finished, an asynchronous event is raised to indicate the final * status. * * The @disk, @bandwidth, and @flags parameters are handled as in * virDomainBlockPull(). * * Returns 0 if the operation has started, -1 on failure. */ int virDomainBlockRebase(virDomainPtr dom, const char *disk, const char *base, unsigned long bandwidth, unsigned int flags); Given that Adam has a pending patch to support a VIR_DOMAIN_BLOCK_PULL_ASYNC flag, this same flag would have to be supported in virDomainBlockRebase. I've also been chatting with Federico Simoncelli about how the above operation would work for VDSM purposes in doing a live block move, while preserving a common template base file: start with: vda: template <- current1 create a disk-only snapshot, with: tmpsnap = virDomainSnapshotCreateXML(dom, "<domainsnapshot>\n" " <disks>\n" " <disk name='vda'>\n" " <source>/path/to/current2</source>\n" " </disk>\n" " <disks>\n" "</domainsnapshot>", VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) where the xml calls out the destination file name, resulting in: vda: template <- current1 <- current2 perform the block rebase, with: virDomainBlockRebase(dom, "vda", "/path/to/template", VIR_DOMAIN_BLOCK_PULL_ASYNC) as well as waiting for the event (or polling status) to wait for completion, resulting in: vda: template <- current2 delete the disk-only snapshot metadata as no longer useful, with: virDomainSnapshotDelete(tmpsnap, VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY) At one point, I thought of creating a single libvirt API that performs all of those steps in one call; but right now, I'm not proposing that, because of the fact that qemu has no way to undo a snapshot. In other words, without an undo operation, if the snapshot phase succeeds but the block rebase phase fails, a single API would have to report failure even though the domain was altered, while the ideal scenario is that reporting failure means things were in the same state as before the API started. Beyond 0.9.10, there are some additional useful merge patterns that might be worth exposing. All of these operations are already possible on offline images, using qemu-img; but none of them are possible on live images using current qemu, which is why I'm thinking it is something for another day. I'm also hoping to someday enhance the set of virStorageVol APIs to make backing file manipulation of offline images easier. At any rate, the addition merge operations are: forward live merge with a non-current image as the merge destination, as in: merge template into intermediate, creating: intermediate <- current backward merge of a current image (that is, undoing a current snapshot): merge current into intermediate, creating: template <- intermediate and backward merge of a non-current image (that is, undoing an earlier snapshot, but by modifying the template rather than the current image): merge intermediate into base, creating: template <- current Backward merge of the current image seems like something easy to fit into my proposed API (add a new flag, maybe called VIR_DOMAIN_BLOCK_REBASE_BACKWARD). Manipulations of anything that does not involve the current image seems tougher, assuming qemu ever even reaches the point where it exposes those operations on live volumes - the user has to specify not one, but two backing file names. But even that could possibly be fit into my API, by adding a flag that states that the const char *backing argument is treated as an XML snippet describing the full details of the merge, with the XML listing which image is being merged to which destination, rather than as just the name of the backing file becoming the new base of the current image. Perhaps something like: virDomainBlockRebase(dom, block, "<rebase>\n" " <source>/path/to/intermediate</source>\n" " <dest>/path/to/template</dest>\n" "</rebase>", VIR_DOMAIN_BLOCK_REBASE_XML|VIR_DOMAIN_BLOCK_REBASE_BACKWARD) as a specification to take the contents of intermediate, merge those backwards into template, and as well as adjusting the rest of the backing file chain so that whatever used to be backed by intermediate is now backed by template. Or, if qemu ever gives us the ability to merge non-current images, we may decide at that time that it is worth a new API to expose those new complexities. Another thing I have been thinking about is virDomainSnapshotDelete. The above conversation talks about merging of a single disk, but a live disk snapshot operation can create backing file chains for multiple disks at once, all tracked by a snapshot. Additionally, the current code allows a snapshot delete of internal snapshots, but refuses to do anything useful with an external snapshot, because there is currently no way to specify if the snapshot is removed by merging the base into the new current, or by undoing the current and merging it backwards into the base. Alas, virDomainSnapshotDelete doesn't take any arguments for how to handle the situation, and use of a flag to make the decision would limit all disks to be handled in the same manner. So what I'm thinking is that when a snapshot is created (or redefined, using redefinition as the vehicle to add in the new XML), that the snapshot XML itself can record the preferred direction for undoing the snapshot; for example: <domainsnapshot> <disks> <disk name='/path/to/old_vda'> <source file='/path/to/new_vda'/> <on_delete merge='forward'/> </disk> <disk name='/path/to/old_vdb'> <source file='/path/to/new_vdb'/> <on_delete merge='backward'/> </disk> <disks> </domainsnapshot> then when virDomainSnapshotDelete is called on that snapshot, old_vda would be forward merged into new_vda, while new_vdb would be backward merged into old_vdb. Again, that's food for thought for post-0.9.10, and shouldn't get in the way of adding virDomainBlockRebase() now. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Tue, Jan 31, 2012 at 09:28:51AM -0700, Eric Blake wrote:
Right now, the existing virDomainBlockPull API has a tough limitation - it is an all-or-none approach. In all my examples below, I'm starting from the following relationship, where '<-' means 'is a backing file of':
template <- intermediate <- current
virDomainBlockPull can only convert things in a forward direction, with the merge destination being the current image, resulting in:
merge template and intermediate into current, creating: current
Meanwhile, qemu is adding support for a partial block pull operation, still on the current image as the merge destination, but where you can now specify an optional argument to limit the pull to just the intermediate files and altering the current image to be backed by an ancestor file, as in:
merge intermediate into current, creating: template <- current
For 0.9.10, I'd like to add the following API:
/** * virDomainBlockRebase: * @dom: pointer to domain object * @disk: path to the block device, or device shorthand * @base: new base image, or NULL for entire block pull * @bandwidth: (optional) specify copy bandwidth limit in Mbps * @flags: extra flags; not used yet, so callers should always pass 0
What is the format of the @base arg? My first thought would be a path, but what if the desired image file is not directly known to libvirt?
* Populate a disk image with data from its backing image chain, and * setting the new backing image to @base, where base is the absolute * path of one of the backing images in the chain. If @base is NULL, * then this operation is identical to virDomainBlockPull(). Once all * data from its backing image chain has been pulled, the disk no * longer depends on those intermediate backing images. This function * pulls data for the entire device in the background. Progress of the * operation can be checked with virDomainGetBlockJobInfo() and * the operation can be aborted with virDomainBlockJobAbort(). When * finished, an asynchronous event is raised to indicate the final * status. * * The @disk, @bandwidth, and @flags parameters are handled as in * virDomainBlockPull(). * * Returns 0 if the operation has started, -1 on failure. */ int virDomainBlockRebase(virDomainPtr dom, const char *disk, const char *base, unsigned long bandwidth, unsigned int flags);
Given that Adam has a pending patch to support a VIR_DOMAIN_BLOCK_PULL_ASYNC flag, this same flag would have to be supported in virDomainBlockRebase.
That patch only applies to virDomainBlockJobCancel(). The blockJob initiators (virDomainBlockPull and this new one) already use an async mode of operation because the call simply starts the block job.
I've also been chatting with Federico Simoncelli about how the above operation would work for VDSM purposes in doing a live block move, while preserving a common template base file:
start with: vda: template <- current1
create a disk-only snapshot, with: tmpsnap = virDomainSnapshotCreateXML(dom, "<domainsnapshot>\n" " <disks>\n" " <disk name='vda'>\n" " <source>/path/to/current2</source>\n" " </disk>\n" " <disks>\n" "</domainsnapshot>", VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) where the xml calls out the destination file name, resulting in: vda: template <- current1 <- current2
perform the block rebase, with: virDomainBlockRebase(dom, "vda", "/path/to/template", VIR_DOMAIN_BLOCK_PULL_ASYNC) as well as waiting for the event (or polling status) to wait for completion, resulting in: vda: template <- current2
delete the disk-only snapshot metadata as no longer useful, with: virDomainSnapshotDelete(tmpsnap, VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY)
Yep, seems like a very good method.
At one point, I thought of creating a single libvirt API that performs all of those steps in one call; but right now, I'm not proposing that, because of the fact that qemu has no way to undo a snapshot. In other words, without an undo operation, if the snapshot phase succeeds but the block rebase phase fails, a single API would have to report failure even though the domain was altered, while the ideal scenario is that reporting failure means things were in the same state as before the API started.
Beyond 0.9.10, there are some additional useful merge patterns that might be worth exposing. All of these operations are already possible on offline images, using qemu-img; but none of them are possible on live images using current qemu, which is why I'm thinking it is something for another day. I'm also hoping to someday enhance the set of virStorageVol APIs to make backing file manipulation of offline images easier. At any rate, the addition merge operations are:
forward live merge with a non-current image as the merge destination, as in:
merge template into intermediate, creating: intermediate <- current
backward merge of a current image (that is, undoing a current snapshot):
merge current into intermediate, creating: template <- intermediate
and backward merge of a non-current image (that is, undoing an earlier snapshot, but by modifying the template rather than the current image):
merge intermediate into base, creating: template <- current
Don't these raise some security concerns about modifying a potentially shared intermediate image?
Backward merge of the current image seems like something easy to fit into my proposed API (add a new flag, maybe called VIR_DOMAIN_BLOCK_REBASE_BACKWARD). Manipulations of anything that does not involve the current image seems tougher, assuming qemu ever even reaches the point where it exposes those operations on live volumes - the user has to specify not one, but two backing file names. But even that could possibly be fit into my API, by adding a flag that states that the const char *backing argument is treated as an XML snippet describing the full details of the merge, with the XML listing which image is being merged to which destination, rather than as just the name of the backing file becoming the new base of the current image. Perhaps something like:
virDomainBlockRebase(dom, block, "<rebase>\n" " <source>/path/to/intermediate</source>\n" " <dest>/path/to/template</dest>\n" "</rebase>", VIR_DOMAIN_BLOCK_REBASE_XML|VIR_DOMAIN_BLOCK_REBASE_BACKWARD)
as a specification to take the contents of intermediate, merge those backwards into template, and as well as adjusting the rest of the backing file chain so that whatever used to be backed by intermediate is now backed by template. Or, if qemu ever gives us the ability to merge non-current images, we may decide at that time that it is worth a new API to expose those new complexities.
This is all starting to scare me so I will defer to the storage pros :)
Another thing I have been thinking about is virDomainSnapshotDelete. The above conversation talks about merging of a single disk, but a live disk snapshot operation can create backing file chains for multiple disks at once, all tracked by a snapshot. Additionally, the current code allows a snapshot delete of internal snapshots, but refuses to do anything useful with an external snapshot, because there is currently no way to specify if the snapshot is removed by merging the base into the new current, or by undoing the current and merging it backwards into the base. Alas, virDomainSnapshotDelete doesn't take any arguments for how to handle the situation, and use of a flag to make the decision would limit all disks to be handled in the same manner. So what I'm thinking is that when a snapshot is created (or redefined, using redefinition as the vehicle to add in the new XML), that the snapshot XML itself can record the preferred direction for undoing the snapshot; for example:
<domainsnapshot> <disks> <disk name='/path/to/old_vda'> <source file='/path/to/new_vda'/> <on_delete merge='forward'/> </disk> <disk name='/path/to/old_vdb'> <source file='/path/to/new_vdb'/> <on_delete merge='backward'/> </disk> <disks> </domainsnapshot>
then when virDomainSnapshotDelete is called on that snapshot, old_vda would be forward merged into new_vda, while new_vdb would be backward merged into old_vdb. Again, that's food for thought for post-0.9.10, and shouldn't get in the way of adding virDomainBlockRebase() now.
-- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
-- Adam Litke <agl@us.ibm.com> IBM Linux Technology Center

On 01/31/2012 01:53 PM, Adam Litke wrote:
On Tue, Jan 31, 2012 at 09:28:51AM -0700, Eric Blake wrote:
Right now, the existing virDomainBlockPull API has a tough limitation - it is an all-or-none approach. In all my examples below, I'm starting from the following relationship, where '<-' means 'is a backing file of':
template <- intermediate <- current
Meanwhile, qemu is adding support for a partial block pull operation, still on the current image as the merge destination, but where you can now specify an optional argument to limit the pull to just the intermediate files and altering the current image to be backed by an ancestor file, as in:
merge intermediate into current, creating: template <- current
For 0.9.10, I'd like to add the following API:
/** * virDomainBlockRebase: * @dom: pointer to domain object * @disk: path to the block device, or device shorthand * @base: new base image, or NULL for entire block pull * @bandwidth: (optional) specify copy bandwidth limit in Mbps * @flags: extra flags; not used yet, so callers should always pass 0
What is the format of the @base arg? My first thought would be a path, but what if the desired image file is not directly known to libvirt?
Libvirt already has to know the absolute paths of all disks in the backing file chain, in order to properly SELinux label them prior to invoking qemu. So I'm envisioning the absolute path of the backing file in the chain that will be preserved. That is, with: touch 10M /path/to/template qemu-img create -f qcow2 \ -o backing_file /path/to/template /path/to/intermediate qemu-img create -f qcow2 \ -o backing_file /path/to/intermediate /path/to/current followed by virDomainBlockRebase(dom, "vda", "/path/to/template", 0) would result in /path/to/current referring to /path/to/template as its primary backing file. I also had an idea down below where, with the addition of a new flags value, base could refer to a well-formed XML block rather than a single file name, such that we could then populate that XML block with more complex instructions; but I'm not proposing doing that extension in the 0.9.10 timeframe, so much as trying to argue that this API is extensible and we won't need yet a third API for block pull if qemu ever allows more complex merging scenarios.
Given that Adam has a pending patch to support a VIR_DOMAIN_BLOCK_PULL_ASYNC flag, this same flag would have to be supported in virDomainBlockRebase.
That patch only applies to virDomainBlockJobCancel(). The blockJob initiators (virDomainBlockPull and this new one) already use an async mode of operation because the call simply starts the block job.
Ah, right. I'm getting slightly confused with all the patches that still need review :) virDomainBlockPull has always been asynchronous, so no flag is needed there or in this new API.
and backward merge of a non-current image (that is, undoing an earlier snapshot, but by modifying the template rather than the current image):
merge intermediate into base, creating: template <- current
Don't these raise some security concerns about modifying a potentially shared intermediate image?
Yes, the management app has to be careful to not remove backing files that are in use in other chains. But we already have a lock manager setup, so of course, part of the libvirt work is integrating this work so that no image is ever reverted if the lock manager says a file is in use; libvirt can also check all registered storage pools and prevent a rebase if the storage pool tracks files that serve as base images to more than one other images, and prevent modifying base images in those cases (there's probably still a lot of work to be done in libvirt to make it bulletproof, but that's okay, since again, my ideas about reverse merges are post-0.9.10 and would also require more work from qemu). At any rate, you responded favorably to the first half of my email (the proposal for what to implement in 0.9.10), even if you got scared by my musings about possible future extensions at later releases. I'll take that as a good sign that I have a) come up with a good API worth adding now, and b) divided things appropriately into what is reasonable to do now vs. what is complex enough to be worth delaying until we have more experience with the use cases and ramifications of adding the complexity. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Tue, Jan 31, 2012 at 03:00:40PM -0700, Eric Blake wrote:
On 01/31/2012 01:53 PM, Adam Litke wrote:
On Tue, Jan 31, 2012 at 09:28:51AM -0700, Eric Blake wrote:
Right now, the existing virDomainBlockPull API has a tough limitation - it is an all-or-none approach. In all my examples below, I'm starting from the following relationship, where '<-' means 'is a backing file of':
template <- intermediate <- current
Meanwhile, qemu is adding support for a partial block pull operation, still on the current image as the merge destination, but where you can now specify an optional argument to limit the pull to just the intermediate files and altering the current image to be backed by an ancestor file, as in:
merge intermediate into current, creating: template <- current
For 0.9.10, I'd like to add the following API:
/** * virDomainBlockRebase: * @dom: pointer to domain object * @disk: path to the block device, or device shorthand * @base: new base image, or NULL for entire block pull * @bandwidth: (optional) specify copy bandwidth limit in Mbps * @flags: extra flags; not used yet, so callers should always pass 0
What is the format of the @base arg? My first thought would be a path, but what if the desired image file is not directly known to libvirt?
Libvirt already has to know the absolute paths of all disks in the backing file chain, in order to properly SELinux label them prior to invoking qemu. So I'm envisioning the absolute path of the backing file in the chain that will be preserved. That is, with:
Ok. That is clear. Thanks.
touch 10M /path/to/template qemu-img create -f qcow2 \ -o backing_file /path/to/template /path/to/intermediate qemu-img create -f qcow2 \ -o backing_file /path/to/intermediate /path/to/current
followed by virDomainBlockRebase(dom, "vda", "/path/to/template", 0)
would result in /path/to/current referring to /path/to/template as its primary backing file.
I also had an idea down below where, with the addition of a new flags value, base could refer to a well-formed XML block rather than a single file name, such that we could then populate that XML block with more complex instructions; but I'm not proposing doing that extension in the 0.9.10 timeframe, so much as trying to argue that this API is extensible and we won't need yet a third API for block pull if qemu ever allows more complex merging scenarios.
Given that Adam has a pending patch to support a VIR_DOMAIN_BLOCK_PULL_ASYNC flag, this same flag would have to be supported in virDomainBlockRebase.
That patch only applies to virDomainBlockJobCancel(). The blockJob initiators (virDomainBlockPull and this new one) already use an async mode of operation because the call simply starts the block job.
Ah, right. I'm getting slightly confused with all the patches that still need review :)
virDomainBlockPull has always been asynchronous, so no flag is needed there or in this new API.
and backward merge of a non-current image (that is, undoing an earlier snapshot, but by modifying the template rather than the current image):
merge intermediate into base, creating: template <- current
Don't these raise some security concerns about modifying a potentially shared intermediate image?
Yes, the management app has to be careful to not remove backing files that are in use in other chains. But we already have a lock manager setup, so of course, part of the libvirt work is integrating this work so that no image is ever reverted if the lock manager says a file is in use; libvirt can also check all registered storage pools and prevent a rebase if the storage pool tracks files that serve as base images to more than one other images, and prevent modifying base images in those cases (there's probably still a lot of work to be done in libvirt to make it bulletproof, but that's okay, since again, my ideas about reverse merges are post-0.9.10 and would also require more work from qemu).
At any rate, you responded favorably to the first half of my email (the proposal for what to implement in 0.9.10), even if you got scared by my musings about possible future extensions at later releases. I'll take that as a good sign that I have a) come up with a good API worth adding now, and b) divided things appropriately into what is reasonable to do now vs. what is complex enough to be worth delaying until we have more experience with the use cases and ramifications of adding the complexity.
Yes, exactly. Seems like a good plan. I am happy to see that the blockJob API family will be extended as we initially intended. -- Adam Litke <agl@us.ibm.com> IBM Linux Technology Center
participants (2)
-
Adam Litke
-
Eric Blake