[libvirt] [RFC v2] external (pull) backup API

Hi, all. This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage. This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image. Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server. Here is typical actions on domain backup: - create temporary snapshot of domain disks of interest - export snaphots thru NBD - back them up - remove disks from export - delete temporary snapshot and typical actions on domain restore: - start domain in paused state - export domain disks of interest thru NBD for write - restore them - remove disks from export - resume or destroy domain Now let's write down API in more details. There are minor changes in comparison with previous version [1]. *Temporary snapshot API* In previous version it is called 'Fleece API' after qemu terms and I'll still use BlockSnapshot prefix for commands as in previous RFC instead of TmpSnapshots which I inclined more now. virDomainBlockSnapshotPtr virDomainBlockSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags); virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot, unsigned int flags); virDomainBlockSnapshotList(virDomainPtr domain, virDomainBlockSnapshotPtr **snaps, unsigned int flags); virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot, unsigned int flags); virDomainBlockSnapshotPtr virDomainBlockSnapshotLookupByName(virDomainPtr domain, const char *name, unsigned int flags); Here is an example of snapshot xml description: <domainblocksnapshot> <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/> </disk> <disk name='sdb' type="file"> <fleece file="/tmp/snapshot-b.hdd"/> </disk> </domainblocksnapshot> Temporary snapshots are indepentent thus they are not organized in tree structure as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice. Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time. Checkpoints are visible in active domain xml: <disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c"> .. </disk> Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used. So client need to manage checkpoints and delete unused. Thus next API function: int virDomainBlockCheckpointRemove(virDomainPtr domain, const char *name, unsigned int flags); *Block export API* I guess it is natural to treat qemu NBD server as a domain device. So we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD server and virDomainUpdateDeviceFlags to add/delete disks to be exported. While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags looks a bit inconvinient so I decided to add a pair of API functions just to add/delete disks to be exported: int virDomainBlockExportStart(virDomainPtr domain, const char *xmlDesc, unsigned int flags); int virDomainBlockExportStop(virDomainPtr domain, const char *xmlDesc, unsigned int flags); I guess more appropriate names are virDomainBlockExportAdd and virDomainBlockExportRemove but as I already have a patch series implementing pull backups with these names I would like to keep these names now. These names also reflect that in the implementation I decided to start/stop NBD server in a lazy manner. While it is a bit innovative for libvirt API I guess it is convinient because to refer NBD server to add/remove disks to we need to identify it thru it's parameters like type, address etc until we introduce some device id (which does not looks consistent with current libvirt design). So it looks like we have all parameters to start/stop server in the frame of these calls so why have extra API calls just to start/stop server manually. If we later need to have NBD server without disks we can perfectly support virDomainAttachDeviceFlags/virDomainDetachDeviceFlags. Here is example of xml to add/remove disks (specifying checkpoint attribute is not needed for removing disks of course): <domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> </domainblockexport> And this is how this NBD server will be exposed in domain xml: <devices> ... <blockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8" exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8 exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> </blockexport> ... </devices> *Implementation details from qemu-libvirt interactions POV* 1. Temporary snapshot - create snapshot - add fleece blockdev backed by disk of interest - start fleece blockjob which will pop out data to be overwritten to fleece blockdev { "execute": "blockdev-add" "arguments": { "backing": "drive-scsi0-0-0-0", "driver": "qcow2", "file": { "driver": "file", "filename": "/tmp/snapshot-a.hdd" }, "node-name": "snapshot-scsi0-0-0-0" }, } { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "target": "snapshot-scsi0-0-0-0" "sync": "none", }, } ] }, } - delete snapshot - cancel fleece blockjob - delete fleece blockdev { "execute": "block-job-cancel" "arguments": { "device": "drive-scsi0-0-0-0" }, } { "execute": "blockdev-del" "arguments": { "node-name": "snapshot-scsi0-0-0-0" }, } 2. Block export - add disks to export - start NBD server if it is not started - add disks { "execute": "nbd-server-start" "arguments": { "addr": { "type": "inet" "data": { "host": "0.0.0.0", "port": "49300" }, } }, } { "execute": "nbd-server-add" "arguments": { "device": "snapshot-scsi0-0-0-0", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8", "writable": false }, } - remove disks from export - remove disks - stop NBD server if there are no disks left { "arguments": { "mode": "hard", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8" }, "execute": "nbd-server-remove" } { "execute": "nbd-server-stop" } 3. Checkpoints (the most interesting part) First a few facts about qemu dirty bitmaps. Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot. Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits. Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme. We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too. So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this: - removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1} As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new. So while it is possible to have only one active bitmap at a time it costs some exersices at managment layer. To me it looks like qemu itself is a better place to track bitmaps chain order and consistency. Now how exporting bitmaps looks like. - add to export disk snapshot N with changes from checkpoint K - add fleece blockdev to NBD exports - create new bitmap T - disable bitmap T - merge bitmaps K, K+1, .. N-1 into T - add bitmap to T to nbd export - remove disk snapshot from export - remove fleece blockdev from NBD exports - remove bitmap T Here is qemu commands examples for operation with checkpoints, I'll make several snapshots with checkpoints for purpuse of better illustration. - create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint - same as without checkpoint but additionally add bitmap on fleece blockjob start ... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, } - delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as without checkpoints - create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint - same actions as for the first snapshot, but additionally disable the first bitmap ... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, } - delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint - add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export ... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, } { "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", }, } - remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap ... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" } - delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled) { "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, } Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK). x-vz-block-dirty-bitmap-remove x-vz-block-dirty-bitmap-merge x-vz-block-dirty-bitmap-disable x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap *Restore operation nuances* As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it. *Links* [1] Previous version of RFC https://www.redhat.com/archives/libvir-list/2017-November/msg00514.html

On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
Thanks for such a detailed message! It's got enough that I want to spend some time thinking about the implications, but this is an early reply to let you know I'm at least working on it now. The first thing that caught my eye:
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove x-vz-block-dirty-bitmap-merge x-vz-block-dirty-bitmap-disable x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap
How close are we to having upstream implementations of any of those commands? If not, what are their specifications? Libvirt is very hesitant to add code that depends on a qemu x-* command, but if we can get the actual command into qemu.git without the x-* prefix, it is easier to justify libvirt adding the API even if qemu 2.13 is not yet released. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

11.04.2018 16:56, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage. Thanks for such a detailed message! It's got enough that I want to spend some time thinking about the implications, but this is an early reply to let you know I'm at least working on it now.
The first thing that caught my eye:
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
x-vz-block-dirty-bitmap-merge x-vz-block-dirty-bitmap-disable x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint)
we have command block-dirty-bitmap-remove in qemu. We don't have transaction support, but it is hard to implement it, because we need to protect from operations on same bitmap in same transaction, so, we decided to not do it, Nikolay? they are in [PATCH for-2.12 0/4] qmp dirty bitmap API https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg02281.html (long discussion) and, I've already started v2 (preliminary refactoring): [PATCH 0/7] Dirty bitmaps fixing and refactoring https://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg06568.html (mostly not reviewed)
x-vz-nbd-server-add-bitmap
it is in [PATCH for-2.13 0/4] NBD export bitmaps http://lists.gnu.org/archive/html/qemu-devel/2018-03/msg05701.html (reviewed, I need to respin)
How close are we to having upstream implementations of any of those commands? If not, what are their specifications? Libvirt is very hesitant to add code that depends on a qemu x-* command, but if we can get the actual command into qemu.git without the x-* prefix, it is easier to justify libvirt adding the API even if qemu 2.13 is not yet released.
-- Best regards, Vladimir

On 11.04.2018 17:16, Vladimir Sementsov-Ogievskiy wrote:
11.04.2018 16:56, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all. This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage. Thanks for such a detailed message! It's got enough that I want to spend some time thinking about the implications, but this is an early reply to let you know I'm at least working on it now.
The first thing that caught my eye:
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
we have command block-dirty-bitmap-remove in qemu. We don't have transaction support, but it is hard to implement it, because we need to protect from operations on same bitmap in same transaction, so, we decided to not do it, Nikolay?
Yeah, that's right. Having bitmap remove operation outside of transaction will not impact implementation of tracking bitmaps order and consistency in libvirt substantially.
x-vz-block-dirty-bitmap-merge x-vz-block-dirty-bitmap-disable x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) they are in [PATCH for-2.12 0/4] qmp dirty bitmap API https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg02281.html (long discussion)
and, I've already started v2 (preliminary refactoring): [PATCH 0/7] Dirty bitmaps fixing and refactoring https://lists.nongnu.org/archive/html/qemu-devel/2018-03/msg06568.html (mostly not reviewed)
x-vz-nbd-server-add-bitmap
it is in [PATCH for-2.13 0/4] NBD export bitmaps http://lists.gnu.org/archive/html/qemu-devel/2018-03/msg05701.html (reviewed, I need to respin)
How close are we to having upstream implementations of any of those commands? If not, what are their specifications? Libvirt is very hesitant to add code that depends on a qemu x-* command, but if we can get the actual command into qemu.git without the x-* prefix, it is easier to justify libvirt adding the API even if qemu 2.13 is not yet released.

On 04/11/2018 09:56 AM, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
Thanks for such a detailed message! It's got enough that I want to spend some time thinking about the implications, but this is an early reply to let you know I'm at least working on it now.
The first thing that caught my eye:
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
We have clear and transactionless remove; this is fine.
x-vz-block-dirty-bitmap-merge
Vladimir has a prototype for this I am dragging my feet on because I wanted to see the anticipated use case, which is provided here. The code is not complicated.
x-vz-block-dirty-bitmap-disable
Same story.
x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint)
Same.
x-vz-nbd-server-add-bitmap
<<TANGENT This one is new to me, but I haven't been looking at the NBD code much. I have at times in the past suggested a bigger convenience command, like: x-export-node bitmap=foo node=bar cache-node=baz id=zzyzx [...options...] which would start an NBD server, tie the bitmap "foo" to it, export the node "bar", use the node "baz" as a backing-store for writes to "bar" while the export was active (this is starting a fleecing operation and that cache needs to be somewhere) ... and then after we're done, x-close-export id=zzyzx [...options...] There's been an open question in my mind if we want to expose all of this functionality through primitives (like Virtuozzo is proposing) or if we want to also wrap it up in a convenience command that allows us to offer more focused testing and declare only a subset of combinations of these primitives as supported (through what is effectively a new job command.) ...but don't worry about this suggestion too much. I'll read the full email and Eric's reply to it in a moment and re-evaluate what I think about how to expose these features in a moment. TANGENT
How close are we to having upstream implementations of any of those commands? If not, what are their specifications? Libvirt is very hesitant to add code that depends on a qemu x-* command, but if we can get the actual command into qemu.git without the x-* prefix, it is easier to justify libvirt adding the API even if qemu 2.13 is not yet released.
Answered in part above, I was hesitant to check in new bitmap commands like "merge" without the x- prefix to QEMU before I could see the anticipated workflow and usage for these commands, so I'm going to try to read this email very carefully to offer any critique. At one point I offered an alternative workflow that was ... too complex and at odds with our existing primitives, and I decided to be a little more hands-off after that. I'll try to be brief and prudent here. --js

On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
This is a first-pass review (making comments as I first encounter something, even if it gets explained later in the email)
This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image.
So, rewriting this to make sure I understand, let's start with a disk with contents A, then take a snapshot, then write B: In the existing libvirt snapshot APIs, the data gets distributed as: base (contents A) <- new active (contents B) where you want the new API: base, remains active (contents B) ~~~ backup (contents A)
Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server.
So the biggest reason for a new libvirt API is that we need management actions to control which NBD images from qemu are exposed and torn down at the appropriate sequences.
Here is typical actions on domain backup:
- create temporary snapshot of domain disks of interest - export snaphots thru NBD - back them up - remove disks from export - delete temporary snapshot
and typical actions on domain restore:
- start domain in paused state - export domain disks of interest thru NBD for write - restore them - remove disks from export - resume or destroy domain
Now let's write down API in more details. There are minor changes in comparison with previous version [1].
*Temporary snapshot API*
In previous version it is called 'Fleece API' after qemu terms and I'll still use BlockSnapshot prefix for commands as in previous RFC instead of TmpSnapshots which I inclined more now.
virDomainBlockSnapshotPtr virDomainBlockSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
Just to make sure, we have the existing API of: virDomainSnapshotPtr virDomainSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags); So you are creating a new object (virDomainBlockSnapshotPtr) rather than reusing the existing VirDomainSnapshotPtr, and although the two commands are similar, we get to design a new XML schema from scratch rather than trying to overload yet even more functionality onto the existing API. Should we also have: const char *virDomainBlockSnapshotGetName(virDomainBlockSnapshotPtr snapshot); virDomainPtr virDomainBlockSnapshotGetDomain(virDomainBlockSnapshotPtr snapshot); virConnectPtr virDomainBlockSnapshotGetConnect(virDomainBlockSnapshotPtr snapshot); for symmetry with existing snapshot API?
virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotList(virDomainPtr domain, virDomainBlockSnapshotPtr **snaps, unsigned int flags);
I'm guessing this is the counterpart to virDomainListAllSnapshots() (the modern listing interface), and that we probably don't want counterparts for virDomainSnapshotNum/virDomainSnapshotListNames (the older listing interface, which was inherently racy as the list could change in length between the two calls).
virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotPtr virDomainBlockSnapshotLookupByName(virDomainPtr domain, const char *name, unsigned int flags);
Also, the virDomainSnapshotPtr had a number of API to track a tree-like hierarchy between snapshots (that is, you very much want to know if snapshot B is a child of snapshot A), while it looks like your new virDomainBlockSnapshotPtrs are completely independent (no relationships between the snapshots, each can be independently created or torn down, without having to rewrite a relationship tree between them, and there is no need for counterparts to things like virDomainSnapshotNumChildren). Okay, I think that makes sense, and is a good reason for introducing a new object type rather than shoe-horning this into the existing API.
Here is an example of snapshot xml description:
<domainblocksnapshot> <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/> </disk> <disk name='sdb' type="file"> <fleece file="/tmp/snapshot-b.hdd"/> </disk> </domainblocksnapshot>
Temporary snapshots are indepentent thus they are not organized in tree structure as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice.
So in the XML, the <fleece> element describes the destination file (back to my earlier diagram, it would be the file that is created and will hold content 'A' when the main active image is changed to hold content 'B' after the snapshot was created)?
Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c"> .. </disk>
Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used. So client need to manage checkpoints and delete unused. Thus next API function:
int virDomainBlockCheckpointRemove(virDomainPtr domain, const char *name, unsigned int flags);
I'm trying to figure out how BlockCheckpoint and BlockSnapshots relate. Maybe it will be more clear when I read the implementation section below. Is the idea that I can't create a BlockSnapshot without first having a checkpoint available? If so, where does that fit in the <domainblocksnapshot> XML?
*Block export API*
I guess it is natural to treat qemu NBD server as a domain device. So we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD server and virDomainUpdateDeviceFlags to add/delete disks to be exported.
This feels a bit awkward - up to now, attaching a device is something visible to the guest, but you are trying to reuse the interface to attach something tracked by the domain, but which has no impact to the guest. That is, the guest has no clue whether a block export exists pointing to a particular checkpoint, nor does it care.
While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags looks a bit inconvinient so I decided to add a pair of API functions just to add/delete disks to be exported:
int virDomainBlockExportStart(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
int virDomainBlockExportStop(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
I guess more appropriate names are virDomainBlockExportAdd and virDomainBlockExportRemove but as I already have a patch series implementing pull backups with these names I would like to keep these names now.
What does the XML look like in these calls?
These names also reflect that in the implementation I decided to start/stop NBD server in a lazy manner. While it is a bit innovative for libvirt API I guess it is convinient because to refer NBD server to add/remove disks to we need to identify it thru it's parameters like type, address etc until we introduce some device id (which does not looks consistent with current libvirt design).
This just reinforces my thoughts above - is the reason it doesn't make sense to assign a device id to the export due to the fact that the export is NOT guest-visible? Does it even belong under the "domain/devices/" xpath of the domain XML, or should it be a new sibling of <devices> with an xpath of "domain/blockexports/"?
So it looks like we have all parameters to start/stop server in the frame of these calls so why have extra API calls just to start/stop server manually. If we later need to have NBD server without disks we can perfectly support virDomainAttachDeviceFlags/virDomainDetachDeviceFlags.
Here is example of xml to add/remove disks (specifying checkpoint attribute is not needed for removing disks of course):
<domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> </domainblockexport>
So this is the XML you pass to virDomainBlockExportStart, with the goal of telling qemu to start or stop an NBD export on the backing chain associated with disk "sda", where the export is serving up data tied to checkpoint "d068765e-8b50-4d74-9b72-1e55c663cbf8", and which will be associated with the destination snapshot file described by the <domainblocksnapshot> named "0044757e-1a2d-4c2c-b92f-bb403309bb17"? Why is it named <domainblockexport> here, but...
And this is how this NBD server will be exposed in domain xml:
<devices> ... <blockexport type="nbd">
<blockexport> here?
<address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8" exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
The exportname property is new here compared to the earlier listing - is that something that libvirt generates, or that the user chooses?
<disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8 exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> </blockexport> ... </devices>
*Implementation details from qemu-libvirt interactions POV*
1. Temporary snapshot
- create snapshot
Which libvirt API triggers this action? virDomainBlockSnapshotCreateXML?
- add fleece blockdev backed by disk of interest - start fleece blockjob which will pop out data to be overwritten to fleece blockdev
{ "execute": "blockdev-add" "arguments": { "backing": "drive-scsi0-0-0-0", "driver": "qcow2", "file": { "driver": "file", "filename": "/tmp/snapshot-a.hdd"
Is qemu creating this file, or is libvirt pre-creating it and qemu just opening it? I guess this is a case where libvirt would want to pre-create an empty qcow2 file (either by qemu-img, or by the new x-blockdev-create in qemu 2.12)? Okay, it looks like this file is what you listed in the XML for <domainblocksnapshot>, so libvirt is creating it. Does the new file have a backing image, or does it read as completely zeroes?
}, "node-name": "snapshot-scsi0-0-0-0" }, }
No trailing comma in JSON {}, but it's not too hard to figure out what you mean.
{ "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "target": "snapshot-scsi0-0-0-0" "sync": "none", }, } ]
You showed a transaction with only one element; but presumably we are using a transaction because if we want to create a point in time for multiple disks at once, we need two separate blockdev-backup actions joined in the same transaction to cover the two disks. So this command is telling qemu to start using a brand-new qcow2 file as its local storage for tracking that a snapshot is being taken, and that point in time is the checkpoint? Am I correct that you would then tell qemu to export an NBD view of this qcow2 snapshot which a third-party client can connect to and use NBD_CMD_BLOCK_STATUS to learn which portions of the file contain data (that is, which clusters has qemu copied into the backup, because the active image has changed them since the checkpoint, but anything not dirty in this file is still identical to the last backup? Would libvirt ever want to use something other than "sync":"none"?
}, }
- delete snapshot - cancel fleece blockjob - delete fleece blockdev
{ "execute": "block-job-cancel" "arguments": { "device": "drive-scsi0-0-0-0" }, } { "execute": "blockdev-del" "arguments": { "node-name": "snapshot-scsi0-0-0-0" }, }
2. Block export
- add disks to export - start NBD server if it is not started - add disks
{ "execute": "nbd-server-start" "arguments": { "addr": { "type": "inet" "data": { "host": "0.0.0.0", "port": "49300" }, } }, } { "execute": "nbd-server-add" "arguments": { "device": "snapshot-scsi0-0-0-0", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8", "writable": false
So this is telling qemu to export the temporary qcow2 image created in the point above. An NBD client would see the export getting progressively more blocks with data as the guest continues to write more clusters (as qemu has to copy the data from the checkpoint to the temporary file before updating the main image with the new data). If the NBD client reads a cluster that has not yet been copied by qemu (because the guest has not written to that cluster since the block job started), would it see zeroes, or the same data that the guest still sees?
}, }
- remove disks from export - remove disks - stop NBD server if there are no disks left
{ "arguments": { "mode": "hard", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8" }, "execute": "nbd-server-remove" } { "execute": "nbd-server-stop" }
3. Checkpoints (the most interesting part)
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot.
Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme.
We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new.
So while it is possible to have only one active bitmap at a time it costs some exersices at managment layer. To me it looks like qemu itself is a better place to track bitmaps chain order and consistency.
Libvirt is already tracking a tree relationship between internal snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track that (true, internal snapshots don't get as much attention as external snapshots) - but the fact remains that qemu is probably not the best place to track relationship between multiple persistent bitmaps, any more than it tracks relationships between internal snapshots. So having libvirt track relations between persistent bitmaps is just fine. Do we really have to rename bitmaps in the qcow2 file, or can libvirt track it all on its own? Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
Now how exporting bitmaps looks like.
- add to export disk snapshot N with changes from checkpoint K - add fleece blockdev to NBD exports - create new bitmap T - disable bitmap T - merge bitmaps K, K+1, .. N-1 into T - add bitmap to T to nbd export
- remove disk snapshot from export - remove fleece blockdev from NBD exports - remove bitmap T
Here is qemu commands examples for operation with checkpoints, I'll make several snapshots with checkpoints for purpuse of better illustration.
- create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint - same as without checkpoint but additionally add bitmap on fleece blockjob start
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, }
Here, the transaction makes sense; you have to create the persistent dirty bitmap to track from the same point in time. The dirty bitmap is tied to the active image, not the backup, so that when you create the NEXT incremental backup, you have an accurate record of which sectors were touched in snapshot-scsi0-0-0-0 between this transaction and the next.
] }, }
- delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as without checkpoints
- create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint - same actions as for the first snapshot, but additionally disable the first bitmap
Again, you're showing the QMP commands that libvirt is issuing; which libvirt API calls are driving these actions?
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": {
Do you have measurements on whether having multiple active bitmaps hurts performance? I'm not yet sure that managing a chain of disabled bitmaps (and merging them as needed for restores) is more or less efficient than managing multiple bitmaps all the time. On the other hand, you do have a point that restore is a less frequent operation than backup, so making backup as lean as possible and putting more work on restore is a reasonable tradeoff, even if it adds complexity to the management for doing restores.
"name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, }
- delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export
... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, } { "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", },
Adding a bitmap to a server is would would advertise to the NBD client that it can query the "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
}
- remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap
... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" }
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled)
A suggestion on the examples - while UUIDs are nice and handy for management tools, they are a pain to type and for humans to quickly read. Is there any way we can document a sample transaction stream with all the actors involved (someone issues a libvirt API call XYZ, libvirt in turn issues QMP command ABC), and using shorter names that are easier to read as humans?
{ "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove x-vz-block-dirty-bitmap-merge x-vz-block-dirty-bitmap-disable x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it.
Why can't restore be done while the guest is offline? (Oh right, we still haven't added decent qemu-img support for bitmap manipulation, so we need a qemu process around for any bitmap changes). As I understand it, the point of bitmaps and snapshots is to create an NBD server that a third-party can use to read just the dirty portions of a disk in relation to a known checkpoint, to save off data in whatever form it wants; so you are right that the third party then needs a way to rewrite data from whatever internal form it stored it in back to the view that qemu can consume when rolling back to a given backup, prior to starting the guest on the restored data. Do you need additional libvirt APIs exposed for this, or do the proposed APIs for adding snapshots cover everything already with just an additional flag parameter that says whether the <domainblocksnapshot> is readonly (the third-party is using it for collecting the incremental backup data) or writable (the third-party is actively writing its backup into the file, and when it is done, then perform a block-commit to merge that data back onto the main qcow2 file)? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 04/11/2018 12:32 PM, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
This is a first-pass review (making comments as I first encounter something, even if it gets explained later in the email)
This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image.
Oh, I see -- you're using blockdev-backup sync=none to accomplish fleecing snapshots. It's a little confusing here without the sync=none information, as usually blockdev-backup provides... backups, not snapshots. Your cover letter would be a little clearer with that information. For everyone else: Node fleecing is a very dumb name that means "Live copying of arbitrary blocks from a live node." In QEMU, it works like this: (1) Create a new active layer for the node to be fleeced. [node] <-- [snapshot] Note that this snapshot node is backed by the node named "node", not a file named "node". The snapshot is supported by a qcow2 file on local storage that starts empty. (2) Start blockdev-backup sync=none FROM the node TO the snapshot:
blockdev-backup device=node target=snapshot sync=none
This means that any time 'node' is written to, the information will get written to 'snapshot' on-demand, preserving snapshot as a point-in-time snapshot of node, while leaving 'node' in-use for any other nodes/devices using it. It's effectively the opposite of an external snapshot, facilitated by a live COW job. (3) an NBD server may be started to export the "snapshot" nbd-server-add device=snapshot (4+) At this point, the NBD client can copy the snapshot data out, and the NBD export can be closed upon completion. Then, the snapshot can be removed/deleted. Unlike traditional external snapshot and commit workflow, this snapshot can be deleted at any time without jeopardizing the data it is a snapshot of.
So, rewriting this to make sure I understand, let's start with a disk with contents A, then take a snapshot, then write B:
In the existing libvirt snapshot APIs, the data gets distributed as:
base (contents A) <- new active (contents B)
where you want the new API:
base, remains active (contents B) ~~~ backup (contents A)
Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server.
So the biggest reason for a new libvirt API is that we need management actions to control which NBD images from qemu are exposed and torn down at the appropriate sequences.
Here is typical actions on domain backup:
- create temporary snapshot of domain disks of interest - export snaphots thru NBD - back them up - remove disks from export - delete temporary snapshot
and typical actions on domain restore:
- start domain in paused state - export domain disks of interest thru NBD for write - restore them - remove disks from export - resume or destroy domain
Now let's write down API in more details. There are minor changes in comparison with previous version [1].
*Temporary snapshot API*
In previous version it is called 'Fleece API' after qemu terms and I'll still use BlockSnapshot prefix for commands as in previous RFC instead of TmpSnapshots which I inclined more now.
virDomainBlockSnapshotPtr virDomainBlockSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
Just to make sure, we have the existing API of:
virDomainSnapshotPtr virDomainSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
So you are creating a new object (virDomainBlockSnapshotPtr) rather than reusing the existing VirDomainSnapshotPtr, and although the two commands are similar, we get to design a new XML schema from scratch rather than trying to overload yet even more functionality onto the existing API.
Should we also have:
const char *virDomainBlockSnapshotGetName(virDomainBlockSnapshotPtr snapshot); virDomainPtr virDomainBlockSnapshotGetDomain(virDomainBlockSnapshotPtr snapshot); virConnectPtr virDomainBlockSnapshotGetConnect(virDomainBlockSnapshotPtr snapshot);
for symmetry with existing snapshot API?
virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotList(virDomainPtr domain, virDomainBlockSnapshotPtr **snaps, unsigned int flags);
I'm guessing this is the counterpart to virDomainListAllSnapshots() (the modern listing interface), and that we probably don't want counterparts for virDomainSnapshotNum/virDomainSnapshotListNames (the older listing interface, which was inherently racy as the list could change in length between the two calls).
virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotPtr virDomainBlockSnapshotLookupByName(virDomainPtr domain, const char *name, unsigned int flags);
Also, the virDomainSnapshotPtr had a number of API to track a tree-like hierarchy between snapshots (that is, you very much want to know if snapshot B is a child of snapshot A), while it looks like your new virDomainBlockSnapshotPtrs are completely independent (no relationships between the snapshots, each can be independently created or torn down, without having to rewrite a relationship tree between them, and there is no need for counterparts to things like virDomainSnapshotNumChildren). Okay, I think that makes sense, and is a good reason for introducing a new object type rather than shoe-horning this into the existing API.
Here is an example of snapshot xml description:
<domainblocksnapshot> <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/> </disk> <disk name='sdb' type="file"> <fleece file="/tmp/snapshot-b.hdd"/> </disk> </domainblocksnapshot>
Temporary snapshots are indepentent thus they are not organized in tree structure as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice.
So in the XML, the <fleece> element describes the destination file (back to my earlier diagram, it would be the file that is created and will hold content 'A' when the main active image is changed to hold content 'B' after the snapshot was created)?
Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c"> .. </disk>
It makes sense to avoid the bitmap name in libvirt, but do these indeed correlate 1:1 with bitmaps? I assume each bitmap will have name=%%UUID%% ?
Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used. So client need to manage checkpoints and delete unused. Thus next API function:
int virDomainBlockCheckpointRemove(virDomainPtr domain, const char *name, unsigned int flags);
I'm trying to figure out how BlockCheckpoint and BlockSnapshots relate. Maybe it will be more clear when I read the implementation section below. Is the idea that I can't create a BlockSnapshot without first having a checkpoint available? If so, where does that fit in the <domainblocksnapshot> XML?
*Block export API*
I guess it is natural to treat qemu NBD server as a domain device. So we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD server and virDomainUpdateDeviceFlags to add/delete disks to be exported.
This feels a bit awkward - up to now, attaching a device is something visible to the guest, but you are trying to reuse the interface to attach something tracked by the domain, but which has no impact to the guest. That is, the guest has no clue whether a block export exists pointing to a particular checkpoint, nor does it care.
While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags looks a bit inconvinient so I decided to add a pair of API functions just to add/delete disks to be exported:
int virDomainBlockExportStart(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
int virDomainBlockExportStop(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
I guess more appropriate names are virDomainBlockExportAdd and virDomainBlockExportRemove but as I already have a patch series implementing pull backups with these names I would like to keep these names now.
What does the XML look like in these calls?
These names also reflect that in the implementation I decided to start/stop NBD server in a lazy manner. While it is a bit innovative for libvirt API I guess it is convinient because to refer NBD server to add/remove disks to we need to identify it thru it's parameters like type, address etc until we introduce some device id (which does not looks consistent with current libvirt design).
This just reinforces my thoughts above - is the reason it doesn't make sense to assign a device id to the export due to the fact that the export is NOT guest-visible? Does it even belong under the "domain/devices/" xpath of the domain XML, or should it be a new sibling of <devices> with an xpath of "domain/blockexports/"?
So it looks like we have all parameters to start/stop server in the frame of these calls so why have extra API calls just to start/stop server manually. If we later need to have NBD server without disks we can perfectly support virDomainAttachDeviceFlags/virDomainDetachDeviceFlags.
Here is example of xml to add/remove disks (specifying checkpoint attribute is not needed for removing disks of course):
<domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> </domainblockexport>
So this is the XML you pass to virDomainBlockExportStart, with the goal of telling qemu to start or stop an NBD export on the backing chain associated with disk "sda", where the export is serving up data tied to checkpoint "d068765e-8b50-4d74-9b72-1e55c663cbf8", and which will be associated with the destination snapshot file described by the <domainblocksnapshot> named "0044757e-1a2d-4c2c-b92f-bb403309bb17"?
Why is it named <domainblockexport> here, but...
And this is how this NBD server will be exposed in domain xml:
<devices> ... <blockexport type="nbd">
<blockexport> here?
<address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8" exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
The exportname property is new here compared to the earlier listing - is that something that libvirt generates, or that the user chooses?
<disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8 exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> </blockexport> ... </devices>
*Implementation details from qemu-libvirt interactions POV*
1. Temporary snapshot
- create snapshot
Which libvirt API triggers this action? virDomainBlockSnapshotCreateXML?
- add fleece blockdev backed by disk of interest - start fleece blockjob which will pop out data to be overwritten to fleece blockdev
{ "execute": "blockdev-add" "arguments": { "backing": "drive-scsi0-0-0-0", "driver": "qcow2", "file": { "driver": "file", "filename": "/tmp/snapshot-a.hdd"
Is qemu creating this file, or is libvirt pre-creating it and qemu just opening it? I guess this is a case where libvirt would want to pre-create an empty qcow2 file (either by qemu-img, or by the new x-blockdev-create in qemu 2.12)? Okay, it looks like this file is what you listed in the XML for <domainblocksnapshot>, so libvirt is creating it. Does the new file have a backing image, or does it read as completely zeroes?
In fleecing workflow, the image can either be created by QEMU or pre-created by libvirt, but in keeping with best practices libvirt should probably create it. It should be an empty qcow2 backed by the current node of interest.
}, "node-name": "snapshot-scsi0-0-0-0" }, }
No trailing comma in JSON {}, but it's not too hard to figure out what you mean.
{ "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "target": "snapshot-scsi0-0-0-0" "sync": "none", }, } ]
You showed a transaction with only one element; but presumably we are using a transaction because if we want to create a point in time for multiple disks at once, we need two separate blockdev-backup actions joined in the same transaction to cover the two disks. So this command is telling qemu to start using a brand-new qcow2 file as its local storage for tracking that a snapshot is being taken, and that point in time is the checkpoint?
Am I correct that you would then tell qemu to export an NBD view of this qcow2 snapshot which a third-party client can connect to and use NBD_CMD_BLOCK_STATUS to learn which portions of the file contain data (that is, which clusters has qemu copied into the backup, because the active image has changed them since the checkpoint, but anything not dirty in this file is still identical to the last backup?
Would libvirt ever want to use something other than "sync":"none"?
No; based on how fleecing is implemented in QEMU. Effectively, "blockdev-backup sync=none" _IS_ the fleecing command for QEMU, but it requires two steps: (1) Create a new empty top layer (2) Sync new writes from the active layer, which is the backing image(!) for the snapshot. The empty top layer can be created whenever, but the backup action needs to happen in a transaction. In my earlier email I mentioned it'd be nice to have a convenience job that wrapped up a few steps into one: A: Create a new top layer B: Start the fleecing job itself C: Tied a specified bitmap belonging to the node to the fleecing job D: Exported the snapshot via NBD and this would become a proper "fleecing job." It would have to be, of course, transactionable.
}, }
- delete snapshot - cancel fleece blockjob - delete fleece blockdev
{ "execute": "block-job-cancel" "arguments": { "device": "drive-scsi0-0-0-0" }, } { "execute": "blockdev-del" "arguments": { "node-name": "snapshot-scsi0-0-0-0" }, }
2. Block export
- add disks to export - start NBD server if it is not started - add disks
{ "execute": "nbd-server-start" "arguments": { "addr": { "type": "inet" "data": { "host": "0.0.0.0", "port": "49300" }, } }, } { "execute": "nbd-server-add" "arguments": { "device": "snapshot-scsi0-0-0-0", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8", "writable": false
So this is telling qemu to export the temporary qcow2 image created in the point above. An NBD client would see the export getting progressively more blocks with data as the guest continues to write more clusters (as qemu has to copy the data from the checkpoint to the temporary file before updating the main image with the new data). If the NBD client reads a cluster that has not yet been copied by qemu (because the guest has not written to that cluster since the block job started), would it see zeroes, or the same data that the guest still sees?
Because the temporary snapshot is backed by the live image, the guest sees an unchanging set of blocks. In essence, we are utilizing a COW mechanism to copy the point-in-time data into the snapshot layer while the backing image remains the live/active layer for the devices utilizing it.
}, }
- remove disks from export - remove disks - stop NBD server if there are no disks left
{ "arguments": { "mode": "hard", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8" }, "execute": "nbd-server-remove" } { "execute": "nbd-server-stop" }
3. Checkpoints (the most interesting part)
So far so good from the QEMU end AFAICT...
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot.
So you are trying to optimize away write penalties if you have, say, ten bitmaps representing checkpoints so we don't have to record all new writes to all ten. This makes sense, and I would have liked to formalize the concept in QEMU, but response to that idea was very poor at the time. Also my design was bad :)
Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme.
Previous, or next? Say we've got bitmaps (in chronological order from oldest to newest) A B C D E F G H and we want to delete bitmap (or "checkpoint") 'C': A B D E F G H the bitmap representing checkpoint 'D' should now contain the bits that used to be in 'C', right? That way all the checkpoints still represent their appropriate points in time. The only problem comes when you delete a checkpoint on the end and the bits have nowhere to go: A B C A B _ In this case you really do lose a checkpoint -- but depending on how we annotate this, it may or may not be possible to delete the most recent checkpoint. Let's assume that the currently active bitmap that doesn't represent *any* point in time yet (because it's still active and recording new writes) is noted as 'X': A B C X If we delete C now, then, that bitmap can get re-merged into the *active bitmap* X: A B _ X
We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too.
Right. A missing bitmap anywhere in the sequence invalidates the entire sequence.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new.
That seems... unfortunate. A record could be kept in libvirt instead, couldn't it? A : Bitmap A, Time 12:34:56, Child of (None), Parent of B B : Bitmap B, Time 23:15:46, Child of A, Parent of (None) I suppose in this case you can't *reconstruct* this information from the bitmap stored in the qcow2, which necessitates your naming scheme... ...Still, if you forego this requirement, deleting bitmaps in the middle becomes fairly easy.
So while it is possible to have only one active bitmap at a time it costs some exersices at managment layer. To me it looks like qemu itself is a better place to track bitmaps chain order and consistency.
If this is a hard requirement, it's certainly *easier* to track the relationship in QEMU ...
Libvirt is already tracking a tree relationship between internal snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track that (true, internal snapshots don't get as much attention as external snapshots) - but the fact remains that qemu is probably not the best place to track relationship between multiple persistent bitmaps, any more than it tracks relationships between internal snapshots. So having libvirt track relations between persistent bitmaps is just fine. Do we really have to rename bitmaps in the qcow2 file, or can libvirt track it all on its own?
This is a way, really, of storing extra metadata by using the bitmap name as arbitrary data storage. I'd say either we promote QEMU to understanding checkpoints, or enhance libvirt to track what it needs independent of QEMU -- but having to rename bitmaps smells fishy to me.
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
I think the *snapshots*, as temporary objects, are independent and don't carry a relation to each other. The *checkpoints* here, however, are persistent and interrelated.
Now how exporting bitmaps looks like.
- add to export disk snapshot N with changes from checkpoint K - add fleece blockdev to NBD exports - create new bitmap T - disable bitmap T - merge bitmaps K, K+1, .. N-1 into T
I see; so we compute a new slice based on previous bitmaps and backup arbitrary from that arbitrary slice. So "T" is a temporary bitmap meant to be discarded at the conclusion of the operation, making it much more like a consumable object.
- add bitmap to T to nbd export
- remove disk snapshot from export - remove fleece blockdev from NBD exports - remove bitmap T
Aha.
Here is qemu commands examples for operation with checkpoints, I'll make several snapshots with checkpoints for purpuse of better illustration.
- create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint - same as without checkpoint but additionally add bitmap on fleece blockjob start
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, }
So a checkpoint creates a reference point, but NOT a backup. You are manually creating checkpoint instances. In this case, though, you haven't disabled the previous checkpoint's bitmap (if any?) atomically with the creation of this one...
Here, the transaction makes sense; you have to create the persistent dirty bitmap to track from the same point in time. The dirty bitmap is tied to the active image, not the backup, so that when you create the NEXT incremental backup, you have an accurate record of which sectors were touched in snapshot-scsi0-0-0-0 between this transaction and the next.
] }, }
- delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as without checkpoints
- create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint - same actions as for the first snapshot, but additionally disable the first bitmap
Again, you're showing the QMP commands that libvirt is issuing; which libvirt API calls are driving these actions?
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": {
Do you have measurements on whether having multiple active bitmaps hurts performance? I'm not yet sure that managing a chain of disabled bitmaps (and merging them as needed for restores) is more or less efficient than managing multiple bitmaps all the time. On the other hand, you do have a point that restore is a less frequent operation than backup, so making backup as lean as possible and putting more work on restore is a reasonable tradeoff, even if it adds complexity to the management for doing restores.
Depending on the number of checkpoints intended to be kept... we certainly make no real promises on the efficiency of marking so many. It's at *least* a linear increase with each checkpoint...
"name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, }
Oh, I see, you handle the "disable old" case here.
- delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export
... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, }
OK, so in this transaction you add a new temporary bitmap for export, and merge the contents of two bitmaps into it. However, it doesn't look like you created a new checkpoint and managed that handoff here, did you?
{ "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", },
And then here, once the bitmap and the data is already frozen, it's actually alright if we add the export at a later point in time.
Adding a bitmap to a server is would would advertise to the NBD client that it can query the "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
Don't know much about this, I stopped paying attention to the BLOCK STATUS patches. Is the NBD spec the best way to find out the current state right now? (Is there a less technical, briefer overview somewhere, perhaps from a commit message or a cover letter?)
}
- remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap
... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" }
OK, this just deletes the checkpoint. I guess we delete the node and stop the NBD server too, right?
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled)
A suggestion on the examples - while UUIDs are nice and handy for management tools, they are a pain to type and for humans to quickly read. Is there any way we can document a sample transaction stream with all the actors involved (someone issues a libvirt API call XYZ, libvirt in turn issues QMP command ABC), and using shorter names that are easier to read as humans?
Yeah, A-B-C-D terminology would be nice for the examples. It's fine if the actual implementation uses UUIDs.
{ "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
We already have this, right? It doesn't even need to be transactionable.
x-vz-block-dirty-bitmap-merge
You need this...
x-vz-block-dirty-bitmap-disable
And this we had originally but since removed, but can be re-added trivially.
x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap
Do my comments make sense? Am I understanding you right so far? I'll try to offer a competing writeup to make sure we're on the same page with your proposed design before I waste any time trying to critique it -- in case I'm misunderstanding you. Thank you for leading the charge and proposing new APIs for this feature. It will be very nice to expose the incremental backup functionality we've been working on in QEMU to users of libvirt. --js
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it.
Why can't restore be done while the guest is offline? (Oh right, we still haven't added decent qemu-img support for bitmap manipulation, so we need a qemu process around for any bitmap changes).
I'm working on this right now, actually! I'm working on JSON format output for bitmap querying, and simple clear/delete commands. I hope to send this out very soon.
As I understand it, the point of bitmaps and snapshots is to create an NBD server that a third-party can use to read just the dirty portions of a disk in relation to a known checkpoint, to save off data in whatever form it wants; so you are right that the third party then needs a way to rewrite data from whatever internal form it stored it in back to the view that qemu can consume when rolling back to a given backup, prior to starting the guest on the restored data. Do you need additional libvirt APIs exposed for this, or do the proposed APIs for adding snapshots cover everything already with just an additional flag parameter that says whether the <domainblocksnapshot> is readonly (the third-party is using it for collecting the incremental backup data) or writable (the third-party is actively writing its backup into the file, and when it is done, then perform a block-commit to merge that data back onto the main qcow2 file)?

On 12.04.2018 07:14, John Snow wrote:
On 04/11/2018 12:32 PM, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
[snip]
Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c"> .. </disk>
It makes sense to avoid the bitmap name in libvirt, but do these indeed correlate 1:1 with bitmaps?
I assume each bitmap will have name=%%UUID%% ?
There is 1:1 correlation but names are different. Checkout checkpoints subsection of *implementation details* section below for naming scheme.
Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used. So client need to manage checkpoints and delete unused. Thus next API function:
[snip]
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot.
So you are trying to optimize away write penalties if you have, say, ten bitmaps representing checkpoints so we don't have to record all new writes to all ten.
This makes sense, and I would have liked to formalize the concept in QEMU, but response to that idea was very poor at the time.
Also my design was bad :)
Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme.
Previous, or next?
In short previous.
Say we've got bitmaps (in chronological order from oldest to newest)
A B C D E F G H
and we want to delete bitmap (or "checkpoint") 'C':
A B D E F G H
the bitmap representing checkpoint 'D' should now contain the bits that used to be in 'C', right? That way all the checkpoints still represent their appropriate points in time.
I merge to previous due to definition above. "A" contains changes from point in time A to point in time B an so no. So if you delete C in order for B to keep changes from point in time B to point in time D (next in checkpoint chain) you need merge C to B.
The only problem comes when you delete a checkpoint on the end and the bits have nowhere to go:
A B C
A B _
In this case you really do lose a checkpoint -- but depending on how we annotate this, it may or may not be possible to delete the most recent checkpoint. Let's assume that the currently active bitmap that doesn't represent *any* point in time yet (because it's still active and recording new writes) is noted as 'X':
A B C X
If we delete C now, then, that bitmap can get re-merged into the *active bitmap* X:
A B _ X
You can delete any bitmap (and accordingly any checkpoint). If checkpoint is last we just merge last bitmap to previous and additioanlly make the previous bitmap active.
We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too.
Right. A missing bitmap anywhere in the sequence invalidates the entire sequence.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new.
That seems... unfortunate. A record could be kept in libvirt instead, couldn't it?
A : Bitmap A, Time 12:34:56, Child of (None), Parent of B B : Bitmap B, Time 23:15:46, Child of A, Parent of (None)
Yes it is possible. I was reluctant to implement this way for a couple of reasons: - if bitmap metadata is in libvirt we need carefully design it for things like libvirtd crashes. If metadata is out of sync with qemu then we can get broken incremental backups. One possible design is: - on bitmap deletion save metadata after deletion bitmap in qemu; in case of libvirtd crash in between upon libvirtd restart we can drop bitmaps that are in metadata but not in qemu as already deleted - on bitmap add (creating new snapshot with checkpoint) save metadata with bitmap before creating bitmap in qemu; then again we have a way to handle libvirtd crashes in between So this approach has tricky parts too. The suggested approach uses qemu transactions to keep bitmap consistent. - I don't like another metadata which looks like belongs to disks and not a domain. It is like keeping disk size in domain xml.
I suppose in this case you can't *reconstruct* this information from the bitmap stored in the qcow2, which necessitates your naming scheme...
...Still, if you forego this requirement, deleting bitmaps in the middle becomes fairly easy.
So while it is possible to have only one active bitmap at a time it costs some exersices at managment layer. To me it looks like qemu itself is a better place to track bitmaps chain order and consistency.
If this is a hard requirement, it's certainly *easier* to track the relationship in QEMU ...
Libvirt is already tracking a tree relationship between internal snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track that (true, internal snapshots don't get as much attention as external snapshots) - but the fact remains that qemu is probably not the best place to track relationship between multiple persistent bitmaps, any more than it tracks relationships between internal snapshots. So having libvirt track relations between persistent bitmaps is just fine. Do we really have to rename bitmaps in the qcow2 file, or can libvirt track it all on its own?
This is a way, really, of storing extra metadata by using the bitmap name as arbitrary data storage.
I'd say either we promote QEMU to understanding checkpoints, or enhance libvirt to track what it needs independent of QEMU -- but having to rename bitmaps smells fishy to me.
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
I think the *snapshots*, as temporary objects, are independent and don't carry a relation to each other.
The *checkpoints* here, however, are persistent and interrelated.
Now how exporting bitmaps looks like.
- add to export disk snapshot N with changes from checkpoint K - add fleece blockdev to NBD exports - create new bitmap T - disable bitmap T - merge bitmaps K, K+1, .. N-1 into T
I see; so we compute a new slice based on previous bitmaps and backup arbitrary from that arbitrary slice.
So "T" is a temporary bitmap meant to be discarded at the conclusion of the operation, making it much more like a consumable object.
- add bitmap to T to nbd export
- remove disk snapshot from export - remove fleece blockdev from NBD exports - remove bitmap T
Aha.
Here is qemu commands examples for operation with checkpoints, I'll make several snapshots with checkpoints for purpuse of better illustration.
- create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint - same as without checkpoint but additionally add bitmap on fleece blockjob start
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, }
So a checkpoint creates a reference point, but NOT a backup. You are manually creating checkpoint instances.
In this case, though, you haven't disabled the previous checkpoint's bitmap (if any?) atomically with the creation of this one...
In the example this is first snapshot so there is no previous checkpoint and thus nothing to disable.
Here, the transaction makes sense; you have to create the persistent dirty bitmap to track from the same point in time. The dirty bitmap is tied to the active image, not the backup, so that when you create the NEXT incremental backup, you have an accurate record of which sectors were touched in snapshot-scsi0-0-0-0 between this transaction and the next.
] }, }
- delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as without checkpoints
- create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint - same actions as for the first snapshot, but additionally disable the first bitmap
Again, you're showing the QMP commands that libvirt is issuing; which libvirt API calls are driving these actions?
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": {
Do you have measurements on whether having multiple active bitmaps hurts performance? I'm not yet sure that managing a chain of disabled bitmaps (and merging them as needed for restores) is more or less efficient than managing multiple bitmaps all the time. On the other hand, you do have a point that restore is a less frequent operation than backup, so making backup as lean as possible and putting more work on restore is a reasonable tradeoff, even if it adds complexity to the management for doing restores.
Depending on the number of checkpoints intended to be kept... we certainly make no real promises on the efficiency of marking so many. It's at *least* a linear increase with each checkpoint...
"name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, }
Oh, I see, you handle the "disable old" case here.
- delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export
... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, }
OK, so in this transaction you add a new temporary bitmap for export, and merge the contents of two bitmaps into it.
However, it doesn't look like you created a new checkpoint and managed that handoff here, did you?
We don't need create checkpoints for the purpuse of exporting. Only temporary bitmap to merge appropriate bitmap chain.
{ "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", },
And then here, once the bitmap and the data is already frozen, it's actually alright if we add the export at a later point in time.
Adding a bitmap to a server is would would advertise to the NBD client that it can query the "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
Don't know much about this, I stopped paying attention to the BLOCK STATUS patches. Is the NBD spec the best way to find out the current state right now?
(Is there a less technical, briefer overview somewhere, perhaps from a commit message or a cover letter?)
}
- remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap
... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" }
OK, this just deletes the checkpoint. I guess we delete the node and
I would not call it checkpoint. Checkpoint is something visible to client. An ability to get CBT from that point in time. Here we create a temporary bitmap to calculate desired CBT.
stop the NBD server too, right?
yeah, just like in case without checkpoint (mentioned in this case description)
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled)
A suggestion on the examples - while UUIDs are nice and handy for management tools, they are a pain to type and for humans to quickly read. Is there any way we can document a sample transaction stream with all the actors involved (someone issues a libvirt API call XYZ, libvirt in turn issues QMP command ABC), and using shorter names that are easier to read as humans?
Yeah, A-B-C-D terminology would be nice for the examples. It's fine if the actual implementation uses UUIDs.
{ "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
We already have this, right? It doesn't even need to be transactionable.
x-vz-block-dirty-bitmap-merge
You need this...
x-vz-block-dirty-bitmap-disable
And this we had originally but since removed, but can be re-added trivially.
x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap
Do my comments make sense? Am I understanding you right so far? I'll try to offer a competing writeup to make sure we're on the same page with your proposed design before I waste any time trying to critique it -- in case I'm misunderstanding you.
Yes, looks like we are in tune.
Thank you for leading the charge and proposing new APIs for this feature. It will be very nice to expose the incremental backup functionality we've been working on in QEMU to users of libvirt.
--js
There are also patches too ( if API design survive review phase at least partially :) )
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it.
Why can't restore be done while the guest is offline? (Oh right, we still haven't added decent qemu-img support for bitmap manipulation, so we need a qemu process around for any bitmap changes).
I'm working on this right now, actually!
I'm working on JSON format output for bitmap querying, and simple clear/delete commands. I hope to send this out very soon.
As I understand it, the point of bitmaps and snapshots is to create an NBD server that a third-party can use to read just the dirty portions of a disk in relation to a known checkpoint, to save off data in whatever form it wants; so you are right that the third party then needs a way to rewrite data from whatever internal form it stored it in back to the view that qemu can consume when rolling back to a given backup, prior to starting the guest on the restored data. Do you need additional libvirt APIs exposed for this, or do the proposed APIs for adding snapshots cover everything already with just an additional flag parameter that says whether the <domainblocksnapshot> is readonly (the third-party is using it for collecting the incremental backup data) or writable (the third-party is actively writing its backup into the file, and when it is done, then perform a block-commit to merge that data back onto the main qcow2 file)?

12.04.2018 15:57, Nikolay Shirokovskiy wrote:
On 12.04.2018 07:14, John Snow wrote:
On 04/11/2018 12:32 PM, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
[snip]
Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c"> .. </disk>
It makes sense to avoid the bitmap name in libvirt, but do these indeed correlate 1:1 with bitmaps?
I assume each bitmap will have name=%%UUID%% ? There is 1:1 correlation but names are different. Checkout checkpoints subsection of *implementation details* section below for naming scheme.
Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used. So client need to manage checkpoints and delete unused. Thus next API function:
[snip]
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot. So you are trying to optimize away write penalties if you have, say, ten bitmaps representing checkpoints so we don't have to record all new writes to all ten.
This makes sense, and I would have liked to formalize the concept in QEMU, but response to that idea was very poor at the time.
Also my design was bad :)
Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme.
Previous, or next? In short previous.
Say we've got bitmaps (in chronological order from oldest to newest)
A B C D E F G H
and we want to delete bitmap (or "checkpoint") 'C':
A B D E F G H
the bitmap representing checkpoint 'D' should now contain the bits that used to be in 'C', right? That way all the checkpoints still represent their appropriate points in time. I merge to previous due to definition above. "A" contains changes from point in time A to point in time B an so no. So if you delete C in order for B to keep changes from point in time B to point in time D (next in checkpoint chain) you need merge C to B.
The only problem comes when you delete a checkpoint on the end and the bits have nowhere to go:
A B C
A B _
In this case you really do lose a checkpoint -- but depending on how we annotate this, it may or may not be possible to delete the most recent checkpoint. Let's assume that the currently active bitmap that doesn't represent *any* point in time yet (because it's still active and recording new writes) is noted as 'X':
A B C X
If we delete C now, then, that bitmap can get re-merged into the *active bitmap* X:
A B _ X
You can delete any bitmap (and accordingly any checkpoint). If checkpoint is last we just merge last bitmap to previous and additioanlly make the previous bitmap active.
I propose, not to say that bitmap represents a checkpoint. It is simpler to say (and it reflects the reality) that bitmap is a difference between two consecutive checkpoints. And we can say, that active state is some kind of a checkpoint, current point in time. So, we have checkpoints (5* is an active state) which are points in time: 1 2 3 4 5* And bitmaps, first three are disabled, last is enabled: "1->2", "2->3", "3->4", "4->5*" So, remove first checkpoint: just remove bitmap "A->B". Remove any other checkpoint N: create new bitmap "(N-1)->(N+1)" = merge("(N-1)->N", "N->(N+1)"), drop bitmaps "(N-1)->N" and "N->(N+1)". If the latter was active, the new one becomes active. And we cant remove 5* checkpoint, as it is an active state, not an actual checkpoint.
We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too.
Right. A missing bitmap anywhere in the sequence invalidates the entire sequence.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new.
That seems... unfortunate. A record could be kept in libvirt instead, couldn't it?
A : Bitmap A, Time 12:34:56, Child of (None), Parent of B B : Bitmap B, Time 23:15:46, Child of A, Parent of (None) Yes it is possible. I was reluctant to implement this way for a couple of reasons:
- if bitmap metadata is in libvirt we need carefully design it for things like libvirtd crashes. If metadata is out of sync with qemu then we can get broken incremental backups. One possible design is:
- on bitmap deletion save metadata after deletion bitmap in qemu; in case of libvirtd crash in between upon libvirtd restart we can drop bitmaps that are in metadata but not in qemu as already deleted
- on bitmap add (creating new snapshot with checkpoint) save metadata with bitmap before creating bitmap in qemu; then again we have a way to handle libvirtd crashes in between
So this approach has tricky parts too. The suggested approach uses qemu transactions to keep bitmap consistent.
- I don't like another metadata which looks like belongs to disks and not a domain. It is like keeping disk size in domain xml.
I suppose in this case you can't *reconstruct* this information from the bitmap stored in the qcow2, which necessitates your naming scheme...
...Still, if you forego this requirement, deleting bitmaps in the middle becomes fairly easy.
So while it is possible to have only one active bitmap at a time it costs some exersices at managment layer. To me it looks like qemu itself is a better place to track bitmaps chain order and consistency. If this is a hard requirement, it's certainly *easier* to track the relationship in QEMU ...
It's not easier, as we'll have to implement either separate of bitmaps concept of checkpoints, which will be based on bitmaps, and we'll have to negotiate and implement storing these objects to qcow2 and migrate them. Or we'll go through proposed by Kevin (If I remember correctly) way of adding something like "backing" or "parent" pointer to BdrvDirtyBitmap, and anyway store to qcow2, migrate and expose qapi for them. The other (more hard way) is move to multi-bit bitmaps (like in vmware), where for each granularity-chunk we store a number, representing "before which checkpoint was the latest change of this chunk", and the same, qapi+qcow2+migration. It's all not easier than call several simple qmp commands.
Libvirt is already tracking a tree relationship between internal snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track that (true, internal snapshots don't get as much attention as external snapshots) - but the fact remains that qemu is probably not the best place to track relationship between multiple persistent bitmaps, any more than it tracks relationships between internal snapshots. So having libvirt track relations between persistent bitmaps is just fine. Do we really have to rename bitmaps in the qcow2 file, or can libvirt track it all on its own?
This is a way, really, of storing extra metadata by using the bitmap name as arbitrary data storage.
I'd say either we promote QEMU to understanding checkpoints, or enhance libvirt to track what it needs independent of QEMU -- but having to rename bitmaps smells fishy to me.
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
I think the *snapshots*, as temporary objects, are independent and don't carry a relation to each other.
The *checkpoints* here, however, are persistent and interrelated.
Now how exporting bitmaps looks like.
- add to export disk snapshot N with changes from checkpoint K - add fleece blockdev to NBD exports - create new bitmap T - disable bitmap T - merge bitmaps K, K+1, .. N-1 into T I see; so we compute a new slice based on previous bitmaps and backup arbitrary from that arbitrary slice.
So "T" is a temporary bitmap meant to be discarded at the conclusion of the operation, making it much more like a consumable object.
- add bitmap to T to nbd export
- remove disk snapshot from export - remove fleece blockdev from NBD exports - remove bitmap T
Aha.
Here is qemu commands examples for operation with checkpoints, I'll make several snapshots with checkpoints for purpuse of better illustration.
- create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint - same as without checkpoint but additionally add bitmap on fleece blockjob start
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } So a checkpoint creates a reference point, but NOT a backup. You are manually creating checkpoint instances.
In this case, though, you haven't disabled the previous checkpoint's bitmap (if any?) atomically with the creation of this one...
In the example this is first snapshot so there is no previous checkpoint and thus nothing to disable.
Here, the transaction makes sense; you have to create the persistent dirty bitmap to track from the same point in time. The dirty bitmap is tied to the active image, not the backup, so that when you create the NEXT incremental backup, you have an accurate record of which sectors were touched in snapshot-scsi0-0-0-0 between this transaction and the next.
] }, }
- delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as without checkpoints
- create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint - same actions as for the first snapshot, but additionally disable the first bitmap
Again, you're showing the QMP commands that libvirt is issuing; which libvirt API calls are driving these actions?
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": {
Do you have measurements on whether having multiple active bitmaps hurts performance? I'm not yet sure that managing a chain of disabled bitmaps (and merging them as needed for restores) is more or less efficient than managing multiple bitmaps all the time. On the other hand, you do have a point that restore is a less frequent operation than backup, so making backup as lean as possible and putting more work on restore is a reasonable tradeoff, even if it adds complexity to the management for doing restores.
Depending on the number of checkpoints intended to be kept... we certainly make no real promises on the efficiency of marking so many. It's at *least* a linear increase with each checkpoint...
"name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, }
Oh, I see, you handle the "disable old" case here.
- delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export
... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, } OK, so in this transaction you add a new temporary bitmap for export, and merge the contents of two bitmaps into it.
However, it doesn't look like you created a new checkpoint and managed that handoff here, did you?
We don't need create checkpoints for the purpuse of exporting. Only temporary bitmap to merge appropriate bitmap chain.
But in this case we will export something strange. Actually, we can only export data changed from some checkpoint to newly created, if we start fleecing in same transaction with creating a new checkpoint. And we can't create backup from one checkpoint in past to any other checkpoint in past (because corresponding data may be already changed after the second checkpoint).
{ "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", },
And then here, once the bitmap and the data is already frozen, it's actually alright if we add the export at a later point in time.
Adding a bitmap to a server is would would advertise to the NBD client that it can query the "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
Don't know much about this, I stopped paying attention to the BLOCK STATUS patches. Is the NBD spec the best way to find out the current state right now?
I'm afraid, yes, branch extension-blockstatus https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/pro.... You can search for "Metadata querying" paragraph.
(Is there a less technical, briefer overview somewhere, perhaps from a commit message or a cover letter?)
}
- remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap
... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" }
OK, this just deletes the checkpoint. I guess we delete the node and
I would not call it checkpoint. Checkpoint is something visible to client. An ability to get CBT from that point in time.
Here we create a temporary bitmap to calculate desired CBT.
stop the NBD server too, right? yeah, just like in case without checkpoint (mentioned in this case description)
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled) A suggestion on the examples - while UUIDs are nice and handy for management tools, they are a pain to type and for humans to quickly read. Is there any way we can document a sample transaction stream with all the actors involved (someone issues a libvirt API call XYZ, libvirt in turn issues QMP command ABC), and using shorter names that are easier to read as humans?
Yeah, A-B-C-D terminology would be nice for the examples. It's fine if the actual implementation uses UUIDs.
{ "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
We already have this, right? It doesn't even need to be transactionable.
x-vz-block-dirty-bitmap-merge You need this...
x-vz-block-dirty-bitmap-disable And this we had originally but since removed, but can be re-added trivially.
x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap
Do my comments make sense? Am I understanding you right so far? I'll try to offer a competing writeup to make sure we're on the same page with your proposed design before I waste any time trying to critique it -- in case I'm misunderstanding you. Yes, looks like we are in tune.
Thank you for leading the charge and proposing new APIs for this feature. It will be very nice to expose the incremental backup functionality we've been working on in QEMU to users of libvirt.
--js There are also patches too ( if API design survive review phase at least partially :) )
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it. Why can't restore be done while the guest is offline? (Oh right, we still haven't added decent qemu-img support for bitmap manipulation, so we need a qemu process around for any bitmap changes).
I'm working on this right now, actually!
I'm working on JSON format output for bitmap querying, and simple clear/delete commands. I hope to send this out very soon.
As I understand it, the point of bitmaps and snapshots is to create an NBD server that a third-party can use to read just the dirty portions of a disk in relation to a known checkpoint, to save off data in whatever form it wants; so you are right that the third party then needs a way to rewrite data from whatever internal form it stored it in back to the view that qemu can consume when rolling back to a given backup, prior to starting the guest on the restored data. Do you need additional libvirt APIs exposed for this, or do the proposed APIs for adding snapshots cover everything already with just an additional flag parameter that says whether the <domainblocksnapshot> is readonly (the third-party is using it for collecting the incremental backup data) or writable (the third-party is actively writing its backup into the file, and when it is done, then perform a block-commit to merge that data back onto the main qcow2 file)?
-- Best regards, Vladimir

On 04/12/2018 10:08 AM, Vladimir Sementsov-Ogievskiy wrote:
I propose, not to say that bitmap represents a checkpoint. It is simpler to say (and it reflects the reality) that bitmap is a difference between two consecutive checkpoints. And we can say, that active state is some kind of a checkpoint, current point in time.
So, we have checkpoints (5* is an active state) which are points in time:
1 2 3 4 5*
Oh -- the most recent checkpoint there doesn't belong to a ***specific time*** yet. It's a floating checkpoint -- it always represents the most current version. It's not really a checkpoint at all. 1, 2, 3, and 4 however are associated with a specific timestamp though.
And bitmaps, first three are disabled, last is enabled:
"1->2", "2->3", "3->4", "4->5*"
OK; so 1->2, 2->3 and 3->4 define deltas between two ***defined*** points in time. 4->5* however is only anchored by one specific point in time, and is floating just like the most recent checkpoint is floating.
So, remove first checkpoint: just remove bitmap "A->B".
I assume you mean "1->2" here. And... yes, I agree -- if you don't care about your very first checkpoint anymore, you can just delete the first bitmap, too.
Remove any other checkpoint N: create new bitmap "(N-1)->(N+1)" = merge("(N-1)->N", "N->(N+1)"), drop bitmaps "(N-1)->N" and "N->(N+1)".
err, okay, so let's say we want to drop checkpoint 3: create: "2->4" merge: "2->3", "3->4" [and presumably store in "2->4"] drop: 2->3, 3->4 OK, that makes more sense to me. In essence; (1) We could consider this 2->3 absorbing 3->4, or (2) 3->4 absorbing 2->3 and in either case it's the same, really.
If the latter was active, the new one becomes active. And we cant remove 5* checkpoint, as it is an active state, not an actual checkpoint.
OK, crystal. --js

On 13.04.2018 03:04, John Snow wrote:
On 04/12/2018 10:08 AM, Vladimir Sementsov-Ogievskiy wrote:
I propose, not to say that bitmap represents a checkpoint. It is simpler to say (and it reflects the reality) that bitmap is a difference between two consecutive checkpoints. And we can say, that active state is some kind of a checkpoint, current point in time.
So, we have checkpoints (5* is an active state) which are points in time:
1 2 3 4 5*
Oh -- the most recent checkpoint there doesn't belong to a ***specific time*** yet. It's a floating checkpoint -- it always represents the most current version. It's not really a checkpoint at all.
1, 2, 3, and 4 however are associated with a specific timestamp though.
And bitmaps, first three are disabled, last is enabled:
"1->2", "2->3", "3->4", "4->5*"
OK; so 1->2, 2->3 and 3->4 define deltas between two ***defined*** points in time.
4->5* however is only anchored by one specific point in time, and is floating just like the most recent checkpoint is floating.
So, remove first checkpoint: just remove bitmap "A->B".
I assume you mean "1->2" here.
And... yes, I agree -- if you don't care about your very first checkpoint anymore, you can just delete the first bitmap, too.
Remove any other checkpoint N: create new bitmap "(N-1)->(N+1)" = merge("(N-1)->N", "N->(N+1)"), drop bitmaps "(N-1)->N" and "N->(N+1)".
err, okay, so let's say we want to drop checkpoint 3:
create: "2->4" merge: "2->3", "3->4" [and presumably store in "2->4"] drop: 2->3, 3->4
OK, that makes more sense to me. In essence;
(1) We could consider this 2->3 absorbing 3->4, or (2) 3->4 absorbing 2->3
and in either case it's the same, really.
If the latter was active, the new one becomes active. And we cant remove 5* checkpoint, as it is an active state, not an actual checkpoint.
OK, crystal.
--js
I prefer not talking of active checkpoint. It is a kind of controversial. Better just keep in mind that last bitmap is active one. So for checkpoints 1 2 3 4 we have bitmaps: 1 1->2 2->3 3->4 Note the first bitmap name. When it was created name 2 was unknown so we'd better have this name for the first bitmap. Checkpoint 4 can not be used without checkpoint 5 by design to it is not a problem that 3->4 is active. Nikolay

13.04.2018 11:51, Nikolay Shirokovskiy wrote:
On 13.04.2018 03:04, John Snow wrote:
On 04/12/2018 10:08 AM, Vladimir Sementsov-Ogievskiy wrote:
I propose, not to say that bitmap represents a checkpoint. It is simpler to say (and it reflects the reality) that bitmap is a difference between two consecutive checkpoints. And we can say, that active state is some kind of a checkpoint, current point in time.
So, we have checkpoints (5* is an active state) which are points in time:
1 2 3 4 5*
Oh -- the most recent checkpoint there doesn't belong to a ***specific time*** yet. It's a floating checkpoint -- it always represents the most current version. It's not really a checkpoint at all.
1, 2, 3, and 4 however are associated with a specific timestamp though.
And bitmaps, first three are disabled, last is enabled:
"1->2", "2->3", "3->4", "4->5*"
OK; so 1->2, 2->3 and 3->4 define deltas between two ***defined*** points in time.
4->5* however is only anchored by one specific point in time, and is floating just like the most recent checkpoint is floating.
So, remove first checkpoint: just remove bitmap "A->B". I assume you mean "1->2" here.
And... yes, I agree -- if you don't care about your very first checkpoint anymore, you can just delete the first bitmap, too.
Remove any other checkpoint N: create new bitmap "(N-1)->(N+1)" = merge("(N-1)->N", "N->(N+1)"), drop bitmaps "(N-1)->N" and "N->(N+1)". err, okay, so let's say we want to drop checkpoint 3:
create: "2->4" merge: "2->3", "3->4" [and presumably store in "2->4"] drop: 2->3, 3->4
OK, that makes more sense to me. In essence;
(1) We could consider this 2->3 absorbing 3->4, or (2) 3->4 absorbing 2->3
and in either case it's the same, really.
If the latter was active, the new one becomes active. And we cant remove 5* checkpoint, as it is an active state, not an actual checkpoint. OK, crystal.
--js
I prefer not talking of active checkpoint. It is a kind of controversial. Better just keep in mind that last bitmap is active one. So for checkpoints 1 2 3 4 we have bitmaps:
1 1->2 2->3 3->4
Note the first bitmap name. When it was created name 2 was unknown so we'd better have this name for the first bitmap.
so here, 1->2 is a difference between checkpoints 2 and 3? I think it's unnatural.. Ofcource, when we create new active bitmap, we don't know the name of next checkpoint, but, why not rename it when we create next checkpoint? So, 1. have no checkpoints and bitmaps 2. create new checkpoint 1, and bitmap 1->? 3. create new checkpoint 2 and bitmap 2->?, disable bitmap 1->? and rename it to 1->2 and so on. this reflects the essence of things
Checkpoint 4 can not be used without checkpoint 5 by design to it is not a problem that 3->4 is active.
Nikolay
-- Best regards, Vladimir

On 13.04.2018 14:41, Vladimir Sementsov-Ogievskiy wrote:
13.04.2018 11:51, Nikolay Shirokovskiy wrote:
On 13.04.2018 03:04, John Snow wrote:
On 04/12/2018 10:08 AM, Vladimir Sementsov-Ogievskiy wrote:
I propose, not to say that bitmap represents a checkpoint. It is simpler to say (and it reflects the reality) that bitmap is a difference between two consecutive checkpoints. And we can say, that active state is some kind of a checkpoint, current point in time.
So, we have checkpoints (5* is an active state) which are points in time:
1 2 3 4 5*
Oh -- the most recent checkpoint there doesn't belong to a ***specific time*** yet. It's a floating checkpoint -- it always represents the most current version. It's not really a checkpoint at all.
1, 2, 3, and 4 however are associated with a specific timestamp though.
And bitmaps, first three are disabled, last is enabled:
"1->2", "2->3", "3->4", "4->5*"
OK; so 1->2, 2->3 and 3->4 define deltas between two ***defined*** points in time.
4->5* however is only anchored by one specific point in time, and is floating just like the most recent checkpoint is floating.
So, remove first checkpoint: just remove bitmap "A->B". I assume you mean "1->2" here.
And... yes, I agree -- if you don't care about your very first checkpoint anymore, you can just delete the first bitmap, too.
Remove any other checkpoint N: create new bitmap "(N-1)->(N+1)" = merge("(N-1)->N", "N->(N+1)"), drop bitmaps "(N-1)->N" and "N->(N+1)". err, okay, so let's say we want to drop checkpoint 3:
create: "2->4" merge: "2->3", "3->4" [and presumably store in "2->4"] drop: 2->3, 3->4
OK, that makes more sense to me. In essence;
(1) We could consider this 2->3 absorbing 3->4, or (2) 3->4 absorbing 2->3
and in either case it's the same, really.
If the latter was active, the new one becomes active. And we cant remove 5* checkpoint, as it is an active state, not an actual checkpoint. OK, crystal.
--js
I prefer not talking of active checkpoint. It is a kind of controversial. Better just keep in mind that last bitmap is active one. So for checkpoints 1 2 3 4 we have bitmaps:
1 1->2 2->3 3->4
Note the first bitmap name. When it was created name 2 was unknown so we'd better have this name for the first bitmap.
so here, 1->2 is a difference between checkpoints 2 and 3? I think it's unnatural.. Ofcource, when we create new active bitmap, we don't know the name of next checkpoint, but, why not rename it when we create next checkpoint?
So,
1. have no checkpoints and bitmaps 2. create new checkpoint 1, and bitmap 1->? 3. create new checkpoint 2 and bitmap 2->?, disable bitmap 1->? and rename it to 1->2 and so on.
this reflects the essence of things
Makes sense and I don't see any issue from implementation POV. I would only use > only or >> (or whatever times >) instead of ->. This makes possible to restrict names to prohibit > only. - is typical for UUIDs.

13.04.2018 18:05, Nikolay Shirokovskiy wrote:
On 13.04.2018 14:41, Vladimir Sementsov-Ogievskiy wrote:
13.04.2018 11:51, Nikolay Shirokovskiy wrote:
On 13.04.2018 03:04, John Snow wrote:
On 04/12/2018 10:08 AM, Vladimir Sementsov-Ogievskiy wrote:
I propose, not to say that bitmap represents a checkpoint. It is simpler to say (and it reflects the reality) that bitmap is a difference between two consecutive checkpoints. And we can say, that active state is some kind of a checkpoint, current point in time.
So, we have checkpoints (5* is an active state) which are points in time:
1 2 3 4 5*
Oh -- the most recent checkpoint there doesn't belong to a ***specific time*** yet. It's a floating checkpoint -- it always represents the most current version. It's not really a checkpoint at all.
1, 2, 3, and 4 however are associated with a specific timestamp though.
And bitmaps, first three are disabled, last is enabled:
"1->2", "2->3", "3->4", "4->5*"
OK; so 1->2, 2->3 and 3->4 define deltas between two ***defined*** points in time.
4->5* however is only anchored by one specific point in time, and is floating just like the most recent checkpoint is floating.
So, remove first checkpoint: just remove bitmap "A->B". I assume you mean "1->2" here.
And... yes, I agree -- if you don't care about your very first checkpoint anymore, you can just delete the first bitmap, too.
Remove any other checkpoint N: create new bitmap "(N-1)->(N+1)" = merge("(N-1)->N", "N->(N+1)"), drop bitmaps "(N-1)->N" and "N->(N+1)". err, okay, so let's say we want to drop checkpoint 3:
create: "2->4" merge: "2->3", "3->4" [and presumably store in "2->4"] drop: 2->3, 3->4
OK, that makes more sense to me. In essence;
(1) We could consider this 2->3 absorbing 3->4, or (2) 3->4 absorbing 2->3
and in either case it's the same, really.
If the latter was active, the new one becomes active. And we cant remove 5* checkpoint, as it is an active state, not an actual checkpoint. OK, crystal.
--js
I prefer not talking of active checkpoint. It is a kind of controversial. Better just keep in mind that last bitmap is active one. So for checkpoints 1 2 3 4 we have bitmaps:
1 1->2 2->3 3->4
Note the first bitmap name. When it was created name 2 was unknown so we'd better have this name for the first bitmap. so here, 1->2 is a difference between checkpoints 2 and 3? I think it's unnatural.. Ofcource, when we create new active bitmap, we don't know the name of next checkpoint, but, why not rename it when we create next checkpoint?
So,
1. have no checkpoints and bitmaps 2. create new checkpoint 1, and bitmap 1->? 3. create new checkpoint 2 and bitmap 2->?, disable bitmap 1->? and rename it to 1->2 and so on.
this reflects the essence of things Makes sense and I don't see any issue from implementation POV. I would only use > only or >> (or whatever times >) instead of ->. This makes possible to restrict names to prohibit > only. - is typical for UUIDs.
in this case, I think just > is ok. Less symbols - less electricity/paper/time overhead) And more. This do not look like a hack (may be a bit=) Why not to call the bitmap, representing difference between snapshots A and B: A>B? -- Best regards, Vladimir

On 04/12/2018 10:08 AM, Vladimir Sementsov-Ogievskiy wrote:
It's not easier, as we'll have to implement either separate of bitmaps concept of checkpoints, which will be based on bitmaps, and we'll have to negotiate and implement storing these objects to qcow2 and migrate them. Or we'll go through proposed by Kevin (If I remember correctly) way of adding something like "backing" or "parent" pointer to BdrvDirtyBitmap, and anyway store to qcow2, migrate and expose qapi for them. The other (more hard way) is move to multi-bit bitmaps (like in vmware), where for each granularity-chunk we store a number, representing "before which checkpoint was the latest change of this chunk", and the same, qapi+qcow2+migration.
It's all not easier than call several simple qmp commands.
OK, I just wanted to explore the option before we settled on using the name as metadata. What are the downsides to actually including a predecessor/successor* pointer in QEMU? (1) We'd need to amend the bitmap persistence format (2) We'd need to amend some of the bitmap management commands (3) We'd need to make sure it migrates correctly: (A) Shared storage should be fine; just flush to disk and pivot (B) Live storage needs to learn a new field to migrate. Certainly it's not ...trivial, but not terribly difficult either. I wonder if it's the right thing to do in lieu of the naming hacks in libvirt. There wasn't really a chorus of applause for the idea of having checkpoints more officially implemented in QEMU, but... abusing the name metadata still makes me feel like we're doing something wrong -- especially if a third party utility that doesn't understand the concept of your naming scheme comes along and modifies a bitmap. It feels tenuous and likely to break, so I'd like to formalize it more. We can move this discussion over to the QEMU lists if you think it's worth talking about. Or I'll just roll with it. I'll see what Eric thinks, I guess? :) *(Uh-oh, that term is overloaded for QEMU bitmap internals... we can address that later...)

13.04.2018 23:02, John Snow wrote:
On 04/12/2018 10:08 AM, Vladimir Sementsov-Ogievskiy wrote:
It's not easier, as we'll have to implement either separate of bitmaps concept of checkpoints, which will be based on bitmaps, and we'll have to negotiate and implement storing these objects to qcow2 and migrate them. Or we'll go through proposed by Kevin (If I remember correctly) way of adding something like "backing" or "parent" pointer to BdrvDirtyBitmap, and anyway store to qcow2, migrate and expose qapi for them. The other (more hard way) is move to multi-bit bitmaps (like in vmware), where for each granularity-chunk we store a number, representing "before which checkpoint was the latest change of this chunk", and the same, qapi+qcow2+migration.
It's all not easier than call several simple qmp commands. OK, I just wanted to explore the option before we settled on using the name as metadata.
What are the downsides to actually including a predecessor/successor* pointer in QEMU?
the problem is the following: we want checkpoints, and it is bad to implement them through additional mechanism, which is definitely not "checkpoints". With checkpoints in qemu: - for user checkpoints are unrelated to dirty bitmaps, they are managed separately, it's safe - clean api realisation in libvirt, to remove a checkpoint libvirt will call qmp block-checkpoint-remove With additional pointer in BdrvDirtyBitmap: - checkpoint-related bitmaps share same namespace with other bitmaps - user still can remove bitmap from the chain, without corresponding merge, which breaks the whole thing - we'll need to implement api like for block layer : bitmap-commit, bitmap-pull, etc.. (or just leave my merge), but it's all not what we want, not checkpoints. So my point is: if we are going to implement something complicated, let's implement entirely what we want, not a semi-solution. Otherwise, implement a minimal and simple thing, to just make it all work (my current solution). So, if you agree with me, that true way is checkpoints api for qemu, there things, which we need to implement: 1. multi-bit dirty bitmaps (if I remember correctly, such thing is done in vmware), that is: For each data chunk we have not one bit, but several, and store a number, which refer to last checkpoint, after which there were changes in this area. So, if we want to get blocks, changed from snapshot N up to current time, we just take all blocks, for which this number is >= N. It is more memory-efficient, then storing several dirty bitmaps. On the other hand, several linked dirty bitmaps have the other advantage: we can store in RAM only the last, the active one, and other on disk. And load them only on demand, when we need to merge. So, it looks like true way is combination of multi- and one- bit dirty bitmaps, with ability to load/store them to disk dinamically. So we need 2. Some link(s) in BdrvDirtyBitmap, to implement relations. May be, it's better to store two links to checkpoints, for which the bitmap defines the difference. (or to several checkpoints, if it is a multi-bit bitmap) 3. Checkpoints objects, with separate management, backed by dirty bitmaps (one- or multi-). These bitmaps should not be directly accessible by user, but user should have a possibility to setup a strategy (one- or multi- bit, or their combinations, keep all in RAM, or keep inactive on disk and active in RAM, etc). 4. All these things should be stored to qcow2, all should successfully migrate, and we also need to thing about NBD exporting (however, looks like NBD protocol is flexible enough to do it) === Also, we need to understand, what are user cases for this all. case 1. Incremental restore to some point in past: If we know, which blocks are modified since this point, we can copy only these blocks from backup. But, it's obvious that this information can be extracted from backup itself (we should know, which blocks was actually backed up). So, I'm not sure that this all worth doing. case 2. several inc-backup chains to different backup storages with different timesheets. Actually, we support it by just several active dirty bitmaps. But it looks inefficient: What is the reason to maintain several active dirty bitmaps, which are used seldom? They eat RAM and CPU time on each write. It looks better to have only one active bitmap, and several disabled, which we can store in the disk, not in RAM. And this leads us to checkpoints.. Checkpoints are more natural for users to make backups, then dirty bitmaps. And checkpoints give a way to improve ram and cpu usage. As a first step the following may be done: Add two string fields to BdrvDirtyBitmap: checkpoint-from checkpoint-to Which defines checkpoint names. For such bitmaps name field should be zero. Add these fields to qcow2 bitmap representation and to migration protocol. Add checkpoint api (create/remove/nbd export). Deprecate bitmap api (move to checkpoints for drive- and blockdev- backup commands). We can add "parent" or something like this pointer to BdrvDirtyBitmap, but it should be only implementation detail, not user-seen thing.
(1) We'd need to amend the bitmap persistence format (2) We'd need to amend some of the bitmap management commands (3) We'd need to make sure it migrates correctly: (A) Shared storage should be fine; just flush to disk and pivot (B) Live storage needs to learn a new field to migrate.
Certainly it's not ...trivial, but not terribly difficult either. I wonder if it's the right thing to do in lieu of the naming hacks in libvirt.
There wasn't really a chorus of applause for the idea of having checkpoints more officially implemented in QEMU, but... abusing the name metadata still makes me feel like we're doing something wrong -- especially if a third party utility that doesn't understand the concept of your naming scheme comes along and modifies a bitmap.
It feels tenuous and likely to break, so I'd like to formalize it more. We can move this discussion over to the QEMU lists if you think it's worth talking about.
Or I'll just roll with it. I'll see what Eric thinks, I guess? :)
*(Uh-oh, that term is overloaded for QEMU bitmap internals... we can address that later...)
-- Best regards, Vladimir

On 04/16/2018 06:20 AM, Vladimir Sementsov-Ogievskiy wrote:
So my point is: if we are going to implement something complicated, let's implement entirely what we want, not a semi-solution. Otherwise, implement a minimal and simple thing, to just make it all work (my current solution).
So basically: (1) Using bitmap names: It's a hack, but it works; and (2) Adding parentage information to QEMU bitmaps is also a hack, but a more permanent commitment to the hack. And further, both (1) and (2) leave the same problem that if a third party utility deletes the bitmap, they are checkpoint-unaware and will ruin the metadata. (Though QEMU could be taught to disallow the deleting of bitmaps with parents/children, unless you specify --force or --mergeleft or --mergeright or some such. That's not an option with the name-as-metadata strategy.) Why is option 3 unworkable, exactly?: (3) Checkpoints exist as structures only with libvirt. They are saved and remembered in the XML entirely. Or put another way: Can you explain to me why it's important for libvirt to be able to reconstruct checkpoint information from a qcow2 file? --js

On 19.04.2018 20:28, John Snow wrote:
On 04/16/2018 06:20 AM, Vladimir Sementsov-Ogievskiy wrote:
So my point is: if we are going to implement something complicated, let's implement entirely what we want, not a semi-solution. Otherwise, implement a minimal and simple thing, to just make it all work (my current solution).
So basically:
(1) Using bitmap names: It's a hack, but it works; and (2) Adding parentage information to QEMU bitmaps is also a hack, but a more permanent commitment to the hack.
And further, both (1) and (2) leave the same problem that if a third party utility deletes the bitmap, they are checkpoint-unaware and will ruin the metadata.
(Though QEMU could be taught to disallow the deleting of bitmaps with parents/children, unless you specify --force or --mergeleft or --mergeright or some such. That's not an option with the name-as-metadata strategy.)
Why is option 3 unworkable, exactly?:
(3) Checkpoints exist as structures only with libvirt. They are saved and remembered in the XML entirely.
Or put another way:
Can you explain to me why it's important for libvirt to be able to reconstruct checkpoint information from a qcow2 file?
In short it take extra effort for metadata to be consistent when libvirtd crashes occurs. See for more detailed explanation in [1] starting from words "Yes it is possible". [1] https://www.redhat.com/archives/libvir-list/2018-April/msg01001.html Nikolay

On 04/20/2018 08:22 AM, Nikolay Shirokovskiy wrote:
On 19.04.2018 20:28, John Snow wrote:
On 04/16/2018 06:20 AM, Vladimir Sementsov-Ogievskiy wrote:
So my point is: if we are going to implement something complicated, let's implement entirely what we want, not a semi-solution. Otherwise, implement a minimal and simple thing, to just make it all work (my current solution).
So basically:
(1) Using bitmap names: It's a hack, but it works; and (2) Adding parentage information to QEMU bitmaps is also a hack, but a more permanent commitment to the hack.
And further, both (1) and (2) leave the same problem that if a third party utility deletes the bitmap, they are checkpoint-unaware and will ruin the metadata.
(Though QEMU could be taught to disallow the deleting of bitmaps with parents/children, unless you specify --force or --mergeleft or --mergeright or some such. That's not an option with the name-as-metadata strategy.)
Why is option 3 unworkable, exactly?:
(3) Checkpoints exist as structures only with libvirt. They are saved and remembered in the XML entirely.
Or put another way:
Can you explain to me why it's important for libvirt to be able to reconstruct checkpoint information from a qcow2 file?
In short it take extra effort for metadata to be consistent when libvirtd crashes occurs. See for more detailed explanation in [1] starting from words "Yes it is possible".
[1] https://www.redhat.com/archives/libvir-list/2018-April/msg01001.html
Nikolay
OK; I can't speak to the XML design (I'll leave that to Eric and other libvirt engineers) but the data consistency issues make sense. ATM I am concerned that by shifting the snapshots into bitmap names that you still leave yourself open for data corruption if these bitmaps are modified outside of libvirt -- these third party tools can't possibly understand the schema that they were created under. (Though I suppose very simply that if a bitmap is missing you'd be able to detect that in libvirt and signal an error, but it's not very nice.) I'll pick up discussion with Eric and Vladimir in the other portion of this thread where we're discussing a checkpoints API and we'll pick this up on QEMU list if need be. Thank you, --John

On 04/20/2018 01:24 PM, John Snow wrote:
Why is option 3 unworkable, exactly?:
(3) Checkpoints exist as structures only with libvirt. They are saved and remembered in the XML entirely.
Or put another way:
Can you explain to me why it's important for libvirt to be able to reconstruct checkpoint information from a qcow2 file?
In short it take extra effort for metadata to be consistent when libvirtd crashes occurs. See for more detailed explanation in [1] starting from words "Yes it is possible".
[1] https://www.redhat.com/archives/libvir-list/2018-April/msg01001.html
I'd argue the converse. Libvirt already knows how to do atomic updates of XML files that it tracks. If libvirtd crashes/restarts in the middle of an API call, you already have indeterminate results of whether the API worked or failed; once libvirtd is restarted, you'll have to probably retry the command. For all other cases, the API call completes, and either no XML changes were made (the command failed and reports the failure properly), or all XML changes were made (the command created the appropriate changes to track the new checkpoint, including whatever bitmap names have to be recorded to map the relation between checkpoints and bitmaps). Consider the case of internal snapshots. Already, we have the case where qemu itself does not track enough useful metadata about internal snapshots (right now, just a name and timestamp of creation); so libvirt additionally tracks further information in <domainsnapshot>: the name, timestamp, relationship to any previous snapshot (libvirt can then reconstruct a tree relationship between all snapshots; where a parent can have more than one child if you roll back to a snapshot and then execute the guest differently), the set of disks participating in the snapshot, and the <domain> description at the time of the snapshot (if you hotplug devices, or even the fact that creating external snapshots changes which file is the active qcow2 in a backing chain, you'll need to know how to roll back to the prior domain state as part of reverting). This is approximately the same set of information that a <domaincheckpoint> will need to track. I'm slightly tempted to just overload <domainsnapshot> to track three modes instead of two (internal, external, and now checkpoint); but think that will probably be a bit too confusing, so more likely I will create <domaincheckpoint> as a new object, but copy a lot of coding paradigms from <domainsnapshot>. So, from that point of view, libvirt tracking the relationship between qcow2 bitmaps in order to form checkpoint information can be done ALL with libvirt, and without NEEDING the qcow2 file to track any relations between bitmaps. BUT, libvirt's job can probably be made easier if qcow2 would, at the least, allow bitmaps to track their parent, and/or provide APIs to easily merge a parent..intermediate..child chain of related bitmaps to be merged into a single bitmap, for easy runtime creation of the temporary bitmap used to express the delta between two checkpoints.
OK; I can't speak to the XML design (I'll leave that to Eric and other libvirt engineers) but the data consistency issues make sense.
And I'm still trying to figure out exactly what is needed, to capture everything needed to create checkpoints and take backups (both push and pull model). Reverting to data from an external backup may be a bit more manual, at least at first (after all, we STILL don't have decent libvirt support for rolling back to external snapshots, several years later). In other words, my focus right now is "how can we safely track checkpoints for capturing of point-in-time incremental backups with minimal guest downtime", rather than "given an incremental backup captured previously, how do we roll a guest back to that point in time".
ATM I am concerned that by shifting the snapshots into bitmap names that you still leave yourself open for data corruption if these bitmaps are modified outside of libvirt -- these third party tools can't possibly understand the schema that they were created under.
(Though I suppose very simply that if a bitmap is missing you'd be able to detect that in libvirt and signal an error, but it's not very nice.)
Well, we also have to realize that third-party tools shouldn't really be mucking around with bitmaps they don't understand. If you are going to manipulate a qcow2 file that contains persistent bitmaps, you should not delete a bitmap you did not create; and if the bitmap is autoloaded, you must obey the rules and amend the bitmap for any guest-visible changes you make during your data edits. Just like a third-party tool shouldn't really be deleting internal snapshots it didn't create. I don't think we have to worry as much about being robust to what a third party tool would do behind our backs (after all, the point of the pull model backups is so that third-party tools can track the backup in the format THEY choose, after reading the dirty bitmap and data over NBD, rather than having to learn qcow2).
I'll pick up discussion with Eric and Vladimir in the other portion of this thread where we're discussing a checkpoints API and we'll pick this up on QEMU list if need be.
Yes, between this thread, and some IRC chats I've had with John in the meantime, it looks like we DO want some improvements on the qcow2 side of things on the qemu list. Other things that I need to capture from IRC: Right now, it sounds like the incremental backup model (whether push or pull) is heavily dependent on qcow2 files for persistent bitmaps. While libvirt can perform external snapshots by creating a qcow2 wrapper around any file type, and live commit can then merge that qcow2 file back into the original file, libvirt is already insistent that internal snapshots can only be taken if all disks are qcow2. So the same logic will apply to taking backups (whether the backup is incremental by starting from a checkpoint, or full over the complete disk contents). Also, how should checkpoints interact with external snapshots? Suppose I have: base <- snap1 and create a checkpoint at time T1 (which really means I create a bitmap titled B1 to track all changes that occur _after_ T1). Then later I create an external snapshot, so that now I have: base <- snap1 <- snap2 at that point, the bitmap B1 in snap1 is no longer being modified, because snap1 is read-only. But we STILL want to track changes since T1, which means we NEED a way in qemu to not only add snap2 as a new snapshot, but ALSO to create a new bitmap B2 in snap2, that tracks all changes (until the next checkpoint, of course). Whether B2 starts life empty (and libvirt just has to remember that it must merge snap1.B1 and snap2.B2 when constructing the delta), or whether B2 starts life as a clone of the final contents of snap1.B1, is something that we need to consider in qemu. And if there is more than one bitmap on snap1, do we need to bring all of those bitmaps forward into snap2, or just the one that was currently active? Similarly, if we later decide to live commit snap2 back into snap1, we'll want to merge the changes in snap2.B2 back into snap1.B1 (now that snap1 is once again active, it needs to track all changes that were merged in, and all future changes until the next snapshot). Which means we need to at least be thinking about cross-node snapshot merges, even if, from the libvirt perspective, checkpoints are more of a per-drive attribute rather than a per-node attribute. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

21.04.2018 00:26, Eric Blake wrote:
On 04/20/2018 01:24 PM, John Snow wrote:
Why is option 3 unworkable, exactly?:
(3) Checkpoints exist as structures only with libvirt. They are saved and remembered in the XML entirely.
Or put another way:
Can you explain to me why it's important for libvirt to be able to reconstruct checkpoint information from a qcow2 file?
In short it take extra effort for metadata to be consistent when libvirtd crashes occurs. See for more detailed explanation in [1] starting from words "Yes it is possible".
[1] https://www.redhat.com/archives/libvir-list/2018-April/msg01001.html I'd argue the converse. Libvirt already knows how to do atomic updates of XML files that it tracks. If libvirtd crashes/restarts in the middle of an API call, you already have indeterminate results of whether the API worked or failed; once libvirtd is restarted, you'll have to probably retry the command. For all other cases, the API call completes, and either no XML changes were made (the command failed and reports the failure properly), or all XML changes were made (the command created the appropriate changes to track the new checkpoint, including whatever bitmap names have to be recorded to map the relation between checkpoints and bitmaps).
Consider the case of internal snapshots. Already, we have the case where qemu itself does not track enough useful metadata about internal snapshots (right now, just a name and timestamp of creation); so libvirt additionally tracks further information in <domainsnapshot>: the name, timestamp, relationship to any previous snapshot (libvirt can then reconstruct a tree relationship between all snapshots; where a parent can have more than one child if you roll back to a snapshot and then execute the guest differently), the set of disks participating in the snapshot, and the <domain> description at the time of the snapshot (if you hotplug devices, or even the fact that creating external snapshots changes which file is the active qcow2 in a backing chain, you'll need to know how to roll back to the prior domain state as part of reverting). This is approximately the same set of information that a <domaincheckpoint> will need to track.
I'm slightly tempted to just overload <domainsnapshot> to track three modes instead of two (internal, external, and now checkpoint); but think that will probably be a bit too confusing, so more likely I will create <domaincheckpoint> as a new object, but copy a lot of coding paradigms from <domainsnapshot>.
So, from that point of view, libvirt tracking the relationship between qcow2 bitmaps in order to form checkpoint information can be done ALL with libvirt, and without NEEDING the qcow2 file to track any relations between bitmaps. BUT, libvirt's job can probably be made easier if qcow2 would, at the least, allow bitmaps to track their parent, and/or provide APIs to easily merge a parent..intermediate..child chain of related bitmaps to be merged into a single bitmap, for easy runtime creation of the temporary bitmap used to express the delta between two checkpoints.
I don't think this is a good idea: https://www.redhat.com/archives/libvir-list/2018-April/msg01306.html In short, I think, if we do something to support checkpoints in qemu (updated BdrvDirtyBitmap, qapi, qcow2 and migration stream, new nbd meta context), we'd better implement checkpoints, than .parent relationship.
OK; I can't speak to the XML design (I'll leave that to Eric and other libvirt engineers) but the data consistency issues make sense. And I'm still trying to figure out exactly what is needed, to capture everything needed to create checkpoints and take backups (both push and pull model). Reverting to data from an external backup may be a bit more manual, at least at first (after all, we STILL don't have decent libvirt support for rolling back to external snapshots, several years later). In other words, my focus right now is "how can we safely track checkpoints for capturing of point-in-time incremental backups with minimal guest downtime", rather than "given an incremental backup captured previously, how do we roll a guest back to that point in time".
ATM I am concerned that by shifting the snapshots into bitmap names that you still leave yourself open for data corruption if these bitmaps are modified outside of libvirt -- these third party tools can't possibly understand the schema that they were created under.
(Though I suppose very simply that if a bitmap is missing you'd be able to detect that in libvirt and signal an error, but it's not very nice.) Well, we also have to realize that third-party tools shouldn't really be mucking around with bitmaps they don't understand. If you are going to manipulate a qcow2 file that contains persistent bitmaps, you should not delete a bitmap you did not create; and if the bitmap is autoloaded, you must obey the rules and amend the bitmap for any guest-visible changes you make during your data edits. Just like a third-party tool shouldn't really be deleting internal snapshots it didn't create. I don't think we have to worry as much about being robust to what a third party tool would do behind our backs (after all, the point of the pull model backups is so that third-party tools can track the backup in the format THEY choose, after reading the dirty bitmap and data over NBD, rather than having to learn qcow2).
I'll pick up discussion with Eric and Vladimir in the other portion of this thread where we're discussing a checkpoints API and we'll pick this up on QEMU list if need be. Yes, between this thread, and some IRC chats I've had with John in the meantime, it looks like we DO want some improvements on the qcow2 side of things on the qemu list.
Other things that I need to capture from IRC:
Right now, it sounds like the incremental backup model (whether push or pull) is heavily dependent on qcow2 files for persistent bitmaps. While libvirt can perform external snapshots by creating a qcow2 wrapper around any file type, and live commit can then merge that qcow2 file back into the original file, libvirt is already insistent that internal snapshots can only be taken if all disks are qcow2. So the same logic will apply to taking backups (whether the backup is incremental by starting from a checkpoint, or full over the complete disk contents).
Also, how should checkpoints interact with external snapshots? Suppose I have:
base <- snap1
and create a checkpoint at time T1 (which really means I create a bitmap titled B1 to track all changes that occur _after_ T1). Then later I create an external snapshot, so that now I have:
base <- snap1 <- snap2
at that point, the bitmap B1 in snap1 is no longer being modified, because snap1 is read-only. But we STILL want to track changes since T1, which means we NEED a way in qemu to not only add snap2 as a new snapshot, but ALSO to create a new bitmap B2 in snap2, that tracks all changes (until the next checkpoint, of course). Whether B2 starts life empty (and libvirt just has to remember that it must merge snap1.B1 and snap2.B2 when constructing the delta), or whether B2 starts life as a clone of the final contents of snap1.B1, is something that we need to consider in qemu.
I'm sure that the latter is a true way, in which snapshots are actually unrelated to checkpoints. We just have a "snapshot" of the bitmap in snapshot file. Here is an additional interesting point: it works for internal snapshots too, as bitmaps will go to the state through migration channel (if we enable corresponding capability, of course)
And if there is more than one bitmap on snap1, do we need to bring all of those bitmaps forward into snap2, or just the one that was currently active?
Again, I think, to make snapshots unrelated, it's better to keep them all. Let disk snapshot to be a snapshot of dirty-bitmaps too.
Similarly, if we later decide to live commit snap2 back into snap1, we'll want to merge the changes in snap2.B2 back into snap1.B1 (now that snap1 is once again active, it needs to track all changes that were merged in, and all future changes until the next snapshot).
And here we will just drop older versions of bitmaps.
Which means we need to at least be thinking about cross-node snapshot merges,
hmm, what is it?
even if, from the libvirt perspective, checkpoints are more of a per-drive attribute rather than a per-node attribute.
-- Best regards, Vladimir

On 04/23/2018 04:31 AM, Vladimir Sementsov-Ogievskiy wrote:
And if there is more than one bitmap on snap1, do we need to bring all of those bitmaps forward into snap2, or just the one that was currently active?
Again, I think, to make snapshots unrelated, it's better to keep them all. Let disk snapshot to be a snapshot of dirty-bitmaps too.
So that means creating a new external snapshot (a new qcow2 wrapper) should copy all existing bitmaps from the backing file into the new active layer?
Similarly, if we later decide to live commit snap2 back into snap1, we'll want to merge the changes in snap2.B2 back into snap1.B1 (now that snap1 is once again active, it needs to track all changes that were merged in, and all future changes until the next snapshot).
And here we will just drop older versions of bitmaps.
Which means we need to at least be thinking about cross-node snapshot merges,
hmm, what is it?
By "cross-node snapshot merge", I meant the situation where we have: base <- snap1 (containing bitmap B1) <- snap2 (containing bitmap B2) If we need to create a bitmap containing the merge of B1 and B2, whether that new bitmap B3 is stored in snap1 or in snap2, we are doing a cross-node merge (because the two source bitmaps in the merge live on different nodes of the backing chain). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 04/23/2018 05:31 AM, Vladimir Sementsov-Ogievskiy wrote:
21.04.2018 00:26, Eric Blake wrote:
On 04/20/2018 01:24 PM, John Snow wrote:
Why is option 3 unworkable, exactly?:
(3) Checkpoints exist as structures only with libvirt. They are saved and remembered in the XML entirely.
Or put another way:
Can you explain to me why it's important for libvirt to be able to reconstruct checkpoint information from a qcow2 file?
In short it take extra effort for metadata to be consistent when libvirtd crashes occurs. See for more detailed explanation in [1] starting from words "Yes it is possible".
[1] https://www.redhat.com/archives/libvir-list/2018-April/msg01001.html I'd argue the converse. Libvirt already knows how to do atomic updates of XML files that it tracks. If libvirtd crashes/restarts in the middle of an API call, you already have indeterminate results of whether the API worked or failed; once libvirtd is restarted, you'll have to probably retry the command. For all other cases, the API call completes, and either no XML changes were made (the command failed and reports the failure properly), or all XML changes were made (the command created the appropriate changes to track the new checkpoint, including whatever bitmap names have to be recorded to map the relation between checkpoints and bitmaps).
Consider the case of internal snapshots. Already, we have the case where qemu itself does not track enough useful metadata about internal snapshots (right now, just a name and timestamp of creation); so libvirt additionally tracks further information in <domainsnapshot>: the name, timestamp, relationship to any previous snapshot (libvirt can then reconstruct a tree relationship between all snapshots; where a parent can have more than one child if you roll back to a snapshot and then execute the guest differently), the set of disks participating in the snapshot, and the <domain> description at the time of the snapshot (if you hotplug devices, or even the fact that creating external snapshots changes which file is the active qcow2 in a backing chain, you'll need to know how to roll back to the prior domain state as part of reverting). This is approximately the same set of information that a <domaincheckpoint> will need to track.
I'm slightly tempted to just overload <domainsnapshot> to track three modes instead of two (internal, external, and now checkpoint); but think that will probably be a bit too confusing, so more likely I will create <domaincheckpoint> as a new object, but copy a lot of coding paradigms from <domainsnapshot>.
So, from that point of view, libvirt tracking the relationship between qcow2 bitmaps in order to form checkpoint information can be done ALL with libvirt, and without NEEDING the qcow2 file to track any relations between bitmaps. BUT, libvirt's job can probably be made easier if qcow2 would, at the least, allow bitmaps to track their parent, and/or provide APIs to easily merge a parent..intermediate..child chain of related bitmaps to be merged into a single bitmap, for easy runtime creation of the temporary bitmap used to express the delta between two checkpoints.
I don't think this is a good idea: https://www.redhat.com/archives/libvir-list/2018-April/msg01306.html
In short, I think, if we do something to support checkpoints in qemu (updated BdrvDirtyBitmap, qapi, qcow2 and migration stream, new nbd meta context), we'd better implement checkpoints, than .parent relationship.
[I'm going to answer this in response to the thread you've referenced.]
OK; I can't speak to the XML design (I'll leave that to Eric and other libvirt engineers) but the data consistency issues make sense. And I'm still trying to figure out exactly what is needed, to capture everything needed to create checkpoints and take backups (both push and pull model). Reverting to data from an external backup may be a bit more manual, at least at first (after all, we STILL don't have decent libvirt support for rolling back to external snapshots, several years later). In other words, my focus right now is "how can we safely track checkpoints for capturing of point-in-time incremental backups with minimal guest downtime", rather than "given an incremental backup captured previously, how do we roll a guest back to that point in time".
ATM I am concerned that by shifting the snapshots into bitmap names that you still leave yourself open for data corruption if these bitmaps are modified outside of libvirt -- these third party tools can't possibly understand the schema that they were created under.
(Though I suppose very simply that if a bitmap is missing you'd be able to detect that in libvirt and signal an error, but it's not very nice.) Well, we also have to realize that third-party tools shouldn't really be mucking around with bitmaps they don't understand. If you are going to manipulate a qcow2 file that contains persistent bitmaps, you should not delete a bitmap you did not create; and if the bitmap is autoloaded, you must obey the rules and amend the bitmap for any guest-visible changes you make during your data edits. Just like a third-party tool shouldn't really be deleting internal snapshots it didn't create. I don't think we have to worry as much about being robust to what a third party tool would do behind our backs (after all, the point of the pull model backups is so that third-party tools can track the backup in the format THEY choose, after reading the dirty bitmap and data over NBD, rather than having to learn qcow2).
I'll pick up discussion with Eric and Vladimir in the other portion of this thread where we're discussing a checkpoints API and we'll pick this up on QEMU list if need be. Yes, between this thread, and some IRC chats I've had with John in the meantime, it looks like we DO want some improvements on the qcow2 side of things on the qemu list.
Other things that I need to capture from IRC:
Right now, it sounds like the incremental backup model (whether push or pull) is heavily dependent on qcow2 files for persistent bitmaps. While libvirt can perform external snapshots by creating a qcow2 wrapper around any file type, and live commit can then merge that qcow2 file back into the original file, libvirt is already insistent that internal snapshots can only be taken if all disks are qcow2. So the same logic will apply to taking backups (whether the backup is incremental by starting from a checkpoint, or full over the complete disk contents).
Also, how should checkpoints interact with external snapshots? Suppose I have:
base <- snap1
and create a checkpoint at time T1 (which really means I create a bitmap titled B1 to track all changes that occur _after_ T1). Then later I create an external snapshot, so that now I have:
base <- snap1 <- snap2
at that point, the bitmap B1 in snap1 is no longer being modified, because snap1 is read-only. But we STILL want to track changes since T1, which means we NEED a way in qemu to not only add snap2 as a new snapshot, but ALSO to create a new bitmap B2 in snap2, that tracks all changes (until the next checkpoint, of course). Whether B2 starts life empty (and libvirt just has to remember that it must merge snap1.B1 and snap2.B2 when constructing the delta), or whether B2 starts life as a clone of the final contents of snap1.B1, is something that we need to consider in qemu.
I'm sure that the latter is a true way, in which snapshots are actually unrelated to checkpoints. We just have a "snapshot" of the bitmap in snapshot file.
This is roughly where I came down in terms of the "quick" way. If we copy everything up into the new active layer there's not much else to do. The existing commands and API in QEMU can just continue ignorant of what happened.
Here is an additional interesting point: it works for internal snapshots too, as bitmaps will go to the state through migration channel (if we enable corresponding capability, of course)
And if there is more than one bitmap on snap1, do we need to bring all of those bitmaps forward into snap2, or just the one that was currently active?
Again, I think, to make snapshots unrelated, it's better to keep them all. Let disk snapshot to be a snapshot of dirty-bitmaps too.
Agree for the same reasons -- unless we want to complicate the bitmap mechanisms ... which I think we do not.
Similarly, if we later decide to live commit snap2 back into snap1, we'll want to merge the changes in snap2.B2 back into snap1.B1 (now that snap1 is once again active, it needs to track all changes that were merged in, and all future changes until the next snapshot).
And here we will just drop older versions of bitmaps.
I think: - Any names that conflict, the bitmap in the backing layer is dropped. - Any existing inactive bitmaps can stay around. - Any existing active bitmaps will need to be updated to record the new writes that were caused by the commit.
Which means we need to at least be thinking about cross-node snapshot merges,
hmm, what is it?
[Assuming Eric explained in his reply.]
even if, from the libvirt perspective, checkpoints are more of a per-drive attribute rather than a per-node attribute.

On 21.04.2018 00:26, Eric Blake wrote:
On 04/20/2018 01:24 PM, John Snow wrote:
Why is option 3 unworkable, exactly?:
(3) Checkpoints exist as structures only with libvirt. They are saved and remembered in the XML entirely.
Or put another way:
Can you explain to me why it's important for libvirt to be able to reconstruct checkpoint information from a qcow2 file?
In short it take extra effort for metadata to be consistent when libvirtd crashes occurs. See for more detailed explanation in [1] starting from words "Yes it is possible".
[1] https://www.redhat.com/archives/libvir-list/2018-April/msg01001.html
I'd argue the converse. Libvirt already knows how to do atomic updates of XML files that it tracks. If libvirtd crashes/restarts in the middle of an API call, you already have indeterminate results of whether the API worked or failed; once libvirtd is restarted, you'll have to probably retry the command. For all other cases, the API call completes, and either no XML changes were made (the command failed and reports the failure properly), or all XML changes were made (the command created the appropriate changes to track the new checkpoint, including whatever bitmap names have to be recorded to map the relation between checkpoints and bitmaps).
We can fail to save XML... Consider we have B1, B2 and create B3 bitmap in the process of creating checkpoint C3. Next qemu creates snapshot and bitmap successfully then libvirt fail to update XML and after some time libvirt restarts (not even crashes). Now libvirt nows of B1 and B2 but not B3. What can be the consequences? For example if we ask bitmap from C2 we miss all changes from C3 as we don't know of B3. This will lead to corrupted backups. This can be fixed: - in qemu. If bitmaps have child/parent realtionship then on libvirt restart we can recover (we ask qemu for bitmaps, discover B3 and then discover B3 is child of B2). This is how basically implementation with naming scheme works. Well on this way we don't need special metadata in libvirt (besides may be domain xml attached to checkpoiint etc) - in libvirt. If we save XML before creating a snapshot with checkpoint. This fixes the issue with successful operation but saving XML failure. But now we have another issue :) We can save XML successfully but then operation itself can fail and we fail to revert XML back. Well we can recover even without child/parent metadata in qemu in this case. Just ask qemu for bitmaps on libvirt restart and if bitmap is missing kick it out as it is a case described above (successful saving XML then unsuccessfull qemu operation) So it is possible to track bitmaps in libvirt. We just need to be extra carefull not to produce invalid backups.
Consider the case of internal snapshots. Already, we have the case where qemu itself does not track enough useful metadata about internal snapshots (right now, just a name and timestamp of creation); so libvirt additionally tracks further information in <domainsnapshot>: the name, timestamp, relationship to any previous snapshot (libvirt can then reconstruct a tree relationship between all snapshots; where a parent can have more than one child if you roll back to a snapshot and then execute the guest differently), the set of disks participating in the snapshot, and the <domain> description at the time of the snapshot (if you hotplug devices, or even the fact that creating external snapshots changes which file is the active qcow2 in a backing chain, you'll need to know how to roll back to the prior domain state as part of reverting). This is approximately the same set of information that a <domaincheckpoint> will need to track.
I would differentiate checkpoints and backups. For example in case of push backups we can store additional metadata in <domainbackup> so later we can revert back to previous state. But checkpoints (bitmaps technically) are only to make incremental backups(restores?). We can attach extra metadata to checkpoints but it looks accidental just because bitmaps and backups relate to some same point in time. To me a backup (push) can carry all the metadata and as to checkpoints a backup can have associated checkpoint or not. For example if we choose to always make full backups we don't need checkpoints at all (at least if we are not going to use them for restore).
I'm slightly tempted to just overload <domainsnapshot> to track three modes instead of two (internal, external, and now checkpoint); but think that will probably be a bit too confusing, so more likely I will create <domaincheckpoint> as a new object, but copy a lot of coding paradigms from <domainsnapshot>.
I wonder if you are going to use tree or list structure for backups. To me it is much easier to think of backups just as sequence of states in time. For example consider Grandfather-Father-Son scheme of Acronis backups [1]. Typical backup can look like: F - I - I - I - I - D - I - I - I - I - D Where F is full monthly backup, I incremental daily backup and D is diferrential weekly backup (no backups on Sunday and Saturday). This is representation from time POV. From backup dependencies POV it look likes next: F - I - I - I - I D - I - I - I - I D \-------------------| | \--------------------------------------| or more common representation: F - I - I - I - I \- D - I - I - I - I \- D - I - I - I - I To me using tree structure in snapshots is aproppriate because each branching point is some semantic state ("basic OS installed") and branches are different trials from that point. In backup case I guess we don't want branching on recovery to some backup, we just want to keep selected backup scheme going. So for example if we recover on Wednesday to previous week's Friday then later on Wednesday we will have regular Wednesday backup as if we have not been recovered. This makes things simple for client or he will drawn in dependencies (especially after a couple of recoverings). Of course internally we need to track backup dependencies in order to properly delete backups or recover from them. [1] https://www.acronis.com/en-us/support/documentation/AcronisBackup_11.5/index...
So, from that point of view, libvirt tracking the relationship between qcow2 bitmaps in order to form checkpoint information can be done ALL with libvirt, and without NEEDING the qcow2 file to track any relations between bitmaps. BUT, libvirt's job can probably be made easier if qcow2 would, at the least, allow bitmaps to track their parent, and/or provide APIs to easily merge a parent..intermediate..child chain of related bitmaps to be merged into a single bitmap, for easy runtime creation of the temporary bitmap used to express the delta between two checkpoints.
[snip] Nikolay

On 04/23/2018 05:38 AM, Nikolay Shirokovskiy wrote:
I'd argue the converse. Libvirt already knows how to do atomic updates of XML files that it tracks. If libvirtd crashes/restarts in the middle of an API call, you already have indeterminate results of whether the API worked or failed; once libvirtd is restarted, you'll have to probably retry the command. For all other cases, the API call completes, and either no XML changes were made (the command failed and reports the failure properly), or all XML changes were made (the command created the appropriate changes to track the new checkpoint, including whatever bitmap names have to be recorded to map the relation between checkpoints and bitmaps).
We can fail to save XML... Consider we have B1, B2 and create B3 bitmap in the process of creating checkpoint C3. Next qemu creates snapshot and bitmap successfully then libvirt fail to update XML and after some time libvirt restarts (not even crashes). Now libvirt nows of B1 and B2 but not B3.
Libvirt is in charge of tracking ALL state internally that it requires to restore state properly across a libvirtd restart, so that it presents the illusion of a libvirt API atomically completing or failing. If libvirt creates bitmap B3 but does not create checkpoint C3 prior to it restarting, then on restart, it should be able to correctly see that B3 is stranded and delete it (rather, merge it back into B2 so that B2 remains the only live bitmap) as part of an incomplete API that failed.
What can be the consequences? For example if we ask bitmap from C2 we miss all changes from C3 as we don't know of B3. This will lead to corrupted backups.
Checkpoint C3 does not exist if libvirt API did not complete correctly (even if bitmap B3 exists). It should merely be a matter of libvirt making proper annotations of what it plans to do prior to calling into qemu, so that if it restarts, it can recover from an intermediate state of failure to follow those plans.
This can be fixed:
- in qemu. If bitmaps have child/parent realtionship then on libvirt restart we can recover (we ask qemu for bitmaps, discover B3 and then discover B3 is child of B2). This is how basically implementation with naming scheme works. Well on this way we don't need special metadata in libvirt (besides may be domain xml attached to checkpoiint etc)
- in libvirt. If we save XML before creating a snapshot with checkpoint. This fixes the issue with successful operation but saving XML failure. But now we have another issue :) We can save XML successfully but then operation itself can fail and we fail to revert XML back. Well we can recover even without child/parent metadata in qemu in this case. Just ask qemu for bitmaps on libvirt restart and if bitmap is missing kick it out as it is a case described above (successful saving XML then unsuccessfull qemu operation)
So it is possible to track bitmaps in libvirt. We just need to be extra carefull not to produce invalid backups.
Yes, but that's true of any interface where a single libvirt API controls multiple steps in qemu.
Consider the case of internal snapshots. Already, we have the case where qemu itself does not track enough useful metadata about internal snapshots (right now, just a name and timestamp of creation); so libvirt additionally tracks further information in <domainsnapshot>: the name, timestamp, relationship to any previous snapshot (libvirt can then reconstruct a tree relationship between all snapshots; where a parent can have more than one child if you roll back to a snapshot and then execute the guest differently), the set of disks participating in the snapshot, and the <domain> description at the time of the snapshot (if you hotplug devices, or even the fact that creating external snapshots changes which file is the active qcow2 in a backing chain, you'll need to know how to roll back to the prior domain state as part of reverting). This is approximately the same set of information that a <domaincheckpoint> will need to track.
I would differentiate checkpoints and backups. For example in case of push backups we can store additional metadata in <domainbackup> so later we can revert back to previous state. But checkpoints (bitmaps technically) are only to make incremental backups(restores?). We can attach extra metadata to checkpoints but it looks accidental just because bitmaps and backups relate to some same point in time. To me a backup (push) can carry all the metadata and as to checkpoints a backup can have associated checkpoint or not. For example if we choose to always make full backups we don't need checkpoints at all (at least if we are not going to use them for restore).
It's still nice to track the state of the <domain> XML at the time of the backup, even if you aren't using checkpoints.
I'm slightly tempted to just overload <domainsnapshot> to track three modes instead of two (internal, external, and now checkpoint); but think that will probably be a bit too confusing, so more likely I will create <domaincheckpoint> as a new object, but copy a lot of coding paradigms from <domainsnapshot>.
I wonder if you are going to use tree or list structure for backups. To me it is much easier to think of backups just as sequence of states in time. For example consider Grandfather-Father-Son scheme of Acronis backups [1]. Typical backup can look like:
F - I - I - I - I - D - I - I - I - I - D
Libvirt already tracks snapshots as a tree rather than a list; so I see no reason why checkpoints should be any different. You don't branch in the tree unless you revert to an earlier point (so the tree is often linear in practice), but just because branching isn't common doesn't mean it can't happen.
Where F is full monthly backup, I incremental daily backup and D is diferrential weekly backup (no backups on Sunday and Saturday). This is representation from time POV. From backup dependencies POV it look likes next:
F - I - I - I - I D - I - I - I - I D \-------------------| | \--------------------------------------|
or more common representation:
F - I - I - I - I \- D - I - I - I - I \- D - I - I - I - I
To me using tree structure in snapshots is aproppriate because each branching point is some semantic state ("basic OS installed") and branches are different trials from that point. In backup case I guess we don't want branching on recovery to some backup, we just want to keep selected backup scheme going. So for example if we recover on Wednesday to previous week's Friday then later on Wednesday we will have regular Wednesday backup as if we have not been recovered. This makes things simple for client or he will drawn in dependencies (especially after a couple of recoverings).
Of course internally we need to track backup dependencies in order to properly delete backups or recover from them.
[1] https://www.acronis.com/en-us/support/documentation/AcronisBackup_11.5/index...
So, from that point of view, libvirt tracking the relationship between qcow2 bitmaps in order to form checkpoint information can be done ALL with libvirt, and without NEEDING the qcow2 file to track any relations between bitmaps. BUT, libvirt's job can probably be made easier if qcow2 would, at the least, allow bitmaps to track their parent, and/or provide APIs to easily merge a parent..intermediate..child chain of related bitmaps to be merged into a single bitmap, for easy runtime creation of the temporary bitmap used to express the delta between two checkpoints.
[snip]
Nikolay
-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 23.04.2018 16:50, Eric Blake wrote:
On 04/23/2018 05:38 AM, Nikolay Shirokovskiy wrote:
I'd argue the converse. Libvirt already knows how to do atomic updates of XML files that it tracks. If libvirtd crashes/restarts in the middle of an API call, you already have indeterminate results of whether the API worked or failed; once libvirtd is restarted, you'll have to probably retry the command. For all other cases, the API call completes, and either no XML changes were made (the command failed and reports the failure properly), or all XML changes were made (the command created the appropriate changes to track the new checkpoint, including whatever bitmap names have to be recorded to map the relation between checkpoints and bitmaps).
We can fail to save XML... Consider we have B1, B2 and create B3 bitmap in the process of creating checkpoint C3. Next qemu creates snapshot and bitmap successfully then libvirt fail to update XML and after some time libvirt restarts (not even crashes). Now libvirt nows of B1 and B2 but not B3.
Libvirt is in charge of tracking ALL state internally that it requires to restore state properly across a libvirtd restart, so that it presents the illusion of a libvirt API atomically completing or failing. If libvirt creates bitmap B3 but does not create checkpoint C3 prior to it restarting, then on restart, it should be able to correctly see that B3 is stranded and delete it (rather, merge it back into B2 so that B2 remains the only live bitmap) as part of an incomplete API that failed.
What can be the consequences? For example if we ask bitmap from C2 we miss all changes from C3 as we don't know of B3. This will lead to corrupted backups.
Checkpoint C3 does not exist if libvirt API did not complete correctly (even if bitmap B3 exists). It should merely be a matter of libvirt making proper annotations of what it plans to do prior to calling into qemu, so that if it restarts, it can recover from an intermediate state of failure to follow those plans.
This can be fixed:
- in qemu. If bitmaps have child/parent realtionship then on libvirt restart we can recover (we ask qemu for bitmaps, discover B3 and then discover B3 is child of B2). This is how basically implementation with naming scheme works. Well on this way we don't need special metadata in libvirt (besides may be domain xml attached to checkpoiint etc)
- in libvirt. If we save XML before creating a snapshot with checkpoint. This fixes the issue with successful operation but saving XML failure. But now we have another issue :) We can save XML successfully but then operation itself can fail and we fail to revert XML back. Well we can recover even without child/parent metadata in qemu in this case. Just ask qemu for bitmaps on libvirt restart and if bitmap is missing kick it out as it is a case described above (successful saving XML then unsuccessfull qemu operation)
So it is possible to track bitmaps in libvirt. We just need to be extra carefull not to produce invalid backups.
Yes, but that's true of any interface where a single libvirt API controls multiple steps in qemu.
Consider the case of internal snapshots. Already, we have the case where qemu itself does not track enough useful metadata about internal snapshots (right now, just a name and timestamp of creation); so libvirt additionally tracks further information in <domainsnapshot>: the name, timestamp, relationship to any previous snapshot (libvirt can then reconstruct a tree relationship between all snapshots; where a parent can have more than one child if you roll back to a snapshot and then execute the guest differently), the set of disks participating in the snapshot, and the <domain> description at the time of the snapshot (if you hotplug devices, or even the fact that creating external snapshots changes which file is the active qcow2 in a backing chain, you'll need to know how to roll back to the prior domain state as part of reverting). This is approximately the same set of information that a <domaincheckpoint> will need to track.
I would differentiate checkpoints and backups. For example in case of push backups we can store additional metadata in <domainbackup> so later we can revert back to previous state. But checkpoints (bitmaps technically) are only to make incremental backups(restores?). We can attach extra metadata to checkpoints but it looks accidental just because bitmaps and backups relate to some same point in time. To me a backup (push) can carry all the metadata and as to checkpoints a backup can have associated checkpoint or not. For example if we choose to always make full backups we don't need checkpoints at all (at least if we are not going to use them for restore).
It's still nice to track the state of the <domain> XML at the time of the backup, even if you aren't using checkpoints.
I'm slightly tempted to just overload <domainsnapshot> to track three modes instead of two (internal, external, and now checkpoint); but think that will probably be a bit too confusing, so more likely I will create <domaincheckpoint> as a new object, but copy a lot of coding paradigms from <domainsnapshot>.
I wonder if you are going to use tree or list structure for backups. To me it is much easier to think of backups just as sequence of states in time. For example consider Grandfather-Father-Son scheme of Acronis backups [1]. Typical backup can look like:
F - I - I - I - I - D - I - I - I - I - D
Libvirt already tracks snapshots as a tree rather than a list; so I see no reason why checkpoints should be any different. You don't branch in the tree unless you revert to an earlier point (so the tree is often linear in practice), but just because branching isn't common doesn't mean it can't happen.
My point is trees are not that useful for backups (and as a result for checkpoints). Let's suppose we use tree structure for backups and tree is based on backing file relationship. First as qouted below from my previous letter we will have trees even without restore because for example we can have differential backups in backup schedule. Second if after an *incremental* restore you make backup based on the state you restored from and not the state dictated by backup schedule (previous day state for example) you get advantage in disk space I guess but there are also disadvantages: - you need to keep the old state you restored from for a longer time while backup retention policies usually tend to keep only recent states - backup need to be aware of restore so that if you restored in the morning and backup is scheduled at evening you know that backup should be based on the state you restored from - backup schedule is disrupted (wednesday backup should be incremental but due to restore it becomes differential for example) Another argument is that checkpoint for the state you restored to can be missing and you will need to make *full* restore so there is no point to have the state restored you restored from as parent as you rewrite whole disk. In short I belive it is much simple to think of restore as process unrelated to backups. So after restore you make your regular scheduled backup just as disk was changed by guest. So the cause of disk changes does not matter and backup schedule countinue to create VM backups day by day. Nikolay
Where F is full monthly backup, I incremental daily backup and D is diferrential weekly backup (no backups on Sunday and Saturday). This is representation from time POV. From backup dependencies POV it look likes next:
F - I - I - I - I D - I - I - I - I D \-------------------| | \--------------------------------------|
or more common representation:
F - I - I - I - I \- D - I - I - I - I \- D - I - I - I - I
To me using tree structure in snapshots is aproppriate because each branching point is some semantic state ("basic OS installed") and branches are different trials from that point. In backup case I guess we don't want branching on recovery to some backup, we just want to keep selected backup scheme going. So for example if we recover on Wednesday to previous week's Friday then later on Wednesday we will have regular Wednesday backup as if we have not been recovered. This makes things simple for client or he will drawn in dependencies (especially after a couple of recoverings).
Of course internally we need to track backup dependencies in order to properly delete backups or recover from them.
[1] https://www.acronis.com/en-us/support/documentation/AcronisBackup_11.5/index...
So, from that point of view, libvirt tracking the relationship between qcow2 bitmaps in order to form checkpoint information can be done ALL with libvirt, and without NEEDING the qcow2 file to track any relations between bitmaps. BUT, libvirt's job can probably be made easier if qcow2 would, at the least, allow bitmaps to track their parent, and/or provide APIs to easily merge a parent..intermediate..child chain of related bitmaps to be merged into a single bitmap, for easy runtime creation of the temporary bitmap used to express the delta between two checkpoints.
[snip]
Nikolay

On 04/23/2018 06:38 AM, Nikolay Shirokovskiy wrote:
On 21.04.2018 00:26, Eric Blake wrote:
On 04/20/2018 01:24 PM, John Snow wrote:
Why is option 3 unworkable, exactly?:
(3) Checkpoints exist as structures only with libvirt. They are saved and remembered in the XML entirely.
Or put another way:
Can you explain to me why it's important for libvirt to be able to reconstruct checkpoint information from a qcow2 file?
In short it take extra effort for metadata to be consistent when libvirtd crashes occurs. See for more detailed explanation in [1] starting from words "Yes it is possible".
[1] https://www.redhat.com/archives/libvir-list/2018-April/msg01001.html
I'd argue the converse. Libvirt already knows how to do atomic updates of XML files that it tracks. If libvirtd crashes/restarts in the middle of an API call, you already have indeterminate results of whether the API worked or failed; once libvirtd is restarted, you'll have to probably retry the command. For all other cases, the API call completes, and either no XML changes were made (the command failed and reports the failure properly), or all XML changes were made (the command created the appropriate changes to track the new checkpoint, including whatever bitmap names have to be recorded to map the relation between checkpoints and bitmaps).
We can fail to save XML... Consider we have B1, B2 and create B3 bitmap in the process of creating checkpoint C3. Next qemu creates snapshot and bitmap successfully then libvirt fail to update XML and after some time libvirt restarts (not even crashes). Now libvirt nows of B1 and B2 but not B3. What can be the consequences? For example if we ask bitmap from C2 we miss all changes from C3 as we don't know of B3. This will lead to corrupted backups.
This can be fixed:
- in qemu. If bitmaps have child/parent realtionship then on libvirt restart we can recover (we ask qemu for bitmaps, discover B3 and then discover B3 is child of B2). This is how basically implementation with naming scheme works. Well on this way we don't need special metadata in libvirt (besides may be domain xml attached to checkpoiint etc)
- in libvirt. If we save XML before creating a snapshot with checkpoint. This fixes the issue with successful operation but saving XML failure. But now we have another issue :) We can save XML successfully but then operation itself can fail and we fail to revert XML back. Well we can recover even without child/parent metadata in qemu in this case. Just ask qemu for bitmaps on libvirt restart and if bitmap is missing kick it out as it is a case described above (successful saving XML then unsuccessfull qemu operation)
This option seems perfectly workable to me...
So it is possible to track bitmaps in libvirt. We just need to be extra carefull not to produce invalid backups.
Consider the case of internal snapshots. Already, we have the case where qemu itself does not track enough useful metadata about internal snapshots (right now, just a name and timestamp of creation); so libvirt additionally tracks further information in <domainsnapshot>: the name, timestamp, relationship to any previous snapshot (libvirt can then reconstruct a tree relationship between all snapshots; where a parent can have more than one child if you roll back to a snapshot and then execute the guest differently), the set of disks participating in the snapshot, and the <domain> description at the time of the snapshot (if you hotplug devices, or even the fact that creating external snapshots changes which file is the active qcow2 in a backing chain, you'll need to know how to roll back to the prior domain state as part of reverting). This is approximately the same set of information that a <domaincheckpoint> will need to track.
I would differentiate checkpoints and backups. For example in case of push backups we can store additional metadata in <domainbackup> so later we can revert back to previous state. But checkpoints (bitmaps technically) are only to make incremental backups(restores?). We can attach extra metadata to checkpoints but it looks accidental just because bitmaps and backups relate to some same point in time. To me a backup (push) can carry all the metadata and as to checkpoints a backup can have associated checkpoint or not. For example if we choose to always make full backups we don't need checkpoints at all (at least if we are not going to use them for restore).
Well ... if we create checkpoints alongside full backups, then you have points to reference to create future incremental backups. You don't need checkpoints if you *NEVER* use an incremental backup. If we want the feature enabled, so to speak, you likely need to be making checkpoints alongside full backups. I'd say the cases in which we don't want them -- once the feature is enabled -- are hard to find.
I'm slightly tempted to just overload <domainsnapshot> to track three modes instead of two (internal, external, and now checkpoint); but think that will probably be a bit too confusing, so more likely I will create <domaincheckpoint> as a new object, but copy a lot of coding paradigms from <domainsnapshot>.
I wonder if you are going to use tree or list structure for backups. To me it is much easier to think of backups just as sequence of states in time. For example consider Grandfather-Father-Son scheme of Acronis backups [1]. Typical backup can look like:
F - I - I - I - I - D - I - I - I - I - D
Where F is full monthly backup, I incremental daily backup and D is diferrential weekly backup (no backups on Sunday and Saturday). This is representation from time POV. From backup dependencies POV it look likes next:
F - I - I - I - I D - I - I - I - I D \-------------------| | \--------------------------------------|
or more common representation:
F - I - I - I - I \- D - I - I - I - I \- D - I - I - I - I
To me using tree structure in snapshots is aproppriate because each branching point is some semantic state ("basic OS installed") and branches are different trials from that point. In backup case I guess we don't want branching on recovery to some backup, we just want to keep selected backup scheme going. So for example if we recover on Wednesday to previous week's Friday then later on Wednesday we will have regular Wednesday backup as if we have not been recovered. This makes things simple for client or he will drawn in dependencies (especially after a couple of recoverings).
But your representation is itself a tree -- is this a good argument against hierarchical information ... ? If you don't utilize the hierarchy, the degenerate form is indeed just a list: F - I - I - I - I - I - I - I - I - I ... everything has just one successor. I think Eric just feels he can get good code re-use out of the <domainsnapshot> element -- since each <snapshot> element itself references a parent ID; there's no real "cost" to tracking a tree instead of a list. There's nothing stopping you from adding three checkpoints that have the same parent, so to speak. I think this is just something that might wind up happening "for free" due to the nature of how libvirt stores relational data at all.
Of course internally we need to track backup dependencies in order to properly delete backups or recover from them.
[1] https://www.acronis.com/en-us/support/documentation/AcronisBackup_11.5/index...
So, from that point of view, libvirt tracking the relationship between qcow2 bitmaps in order to form checkpoint information can be done ALL with libvirt, and without NEEDING the qcow2 file to track any relations between bitmaps. BUT, libvirt's job can probably be made easier if qcow2 would, at the least, allow bitmaps to track their parent, and/or provide APIs to easily merge a parent..intermediate..child chain of related bitmaps to be merged into a single bitmap, for easy runtime creation of the temporary bitmap used to express the delta between two checkpoints.
[snip]
Nikolay

On 24.04.2018 23:02, John Snow wrote:
On 04/23/2018 06:38 AM, Nikolay Shirokovskiy wrote:
On 21.04.2018 00:26, Eric Blake wrote:
On 04/20/2018 01:24 PM, John Snow wrote:
Why is option 3 unworkable, exactly?:
(3) Checkpoints exist as structures only with libvirt. They are saved and remembered in the XML entirely.
Or put another way:
Can you explain to me why it's important for libvirt to be able to reconstruct checkpoint information from a qcow2 file?
In short it take extra effort for metadata to be consistent when libvirtd crashes occurs. See for more detailed explanation in [1] starting from words "Yes it is possible".
[1] https://www.redhat.com/archives/libvir-list/2018-April/msg01001.html
I'd argue the converse. Libvirt already knows how to do atomic updates of XML files that it tracks. If libvirtd crashes/restarts in the middle of an API call, you already have indeterminate results of whether the API worked or failed; once libvirtd is restarted, you'll have to probably retry the command. For all other cases, the API call completes, and either no XML changes were made (the command failed and reports the failure properly), or all XML changes were made (the command created the appropriate changes to track the new checkpoint, including whatever bitmap names have to be recorded to map the relation between checkpoints and bitmaps).
We can fail to save XML... Consider we have B1, B2 and create B3 bitmap in the process of creating checkpoint C3. Next qemu creates snapshot and bitmap successfully then libvirt fail to update XML and after some time libvirt restarts (not even crashes). Now libvirt nows of B1 and B2 but not B3. What can be the consequences? For example if we ask bitmap from C2 we miss all changes from C3 as we don't know of B3. This will lead to corrupted backups.
This can be fixed:
- in qemu. If bitmaps have child/parent realtionship then on libvirt restart we can recover (we ask qemu for bitmaps, discover B3 and then discover B3 is child of B2). This is how basically implementation with naming scheme works. Well on this way we don't need special metadata in libvirt (besides may be domain xml attached to checkpoiint etc)
- in libvirt. If we save XML before creating a snapshot with checkpoint. This fixes the issue with successful operation but saving XML failure. But now we have another issue :) We can save XML successfully but then operation itself can fail and we fail to revert XML back. Well we can recover even without child/parent metadata in qemu in this case. Just ask qemu for bitmaps on libvirt restart and if bitmap is missing kick it out as it is a case described above (successful saving XML then unsuccessfull qemu operation)
This option seems perfectly workable to me...
So it is possible to track bitmaps in libvirt. We just need to be extra carefull not to produce invalid backups.
Consider the case of internal snapshots. Already, we have the case where qemu itself does not track enough useful metadata about internal snapshots (right now, just a name and timestamp of creation); so libvirt additionally tracks further information in <domainsnapshot>: the name, timestamp, relationship to any previous snapshot (libvirt can then reconstruct a tree relationship between all snapshots; where a parent can have more than one child if you roll back to a snapshot and then execute the guest differently), the set of disks participating in the snapshot, and the <domain> description at the time of the snapshot (if you hotplug devices, or even the fact that creating external snapshots changes which file is the active qcow2 in a backing chain, you'll need to know how to roll back to the prior domain state as part of reverting). This is approximately the same set of information that a <domaincheckpoint> will need to track.
I would differentiate checkpoints and backups. For example in case of push backups we can store additional metadata in <domainbackup> so later we can revert back to previous state. But checkpoints (bitmaps technically) are only to make incremental backups(restores?). We can attach extra metadata to checkpoints but it looks accidental just because bitmaps and backups relate to some same point in time. To me a backup (push) can carry all the metadata and as to checkpoints a backup can have associated checkpoint or not. For example if we choose to always make full backups we don't need checkpoints at all (at least if we are not going to use them for restore).
Well ... if we create checkpoints alongside full backups, then you have points to reference to create future incremental backups. You don't need checkpoints if you *NEVER* use an incremental backup. If we want the feature enabled, so to speak, you likely need to be making checkpoints alongside full backups.
I'd say the cases in which we don't want them -- once the feature is enabled -- are hard to find.
I'm slightly tempted to just overload <domainsnapshot> to track three modes instead of two (internal, external, and now checkpoint); but think that will probably be a bit too confusing, so more likely I will create <domaincheckpoint> as a new object, but copy a lot of coding paradigms from <domainsnapshot>.
I wonder if you are going to use tree or list structure for backups. To me it is much easier to think of backups just as sequence of states in time. For example consider Grandfather-Father-Son scheme of Acronis backups [1]. Typical backup can look like:
F - I - I - I - I - D - I - I - I - I - D
Where F is full monthly backup, I incremental daily backup and D is diferrential weekly backup (no backups on Sunday and Saturday). This is representation from time POV. From backup dependencies POV it look likes next:
F - I - I - I - I D - I - I - I - I D \-------------------| | \--------------------------------------|
or more common representation:
F - I - I - I - I \- D - I - I - I - I \- D - I - I - I - I
To me using tree structure in snapshots is aproppriate because each branching point is some semantic state ("basic OS installed") and branches are different trials from that point. In backup case I guess we don't want branching on recovery to some backup, we just want to keep selected backup scheme going. So for example if we recover on Wednesday to previous week's Friday then later on Wednesday we will have regular Wednesday backup as if we have not been recovered. This makes things simple for client or he will drawn in dependencies (especially after a couple of recoverings).
But your representation is itself a tree -- is this a good argument against hierarchical information ... ?
If you don't utilize the hierarchy, the degenerate form is indeed just a list:
F - I - I - I - I - I - I - I - I - I ...
everything has just one successor.
I think Eric just feels he can get good code re-use out of the <domainsnapshot> element -- since each <snapshot> element itself references a parent ID; there's no real "cost" to tracking a tree instead of a list.
There's nothing stopping you from adding three checkpoints that have the same parent, so to speak.
I think this is just something that might wind up happening "for free" due to the nature of how libvirt stores relational data at all.
I mean we have to store tree structure for backups of course. I suggest - not to expose tree structure thru API in the first place. For example we can have API like - virDomainBackupList(time_t from, time_t to, virDomainBackupPtr **backups, unsigned int flags) to list backups in some period of time with flags like -'only full backups', -'include parent backups if they don't fit into interval' -'include children backups if they don't fit into interval' - virDomainBackupListChildren(virDomainBackupPtr parent, virDomainBackupPtr **backups, unsigned int flags) to list backup childrens - in case of restore don't branch from restored state instead just continue to backup as if changes brought by restore are produced by guest So API has means to explore tree structure eventually (virDomainBackupListChildren) but I suggest to think of and provide means to work with backups as a sequence in time not tree in the first place.
Of course internally we need to track backup dependencies in order to properly delete backups or recover from them.
[1] https://www.acronis.com/en-us/support/documentation/AcronisBackup_11.5/index...
So, from that point of view, libvirt tracking the relationship between qcow2 bitmaps in order to form checkpoint information can be done ALL with libvirt, and without NEEDING the qcow2 file to track any relations between bitmaps. BUT, libvirt's job can probably be made easier if qcow2 would, at the least, allow bitmaps to track their parent, and/or provide APIs to easily merge a parent..intermediate..child chain of related bitmaps to be merged into a single bitmap, for easy runtime creation of the temporary bitmap used to express the delta between two checkpoints.
[snip]
Nikolay

On 04/25/2018 03:19 AM, Nikolay Shirokovskiy wrote:
On 24.04.2018 23:02, John Snow wrote:
On 04/23/2018 06:38 AM, Nikolay Shirokovskiy wrote:
On 21.04.2018 00:26, Eric Blake wrote:
On 04/20/2018 01:24 PM, John Snow wrote:
> Why is option 3 unworkable, exactly?: > > (3) Checkpoints exist as structures only with libvirt. They are saved > and remembered in the XML entirely. > > Or put another way: > > Can you explain to me why it's important for libvirt to be able to > reconstruct checkpoint information from a qcow2 file? >
In short it take extra effort for metadata to be consistent when libvirtd crashes occurs. See for more detailed explanation in [1] starting from words "Yes it is possible".
[1] https://www.redhat.com/archives/libvir-list/2018-April/msg01001.html
I'd argue the converse. Libvirt already knows how to do atomic updates of XML files that it tracks. If libvirtd crashes/restarts in the middle of an API call, you already have indeterminate results of whether the API worked or failed; once libvirtd is restarted, you'll have to probably retry the command. For all other cases, the API call completes, and either no XML changes were made (the command failed and reports the failure properly), or all XML changes were made (the command created the appropriate changes to track the new checkpoint, including whatever bitmap names have to be recorded to map the relation between checkpoints and bitmaps).
We can fail to save XML... Consider we have B1, B2 and create B3 bitmap in the process of creating checkpoint C3. Next qemu creates snapshot and bitmap successfully then libvirt fail to update XML and after some time libvirt restarts (not even crashes). Now libvirt nows of B1 and B2 but not B3. What can be the consequences? For example if we ask bitmap from C2 we miss all changes from C3 as we don't know of B3. This will lead to corrupted backups.
This can be fixed:
- in qemu. If bitmaps have child/parent realtionship then on libvirt restart we can recover (we ask qemu for bitmaps, discover B3 and then discover B3 is child of B2). This is how basically implementation with naming scheme works. Well on this way we don't need special metadata in libvirt (besides may be domain xml attached to checkpoiint etc)
- in libvirt. If we save XML before creating a snapshot with checkpoint. This fixes the issue with successful operation but saving XML failure. But now we have another issue :) We can save XML successfully but then operation itself can fail and we fail to revert XML back. Well we can recover even without child/parent metadata in qemu in this case. Just ask qemu for bitmaps on libvirt restart and if bitmap is missing kick it out as it is a case described above (successful saving XML then unsuccessfull qemu operation)
This option seems perfectly workable to me...
So it is possible to track bitmaps in libvirt. We just need to be extra carefull not to produce invalid backups.
Consider the case of internal snapshots. Already, we have the case where qemu itself does not track enough useful metadata about internal snapshots (right now, just a name and timestamp of creation); so libvirt additionally tracks further information in <domainsnapshot>: the name, timestamp, relationship to any previous snapshot (libvirt can then reconstruct a tree relationship between all snapshots; where a parent can have more than one child if you roll back to a snapshot and then execute the guest differently), the set of disks participating in the snapshot, and the <domain> description at the time of the snapshot (if you hotplug devices, or even the fact that creating external snapshots changes which file is the active qcow2 in a backing chain, you'll need to know how to roll back to the prior domain state as part of reverting). This is approximately the same set of information that a <domaincheckpoint> will need to track.
I would differentiate checkpoints and backups. For example in case of push backups we can store additional metadata in <domainbackup> so later we can revert back to previous state. But checkpoints (bitmaps technically) are only to make incremental backups(restores?). We can attach extra metadata to checkpoints but it looks accidental just because bitmaps and backups relate to some same point in time. To me a backup (push) can carry all the metadata and as to checkpoints a backup can have associated checkpoint or not. For example if we choose to always make full backups we don't need checkpoints at all (at least if we are not going to use them for restore).
Well ... if we create checkpoints alongside full backups, then you have points to reference to create future incremental backups. You don't need checkpoints if you *NEVER* use an incremental backup. If we want the feature enabled, so to speak, you likely need to be making checkpoints alongside full backups.
I'd say the cases in which we don't want them -- once the feature is enabled -- are hard to find.
I'm slightly tempted to just overload <domainsnapshot> to track three modes instead of two (internal, external, and now checkpoint); but think that will probably be a bit too confusing, so more likely I will create <domaincheckpoint> as a new object, but copy a lot of coding paradigms from <domainsnapshot>.
I wonder if you are going to use tree or list structure for backups. To me it is much easier to think of backups just as sequence of states in time. For example consider Grandfather-Father-Son scheme of Acronis backups [1]. Typical backup can look like:
F - I - I - I - I - D - I - I - I - I - D
Where F is full monthly backup, I incremental daily backup and D is diferrential weekly backup (no backups on Sunday and Saturday). This is representation from time POV. From backup dependencies POV it look likes next:
F - I - I - I - I D - I - I - I - I D \-------------------| | \--------------------------------------|
or more common representation:
F - I - I - I - I \- D - I - I - I - I \- D - I - I - I - I
To me using tree structure in snapshots is aproppriate because each branching point is some semantic state ("basic OS installed") and branches are different trials from that point. In backup case I guess we don't want branching on recovery to some backup, we just want to keep selected backup scheme going. So for example if we recover on Wednesday to previous week's Friday then later on Wednesday we will have regular Wednesday backup as if we have not been recovered. This makes things simple for client or he will drawn in dependencies (especially after a couple of recoverings).
But your representation is itself a tree -- is this a good argument against hierarchical information ... ?
If you don't utilize the hierarchy, the degenerate form is indeed just a list:
F - I - I - I - I - I - I - I - I - I ...
everything has just one successor.
I think Eric just feels he can get good code re-use out of the <domainsnapshot> element -- since each <snapshot> element itself references a parent ID; there's no real "cost" to tracking a tree instead of a list.
There's nothing stopping you from adding three checkpoints that have the same parent, so to speak.
I think this is just something that might wind up happening "for free" due to the nature of how libvirt stores relational data at all.
I mean we have to store tree structure for backups of course. I suggest
- not to expose tree structure thru API in the first place. For example we can have API like
- virDomainBackupList(time_t from, time_t to, virDomainBackupPtr **backups, unsigned int flags)
to list backups in some period of time with flags like -'only full backups', -'include parent backups if they don't fit into interval' -'include children backups if they don't fit into interval'
- virDomainBackupListChildren(virDomainBackupPtr parent, virDomainBackupPtr **backups, unsigned int flags)
to list backup childrens
- in case of restore don't branch from restored state instead just continue to backup as if changes brought by restore are produced by guest
So API has means to explore tree structure eventually (virDomainBackupListChildren) but I suggest to think of and provide means to work with backups as a sequence in time not tree in the first place.
Oh, sure. That might be reasonable, but I'll probably defer to Eric's opinion here. The XML storage can be tree-based (as a natural occurrence) but I don't know if we need to make the API tree-based, right. I don't have a really strong stance here -- I'd say whatever makes the most sense with the implementation that best facilitates code re-use in libvirt. --js
Of course internally we need to track backup dependencies in order to properly delete backups or recover from them.
[1] https://www.acronis.com/en-us/support/documentation/AcronisBackup_11.5/index...
So, from that point of view, libvirt tracking the relationship between qcow2 bitmaps in order to form checkpoint information can be done ALL with libvirt, and without NEEDING the qcow2 file to track any relations between bitmaps. BUT, libvirt's job can probably be made easier if qcow2 would, at the least, allow bitmaps to track their parent, and/or provide APIs to easily merge a parent..intermediate..child chain of related bitmaps to be merged into a single bitmap, for easy runtime creation of the temporary bitmap used to express the delta between two checkpoints.
[snip]
Nikolay

On 04/13/2018 03:02 PM, John Snow wrote:
What are the downsides to actually including a predecessor/successor* pointer in QEMU?
(1) We'd need to amend the bitmap persistence format
Which I think is doable, since we have a size field.
(2) We'd need to amend some of the bitmap management commands (3) We'd need to make sure it migrates correctly: (A) Shared storage should be fine; just flush to disk and pivot (B) Live storage needs to learn a new field to migrate.
Certainly it's not ...trivial, but not terribly difficult either. I wonder if it's the right thing to do in lieu of the naming hacks in libvirt.
There wasn't really a chorus of applause for the idea of having checkpoints more officially implemented in QEMU, but... abusing the name metadata still makes me feel like we're doing something wrong -- especially if a third party utility that doesn't understand the concept of your naming scheme comes along and modifies a bitmap.
Speaking of that, we really need at least read-only commands for qemu-img to show details about what bitmaps are present in a qcow2 file, at least for debugging all of this.
It feels tenuous and likely to break, so I'd like to formalize it more. We can move this discussion over to the QEMU lists if you think it's worth talking about.
Or I'll just roll with it. I'll see what Eric thinks, I guess? :)
Indeed, discussing an enhancement of qcow2 metadata to track bitmap relationships is probably appropriate on the qemu list.
*(Uh-oh, that term is overloaded for QEMU bitmap internals... we can address that later...)
-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 04/12/2018 08:57 AM, Nikolay Shirokovskiy wrote:
On 12.04.2018 07:14, John Snow wrote:
On 04/11/2018 12:32 PM, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
[snip]
Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c"> .. </disk>
It makes sense to avoid the bitmap name in libvirt, but do these indeed correlate 1:1 with bitmaps?
I assume each bitmap will have name=%%UUID%% ?
There is 1:1 correlation but names are different. Checkout checkpoints subsection of *implementation details* section below for naming scheme.
Yeah, I saw later. You have both "checkpoints" (associated with bitmaps) and then the bitmaps themselves.
Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used. So client need to manage checkpoints and delete unused. Thus next API function:
[snip]
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot.
So you are trying to optimize away write penalties if you have, say, ten bitmaps representing checkpoints so we don't have to record all new writes to all ten.
This makes sense, and I would have liked to formalize the concept in QEMU, but response to that idea was very poor at the time.
Also my design was bad :)
Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme.
Previous, or next?
In short previous.
Say we've got bitmaps (in chronological order from oldest to newest)
A B C D E F G H
and we want to delete bitmap (or "checkpoint") 'C':
A B D E F G H
the bitmap representing checkpoint 'D' should now contain the bits that used to be in 'C', right? That way all the checkpoints still represent their appropriate points in time.
I merge to previous due to definition above. "A" contains changes from point in time A to point in time B an so no. So if you delete C in order for B to keep changes from point in time B to point in time D (next in checkpoint chain) you need merge C to B.
I'm not sure the way it's explained here makes sense to me, but Vladimir's explanation does.
The only problem comes when you delete a checkpoint on the end and the bits have nowhere to go:
A B C
A B _
In this case you really do lose a checkpoint -- but depending on how we annotate this, it may or may not be possible to delete the most recent checkpoint. Let's assume that the currently active bitmap that doesn't represent *any* point in time yet (because it's still active and recording new writes) is noted as 'X':
A B C X
If we delete C now, then, that bitmap can get re-merged into the *active bitmap* X:
A B _ X
You can delete any bitmap (and accordingly any checkpoint). If checkpoint is last we just merge last bitmap to previous and additioanlly make the previous bitmap active.
We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too.
Right. A missing bitmap anywhere in the sequence invalidates the entire sequence.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new.
That seems... unfortunate. A record could be kept in libvirt instead, couldn't it?
A : Bitmap A, Time 12:34:56, Child of (None), Parent of B B : Bitmap B, Time 23:15:46, Child of A, Parent of (None)
Yes it is possible. I was reluctant to implement this way for a couple of reasons:
- if bitmap metadata is in libvirt we need carefully design it for things like libvirtd crashes. If metadata is out of sync with qemu then we can get broken incremental backups. One possible design is:
- on bitmap deletion save metadata after deletion bitmap in qemu; in case of libvirtd crash in between upon libvirtd restart we can drop bitmaps that are in metadata but not in qemu as already deleted
- on bitmap add (creating new snapshot with checkpoint) save metadata with bitmap before creating bitmap in qemu; then again we have a way to handle libvirtd crashes in between
So this approach has tricky parts too. The suggested approach uses qemu transactions to keep bitmap consistent.
- I don't like another metadata which looks like belongs to disks and not a domain. It is like keeping disk size in domain xml.
Yeah, I see... Having to rename bitmaps in the middle of the chain seems unfortunate, though... and I'm still a little wary of using the names as important metadata to be really honest. It feels like a misuse of the field.
I suppose in this case you can't *reconstruct* this information from the bitmap stored in the qcow2, which necessitates your naming scheme...
...Still, if you forego this requirement, deleting bitmaps in the middle becomes fairly easy.
So while it is possible to have only one active bitmap at a time it costs some exersices at managment layer. To me it looks like qemu itself is a better place to track bitmaps chain order and consistency.
If this is a hard requirement, it's certainly *easier* to track the relationship in QEMU ...
Libvirt is already tracking a tree relationship between internal snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track that (true, internal snapshots don't get as much attention as external snapshots) - but the fact remains that qemu is probably not the best place to track relationship between multiple persistent bitmaps, any more than it tracks relationships between internal snapshots. So having libvirt track relations between persistent bitmaps is just fine. Do we really have to rename bitmaps in the qcow2 file, or can libvirt track it all on its own?
This is a way, really, of storing extra metadata by using the bitmap name as arbitrary data storage.
I'd say either we promote QEMU to understanding checkpoints, or enhance libvirt to track what it needs independent of QEMU -- but having to rename bitmaps smells fishy to me.
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
I think the *snapshots*, as temporary objects, are independent and don't carry a relation to each other.
The *checkpoints* here, however, are persistent and interrelated.
Now how exporting bitmaps looks like.
- add to export disk snapshot N with changes from checkpoint K - add fleece blockdev to NBD exports - create new bitmap T - disable bitmap T - merge bitmaps K, K+1, .. N-1 into T
I see; so we compute a new slice based on previous bitmaps and backup arbitrary from that arbitrary slice.
So "T" is a temporary bitmap meant to be discarded at the conclusion of the operation, making it much more like a consumable object.
- add bitmap to T to nbd export
- remove disk snapshot from export - remove fleece blockdev from NBD exports - remove bitmap T
Aha.
Here is qemu commands examples for operation with checkpoints, I'll make several snapshots with checkpoints for purpuse of better illustration.
- create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint - same as without checkpoint but additionally add bitmap on fleece blockjob start
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, }
So a checkpoint creates a reference point, but NOT a backup. You are manually creating checkpoint instances.
In this case, though, you haven't disabled the previous checkpoint's bitmap (if any?) atomically with the creation of this one...
In the example this is first snapshot so there is no previous checkpoint and thus nothing to disable.
OK, got it!
Here, the transaction makes sense; you have to create the persistent dirty bitmap to track from the same point in time. The dirty bitmap is tied to the active image, not the backup, so that when you create the NEXT incremental backup, you have an accurate record of which sectors were touched in snapshot-scsi0-0-0-0 between this transaction and the next.
] }, }
- delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as without checkpoints
- create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint - same actions as for the first snapshot, but additionally disable the first bitmap
Again, you're showing the QMP commands that libvirt is issuing; which libvirt API calls are driving these actions?
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": {
Do you have measurements on whether having multiple active bitmaps hurts performance? I'm not yet sure that managing a chain of disabled bitmaps (and merging them as needed for restores) is more or less efficient than managing multiple bitmaps all the time. On the other hand, you do have a point that restore is a less frequent operation than backup, so making backup as lean as possible and putting more work on restore is a reasonable tradeoff, even if it adds complexity to the management for doing restores.
Depending on the number of checkpoints intended to be kept... we certainly make no real promises on the efficiency of marking so many. It's at *least* a linear increase with each checkpoint...
"name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, }
Oh, I see, you handle the "disable old" case here.
- delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export
... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, }
OK, so in this transaction you add a new temporary bitmap for export, and merge the contents of two bitmaps into it.
However, it doesn't look like you created a new checkpoint and managed that handoff here, did you?
We don't need create checkpoints for the purpuse of exporting. Only temporary bitmap to merge appropriate bitmap chain.
See reply below
{ "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", },
And then here, once the bitmap and the data is already frozen, it's actually alright if we add the export at a later point in time.
Adding a bitmap to a server is would would advertise to the NBD client that it can query the "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
Don't know much about this, I stopped paying attention to the BLOCK STATUS patches. Is the NBD spec the best way to find out the current state right now?
(Is there a less technical, briefer overview somewhere, perhaps from a commit message or a cover letter?)
}
- remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap
... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" }
OK, this just deletes the checkpoint. I guess we delete the node and
I would not call it checkpoint. Checkpoint is something visible to client. An ability to get CBT from that point in time.
Here we create a temporary bitmap to calculate desired CBT.
Aha, right. I misspoke; but it's because in my mind I feel like creating an export will *generally* be accompanied by a new checkpoint, so I was surprised to see that missing from the example. But, yes, there's no reason you *have* to create a new checkpoint when you do an export -- but I suspect that when you DO create a new checkpoint it's generally going to be accompanied by an export like this, right?
stop the NBD server too, right?
yeah, just like in case without checkpoint (mentioned in this case description)
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled)
A suggestion on the examples - while UUIDs are nice and handy for management tools, they are a pain to type and for humans to quickly read. Is there any way we can document a sample transaction stream with all the actors involved (someone issues a libvirt API call XYZ, libvirt in turn issues QMP command ABC), and using shorter names that are easier to read as humans?
Yeah, A-B-C-D terminology would be nice for the examples. It's fine if the actual implementation uses UUIDs.
{ "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
We already have this, right? It doesn't even need to be transactionable.
x-vz-block-dirty-bitmap-merge
You need this...
x-vz-block-dirty-bitmap-disable
And this we had originally but since removed, but can be re-added trivially.
x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap
Do my comments make sense? Am I understanding you right so far? I'll try to offer a competing writeup to make sure we're on the same page with your proposed design before I waste any time trying to critique it -- in case I'm misunderstanding you.
Yes, looks like we are in tune.
More or less. Thank you for taking the time to explain it all out to me. I think I understand the general shape of your proposal, more or less.
Thank you for leading the charge and proposing new APIs for this feature. It will be very nice to expose the incremental backup functionality we've been working on in QEMU to users of libvirt.
--js
There are also patches too ( if API design survive review phase at least partially :) )
I can only really help (or hinder?) where QEMU primitives are concerned -- the actual libvirt API is going to be what Eric cares about. I think this looks good so far, though -- at least, it makes sense to me.
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it.
Why can't restore be done while the guest is offline? (Oh right, we still haven't added decent qemu-img support for bitmap manipulation, so we need a qemu process around for any bitmap changes).
I'm working on this right now, actually!
I'm working on JSON format output for bitmap querying, and simple clear/delete commands. I hope to send this out very soon.
As I understand it, the point of bitmaps and snapshots is to create an NBD server that a third-party can use to read just the dirty portions of a disk in relation to a known checkpoint, to save off data in whatever form it wants; so you are right that the third party then needs a way to rewrite data from whatever internal form it stored it in back to the view that qemu can consume when rolling back to a given backup, prior to starting the guest on the restored data. Do you need additional libvirt APIs exposed for this, or do the proposed APIs for adding snapshots cover everything already with just an additional flag parameter that says whether the <domainblocksnapshot> is readonly (the third-party is using it for collecting the incremental backup data) or writable (the third-party is actively writing its backup into the file, and when it is done, then perform a block-commit to merge that data back onto the main qcow2 file)?
Thank you!

On 13.04.2018 02:53, John Snow wrote:
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot.
So you are trying to optimize away write penalties if you have, say, ten bitmaps representing checkpoints so we don't have to record all new writes to all ten.
This makes sense, and I would have liked to formalize the concept in QEMU, but response to that idea was very poor at the time.
Also my design was bad :)
Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme.
Previous, or next?
In short previous.
Say we've got bitmaps (in chronological order from oldest to newest)
A B C D E F G H
and we want to delete bitmap (or "checkpoint") 'C':
A B D E F G H
the bitmap representing checkpoint 'D' should now contain the bits that used to be in 'C', right? That way all the checkpoints still represent their appropriate points in time.
I merge to previous due to definition above. "A" contains changes from point in time A to point in time B an so no. So if you delete C in order for B to keep changes from point in time B to point in time D (next in checkpoint chain) you need merge C to B.
I'm not sure the way it's explained here makes sense to me, but Vladimir's explanation does.
The only problem comes when you delete a checkpoint on the end and the bits have nowhere to go:
A B C
A B _
In this case you really do lose a checkpoint -- but depending on how we annotate this, it may or may not be possible to delete the most recent checkpoint. Let's assume that the currently active bitmap that doesn't represent *any* point in time yet (because it's still active and recording new writes) is noted as 'X':
A B C X
If we delete C now, then, that bitmap can get re-merged into the *active bitmap* X:
A B _ X
You can delete any bitmap (and accordingly any checkpoint). If checkpoint is last we just merge last bitmap to previous and additioanlly make the previous bitmap active.
We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too.
Right. A missing bitmap anywhere in the sequence invalidates the entire sequence.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new.
That seems... unfortunate. A record could be kept in libvirt instead, couldn't it?
A : Bitmap A, Time 12:34:56, Child of (None), Parent of B B : Bitmap B, Time 23:15:46, Child of A, Parent of (None)
Yes it is possible. I was reluctant to implement this way for a couple of reasons:
- if bitmap metadata is in libvirt we need carefully design it for things like libvirtd crashes. If metadata is out of sync with qemu then we can get broken incremental backups. One possible design is:
- on bitmap deletion save metadata after deletion bitmap in qemu; in case of libvirtd crash in between upon libvirtd restart we can drop bitmaps that are in metadata but not in qemu as already deleted
- on bitmap add (creating new snapshot with checkpoint) save metadata with bitmap before creating bitmap in qemu; then again we have a way to handle libvirtd crashes in between
So this approach has tricky parts too. The suggested approach uses qemu transactions to keep bitmap consistent.
- I don't like another metadata which looks like belongs to disks and not a domain. It is like keeping disk size in domain xml.
Yeah, I see... Having to rename bitmaps in the middle of the chain seems unfortunate, though...
and I'm still a little wary of using the names as important metadata to be really honest. It feels like a misuse of the field.
Me too. I'd like to see a different offer from qemu for bitmaps so that mgmt does not have to do such heavy lifting. [snip]
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export
... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, }
OK, so in this transaction you add a new temporary bitmap for export, and merge the contents of two bitmaps into it.
However, it doesn't look like you created a new checkpoint and managed that handoff here, did you?
We don't need create checkpoints for the purpuse of exporting. Only temporary bitmap to merge appropriate bitmap chain.
See reply below
{ "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", },
And then here, once the bitmap and the data is already frozen, it's actually alright if we add the export at a later point in time.
Adding a bitmap to a server is would would advertise to the NBD client that it can query the "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
Don't know much about this, I stopped paying attention to the BLOCK STATUS patches. Is the NBD spec the best way to find out the current state right now?
(Is there a less technical, briefer overview somewhere, perhaps from a commit message or a cover letter?)
}
- remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap
... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" }
OK, this just deletes the checkpoint. I guess we delete the node and
I would not call it checkpoint. Checkpoint is something visible to client. An ability to get CBT from that point in time.
Here we create a temporary bitmap to calculate desired CBT.
Aha, right. I misspoke; but it's because in my mind I feel like creating an export will *generally* be accompanied by a new checkpoint, so I was surprised to see that missing from the example.
But, yes, there's no reason you *have* to create a new checkpoint when you do an export -- but I suspect that when you DO create a new checkpoint it's generally going to be accompanied by an export like this, right?
Generally speaking yes. But not always. Consider the first backup. It should be full with no other option. We create snapshot with checkpoint but export only snapshot. No need to export bitmap (bitmap from what checkpoint? current checkpoint does not count). We create checkpoint because we want second backup to be incremental from the first. Later from time to time we will create full backups again so that incremental chain is not too long I guess. Then again we create a snapshot with checkpoint but not export any CBT. In short we create a checkpoint not for the purpuse of export but rather to have a point in time to get CBT from in later backups. Actually in current implementation (not yet published) there is a restiriction that snaphot can be exported with CBT only in case if snapshot was created with checkpoint. This saves us from creating temporary qemu bitmap which marks end point in time for requested CBT.
stop the NBD server too, right?
yeah, just like in case without checkpoint (mentioned in this case description)
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled)
A suggestion on the examples - while UUIDs are nice and handy for management tools, they are a pain to type and for humans to quickly read. Is there any way we can document a sample transaction stream with all the actors involved (someone issues a libvirt API call XYZ, libvirt in turn issues QMP command ABC), and using shorter names that are easier to read as humans?
Yeah, A-B-C-D terminology would be nice for the examples. It's fine if the actual implementation uses UUIDs.
{ "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
We already have this, right? It doesn't even need to be transactionable.
x-vz-block-dirty-bitmap-merge
You need this...
x-vz-block-dirty-bitmap-disable
And this we had originally but since removed, but can be re-added trivially.
x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap
Do my comments make sense? Am I understanding you right so far? I'll try to offer a competing writeup to make sure we're on the same page with your proposed design before I waste any time trying to critique it -- in case I'm misunderstanding you.
Yes, looks like we are in tune.
More or less. Thank you for taking the time to explain it all out to me. I think I understand the general shape of your proposal, more or less.
Thank you for leading the charge and proposing new APIs for this feature. It will be very nice to expose the incremental backup functionality we've been working on in QEMU to users of libvirt.
--js
There are also patches too ( if API design survive review phase at least partially :) )
I can only really help (or hinder?) where QEMU primitives are concerned -- the actual libvirt API is going to be what Eric cares about.
I think this looks good so far, though -- at least, it makes sense to me.
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it.
Why can't restore be done while the guest is offline? (Oh right, we still haven't added decent qemu-img support for bitmap manipulation, so we need a qemu process around for any bitmap changes).
I'm working on this right now, actually!
I'm working on JSON format output for bitmap querying, and simple clear/delete commands. I hope to send this out very soon.
As I understand it, the point of bitmaps and snapshots is to create an NBD server that a third-party can use to read just the dirty portions of a disk in relation to a known checkpoint, to save off data in whatever form it wants; so you are right that the third party then needs a way to rewrite data from whatever internal form it stored it in back to the view that qemu can consume when rolling back to a given backup, prior to starting the guest on the restored data. Do you need additional libvirt APIs exposed for this, or do the proposed APIs for adding snapshots cover everything already with just an additional flag parameter that says whether the <domainblocksnapshot> is readonly (the third-party is using it for collecting the incremental backup data) or writable (the third-party is actively writing its backup into the file, and when it is done, then perform a block-commit to merge that data back onto the main qcow2 file)?
Thank you!

On 04/11/2018 11:14 PM, John Snow wrote:
This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image.
Oh, I see -- you're using blockdev-backup sync=none to accomplish fleecing snapshots. It's a little confusing here without the sync=none information, as usually blockdev-backup provides... backups, not snapshots.
Right now, I'm leaning towards a single virDomainBackupStart()/virDomainBackupEnd() pair that work for both push and pull model backups: push model: Start() command points to the place to use as the destination that qemu will push to; it either provides the checkpoint to start from (qemu's "sync":"incremental", using the appropriate bitmap(s) to construct how much data to actually push to the destination), or omits the checkpoint (full backup, qemu's "sync":"full"); optionally creates a new checkpoint at the same time; emits a libvirt event to note when the backup job is complete; then user calls End() to tear down any resources (qemu can no longer write to the destination) pull model: Start() command points to place to use as temporary storage so that guest can continue to write, AND includes the details of the NBD server to set up for third-party access to the backup. The command optionally provides the checkpoint to expose over NBD (which gets translated into the bitmap(s) to expose as block status), or omits it (the third party has to assume that the full image is dirty); either way, the temporary storage uses "sync":"none", and it is up to the third party connecting to NBD how much of the image to read (whether a checkpoint was specified or not, the third-party can read the entire image if desired, or only a fraction of the image if desired; but a checkpoint has to be provided if the third party plans on learning how much of the image to read to capture an incremental backup). There is no event emitted by libvirt; the user calls End() when the third-party client is done using the NBD connection. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 11.04.2018 19:32, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
This is a first-pass review (making comments as I first encounter something, even if it gets explained later in the email)
This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image.
So, rewriting this to make sure I understand, let's start with a disk with contents A, then take a snapshot, then write B:
In the existing libvirt snapshot APIs, the data gets distributed as:
base (contents A) <- new active (contents B)
where you want the new API:
base, remains active (contents B) ~~~ backup (contents A)
Exactly
Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server.
So the biggest reason for a new libvirt API is that we need management actions to control which NBD images from qemu are exposed and torn down at the appropriate sequences.
Here is typical actions on domain backup:
- create temporary snapshot of domain disks of interest - export snaphots thru NBD - back them up - remove disks from export - delete temporary snapshot
and typical actions on domain restore:
- start domain in paused state - export domain disks of interest thru NBD for write - restore them - remove disks from export - resume or destroy domain
Now let's write down API in more details. There are minor changes in comparison with previous version [1].
*Temporary snapshot API*
In previous version it is called 'Fleece API' after qemu terms and I'll still use BlockSnapshot prefix for commands as in previous RFC instead of TmpSnapshots which I inclined more now.
virDomainBlockSnapshotPtr virDomainBlockSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
Just to make sure, we have the existing API of:
virDomainSnapshotPtr virDomainSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
So you are creating a new object (virDomainBlockSnapshotPtr) rather than reusing the existing VirDomainSnapshotPtr, and although the two commands are similar, we get to design a new XML schema from scratch rather than trying to overload yet even more functionality onto the existing API.
Yes. Existing snapshots are different from temporary snapshots in many ways. The former for example form a tree structure and the latter are not.
Should we also have:
const char *virDomainBlockSnapshotGetName(virDomainBlockSnapshotPtr snapshot); virDomainPtr virDomainBlockSnapshotGetDomain(virDomainBlockSnapshotPtr snapshot); virConnectPtr virDomainBlockSnapshotGetConnect(virDomainBlockSnapshotPtr snapshot);
for symmetry with existing snapshot API?
Yes. I ommited these calls in RFC as they are trivial and don't need to be considered to grasp the picture.
virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotList(virDomainPtr domain, virDomainBlockSnapshotPtr **snaps, unsigned int flags);
I'm guessing this is the counterpart to virDomainListAllSnapshots() (the modern listing interface), and that we probably don't want counterparts for virDomainSnapshotNum/virDomainSnapshotListNames (the older listing interface, which was inherently racy as the list could change in length between the two calls).
That's right.
virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotPtr virDomainBlockSnapshotLookupByName(virDomainPtr domain, const char *name, unsigned int flags);
Also, the virDomainSnapshotPtr had a number of API to track a tree-like hierarchy between snapshots (that is, you very much want to know if snapshot B is a child of snapshot A), while it looks like your new virDomainBlockSnapshotPtrs are completely independent (no relationships between the snapshots, each can be independently created or torn down, without having to rewrite a relationship tree between them, and there is no need for counterparts to things like virDomainSnapshotNumChildren). Okay, I think that makes sense, and is a good reason for introducing a new object type rather than shoe-horning this into the existing API.
This fact motivates me introduce new API too.
Here is an example of snapshot xml description:
<domainblocksnapshot> <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/> </disk> <disk name='sdb' type="file"> <fleece file="/tmp/snapshot-b.hdd"/> </disk> </domainblocksnapshot>
Temporary snapshots are indepentent thus they are not organized in tree structure as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice.
So in the XML, the <fleece> element describes the destination file (back to my earlier diagram, it would be the file that is created and will hold content 'A' when the main active image is changed to hold content 'B' after the snapshot was created)?
Yes.
Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c"> .. </disk>
Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used. So client need to manage checkpoints and delete unused. Thus next API function:
int virDomainBlockCheckpointRemove(virDomainPtr domain, const char *name, unsigned int flags);
I'm trying to figure out how BlockCheckpoint and BlockSnapshots relate. Maybe it will be more clear when I read the implementation section below. Is the idea that I can't create a BlockSnapshot without first having a checkpoint available? If so, where does that fit in the <domainblocksnapshot> XML?
No, you can create snapshot without available checkpoints. Actually the first snapshot is like that. Now if you create a snapshot with checkpoint and then delete the snapshot the checkpoint remains, so we need an API to delete them if we wish.
*Block export API*
I guess it is natural to treat qemu NBD server as a domain device. So we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD server and virDomainUpdateDeviceFlags to add/delete disks to be exported.
This feels a bit awkward - up to now, attaching a device is something visible to the guest, but you are trying to reuse the interface to attach something tracked by the domain, but which has no impact to the guest. That is, the guest has no clue whether a block export exists pointing to a particular checkpoint, nor does it care.
Not entirely true. Take a graphical framebuffers (vnc) or serial devices. The guest are completely unaware of vnc. Serial device is related to guest device but again guest is not aware of such relation.
While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags looks a bit inconvinient so I decided to add a pair of API functions just to add/delete disks to be exported:
int virDomainBlockExportStart(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
int virDomainBlockExportStop(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
I guess more appropriate names are virDomainBlockExportAdd and virDomainBlockExportRemove but as I already have a patch series implementing pull backups with these names I would like to keep these names now.
What does the XML look like in these calls?
These names also reflect that in the implementation I decided to start/stop NBD server in a lazy manner. While it is a bit innovative for libvirt API I guess it is convinient because to refer NBD server to add/remove disks to we need to identify it thru it's parameters like type, address etc until we introduce some device id (which does not looks consistent with current libvirt design).
This just reinforces my thoughts above - is the reason it doesn't make sense to assign a device id to the export due to the fact that the export is NOT guest-visible? Does it even belong under the
By export you mean a NBD server or disk being exported? What is this id for? Is this libvirt alias for devices or something different?
"domain/devices/" xpath of the domain XML, or should it be a new sibling of <devices> with an xpath of "domain/blockexports/"?
So it looks like we have all parameters to start/stop server in the frame of these calls so why have extra API calls just to start/stop server manually. If we later need to have NBD server without disks we can perfectly support virDomainAttachDeviceFlags/virDomainDetachDeviceFlags.
Here is example of xml to add/remove disks (specifying checkpoint attribute is not needed for removing disks of course):
<domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> </domainblockexport>
So this is the XML you pass to virDomainBlockExportStart, with the goal of telling qemu to start or stop an NBD export on the backing chain associated with disk "sda", where the export is serving up data tied to checkpoint "d068765e-8b50-4d74-9b72-1e55c663cbf8", and which will be associated with the destination snapshot file described by the <domainblocksnapshot> named "0044757e-1a2d-4c2c-b92f-bb403309bb17"?
I would rephrase. I didn't think of arbitrary backing chains in this API. It just exports the temporary snapshot of disk "sda". Snapshot is referenced by its name "0044757e-1a2d-4c2c-b92f-bb403309bb17". Additionally you can ask to export CBT from some earlier snapshot of "sda" referenced by "d068765e-8b50-4d74-9b72-1e55c663cbf8" to the exported snapshot ("0044757e-1a2d-4c2c-b92f-bb403309bb17"). To make exporting CBT possible the earlier snapshot should be created with VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag. So this export API is somewhat oriented to block snapshots. May be one day we want to export backing chain of regular snaphots then this API will be insufficient...
Why is it named <domainblockexport> here, but...
And this is how this NBD server will be exposed in domain xml:
<devices> ... <blockexport type="nbd">
<blockexport> here?
In this case we already have domain context as xpath is /domain/devices/blockexport.
<address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8" exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
The exportname property is new here compared to the earlier listing - is that something that libvirt generates, or that the user chooses?
In current implementation it is generated. I see no obstacles for "exportname" to be specified in input too.
<disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8 exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> </blockexport> ... </devices>
*Implementation details from qemu-libvirt interactions POV*
1. Temporary snapshot
- create snapshot
Which libvirt API triggers this action? virDomainBlockSnapshotCreateXML?
Yes.
- add fleece blockdev backed by disk of interest - start fleece blockjob which will pop out data to be overwritten to fleece blockdev
{ "execute": "blockdev-add" "arguments": { "backing": "drive-scsi0-0-0-0", "driver": "qcow2", "file": { "driver": "file", "filename": "/tmp/snapshot-a.hdd"
Is qemu creating this file, or is libvirt pre-creating it and qemu just opening it? I guess this is a case where libvirt would want to
The latter.
pre-create an empty qcow2 file (either by qemu-img, or by the new x-blockdev-create in qemu 2.12)? Okay, it looks like this file is what you listed in the XML for <domainblocksnapshot>, so libvirt is creating it. Does the new file have a backing image, or does it read as completely zeroes?
File is created by qemu-img without backing chain so it is read as zeros. But it does not matter as the file is not meant to be read/write outside of qemu process. I guess after blockdev-add command the fleece image gets active image as backing in qemu internals.
}, "node-name": "snapshot-scsi0-0-0-0" }, }
No trailing comma in JSON {}, but it's not too hard to figure out what you mean.
Oops) I used python -mjson.tool to pretty print json grabbed from qemu logs. It sorts json keys alphabetically which is not convinient in this case - "execute" is better to be above "arguments". So I just moved "execute" line in editor and completely forgot about commas)
{ "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "target": "snapshot-scsi0-0-0-0" "sync": "none", }, } ]
You showed a transaction with only one element; but presumably we are using a transaction because if we want to create a point in time for multiple disks at once, we need two separate blockdev-backup actions joined in the same transaction to cover the two disks. So this command
Yes, strictly speaking we don't need a transaction here. I provide here qemu logs from current dumb implementation) I guess I'd better use snapshot for 2 disks in the example as you suggest.
is telling qemu to start using a brand-new qcow2 file as its local storage for tracking that a snapshot is being taken, and that point in
Yes.
time is the checkpoint? No, this actions will not create a checkpoint. Examples for checkpoints are below. In case of checkpoints we additionally add a new dirty bimap in transaction for every disk and manipulate with existing dirty bitmaps.
Am I correct that you would then tell qemu to export an NBD view of this qcow2 snapshot which a third-party client can connect to and use NBD_CMD_BLOCK_STATUS to learn which portions of the file contain data (that is, which clusters has qemu copied into the backup, because the active image has changed them since the checkpoint, but anything not dirty in this file is still identical to the last backup?
No. In this example we don't talk about checkpoints which is for incremental backups. This is plain full backup. You create a snapshot and export it. Even if we created the snapshot with checkpoint the checkpoint is of no use for the first backup. The first backup can not be anything but full copy of snapshot. But lately if you make first backup, delete snapshot, then after sometime want to create another backup you create new snapshot and this time if first snapshot was created with checkpoint we can tell thru NBD_CMD_BLOCK_STATUS what portions of disk in second snapshot are changed relative to the first snapshot. Using this info you can create incremental backup.
Would libvirt ever want to use something other than "sync":"none"?
I don't know of usecases for other modes. Looks like "none" is sufficient for snapshot purpuses.
}, }
- delete snapshot - cancel fleece blockjob - delete fleece blockdev
{ "execute": "block-job-cancel" "arguments": { "device": "drive-scsi0-0-0-0" }, } { "execute": "blockdev-del" "arguments": { "node-name": "snapshot-scsi0-0-0-0" }, }
2. Block export
- add disks to export - start NBD server if it is not started - add disks
{ "execute": "nbd-server-start" "arguments": { "addr": { "type": "inet" "data": { "host": "0.0.0.0", "port": "49300" }, } }, } { "execute": "nbd-server-add" "arguments": { "device": "snapshot-scsi0-0-0-0", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8", "writable": false
So this is telling qemu to export the temporary qcow2 image created in the point above. An NBD client would see the export getting progressively more blocks with data as the guest continues to write more clusters (as qemu has to copy the data from the checkpoint to the temporary file before updating the main image with the new data). If the NBD client reads a cluster that has not yet been copied by qemu (because the guest has not written to that cluster since the block job started), would it see zeroes, or the same data that the guest still sees?
No client will see a snapshotted disk state. The snapshot does not get changes at all. If guest makes a write then first old data is written to the fleece image and then new data is written to the active image. There is no checkpoints in this example also. Just a snapshot of disk and this snapshot is exported thru NBD.
}, }
- remove disks from export - remove disks - stop NBD server if there are no disks left
{ "arguments": { "mode": "hard", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8" }, "execute": "nbd-server-remove" } { "execute": "nbd-server-stop" }
3. Checkpoints (the most interesting part)
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot.
Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme.
We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new.
So while it is possible to have only one active bitmap at a time it costs some exersices at managment layer. To me it looks like qemu itself is a better place to track bitmaps chain order and consistency.
Libvirt is already tracking a tree relationship between internal snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track that (true, internal snapshots don't get as much attention as external snapshots) - but the fact remains that qemu is probably not the best place to track relationship between multiple persistent bitmaps, any more than it tracks relationships between internal snapshots. So having libvirt track relations between persistent bitmaps is just fine. Do we
The situations are different. For example you can delete internal snapshot S and this will not hurt any children snapshots. Changes for parent snapshot of A to A itself will be merged to children. In this sense qemu tracks snapshot relationships. Now let's consider dirty bitmaps. Say you have B1, B2, B3, B4, B5. All but B5 are disabled and B5 is active and get changes on guest writes. B1 keep changes from point in time 1 to point in time 2 and so on. Now if you simply delete B3 then B2 for example became invalid as now B2 does not reflect all changes from point in time 2 to point in time 4 as we want in our scheme. Qemu does not automatically merge B3 to B2. In this sense qemu does not track bitmap relationships.
really have to rename bitmaps in the qcow2 file, or can libvirt track it all on its own?
Libvirt needs naming scheme described above to track bitmaps order on domain restarts. Thus we need to rename on deletion.
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
Yes, but backups are not snapshots. All backup relation management are on client. In pull backup scheme libvirt is only here to export a snapshotted disk state with optionally a CBT from some point in time. Client itself makes backups and track their relationships. However as we use chain of disabled bitmaps with one active bitmap on tip of the chain and qemu does not track their order we need to do it in libvirt.
Now how exporting bitmaps looks like.
- add to export disk snapshot N with changes from checkpoint K - add fleece blockdev to NBD exports - create new bitmap T - disable bitmap T - merge bitmaps K, K+1, .. N-1 into T - add bitmap to T to nbd export
- remove disk snapshot from export - remove fleece blockdev from NBD exports - remove bitmap T
Here is qemu commands examples for operation with checkpoints, I'll make several snapshots with checkpoints for purpuse of better illustration.
- create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint - same as without checkpoint but additionally add bitmap on fleece blockjob start
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, }
Here, the transaction makes sense; you have to create the persistent dirty bitmap to track from the same point in time. The dirty bitmap is tied to the active image, not the backup, so that when you create the NEXT incremental backup, you have an accurate record of which sectors were touched in snapshot-scsi0-0-0-0 between this transaction and the next.
Yes.
] }, }
- delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as without checkpoints
- create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint - same actions as for the first snapshot, but additionally disable the first bitmap
Again, you're showing the QMP commands that libvirt is issuing; which libvirt API calls are driving these actions?
Well I thought of this section of RFC to be more specific on qemu commands issued by libvirt to qemu during some API call so that one can better understand how we use qemu API. I thouht the API call and its arguments are clear from description above. In this case it is virDomainBlockSnapshotCreateXML with VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag set. xml is next: <domainblocksnapshot> <name>0044757e-1a2d-4c2c-b92f-bb403309bb17</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/> </disk> </domainblocksnapshot>
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": {
Do you have measurements on whether having multiple active bitmaps hurts performance? I'm not yet sure that managing a chain of disabled bitmaps (and merging them as needed for restores) is more or less efficient than
Vova, can you shed a ligh on this topic?
managing multiple bitmaps all the time. On the other hand, you do have a point that restore is a less frequent operation than backup, so making backup as lean as possible and putting more work on restore is a reasonable tradeoff, even if it adds complexity to the management for doing restores.
Sorry, I'm not understand what is tradeoff from you words.
"name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, }
- delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export
... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, } { "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", },
Adding a bitmap to a server is would would advertise to the NBD client that it can query the "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
I guess so. I don't know neither NBD protocol nor it's extensions. Vova, can you clarify?
}
- remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap
... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" }
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled)
A suggestion on the examples - while UUIDs are nice and handy for management tools, they are a pain to type and for humans to quickly read. Is there any way we can document a sample transaction stream with all the actors involved (someone issues a libvirt API call XYZ, libvirt in turn issues QMP command ABC), and using shorter names that are easier to read as humans?
Sure. I'll definetely do so in next round of RFC if there will be one)
{ "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove x-vz-block-dirty-bitmap-merge x-vz-block-dirty-bitmap-disable x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it.
Why can't restore be done while the guest is offline? (Oh right, we still haven't added decent qemu-img support for bitmap manipulation, so we need a qemu process around for any bitmap changes).
As I understand it, the point of bitmaps and snapshots is to create an NBD server that a third-party can use to read just the dirty portions of a disk in relation to a known checkpoint, to save off data in whatever form it wants; so you are right that the third party then needs a way to rewrite data from whatever internal form it stored it in back to the view that qemu can consume when rolling back to a given backup, prior to starting the guest on the restored data. Do you need additional libvirt APIs exposed for this, or do the proposed APIs for adding snapshots cover everything already with just an additional flag parameter that says whether the <domainblocksnapshot> is readonly (the third-party is using it for collecting the incremental backup data) or writable (the third-party is actively writing its backup into the file, and when it is done, then perform a block-commit to merge that data back onto the main qcow2 file)?
We don't need snapshots for restore at all. Restore is described at very top of document: and typical actions on domain restore: - start domain in paused state Here we use virDomainCreateXML/virDomainCreate with VIR_DOMAIN_START_PAUSED and VIR_DOMAIN_START_EXPORTABLE set. The latter is new flag described in *Nuances* section. - export domain disks of interest thru NBD for write Here we use next xml for virDomainBlockExportStart/virDomainBlockExportStop. <domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda"/> <disk name="sdb"/> </domainblockexport> - restore them - remove disks from export - resume or destroy domain So to be able to restore we need additionally only VIR_DOMAIN_START_EXPORTABLE to start a domain. The export API is same, just the xml specifies plain disks without snapshots/checkpoints. Note that VIR_DOMAIN_START_EXPORTABLE is a kind of workaround. Not sure this should be in an API. We don't need this flags if qemu let us exports disks for write for just freshly started domain in paused state. Nikolay

12.04.2018 11:58, Nikolay Shirokovskiy wrote:
On 11.04.2018 19:32, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage. This is a first-pass review (making comments as I first encounter something, even if it gets explained later in the email)
This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image.
So, rewriting this to make sure I understand, let's start with a disk with contents A, then take a snapshot, then write B:
In the existing libvirt snapshot APIs, the data gets distributed as:
base (contents A) <- new active (contents B)
where you want the new API:
base, remains active (contents B) ~~~ backup (contents A)
Exactly
Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server.
So the biggest reason for a new libvirt API is that we need management actions to control which NBD images from qemu are exposed and torn down at the appropriate sequences.
Here is typical actions on domain backup:
- create temporary snapshot of domain disks of interest - export snaphots thru NBD - back them up - remove disks from export - delete temporary snapshot
and typical actions on domain restore:
- start domain in paused state - export domain disks of interest thru NBD for write - restore them - remove disks from export - resume or destroy domain
Now let's write down API in more details. There are minor changes in comparison with previous version [1].
*Temporary snapshot API*
In previous version it is called 'Fleece API' after qemu terms and I'll still use BlockSnapshot prefix for commands as in previous RFC instead of TmpSnapshots which I inclined more now.
virDomainBlockSnapshotPtr virDomainBlockSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
Just to make sure, we have the existing API of:
virDomainSnapshotPtr virDomainSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
So you are creating a new object (virDomainBlockSnapshotPtr) rather than reusing the existing VirDomainSnapshotPtr, and although the two commands are similar, we get to design a new XML schema from scratch rather than trying to overload yet even more functionality onto the existing API. Yes. Existing snapshots are different from temporary snapshots in many ways. The former for example form a tree structure and the latter are not.
Should we also have:
const char *virDomainBlockSnapshotGetName(virDomainBlockSnapshotPtr snapshot); virDomainPtr virDomainBlockSnapshotGetDomain(virDomainBlockSnapshotPtr snapshot); virConnectPtr virDomainBlockSnapshotGetConnect(virDomainBlockSnapshotPtr snapshot);
for symmetry with existing snapshot API? Yes. I ommited these calls in RFC as they are trivial and don't need to be considered to grasp the picture.
virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotList(virDomainPtr domain, virDomainBlockSnapshotPtr **snaps, unsigned int flags); I'm guessing this is the counterpart to virDomainListAllSnapshots() (the modern listing interface), and that we probably don't want counterparts for virDomainSnapshotNum/virDomainSnapshotListNames (the older listing interface, which was inherently racy as the list could change in length between the two calls). That's right.
virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotPtr virDomainBlockSnapshotLookupByName(virDomainPtr domain, const char *name, unsigned int flags); Also, the virDomainSnapshotPtr had a number of API to track a tree-like hierarchy between snapshots (that is, you very much want to know if snapshot B is a child of snapshot A), while it looks like your new virDomainBlockSnapshotPtrs are completely independent (no relationships between the snapshots, each can be independently created or torn down, without having to rewrite a relationship tree between them, and there is no need for counterparts to things like virDomainSnapshotNumChildren). Okay, I think that makes sense, and is a good reason for introducing a new object type rather than shoe-horning this into the existing API. This fact motivates me introduce new API too.
Here is an example of snapshot xml description:
<domainblocksnapshot> <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/> </disk> <disk name='sdb' type="file"> <fleece file="/tmp/snapshot-b.hdd"/> </disk> </domainblocksnapshot>
Temporary snapshots are indepentent thus they are not organized in tree structure as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice. So in the XML, the <fleece> element describes the destination file (back to my earlier diagram, it would be the file that is created and will hold content 'A' when the main active image is changed to hold content 'B' after the snapshot was created)? Yes.
Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c"> .. </disk>
Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used. So client need to manage checkpoints and delete unused. Thus next API function:
int virDomainBlockCheckpointRemove(virDomainPtr domain, const char *name, unsigned int flags);
I'm trying to figure out how BlockCheckpoint and BlockSnapshots relate. Maybe it will be more clear when I read the implementation section below. Is the idea that I can't create a BlockSnapshot without first having a checkpoint available? If so, where does that fit in the <domainblocksnapshot> XML? No, you can create snapshot without available checkpoints. Actually the first snapshot is like that.
Now if you create a snapshot with checkpoint and then delete the snapshot the checkpoint remains, so we need an API to delete them if we wish.
*Block export API*
I guess it is natural to treat qemu NBD server as a domain device. So we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD server and virDomainUpdateDeviceFlags to add/delete disks to be exported. This feels a bit awkward - up to now, attaching a device is something visible to the guest, but you are trying to reuse the interface to attach something tracked by the domain, but which has no impact to the guest. That is, the guest has no clue whether a block export exists pointing to a particular checkpoint, nor does it care. Not entirely true. Take a graphical framebuffers (vnc) or serial devices. The guest are completely unaware of vnc. Serial device is related to guest device but again guest is not aware of such relation.
While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags looks a bit inconvinient so I decided to add a pair of API functions just to add/delete disks to be exported:
int virDomainBlockExportStart(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
int virDomainBlockExportStop(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
I guess more appropriate names are virDomainBlockExportAdd and virDomainBlockExportRemove but as I already have a patch series implementing pull backups with these names I would like to keep these names now. What does the XML look like in these calls?
These names also reflect that in the implementation I decided to start/stop NBD server in a lazy manner. While it is a bit innovative for libvirt API I guess it is convinient because to refer NBD server to add/remove disks to we need to identify it thru it's parameters like type, address etc until we introduce some device id (which does not looks consistent with current libvirt design). This just reinforces my thoughts above - is the reason it doesn't make sense to assign a device id to the export due to the fact that the export is NOT guest-visible? Does it even belong under the By export you mean a NBD server or disk being exported? What is this id for? Is this libvirt alias for devices or something different?
"domain/devices/" xpath of the domain XML, or should it be a new sibling of <devices> with an xpath of "domain/blockexports/"?
So it looks like we have all parameters to start/stop server in the frame of these calls so why have extra API calls just to start/stop server manually. If we later need to have NBD server without disks we can perfectly support virDomainAttachDeviceFlags/virDomainDetachDeviceFlags.
Here is example of xml to add/remove disks (specifying checkpoint attribute is not needed for removing disks of course):
<domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> </domainblockexport> So this is the XML you pass to virDomainBlockExportStart, with the goal of telling qemu to start or stop an NBD export on the backing chain associated with disk "sda", where the export is serving up data tied to checkpoint "d068765e-8b50-4d74-9b72-1e55c663cbf8", and which will be associated with the destination snapshot file described by the <domainblocksnapshot> named "0044757e-1a2d-4c2c-b92f-bb403309bb17"? I would rephrase. I didn't think of arbitrary backing chains in this API. It just exports the temporary snapshot of disk "sda". Snapshot is referenced by its name "0044757e-1a2d-4c2c-b92f-bb403309bb17". Additionally you can ask to export CBT from some earlier snapshot of "sda" referenced by "d068765e-8b50-4d74-9b72-1e55c663cbf8" to the exported snapshot ("0044757e-1a2d-4c2c-b92f-bb403309bb17"). To make exporting CBT possible the earlier snapshot should be created with VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag.
So this export API is somewhat oriented to block snapshots. May be one day we want to export backing chain of regular snaphots then this API will be insufficient...
Why is it named <domainblockexport> here, but...
And this is how this NBD server will be exposed in domain xml:
<devices> ... <blockexport type="nbd"> <blockexport> here? In this case we already have domain context as xpath is /domain/devices/blockexport.
<address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8" exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
The exportname property is new here compared to the earlier listing - is that something that libvirt generates, or that the user chooses? In current implementation it is generated. I see no obstacles for "exportname" to be specified in input too.
<disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8 exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> </blockexport> ... </devices>
*Implementation details from qemu-libvirt interactions POV*
1. Temporary snapshot
- create snapshot
Which libvirt API triggers this action? virDomainBlockSnapshotCreateXML? Yes.
- add fleece blockdev backed by disk of interest - start fleece blockjob which will pop out data to be overwritten to fleece blockdev
{ "execute": "blockdev-add" "arguments": { "backing": "drive-scsi0-0-0-0", "driver": "qcow2", "file": { "driver": "file", "filename": "/tmp/snapshot-a.hdd"
Is qemu creating this file, or is libvirt pre-creating it and qemu just opening it? I guess this is a case where libvirt would want to The latter.
pre-create an empty qcow2 file (either by qemu-img, or by the new x-blockdev-create in qemu 2.12)? Okay, it looks like this file is what you listed in the XML for <domainblocksnapshot>, so libvirt is creating it. Does the new file have a backing image, or does it read as completely zeroes? File is created by qemu-img without backing chain so it is read as zeros. But it does not matter as the file is not meant to be read/write outside of qemu process. I guess after blockdev-add command the fleece image gets active image as backing in qemu internals.
}, "node-name": "snapshot-scsi0-0-0-0" }, }
No trailing comma in JSON {}, but it's not too hard to figure out what you mean. Oops) I used python -mjson.tool to pretty print json grabbed from qemu logs. It sorts json keys alphabetically which is not convinient in this case - "execute" is better to be above "arguments". So I just moved "execute" line in editor and completely forgot about commas)
{ "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "target": "snapshot-scsi0-0-0-0" "sync": "none", }, } ]
You showed a transaction with only one element; but presumably we are using a transaction because if we want to create a point in time for multiple disks at once, we need two separate blockdev-backup actions joined in the same transaction to cover the two disks. So this command Yes, strictly speaking we don't need a transaction here. I provide here qemu logs from current dumb implementation) I guess I'd better use snapshot for 2 disks in the example as you suggest.
is telling qemu to start using a brand-new qcow2 file as its local storage for tracking that a snapshot is being taken, and that point in Yes.
time is the checkpoint? No, this actions will not create a checkpoint. Examples for checkpoints are below. In case of checkpoints we additionally add a new dirty bimap in transaction for every disk and manipulate with existing dirty bitmaps.
Am I correct that you would then tell qemu to export an NBD view of this qcow2 snapshot which a third-party client can connect to and use NBD_CMD_BLOCK_STATUS to learn which portions of the file contain data (that is, which clusters has qemu copied into the backup, because the active image has changed them since the checkpoint, but anything not dirty in this file is still identical to the last backup? No. In this example we don't talk about checkpoints which is for incremental backups. This is plain full backup. You create a snapshot and export it.
Even if we created the snapshot with checkpoint the checkpoint is of no use for the first backup. The first backup can not be anything but full copy of snapshot. But lately if you make first backup, delete snapshot, then after sometime want to create another backup you create new snapshot and this time if first snapshot was created with checkpoint we can tell thru NBD_CMD_BLOCK_STATUS what portions of disk in second snapshot are changed relative to the first snapshot. Using this info you can create incremental backup.
Would libvirt ever want to use something other than "sync":"none"? I don't know of usecases for other modes. Looks like "none" is sufficient for snapshot purpuses.
It's an interesting question. Some points: 1. It looks unsafe to use nbd server + backup(sync=none) on same node, synchronization is needed, like in block/replication, which uses backup_wait_for_overlapping_requests, backup_cow_request_begin, backup_cow_request_end. We have a filter driver for this thing, not yet in upstream. 2. If we use filter driver anyway, it may be better to not use backup at all, and do all needed things in a filter driver. 3. It may be interesting to implement something like READ_ONCE for NBD, which means, that we will never read these clusters again. And after such command, we don't need to copy corresponding clusters to temporary image, if guests decides to write them (as we know, that client already read them and don't going to read again).
}, }
- delete snapshot - cancel fleece blockjob - delete fleece blockdev
{ "execute": "block-job-cancel" "arguments": { "device": "drive-scsi0-0-0-0" }, } { "execute": "blockdev-del" "arguments": { "node-name": "snapshot-scsi0-0-0-0" }, }
2. Block export
- add disks to export - start NBD server if it is not started - add disks
{ "execute": "nbd-server-start" "arguments": { "addr": { "type": "inet" "data": { "host": "0.0.0.0", "port": "49300" }, } }, } { "execute": "nbd-server-add" "arguments": { "device": "snapshot-scsi0-0-0-0", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8", "writable": false
So this is telling qemu to export the temporary qcow2 image created in the point above. An NBD client would see the export getting progressively more blocks with data as the guest continues to write more clusters (as qemu has to copy the data from the checkpoint to the temporary file before updating the main image with the new data). If the NBD client reads a cluster that has not yet been copied by qemu (because the guest has not written to that cluster since the block job started), would it see zeroes, or the same data that the guest still sees?
It would see same data that guest still sees (hole in temporary image, so, we go to backing)
No client will see a snapshotted disk state. The snapshot does not get changes at all. If guest makes a write then first old data is written to the fleece image and then new data is written to the active image.
There is no checkpoints in this example also. Just a snapshot of disk and this snapshot is exported thru NBD.
}, }
- remove disks from export - remove disks - stop NBD server if there are no disks left
{ "arguments": { "mode": "hard", "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8" }, "execute": "nbd-server-remove" } { "execute": "nbd-server-stop" }
3. Checkpoints (the most interesting part)
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot.
Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme.
We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start. This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming encoding upon domain start we can find out bitmap order and check for missing ones. This complicates a bit bitmap removing though. For example removing a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1} - create new bitmap named NAME_{K+1}^NAME_{K-1} ---. - disable new bitmap | This is effectively renaming - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the naming scheme - remove bitmap NAME_{K+1}^NAME_{K} ___/ - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2} - remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap naming scheme. This is done creating new K+1 bitmap with appropriate name and copying old K+1 bitmap into new.
So while it is possible to have only one active bitmap at a time it costs some exersices at managment layer. To me it looks like qemu itself is a better place to track bitmaps chain order and consistency.
Libvirt is already tracking a tree relationship between internal snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track that (true, internal snapshots don't get as much attention as external snapshots) - but the fact remains that qemu is probably not the best place to track relationship between multiple persistent bitmaps, any more than it tracks relationships between internal snapshots. So having libvirt track relations between persistent bitmaps is just fine. Do we The situations are different. For example you can delete internal snapshot S and this will not hurt any children snapshots. Changes for parent snapshot of A to A itself will be merged to children. In this sense qemu tracks snapshot relationships.
Now let's consider dirty bitmaps. Say you have B1, B2, B3, B4, B5. All but B5 are disabled and B5 is active and get changes on guest writes. B1 keep changes from point in time 1 to point in time 2 and so on. Now if you simply delete B3 then B2 for example became invalid as now B2 does not reflect all changes from point in time 2 to point in time 4 as we want in our scheme. Qemu does not automatically merge B3 to B2. In this sense qemu does not track bitmap relationships.
really have to rename bitmaps in the qcow2 file, or can libvirt track it all on its own? Libvirt needs naming scheme described above to track bitmaps order on domain restarts. Thus we need to rename on deletion.
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another. Yes, but backups are not snapshots. All backup relation management are on client. In pull backup scheme libvirt is only here to export a snapshotted disk state with optionally a CBT from some point in time. Client itself makes backups and track their relationships.
However as we use chain of disabled bitmaps with one active bitmap on tip of the chain and qemu does not track their order we need to do it in libvirt.
Now how exporting bitmaps looks like.
- add to export disk snapshot N with changes from checkpoint K - add fleece blockdev to NBD exports - create new bitmap T - disable bitmap T - merge bitmaps K, K+1, .. N-1 into T - add bitmap to T to nbd export
- remove disk snapshot from export - remove fleece blockdev from NBD exports - remove bitmap T
Here is qemu commands examples for operation with checkpoints, I'll make several snapshots with checkpoints for purpuse of better illustration.
- create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint - same as without checkpoint but additionally add bitmap on fleece blockjob start
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } Here, the transaction makes sense; you have to create the persistent dirty bitmap to track from the same point in time. The dirty bitmap is tied to the active image, not the backup, so that when you create the NEXT incremental backup, you have an accurate record of which sectors were touched in snapshot-scsi0-0-0-0 between this transaction and the next. Yes.
] }, }
- delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as without checkpoints
- create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint - same actions as for the first snapshot, but additionally disable the first bitmap
Again, you're showing the QMP commands that libvirt is issuing; which libvirt API calls are driving these actions? Well I thought of this section of RFC to be more specific on qemu commands issued by libvirt to qemu during some API call so that one can better understand how we use qemu API. I thouht the API call and its arguments are clear from description above. In this case it is
virDomainBlockSnapshotCreateXML with VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag set.
xml is next:
<domainblocksnapshot> <name>0044757e-1a2d-4c2c-b92f-bb403309bb17</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/> </disk> </domainblocksnapshot>
... { "execute": "transaction" "arguments": { "actions": [ { "type": "blockdev-backup" "data": { "device": "drive-scsi0-0-0-0", "sync": "none", "target": "snapshot-scsi0-0-0-0" }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": {
Do you have measurements on whether having multiple active bitmaps hurts performance? I'm not yet sure that managing a chain of disabled bitmaps (and merging them as needed for restores) is more or less efficient than Vova, can you shed a ligh on this topic?
No. But I have another argument for it: we can drop disabled bitmaps from RAM and store them in qcow2, and load on demand, to save RAM space. Merging is for backups, not for restores, or what am I missing? Merging increases a bit time for backup creation, multiple active bitmaps increase a bit time for every guest write. Guest writes are for sure more often than backups.
managing multiple bitmaps all the time. On the other hand, you do have a point that restore is a less frequent operation than backup, so making backup as lean as possible and putting more work on restore is a reasonable tradeoff, even if it adds complexity to the management for doing restores. Sorry, I'm not understand what is tradeoff from you words.
"name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0" }, }, { "type": "block-dirty-bitmap-add" "data": { "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", "node": "drive-scsi0-0-0-0", "persistent": true }, } ] }, }
- delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 - same as add export without checkpoint, but aditionally - form result bitmap - add bitmap to NBD export
... { "execute": "transaction" "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-__export_temporary__", "persistent": false }, }, { "type": "x-vz-block-dirty-bitmap-disable" "data": { "node": "drive-scsi0-0-0-0" "name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8" "dst_name": "libvirt-__export_temporary__", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-__export_temporary__", }, } ] }, } { "execute": "x-vz-nbd-server-add-bitmap" "arguments": { "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b" "bitmap": "libvirt-__export_temporary__", "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8", },
Adding a bitmap to a server is would would advertise to the NBD client that it can query the "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
Yes.
I guess so. I don't know neither NBD protocol nor it's extensions.
Vova, can you clarify?
}
- remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export - same as without checkpoint but additionally remove temporary bitmap
... { "arguments": { "name": "libvirt-__export_temporary__", "node": "drive-scsi0-0-0-0" }, "execute": "block-dirty-bitmap-remove" }
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17 (similar operation is described in the section about naming scheme for bitmaps, with difference that K+1 is N here and thus new bitmap should not be disabled)
A suggestion on the examples - while UUIDs are nice and handy for management tools, they are a pain to type and for humans to quickly read. Is there any way we can document a sample transaction stream with all the actors involved (someone issues a libvirt API call XYZ, libvirt in turn issues QMP command ABC), and using shorter names that are easier to read as humans? Sure. I'll definetely do so in next round of RFC if there will be one)
{ "arguments": { "actions": [ { "type": "block-dirty-bitmap-add" "data": { "node": "drive-scsi0-0-0-0", "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8", "persistent": true }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf# "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }, { "type": "x-vz-block-dirty-bitmap-merge" "data": { "node": "drive-scsi0-0-0-0", "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1# "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf# }, }, ] }, "execute": "transaction" } { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17", }, }, { "execute": "x-vz-block-dirty-bitmap-remove" "arguments": { "node": "drive-scsi0-0-0-0" "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8", }, }
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove x-vz-block-dirty-bitmap-merge x-vz-block-dirty-bitmap-disable x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint) x-vz-nbd-server-add-bitmap
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it.
Why can't restore be done while the guest is offline? (Oh right, we still haven't added decent qemu-img support for bitmap manipulation, so we need a qemu process around for any bitmap changes).
As I understand it, the point of bitmaps and snapshots is to create an NBD server that a third-party can use to read just the dirty portions of a disk in relation to a known checkpoint, to save off data in whatever form it wants; so you are right that the third party then needs a way to rewrite data from whatever internal form it stored it in back to the view that qemu can consume when rolling back to a given backup, prior to starting the guest on the restored data. Do you need additional libvirt APIs exposed for this, or do the proposed APIs for adding snapshots cover everything already with just an additional flag parameter that says whether the <domainblocksnapshot> is readonly (the third-party is using it for collecting the incremental backup data) or writable (the third-party is actively writing its backup into the file, and when it is done, then perform a block-commit to merge that data back onto the main qcow2 file)?
We don't need snapshots for restore at all. Restore is described at very top of document:
and typical actions on domain restore:
- start domain in paused state
Here we use virDomainCreateXML/virDomainCreate with VIR_DOMAIN_START_PAUSED and VIR_DOMAIN_START_EXPORTABLE set. The latter is new flag described in *Nuances* section.
- export domain disks of interest thru NBD for write
Here we use next xml for virDomainBlockExportStart/virDomainBlockExportStop.
<domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda"/> <disk name="sdb"/> </domainblockexport>
- restore them - remove disks from export - resume or destroy domain
So to be able to restore we need additionally only VIR_DOMAIN_START_EXPORTABLE to start a domain. The export API is same, just the xml specifies plain disks without snapshots/checkpoints.
Note that VIR_DOMAIN_START_EXPORTABLE is a kind of workaround. Not sure this should be in an API. We don't need this flags if qemu let us exports disks for write for just freshly started domain in paused state.
Nikolay
-- Best regards, Vladimir

On 04/12/2018 08:26 AM, Vladimir Sementsov-Ogievskiy wrote:
1. It looks unsafe to use nbd server + backup(sync=none) on same node, synchronization is needed, like in block/replication, which uses backup_wait_for_overlapping_requests, backup_cow_request_begin, backup_cow_request_end. We have a filter driver for this thing, not yet in upstream.
Is it the case that blockdev-backup sync=none can race with read requests on the NBD server? i.e. we can get temporarily inconsistent data before the COW completes? Can you elaborate?
2. If we use filter driver anyway, it may be better to not use backup at all, and do all needed things in a filter driver.
if blockdev-backup sync=none isn't sufficient to get the semantics we want, it may indeed be more appropriate to just leave the entire task to a new filter node.
3. It may be interesting to implement something like READ_ONCE for NBD, which means, that we will never read these clusters again. And after such command, we don't need to copy corresponding clusters to temporary image, if guests decides to write them (as we know, that client already read them and don't going to read again).
That would be a very interesting optimization indeed; but I don't think we have any kind of infrastructure for such things currently. It's almost like a TRIM on which regions need to perform COW for the BlockSnapshot.

13.04.2018 00:35, John Snow wrote:
On 04/12/2018 08:26 AM, Vladimir Sementsov-Ogievskiy wrote:
1. It looks unsafe to use nbd server + backup(sync=none) on same node, synchronization is needed, like in block/replication, which uses backup_wait_for_overlapping_requests, backup_cow_request_begin, backup_cow_request_end. We have a filter driver for this thing, not yet in upstream. Is it the case that blockdev-backup sync=none can race with read requests on the NBD server?
i.e. we can get temporarily inconsistent data before the COW completes? Can you elaborate?
I'm not sure but looks possible: 1. start NBD read, find that there is a hole in temporary image, decide to read from active image (or even start read) and yield 2. guest writes to the same are (COW happens, but it doesn't help) 3. reduce point (1.), read invalid (already updated by 2.) data And similar place in block/replication, which uses backup(sync=none) too is protected from such situation.
2. If we use filter driver anyway, it may be better to not use backup at all, and do all needed things in a filter driver. if blockdev-backup sync=none isn't sufficient to get the semantics we want, it may indeed be more appropriate to just leave the entire task to a new filter node.
3. It may be interesting to implement something like READ_ONCE for NBD, which means, that we will never read these clusters again. And after such command, we don't need to copy corresponding clusters to temporary image, if guests decides to write them (as we know, that client already read them and don't going to read again). That would be a very interesting optimization indeed; but I don't think we have any kind of infrastructure for such things currently. It's almost like a TRIM on which regions need to perform COW for the BlockSnapshot.
Hmm, READ+TRIM may be used too. And trim may be naturally implemented in special filter driver. -- Best regards, Vladimir

On 04/13/2018 08:01 AM, Vladimir Sementsov-Ogievskiy wrote:
1. It looks unsafe to use nbd server + backup(sync=none) on same node, synchronization is needed, like in block/replication, which uses backup_wait_for_overlapping_requests, backup_cow_request_begin, backup_cow_request_end. We have a filter driver for this thing, not yet in upstream. Is it the case that blockdev-backup sync=none can race with read requests on the NBD server?
i.e. we can get temporarily inconsistent data before the COW completes? Can you elaborate?
I'm not sure but looks possible:
1. start NBD read, find that there is a hole in temporary image, decide to read from active image (or even start read) and yield 2. guest writes to the same are (COW happens, but it doesn't help) 3. reduce point (1.), read invalid (already updated by 2.) data
And similar place in block/replication, which uses backup(sync=none) too is protected from such situation.
I'll have to look into this one -- were you seeing problems in practice before you implemented your proprietary filter node? --js

13.04.2018 21:02, John Snow wrote:
On 04/13/2018 08:01 AM, Vladimir Sementsov-Ogievskiy wrote:
1. It looks unsafe to use nbd server + backup(sync=none) on same node, synchronization is needed, like in block/replication, which uses backup_wait_for_overlapping_requests, backup_cow_request_begin, backup_cow_request_end. We have a filter driver for this thing, not yet in upstream. Is it the case that blockdev-backup sync=none can race with read requests on the NBD server?
i.e. we can get temporarily inconsistent data before the COW completes? Can you elaborate? I'm not sure but looks possible:
1. start NBD read, find that there is a hole in temporary image, decide to read from active image (or even start read) and yield 2. guest writes to the same are (COW happens, but it doesn't help) 3. reduce point (1.), read invalid (already updated by 2.) data
And similar place in block/replication, which uses backup(sync=none) too is protected from such situation. I'll have to look into this one -- were you seeing problems in practice before you implemented your proprietary filter node?
--js
I didn't see problems, I just noted, that it is done in block/replication and looked through corresponding commit messages. -- Best regards, Vladimir

On 04/12/2018 04:58 AM, Nikolay Shirokovskiy wrote:
On 11.04.2018 19:32, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
[snip]
I'm trying to figure out how BlockCheckpoint and BlockSnapshots relate. Maybe it will be more clear when I read the implementation section below. Is the idea that I can't create a BlockSnapshot without first having a checkpoint available? If so, where does that fit in the <domainblocksnapshot> XML?
No, you can create snapshot without available checkpoints. Actually the first snapshot is like that.
Now if you create a snapshot with checkpoint and then delete the snapshot the checkpoint remains, so we need an API to delete them if we wish.
Hmm - OK, you are being careful to label three notions separately: (1) Checkpoints (which are metadata objects in libvirt supported by bitmaps in QEMU, roughly) (2) BlockSnapshots (which expose data using checkpoints as metadata) (3) Backups (which are made by a 3rd party client using a snapshot) In this case, though, if a snapshot is requested it probably ought to be *prepared* to create a checkpoint in case that snapshot is used by the third party client to make a backup, right? IOW, a snapshot -- though ignorant of how it is used -- can be and often will be used to accomplish an incremental backup and as such must be prepared to manipulate the checkpoints/bitmaps/etc in such a way to be able to make a new checkpoint. Right? [snip]
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
Yes, but backups are not snapshots. All backup relation management are on client. In pull backup scheme libvirt is only here to export a snapshotted disk state with optionally a CBT from some point in time. Client itself makes backups and track their relationships.
However as we use chain of disabled bitmaps with one active bitmap on tip of the chain and qemu does not track their order we need to do it in libvirt.
Well, you seem to be tracking it in *qemu*, by using the name field. Should we not make a commitment to whether or not we store this lineage information in either qemu OR libvirt, but not distributed across both...?

On 13.04.2018 01:10, John Snow wrote:
On 04/12/2018 04:58 AM, Nikolay Shirokovskiy wrote:
On 11.04.2018 19:32, Eric Blake wrote:
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
[snip]
I'm trying to figure out how BlockCheckpoint and BlockSnapshots relate. Maybe it will be more clear when I read the implementation section below. Is the idea that I can't create a BlockSnapshot without first having a checkpoint available? If so, where does that fit in the <domainblocksnapshot> XML?
No, you can create snapshot without available checkpoints. Actually the first snapshot is like that.
Now if you create a snapshot with checkpoint and then delete the snapshot the checkpoint remains, so we need an API to delete them if we wish.
Hmm - OK, you are being careful to label three notions separately:
(1) Checkpoints (which are metadata objects in libvirt supported by bitmaps in QEMU, roughly) (2) BlockSnapshots (which expose data using checkpoints as metadata) (3) Backups (which are made by a 3rd party client using a snapshot)
In this case, though, if a snapshot is requested it probably ought to be *prepared* to create a checkpoint in case that snapshot is used by the third party client to make a backup, right?
Block snapshots can be used without checkpoint altogether. Imagine we always make full backups. It looks like next. Create block snapshot without checkpoint, export it, make a backup, delete from export and delete snapshot. No any checkpoints appear. On next full backup steps are the same.
IOW, a snapshot -- though ignorant of how it is used -- can be and often will be used to accomplish an incremental backup and as such must be prepared to manipulate the checkpoints/bitmaps/etc in such a way to be able to make a new checkpoint.
Right?
I don't unrestand what these preparations are. If we want to make incremental backups we need to create every snapshot with checkpoint so that later we can export CBT between current snapshot and snapshot in past. If we create snapshot with checkpoint then internally it means we create bitmap which starts at this point in time. Later we can use this bitmap (and probably other bitmaps corresponding to checkpoints in the middle) to calculate the CBT we interested in.
[snip]
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
Yes, but backups are not snapshots. All backup relation management are on client. In pull backup scheme libvirt is only here to export a snapshotted disk state with optionally a CBT from some point in time. Client itself makes backups and track their relationships.
However as we use chain of disabled bitmaps with one active bitmap on tip of the chain and qemu does not track their order we need to do it in libvirt.
Well, you seem to be tracking it in *qemu*, by using the name field. Should we not make a commitment to whether or not we store this lineage information in either qemu OR libvirt, but not distributed across both...?
I don't know actual use cases to decide. A commitment that this meta is stored in disks like proposed can be useful IMHO so that mgmt can expect that dumb reinserting disks to a different domain (recreated for example) keep all checkpoints.

On 04/13/2018 05:47 AM, Nikolay Shirokovskiy wrote:
However as we use chain of disabled bitmaps with one active bitmap on tip of the chain and qemu does not track their order we need to do it in libvirt.
Well, you seem to be tracking it in *qemu*, by using the name field. Should we not make a commitment to whether or not we store this lineage information in either qemu OR libvirt, but not distributed across both...?
I don't know actual use cases to decide. A commitment that this meta is stored in disks like proposed can be useful IMHO so that mgmt can expect that dumb reinserting disks to a different domain (recreated for example) keep all checkpoints.
What I am asking rather indirectly is if it would be useful to elevate this to a *real* metadata field in QEMU so that you don't have to hack it by using the name field, OR cease using the name field for this purpose and store the data entirely within libvirt. It sounds like you want the flexibility of the first option. I think Vladimir had an argument against this though, I need to go back and read it. --js

On 04/13/2018 04:47 AM, Nikolay Shirokovskiy wrote:
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
Yes, but backups are not snapshots. All backup relation management are on client. In pull backup scheme libvirt is only here to export a snapshotted disk state with optionally a CBT from some point in time. Client itself makes backups and track their relationships.
However as we use chain of disabled bitmaps with one active bitmap on tip of the chain and qemu does not track their order we need to do it in libvirt.
Well, you seem to be tracking it in *qemu*, by using the name field. Should we not make a commitment to whether or not we store this lineage information in either qemu OR libvirt, but not distributed across both...?
I don't know actual use cases to decide. A commitment that this meta is stored in disks like proposed can be useful IMHO so that mgmt can expect that dumb reinserting disks to a different domain (recreated for example) keep all checkpoints.
I'm still trying to figure out how to represent checkpoint metadata in libvirt XML; I'm not yet sure whether exposing it directly in <domain> makes sense, or whether checkpoints should be more like <domainsnapshot> in that they are a separate object, each containing a copy of the <domain> at the time they were created, and allowing parent->child relationships between objects that were created along the same guest-visible timeline of events. But your comment about wanting to store lineage information between checkpoints in the qcow2 metadata, so that you can recreate that lineage when inserting that qcow2 file into a different <domain>, feels rather fragile. With <domainsnapshot>, libvirt has APIs for recreating snapshot objects to (re-)teach libvirt about state that was copied from some other location. It seems like having a similar way to recreate checkpoint objects would be the proper way to plug in a qcow2 file with persistent bitmaps already existing, in order to get libvirt to know the proper <checkpoint> relationships that it can now use from that qcow2 file. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 18.04.2018 18:24, Eric Blake wrote:
On 04/13/2018 04:47 AM, Nikolay Shirokovskiy wrote:
Earlier, you said that the new virDomainBlockSnapshotPtr are independent, with no relations between them. But here, you are wanting to keep incremental backups related to one another.
Yes, but backups are not snapshots. All backup relation management are on client. In pull backup scheme libvirt is only here to export a snapshotted disk state with optionally a CBT from some point in time. Client itself makes backups and track their relationships.
However as we use chain of disabled bitmaps with one active bitmap on tip of the chain and qemu does not track their order we need to do it in libvirt.
Well, you seem to be tracking it in *qemu*, by using the name field. Should we not make a commitment to whether or not we store this lineage information in either qemu OR libvirt, but not distributed across both...?
I don't know actual use cases to decide. A commitment that this meta is stored in disks like proposed can be useful IMHO so that mgmt can expect that dumb reinserting disks to a different domain (recreated for example) keep all checkpoints.
I'm still trying to figure out how to represent checkpoint metadata in libvirt XML; I'm not yet sure whether exposing it directly in <domain> makes sense, or whether checkpoints should be more like <domainsnapshot> in that they are a separate object, each containing a copy of the <domain> at the time they were created, and allowing parent->child relationships between objects that were created along the same guest-visible timeline of events. But your comment about wanting to store lineage information between checkpoints in the qcow2 metadata, so that you can recreate that lineage when inserting that qcow2 file into a different <domain>, feels rather fragile.
With <domainsnapshot>, libvirt has APIs for recreating snapshot objects to (re-)teach libvirt about state that was copied from some other location. It seems like having a similar way to recreate checkpoint objects would be the proper way to plug in a qcow2 file with persistent bitmaps already existing, in order to get libvirt to know the proper <checkpoint> relationships that it can now use from that qcow2 file.
It is proposed to introduce checkpoints and their relationships to qemu and qcow2 in [1]. So that no need to have libvirt metadata for that. Also checkpoints itself are not enought to restore a domain of course so does it make sense to store domain config with checkpoints? Looks like it is a backup client responsibility. In case of push backups storing domain config looks useful. But we can distinguish between checkpoints and backups. [1] https://www.redhat.com/archives/libvir-list/2018-April/msg01306.html

03.04.2018 15:01, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
[...]
3. Checkpoints (the most interesting part)
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not get changed on guest writes. And oppositely in active state it tracks guest writes. This implementation uses approach with only one active bitmap at a time. This should reduce guest write penalties in the presence of checkpoints.
And this give us great possibility to store disabled bitmaps in qcow2, not in RAM
So on first snapshot we create bitmap B_1. Now it tracks changes from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2 - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and gets most disk change after latest snapshot.
Getting changed blocks bitmap from some checkpoint in past till current snapshot is quite simple in this scheme. For example if the last snapshot is 7 then to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3, B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires merging correspondent bitmap to the previous bitmap in this scheme.
We use persitent bitmaps in the implementation. This means upon qemu process termination bitmaps are saved in disks images metadata and restored back on qemu process start.
note, that it currently works only for qcow2 disks
This makes checkpoint a persistent property that is we keep them across domain start/stops. Qemu does not try hard to keep bitmaps. If upon save something goes wrong bitmap is dropped. The same is applied to the migration process too. For backup process it is not critical. If we don't discover a checkpoint we always can make a full backup. Also qemu provides no special means to track order of bitmaps. These facts are critical for implementation with one active bitmap at a time. We need right order of bitmaps upon merge - for snapshot N and block changes from snanpshot K, K < N to N we need to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged is missing we can't calculate desired block changes too.
[...]
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused state, export domain's disks and write them from backup. However qemu currently does not let export disks for write even for a domain that never starts guests CPU. We have an experimental qemu command option -x-vz-nbd-restore (passed together with -incoming option) to fix it.
Yes. "-incoming x-vz-nbd-restore", it works like -incoming defer, but without logic related to qmp migrate-incoming. So, it is just starting a vm in paused mode with inactive disks. Then we can call qmp cont to resume.
*Links*
[1] Previous version of RFC https://www.redhat.com/archives/libvir-list/2017-November/msg00514.html
-- Best regards, Vladimir

On 04/03/2018 08:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image.
Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server.
[snip!] Do you think it's possible to characterize this API proposal as two mechanisms: (1) A mechanism for creating and manipulating "checkpoints" -- which are book-ended by bitmap objects in QEMU -- implemented by the creation, deletion, 'disabling' and 'merging' of bitmaps, and (2) A mechanism for the consumption of said checkpoints via NBD / the "fleecing" mechanisms that allow a live export of a static view of the disk at that time (block snapshots + NBD exports) If this is the case, do you think it is possible to consider (1) and (2) somewhat orthogonal items -- in so far as it might be possible to add support to libvirt directly to add push-model support for writing out these checkpoints? i.e. once you have created a temporary snapshot and merged the various component bitmaps into it, instead of creating an ephemeral block snapshot and exporting it via NBD, we request a `blockdev-backup` with a libvirt-specified target instead? You don't have to add support for this right away, but I would really enjoy if any API we check in here has the capacity to support both push-and-pull paradigms without getting too ugly. Does that sound like it can easily fit in with your designs so far?

On 13.04.2018 00:16, John Snow wrote:
On 04/03/2018 08:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image.
Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server.
[snip!]
Do you think it's possible to characterize this API proposal as two mechanisms:
(1) A mechanism for creating and manipulating "checkpoints" -- which are book-ended by bitmap objects in QEMU -- implemented by the creation, deletion, 'disabling' and 'merging' of bitmaps, and
(2) A mechanism for the consumption of said checkpoints via NBD / the "fleecing" mechanisms that allow a live export of a static view of the disk at that time (block snapshots + NBD exports)
I can't share this view because checkpoints and snapshots are created in one transation in pull scheme so you can't not move these two to different mechs. I'll rather see 2 mechanism here at least for pull scheme. 1. create snapshots (and optionally checkpoints) 2. export snapshots
If this is the case, do you think it is possible to consider (1) and (2) somewhat orthogonal items -- in so far as it might be possible to add support to libvirt directly to add push-model support for writing out these checkpoints?
i.e.
once you have created a temporary snapshot and merged the various component bitmaps into it, instead of creating an ephemeral block snapshot and exporting it via NBD, we request a `blockdev-backup` with a libvirt-specified target instead?
You don't have to add support for this right away, but I would really enjoy if any API we check in here has the capacity to support both push-and-pull paradigms without getting too ugly.
Does that sound like it can easily fit in with your designs so far?
I think push scheme require 3rd (1st is fleece snapshots, 2nd is exporting snapshots) API. First push backup has nothing to do with exporting of course. Second contrary to fleece snapshots it will require additional parameter - a checkpoint in past in case of incremental backup. It also have quite different image parameter. In case of fleece snapshot fleece image will only store small delta even in case of full backups. In case of push backup image will store full disk in case of full backups. Nikolay

13.04.2018 12:16, Nikolay Shirokovskiy wrote:
On 13.04.2018 00:16, John Snow wrote:
On 04/03/2018 08:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image.
Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server. [snip!]
Do you think it's possible to characterize this API proposal as two mechanisms:
(1) A mechanism for creating and manipulating "checkpoints" -- which are book-ended by bitmap objects in QEMU -- implemented by the creation, deletion, 'disabling' and 'merging' of bitmaps, and
(2) A mechanism for the consumption of said checkpoints via NBD / the "fleecing" mechanisms that allow a live export of a static view of the disk at that time (block snapshots + NBD exports)
I can't share this view because checkpoints and snapshots are created in one transation in pull scheme so you can't not move these two to different mechs.
I'll rather see 2 mechanism here at least for pull scheme.
1. create snapshots (and optionally checkpoints) 2. export snapshots
If this is the case, do you think it is possible to consider (1) and (2) somewhat orthogonal items -- in so far as it might be possible to add support to libvirt directly to add push-model support for writing out these checkpoints?
i.e.
once you have created a temporary snapshot and merged the various component bitmaps into it, instead of creating an ephemeral block snapshot and exporting it via NBD, we request a `blockdev-backup` with a libvirt-specified target instead?
You don't have to add support for this right away, but I would really enjoy if any API we check in here has the capacity to support both push-and-pull paradigms without getting too ugly.
Does that sound like it can easily fit in with your designs so far?
I think push scheme require 3rd (1st is fleece snapshots, 2nd is exporting snapshots) API. First push backup has nothing to do with exporting of course. Second contrary to fleece snapshots it will require additional parameter - a checkpoint in past in case of incremental backup. It also have quite different image parameter. In case of fleece snapshot fleece image will only store small delta even in case of full backups. In case of push backup image will store full disk in case of full backups.
Nikolay
Hmm, to use checkpoints with push backups, we just need to start normal incremental backups in transaction with checkpoint creation. So. Checkpoint is a separate thing, but it is useless, if we didn't create some kind of backup (push or pull) in same transaction with it. If we in future, implement checkpoints support in qemu, libvirt api realization will become simpler. But anyway, we should not create checkpoints separately in libvirt, as they will useless in this case. -- Best regards, Vladimir

On 04/13/2018 05:16 AM, Nikolay Shirokovskiy wrote:
On 13.04.2018 00:16, John Snow wrote:
On 04/03/2018 08:01 AM, Nikolay Shirokovskiy wrote:
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain disks in a snapshotted state so that client can back them up as well as means to write domain disks to revert them to backed up state. The previous version of RFC is [1]. I'll also describe the API implementation details to shed light on misc qemu dirty bitmap commands usage.
This API does not use existent disks snapshots. Instead it introduces snapshots provided by qemu's blockdev-backup command. The reason is we need snapshotted disk state only temporarily for duration of backup operation and newly introduced snapshots can be easily discarded at the end of operation without block commit operation. Technically difference is next. On usual snapshot we create new image backed by original and all new data goes to the new image thus original image stays in a snapshotted state. In temporary snapshots we create new image backed by original and all new data still goes to the original image but before new data is written old data to be overwritten is popped out to the new image thus we get snapshotted state thru new image.
Disks snapshots as well as disks itself are avaiable to read/write thru qemu NBD server.
[snip!]
Do you think it's possible to characterize this API proposal as two mechanisms:
(1) A mechanism for creating and manipulating "checkpoints" -- which are book-ended by bitmap objects in QEMU -- implemented by the creation, deletion, 'disabling' and 'merging' of bitmaps, and
(2) A mechanism for the consumption of said checkpoints via NBD / the "fleecing" mechanisms that allow a live export of a static view of the disk at that time (block snapshots + NBD exports)
I can't share this view because checkpoints and snapshots are created in one transation in pull scheme so you can't not move these two to different mechs.
That's not a problem - transactions are comprised of elementary actions, so it's okay to make an artificial distinction between half of the actions and half of the others if it aids in the composition of other transaction types.
I'll rather see 2 mechanism here at least for pull scheme.
1. create snapshots (and optionally checkpoints) 2. export snapshots
You're thinking more of the Libvirt API calls instead of the component mechanisms these API manipulate, I think.
If this is the case, do you think it is possible to consider (1) and (2) somewhat orthogonal items -- in so far as it might be possible to add support to libvirt directly to add push-model support for writing out these checkpoints?
i.e.
once you have created a temporary snapshot and merged the various component bitmaps into it, instead of creating an ephemeral block snapshot and exporting it via NBD, we request a `blockdev-backup` with a libvirt-specified target instead?
You don't have to add support for this right away, but I would really enjoy if any API we check in here has the capacity to support both push-and-pull paradigms without getting too ugly.
Does that sound like it can easily fit in with your designs so far?
I think push scheme require 3rd (1st is fleece snapshots, 2nd is exporting snapshots) API. First push backup has nothing to do with exporting of course. Second contrary to fleece snapshots it will require additional parameter - a checkpoint in past in case of incremental backup. It also have quite different image parameter. In case of fleece snapshot fleece image will only store small delta even in case of full backups. In case of push backup image will store full disk in case of full backups.
Nikolay
That doesn't sound too crazy. As long as the idea of a "checkpoint" can be re-used for push-model backups, I am pretty happy with the design overall as presented, once we iron out some of the technicalities, like: (1) What do we name the API calls? (2) Do we store metadata in the bitmap names? (3) What does the XML look like? etc. the general idea seems okay to me AFAICT. Eric Blake is currently on a brief leave but will be back Tuesday. I think if you can resume discussions with Daniel Berrange and Eric Blake on the API design we'll be able to make progress here. Thanks for your work so far, --John

On Tue, Apr 03, 2018 at 03:01:22PM +0300, Nikolay Shirokovskiy wrote:
*Temporary snapshot API*
In previous version it is called 'Fleece API' after qemu terms and I'll still use BlockSnapshot prefix for commands as in previous RFC instead of TmpSnapshots which I inclined more now.
virDomainBlockSnapshotPtr virDomainBlockSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotList(virDomainPtr domain, virDomainBlockSnapshotPtr **snaps, unsigned int flags);
virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotPtr virDomainBlockSnapshotLookupByName(virDomainPtr domain, const char *name, unsigned int flags);
Here is an example of snapshot xml description:
<domainblocksnapshot> <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/>
Can we just call this <source file="....."/> which is how we name things in normal <disk> elements. 'fleece' in particular is an awful name giving no indication of what is being talked about unless you've already read the QEMU low levels, so I'd rather we don't use the word "fleece" anywhere in API or XML or docs at the libvirt level.
</disk> <disk name='sdb' type="file"> <fleece file="/tmp/snapshot-b.hdd"/> </disk> </domainblocksnapshot>
Temporary snapshots are indepentent thus they are not organized in tree structure as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice.
Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c">
How are these checkpoints recorded / associated to actual storage on disk ? What happens across restarts of the VM if this is only in the live XML.
.. </disk>
*Block export API*
I guess it is natural to treat qemu NBD server as a domain device. So
A domain device is normally something that is related to the guest machine ABI. This is entirely invisible to the guest, just a backend concept, so this isn't really a natural fit as a domain device.
we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD server and virDomainUpdateDeviceFlags to add/delete disks to be exported. While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags looks a bit inconvinient so I decided to add a pair of API functions just to add/delete disks to be exported:
int virDomainBlockExportStart(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
int virDomainBlockExportStop(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
I guess more appropriate names are virDomainBlockExportAdd and virDomainBlockExportRemove but as I already have a patch series implementing pull backups with these names I would like to keep these names now.
These names also reflect that in the implementation I decided to start/stop NBD server in a lazy manner. While it is a bit innovative for libvirt API I guess it is convinient because to refer NBD server to add/remove disks to we need to identify it thru it's parameters like type, address etc until we introduce some device id (which does not looks consistent with current libvirt design). So it looks like we have all parameters to start/stop server in the frame of these calls so why have extra API calls just to start/stop server manually. If we later need to have NBD server without disks we can perfectly support virDomainAttachDeviceFlags/virDomainDetachDeviceFlags.
Here is example of xml to add/remove disks (specifying checkpoint attribute is not needed for removing disks of course):
<domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/>
What do all these UUIDs refer to ?
</domainblockexport>
And this is how this NBD server will be exposed in domain xml:
<devices> ... <blockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8" exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8 exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> </blockexport>
This feels pretty awkward to me - as mentioned above this is not really guest ABI related to having it under <devices> is not a good fit. I question whether the NBD server address should be exposed in the XML at all. This is a transient thing that is started/stopped on demand via the APIs you show above. So I'd suggest we just have an API to query the listening address of the NBD server. At most in the XML we could have a element under each respective existing <disk/> element to say whether it is exported or not, instead of adding new <disk/> elements in a separate place. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 13.04.2018 12:07, Daniel P. Berrangé wrote:
On Tue, Apr 03, 2018 at 03:01:22PM +0300, Nikolay Shirokovskiy wrote:
*Temporary snapshot API*
In previous version it is called 'Fleece API' after qemu terms and I'll still use BlockSnapshot prefix for commands as in previous RFC instead of TmpSnapshots which I inclined more now.
virDomainBlockSnapshotPtr virDomainBlockSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotList(virDomainPtr domain, virDomainBlockSnapshotPtr **snaps, unsigned int flags);
virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotPtr virDomainBlockSnapshotLookupByName(virDomainPtr domain, const char *name, unsigned int flags);
Here is an example of snapshot xml description:
<domainblocksnapshot> <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/>
Can we just call this <source file="....."/> which is how we name things in normal <disk> elements. 'fleece' in particular is an awful name giving no indication of what is being talked about unless you've already read the QEMU low levels, so I'd rather we don't use the word "fleece" anywhere in API or XML or docs at the libvirt level.
It would be easiest thing to do) Let me explain. "source" in plain external snapshots for example feels awkward to me. It is read like "make a snapshot of disk sda which source file is like that". IMHO it would be better if xml is read like "make a snapshot of disk sda and put it into dest|target file. Then for block snapshot xml would read like "make a snapshot of disk sda and put fleece there". Fleece may be a new term but it only costs one-two sentences to define it. And it is better to have this definition so that user knows what the nature of this image, so that user have correct assumptions on image size, lifetime... If fleece word itself unfortunate then we can coin another one. Looks like "source" takes root in domain xml where it reads well.
</disk> <disk name='sdb' type="file"> <fleece file="/tmp/snapshot-b.hdd"/> </disk> </domainblocksnapshot>
Temporary snapshots are indepentent thus they are not organized in tree structure as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice.
Qemu can track what disk's blocks are changed from snapshotted state so on next backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option for snapshot which means to track changes from this particular snapshot. I used checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used to provide changed blocks from the given checkpoint to current snapshot in current implementation (see *Implemenation* section for more details). Also bitmap keeps block changes and thus itself changes in time and checkpoint is a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'> .. <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178"> <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c">
How are these checkpoints recorded / associated to actual storage on disk ? What happens across restarts of the VM if this is only in the live XML.
Checkpoints reside in qcow2 image entirely. Internally they represented as dirty bitmaps with specially constructed names (name scheme is explained in *checkpoints* subsection of *implementation details* section). After VM restart checkpoints are reread from qemu. Hmm, it strikes me that it is good idea to provide checkpoints info in domain stats rather then domain xml just like image size etc. On the other hand disk backing chain is expanded in live xml so having checkpoints in domain xml not too unexpected...
.. </disk>
*Block export API*
I guess it is natural to treat qemu NBD server as a domain device. So
A domain device is normally something that is related to the guest machine ABI. This is entirely invisible to the guest, just a backend concept, so this isn't really a natural fit as a domain device.
I have VNC in mind as a precedent.
we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD server and virDomainUpdateDeviceFlags to add/delete disks to be exported. While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags looks a bit inconvinient so I decided to add a pair of API functions just to add/delete disks to be exported:
int virDomainBlockExportStart(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
int virDomainBlockExportStop(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
I guess more appropriate names are virDomainBlockExportAdd and virDomainBlockExportRemove but as I already have a patch series implementing pull backups with these names I would like to keep these names now.
These names also reflect that in the implementation I decided to start/stop NBD server in a lazy manner. While it is a bit innovative for libvirt API I guess it is convinient because to refer NBD server to add/remove disks to we need to identify it thru it's parameters like type, address etc until we introduce some device id (which does not looks consistent with current libvirt design). So it looks like we have all parameters to start/stop server in the frame of these calls so why have extra API calls just to start/stop server manually. If we later need to have NBD server without disks we can perfectly support virDomainAttachDeviceFlags/virDomainDetachDeviceFlags.
Here is example of xml to add/remove disks (specifying checkpoint attribute is not needed for removing disks of course):
<domainblockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/>
What do all these UUIDs refer to ?
Sorry for UUIDs instead of human names, my bad. This xml exports snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 of disk sda and CBT from point in time reffered by checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8 to point in time reffered by snapshot being exported (0044757e-1a2d-4c2c-b92f-bb403309bb17).
</domainblockexport>
And this is how this NBD server will be exposed in domain xml:
<devices> ... <blockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8" exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8 exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> </blockexport>
This feels pretty awkward to me - as mentioned above this is not really guest ABI related to having it under <devices> is not a good fit.
I question whether the NBD server address should be exposed in the XML at all. This is a transient thing that is started/stopped on demand via
Such xml resembles VNC/serial ports to me. These two are not guest ABI. On the other hand they connected to guest devices and nbd server is not ...
the APIs you show above. So I'd suggest we just have an API to query the listening address of the NBD server.
Such API looks like having very little function to have it...
At most in the XML we could have a element under each respective existing <disk/> element to say whether it is exported or not, instead of adding new <disk/> elements in a separate place.
Just as in case of graphical framebuffer I thought we can have multiple NBD servers (qemu limited to just one now). So if we put export info under disks we need to refer to NBD server which is basically specifying its address. So having xml with NBD servers each providing info on what it exports looks more simple. Nikolay

On Fri, Apr 13, 2018 at 03:02:07PM +0300, Nikolay Shirokovskiy wrote:
On 13.04.2018 12:07, Daniel P. Berrangé wrote:
On Tue, Apr 03, 2018 at 03:01:22PM +0300, Nikolay Shirokovskiy wrote:
*Temporary snapshot API*
In previous version it is called 'Fleece API' after qemu terms and I'll still use BlockSnapshot prefix for commands as in previous RFC instead of TmpSnapshots which I inclined more now.
virDomainBlockSnapshotPtr virDomainBlockSnapshotCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotList(virDomainPtr domain, virDomainBlockSnapshotPtr **snaps, unsigned int flags);
virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot, unsigned int flags);
virDomainBlockSnapshotPtr virDomainBlockSnapshotLookupByName(virDomainPtr domain, const char *name, unsigned int flags);
Here is an example of snapshot xml description:
<domainblocksnapshot> <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name> <disk name='sda' type="file"> <fleece file="/tmp/snapshot-a.hdd"/>
Can we just call this <source file="....."/> which is how we name things in normal <disk> elements. 'fleece' in particular is an awful name giving no indication of what is being talked about unless you've already read the QEMU low levels, so I'd rather we don't use the word "fleece" anywhere in API or XML or docs at the libvirt level.
It would be easiest thing to do) Let me explain.
"source" in plain external snapshots for example feels awkward to me. It is read like "make a snapshot of disk sda which source file is like that". IMHO it would be better if xml is read like "make a snapshot of disk sda and put it into dest|target file. Then for block snapshot xml would read like "make a snapshot of disk sda and put fleece there". Fleece may be a new term but it only costs one-two sentences to define it. And it is better to have this definition so that user knows what the nature of this image, so that user have correct assumptions on image size, lifetime... If fleece word itself unfortunate then we can coin another one.
Looks like "source" takes root in domain xml where it reads well.
It is the "source" of the data for the snapshot, in the same way that is is the "source" of the data for the original disk.
*Block export API*
I guess it is natural to treat qemu NBD server as a domain device. So
A domain device is normally something that is related to the guest machine ABI. This is entirely invisible to the guest, just a backend concept, so this isn't really a natural fit as a domain device.
I have VNC in mind as a precedent.
Replace "precedent" with "historical mistake" and it would more accurately reflect feelings about VNC.
And this is how this NBD server will be exposed in domain xml:
<devices> ... <blockexport type="nbd"> <address type="ip" host="0.0.0.0" port="8000"/> <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8" exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17" checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8 exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/> </blockexport>
This feels pretty awkward to me - as mentioned above this is not really guest ABI related to having it under <devices> is not a good fit.
I question whether the NBD server address should be exposed in the XML at all. This is a transient thing that is started/stopped on demand via
Such xml resembles VNC/serial ports to me. These two are not guest ABI. On the other hand they connected to guest devices and nbd server is not ...
the APIs you show above. So I'd suggest we just have an API to query the listening address of the NBD server.
Such API looks like having very little function to have it...
At most in the XML we could have a element under each respective existing <disk/> element to say whether it is exported or not, instead of adding new <disk/> elements in a separate place.
Just as in case of graphical framebuffer I thought we can have multiple NBD servers (qemu limited to just one now). So if we put export info under disks we need to refer to NBD server which is basically specifying its address. So having xml with NBD servers each providing info on what it exports looks more simple.
If we ever have multiple NBD servers, then we can just assign each one a name, and under the <disk/> reference that name <export server="$name/> to indicate which to export to. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (5)
-
Daniel P. Berrangé
-
Eric Blake
-
John Snow
-
Nikolay Shirokovskiy
-
Vladimir Sementsov-Ogievskiy