Hi, all.
This is another RFC on pull backup API. This API provides means to read domain
disks in a snapshotted state so that client can back them up as well as means
to write domain disks to revert them to backed up state. The previous version
of RFC is [1]. I'll also describe the API implementation details to shed light
on misc qemu dirty bitmap commands usage.
This API does not use existent disks snapshots. Instead it introduces snapshots
provided by qemu's blockdev-backup command. The reason is we need snapshotted
disk state only temporarily for duration of backup operation and newly
introduced snapshots can be easily discarded at the end of operation without
block commit operation. Technically difference is next. On usual snapshot we
create new image backed by original and all new data goes to the new image thus
original image stays in a snapshotted state. In temporary snapshots we create
new image backed by original and all new data still goes to the original image
but before new data is written old data to be overwritten is popped out to the new
image thus we get snapshotted state thru new image.
Disks snapshots as well as disks itself are avaiable to read/write thru qemu
NBD server.
Here is typical actions on domain backup:
- create temporary snapshot of domain disks of interest
- export snaphots thru NBD
- back them up
- remove disks from export
- delete temporary snapshot
and typical actions on domain restore:
- start domain in paused state
- export domain disks of interest thru NBD for write
- restore them
- remove disks from export
- resume or destroy domain
Now let's write down API in more details. There are minor changes in comparison
with previous version [1].
*Temporary snapshot API*
In previous version it is called 'Fleece API' after qemu terms and I'll still
use BlockSnapshot prefix for commands as in previous RFC instead of
TmpSnapshots which I inclined more now.
virDomainBlockSnapshotPtr
virDomainBlockSnapshotCreateXML(virDomainPtr domain,
const char *xmlDesc,
unsigned int flags);
virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot,
unsigned int flags);
virDomainBlockSnapshotList(virDomainPtr domain,
virDomainBlockSnapshotPtr **snaps,
unsigned int flags);
virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot,
unsigned int flags);
virDomainBlockSnapshotPtr
virDomainBlockSnapshotLookupByName(virDomainPtr domain,
const char *name,
unsigned int flags);
Here is an example of snapshot xml description:
<domainblocksnapshot>
<name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name>
<disk name='sda' type="file">
<fleece file="/tmp/snapshot-a.hdd"/>
</disk>
<disk name='sdb' type="file">
<fleece file="/tmp/snapshot-b.hdd"/>
</disk>
</domainblocksnapshot>
Temporary snapshots are indepentent thus they are not organized in tree structure
as usual snapshots, so the 'list snapshots' and 'lookup' function will
suffice.
Qemu can track what disk's blocks are changed from snapshotted state so on next
backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML
accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option
for snapshot which means to track changes from this particular snapshot. I used
checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used
to provide changed blocks from the given checkpoint to current snapshot in
current implementation (see *Implemenation* section for more details). Also
bitmap keeps block changes and thus itself changes in time and checkpoint is
a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'>
..
<target dev='sda' bus='scsi'/>
<alias name='scsi0-0-0-0'/>
<checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178">
<checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c">
..
</disk>
Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default
dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used.
So client need to manage checkpoints and delete unused. Thus next API function:
int
virDomainBlockCheckpointRemove(virDomainPtr domain,
const char *name,
unsigned int flags);
*Block export API*
I guess it is natural to treat qemu NBD server as a domain device. So
we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD
server and virDomainUpdateDeviceFlags to add/delete disks to be exported.
While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags
looks a bit inconvinient so I decided to add a pair of API functions just
to add/delete disks to be exported:
int
virDomainBlockExportStart(virDomainPtr domain,
const char *xmlDesc,
unsigned int flags);
int
virDomainBlockExportStop(virDomainPtr domain,
const char *xmlDesc,
unsigned int flags);
I guess more appropriate names are virDomainBlockExportAdd and
virDomainBlockExportRemove but as I already have a patch series implementing pull
backups with these names I would like to keep these names now.
These names also reflect that in the implementation I decided to start/stop NBD
server in a lazy manner. While it is a bit innovative for libvirt API I guess
it is convinient because to refer NBD server to add/remove disks to we need to
identify it thru it's parameters like type, address etc until we introduce some
device id (which does not looks consistent with current libvirt design). So it
looks like we have all parameters to start/stop server in the frame of these
calls so why have extra API calls just to start/stop server manually. If we
later need to have NBD server without disks we can perfectly support
virDomainAttachDeviceFlags/virDomainDetachDeviceFlags.
Here is example of xml to add/remove disks (specifying checkpoint
attribute is not needed for removing disks of course):
<domainblockexport type="nbd">
<address type="ip" host="0.0.0.0" port="8000"/>
<disk name="sda"
snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/>
<disk name="sdb"
snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/>
</domainblockexport>
And this is how this NBD server will be exposed in domain xml:
<devices>
...
<blockexport type="nbd">
<address type="ip" host="0.0.0.0"
port="8000"/>
<disk name="sda"
snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"
exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
<disk name="sdb"
snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8
exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
</blockexport>
...
</devices>
*Implementation details from qemu-libvirt interactions POV*
1. Temporary snapshot
- create snapshot
- add fleece blockdev backed by disk of interest
- start fleece blockjob which will pop out data to be overwritten to fleece blockdev
{
"execute": "blockdev-add"
"arguments": {
"backing": "drive-scsi0-0-0-0",
"driver": "qcow2",
"file": {
"driver": "file",
"filename": "/tmp/snapshot-a.hdd"
},
"node-name": "snapshot-scsi0-0-0-0"
},
}
{
"execute": "transaction"
"arguments": {
"actions": [
{
"type": "blockdev-backup"
"data": {
"device": "drive-scsi0-0-0-0",
"target": "snapshot-scsi0-0-0-0"
"sync": "none",
},
}
]
},
}
- delete snapshot
- cancel fleece blockjob
- delete fleece blockdev
{
"execute": "block-job-cancel"
"arguments": {
"device": "drive-scsi0-0-0-0"
},
}
{
"execute": "blockdev-del"
"arguments": {
"node-name": "snapshot-scsi0-0-0-0"
},
}
2. Block export
- add disks to export
- start NBD server if it is not started
- add disks
{
"execute": "nbd-server-start"
"arguments": {
"addr": {
"type": "inet"
"data": {
"host": "0.0.0.0",
"port": "49300"
},
}
},
}
{
"execute": "nbd-server-add"
"arguments": {
"device": "snapshot-scsi0-0-0-0",
"name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8",
"writable": false
},
}
- remove disks from export
- remove disks
- stop NBD server if there are no disks left
{
"arguments": {
"mode": "hard",
"name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8"
},
"execute": "nbd-server-remove"
}
{
"execute": "nbd-server-stop"
}
3. Checkpoints (the most interesting part)
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not
get changed on guest writes. And oppositely in active state it tracks guest
writes. This implementation uses approach with only one active bitmap at
a time. This should reduce guest write penalties in the presence of
checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes
from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap
B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2
- changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and
gets most disk change after latest snapshot.
Getting changed blocks bitmap from some checkpoint in past till current snapshot
is quite simple in this scheme. For example if the last snapshot is 7 then
to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3,
B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires
merging correspondent bitmap to the previous bitmap in this scheme.
We use persitent bitmaps in the implementation. This means upon qemu process
termination bitmaps are saved in disks images metadata and restored back on
qemu process start. This makes checkpoint a persistent property that is we
keep them across domain start/stops. Qemu does not try hard to keep bitmaps.
If upon save something goes wrong bitmap is dropped. The same is applied to the
migration process too. For backup process it is not critical. If we don't
discover a checkpoint we always can make a full backup. Also qemu provides no
special means to track order of bitmaps. These facts are critical for
implementation with one active bitmap at a time. We need right order of bitmaps upon
merge - for snapshot N and block changes from snanpshot K, K < N to N we need
to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged
is missing we can't calculate desired block changes too.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap
name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming
encoding upon domain start we can find out bitmap order and check for missing
ones. This complicates a bit bitmap removing though. For example removing
a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1}
- create new bitmap named NAME_{K+1}^NAME_{K-1} ---.
- disable new bitmap | This is effectively
renaming
- merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap | of bitmap K+1 to comply the
naming scheme
- remove bitmap NAME_{K+1}^NAME_{K} ___/
- merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2}
- remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap
naming scheme. This is done creating new K+1 bitmap with appropriate name
and copying old K+1 bitmap into new.
So while it is possible to have only one active bitmap at a time it costs
some exersices at managment layer. To me it looks like qemu itself is a better
place to track bitmaps chain order and consistency.
Now how exporting bitmaps looks like.
- add to export disk snapshot N with changes from checkpoint K
- add fleece blockdev to NBD exports
- create new bitmap T
- disable bitmap T
- merge bitmaps K, K+1, .. N-1 into T
- add bitmap to T to nbd export
- remove disk snapshot from export
- remove fleece blockdev from NBD exports
- remove bitmap T
Here is qemu commands examples for operation with checkpoints, I'll make
several snapshots with checkpoints for purpuse of better illustration.
- create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint
- same as without checkpoint but additionally add bitmap on fleece blockjob start
...
{
"execute": "transaction"
"arguments": {
"actions": [
{
"type": "blockdev-backup"
"data": {
"device": "drive-scsi0-0-0-0",
"sync": "none",
"target": "snapshot-scsi0-0-0-0"
},
},
{
"type": "block-dirty-bitmap-add"
"data": {
"name":
"libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
"node": "drive-scsi0-0-0-0",
"persistent": true
},
}
]
},
}
- delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8
- same as without checkpoints
- create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint
- same actions as for the first snapshot, but additionally disable the first bitmap
...
{
"execute": "transaction"
"arguments": {
"actions": [
{
"type": "blockdev-backup"
"data": {
"device": "drive-scsi0-0-0-0",
"sync": "none",
"target": "snapshot-scsi0-0-0-0"
},
},
{
"type": "x-vz-block-dirty-bitmap-disable"
"data": {
"name":
"libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
"node": "drive-scsi0-0-0-0"
},
},
{
"type": "block-dirty-bitmap-add"
"data": {
"name":
"libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8",
"node": "drive-scsi0-0-0-0",
"persistent": true
},
}
]
},
}
- delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17
- create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with
changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8
- same as add export without checkpoint, but aditionally
- form result bitmap
- add bitmap to NBD export
...
{
"execute": "transaction"
"arguments": {
"actions": [
{
"type": "block-dirty-bitmap-add"
"data": {
"node": "drive-scsi0-0-0-0",
"name": "libvirt-__export_temporary__",
"persistent": false
},
},
{
"type": "x-vz-block-dirty-bitmap-disable"
"data": {
"node": "drive-scsi0-0-0-0"
"name": "libvirt-__export_temporary__",
},
},
{
"type": "x-vz-block-dirty-bitmap-merge"
"data": {
"node": "drive-scsi0-0-0-0",
"src_name":
"libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8"
"dst_name": "libvirt-__export_temporary__",
},
},
{
"type": "x-vz-block-dirty-bitmap-merge"
"data": {
"node": "drive-scsi0-0-0-0",
"src_name":
"libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
"dst_name": "libvirt-__export_temporary__",
},
}
]
},
}
{
"execute": "x-vz-nbd-server-add-bitmap"
"arguments": {
"name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b"
"bitmap": "libvirt-__export_temporary__",
"bitmap-export-name":
"d068765e-8b50-4d74-9b72-1e55c663cbf8",
},
}
- remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export
- same as without checkpoint but additionally remove temporary bitmap
...
{
"arguments": {
"name": "libvirt-__export_temporary__",
"node": "drive-scsi0-0-0-0"
},
"execute": "block-dirty-bitmap-remove"
}
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17
(similar operation is described in the section about naming scheme for bitmaps,
with difference that K+1 is N here and thus new bitmap should not be disabled)
{
"arguments": {
"actions": [
{
"type": "block-dirty-bitmap-add"
"data": {
"node": "drive-scsi0-0-0-0",
"name":
"libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8",
"persistent": true
},
},
{
"type": "x-vz-block-dirty-bitmap-merge"
"data": {
"node": "drive-scsi0-0-0-0",
"src_name":
"libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
"dst_name":
"libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
},
},
{
"type": "x-vz-block-dirty-bitmap-merge"
"data": {
"node": "drive-scsi0-0-0-0",
"src_name":
"libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1#
"dst_name":
"libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf#
},
},
]
},
"execute": "transaction"
}
{
"execute": "x-vz-block-dirty-bitmap-remove"
"arguments": {
"node": "drive-scsi0-0-0-0"
"name":
"libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17",
},
},
{
"execute": "x-vz-block-dirty-bitmap-remove"
"arguments": {
"node": "drive-scsi0-0-0-0"
"name":
"libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8",
},
}
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
x-vz-block-dirty-bitmap-merge
x-vz-block-dirty-bitmap-disable
x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent
checkpoint)
x-vz-nbd-server-add-bitmap
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused
state, export domain's disks and write them from backup. However qemu currently does
not let export disks for write even for a domain that never starts guests CPU.
We have an experimental qemu command option -x-vz-nbd-restore (passed together
with -incoming option) to fix it.
*Links*
[1] Previous version of RFC
https://www.redhat.com/archives/libvir-list/2017-November/msg00514.html