[libvirt] Overview of libvirt incremental backup API, part 1 (full pull mode)

The following (long) email describes a portion of the work-flow of how my proposed incremental backup APIs will work, along with the backend QMP commands that each one executes. I will reply to this thread with further examples (the first example is long enough to be its own email). This is an update to a thread last posted here: https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html I'm still pulling together pieces in the src/qemu directory of libvirt while implementing my proposed API, but you can track the status of my code (currently around 6000 lines of code and 1500 lines of documentation added) at: https://repo.or.cz/libvirt/ericb.git The documentation below describes the end goal of my demo (which I will be presenting at KVM Forum), even if the current git checkout of my work in progress doesn't quite behave that way. My hope is that the API itself is in a stable enough state to include in the libvirt 4.9 release (end of this month - which really means upstream commit prior to KVM Forum!) by demo-ing how it is used with qemu experimental commands, even if the qemu driver portions of my series are not yet ready to be committed because they are waiting for the qemu side of incremental backups to stabilize. If we like the API and are willing to commit to it, then downstream vendors can backport whatever fixes in the qemu driver on top of the existing API without having to suffer from rebase barriers preventing the addition of new API. Performing a full backup can work on any disk format, but incremental (all changes since the most recent checkpoint) and differential (all changes since an arbitrary earlier checkpoint) backups require the use of a persistent bitmap for tracking the changes between checkpoints, and that in turn requires a disk with qcow2 format. The API can handle multiple disks at the same point in time (so I'll demonstrate two at once), and is designed to handle both push model (qemu writes to a specific destination, and the format has to be one that qemu knows) and pull model (qemu opens up an NBD server for all disks, then you connect one or more read-only client per export on that server to read the information of interest into a destination of your choosing). This demo also shows how I consume the data over a pull model backup. Remember, in the pull model, you don't have to use a qemu binary as the NBD client (you merely need a client that can request base:allocation and qemu:dirty-bitmap:name contexts) - it's just that it is easier to demonstrate everything with the tools already at hand. Thus, I use existing qemu-img 3.0 functionality to extract the dirty bitmap (the qemu:dirty-bitmap:name context) in one process, and a second qemu-io process (using base:allocation to optimize reads of holes) for extracting the actual data; the demo shows both processes accessing the read-only NBD server in parallel. While I use two processes, it is also feasible to write a single client that can get at both contexts through a single NBD connection (the qemu 3.0 server supports that, even if none of the qemu 3.0 clients can request multiple contexts). Down the road, we may further enhance tools shipped with qemu to be easier to use as such a client, but that does not affect the actual backup API (which is merely what it takes to get the NBD server up and running). - Preliminary setup: I'm using bash as my shell, and set $ orig1=/path/to/disk1.img orig2=/path/to/disk2.img $ dom=my_domain qemu_img=/path/to/qemu-img $ virsh="/path/to/virsh -k 0" to make later steps easier to type. While the steps below should work with qemu 3.0, I found it easier to test with both self-built qemu (modify the <emulator> line in my domain) and self-built libvirtd (systemctl stop libvirtd, then run src/libvrtd, also my use of $virsh with heartbeat disabled, so that I was able to attach gdb during development without having to worry about the connection dying). Also, you may need 'setenforce 0' when using self-built binaries, since otherwise SELinux labeling gets weird (obviously, when the actual code is ready to check into libvirt, it will work with SELinux enforcing and with system-installed rather than self-installed binaries). I also used: $ $virsh domblklist $dom to verify that I have plugged in $orig1 and $orig2 as two of the disks to $dom (I used: <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/> <source file='/path/to/disk1.img'/> <backingStore/> <target dev='sdc' bus='scsi'/> </disk> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/> <source file='/path/to/disk2.img'/> <backingStore/> <target dev='sdd' bus='scsi'/> </disk> in my domain XML) - First example: creating a full backup via pull model, initially with no checkpoint created $ cat > backup.xml <<EOF <domainbackup mode='pull'> <server transport='tcp' name='localhost' port='10809'/> <disks> <disk name='$orig1' type='file'> <scratch file='$PWD/scratch1.img'/> </disk> <disk name='sdd' type='file'> <scratch file='$PWD/scratch2.img'/> </disk> </disks> </domainbackup> EOF $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img Here, I'm explicitly requesting a pull backup (the API defaults to push otherwise), as well as explicitly requesting the NBD server to be set up (the XML should support both transport='tcp' and transport='unix'). Note that the <server> is global; but the server will support multiple export names at once, so that you can connect multiple clients to process those exports in parallel. Ideally, if <server> is omitted, libvirt should auto-generate an appropriate server name, and has a way for you to query what it generated (right now, I don't have that working in libvirt, so being explicit is necessary - but again, the goal now is to prove that the API is reasonable for including it in libvirt 4.9; enhancements like making <server> optional can come later even if they miss libvirt 4.9). I'm also requesting that the backup operate on only two disks of the domain, and pointing libvirt to the scratch storage it needs to use for the duration of the backup (ideally, libvirt will generate an appropriate scratch file name itself if omitted from the XML, and create scratch files itself instead of me having to pre-create them). Note that I can give either the path to my original disk ($orig1, $orig2) or the target name in the domain XML (in my case sdc, sdd); libvirt will normalize my input and always uses the target name when reposting the XML in output. $ $virsh backup-begin $dom backup.xml Backup id 1 started backup used description from 'backup.xml' Kicks off the backup job. virsh called virDomainBackupBegin(dom, "<domainbackup ...>", NULL, 0) and in turn libvirt makes the all of following QMP calls (if any QMP call fails, libvirt attempts to unroll things so that there is no lasting change to the guest before actually reporting failure): {"execute":"nbd-server-start", "arguments":{"addr":{"type":"inet", "data":{"host":"localhost", "port":"10809"}}}} {"execute":"blockdev-add", "arguments":{"driver":"qcow2", "node-name":"backup-sdc", "file":{"driver":"file", "filename":"$PWD/scratch1.img"}, "backing":"'$node1'"}} {"execute":"blockdev-add", "arguments":{"driver":"qcow2", "node-name":"backup-sdd", "file":{"driver":"file", "filename":"$PWD/scratch2.img"}, "backing":"'$node2'"}} {"execute":"transaction", "arguments":{"actions":[ {"type":"blockdev-backup", "data":{ "device":"$node1", "target":"backup-sdc", "sync":"none", "job-id":"backup-sdc" }}, {"type":"blockdev-backup", "data":{ "device":"$node2", "target":"backup-sdd", "sync":"none", "job-id":"backup-sdd" }} ]}} {"execute":"nbd-server-add", "arguments":{"device":"backup-sdc", "name":"sdc"}} {"execute":"nbd-server-add", "arguments":{"device":"backup-sdd", "name":"sdd"}} libvirt populated $node1 and $node2 to be the node names actually assigned by qemu; until Peter's work on libvirt using node names everywhere actually lands, libvirt is scraping the auto-generated #blockNNN name from query-block and friends (the same as it already does in other situations like write threshold). With this command complete, libvirt has now kicked off a pull backup job, which includes single qemu NBD server, with two separate exports named 'sdc' and 'sdd' that expose the state of the disks at the time of the API call (any guest writes to $orig1 or $orig2 trigger copy-on-write actions into scratch1.img and scratch2.img to preserve the fact that reading from NBD sees unchanging contents). We can double-check what libvirt is tracking for the running backup job, including the fact that libvirt normalized the <disk> names to match the domain XML target listings, and matching the names of the exports being served over the NBD server: $ $virsh backup-dumpxml $dom 1 <domainbackup type='pull' id='1'> <server transport='tcp' name='localhost' port='10809'/> <disks> <disk name='sdc' type='file'> <scratch file='/home/eblake/scratch1.img'/> </disk> <disk name='sdd' type='file'> <scratch file='/home/eblake/scratch2.img'/> </disk> </disks> </domainbackup> where 1 on the command line would be replaced by whatever id was printed by the earlier backup-begin command (yes, my demo can hard-code things to 1, because the current qemu and initial libvirt implementations only support one backup job at a time, although we have plans to allow parallel jobs in the future). This translated to the libvirt API call virDomainBackupGetXMLDesc(dom, 1, 0) and did not have to make any QMP calls into qemu. Now that the backup job is running, we want to scrape the data off the NBD server. The most naive way is: $ $qemu_img convert -f raw -O $fmt nbd://localhost:10809/sdc full1.img $ $qemu_img convert -f raw -O $fmt nbd://localhost:10809/sdd full2.img where we hope that qemu-img convert is able to recognize the holes in the source and only write into the backup copy where actual data lives. You don't have to uses qemu-img; it's possible to use any NBD client, such as the kernel NBD module: $ modprobe nbd $ qemu-nbd -c /dev/nbd0 -f raw nbd://localhost:10809/sdc $ cp /dev/nbd0 full1.img $ qemu-nbd -d /dev/nbd0 The above demonstrates the flexibility of the pull model (your backup file can be ANY format you choose; here I did 'cp' to copy it to a raw destination), but it was also a less efficient NBD client, since the kernel module doesn't yet know about NBD_CMD_BLOCK_STATUS for learning where the holes are, nor about NBD_OPT_STRUCTURED_REPLY for faster reads of those holes. Of course, we don't have to blindly read the entire image, but can instead use two clients in parallel (per exported disk), one which is using 'qemu-img map' to learn which parts of the export contain data, then feeds it through a bash 'while read' loop to parse out which offsets contain interesting data, and spawning a second client per region to copy just that subset of the file. Here, I'll use 'qemu-io -C' to perform copy-on-read - that requires that my output file be qcow2 rather than any other particular format, but I'm guaranteed that my output backup file is only populated in the same places that $orig1 was populated at the time the backup started. $ $qemu_img create -f qcow2 full1.img $size_of_orig1 $ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \ full1.img $ while read line; do [[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.true.* ]] || continue start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]} qemu-io -C -c "r $start $len" -f qcow2 full1.img done < <($qemu_img map --output=json -f raw nbd://localhost:10809/sdc) $ $qemu_img rebase -u -f qcow2 -b '' full1.img and the nice thing about this loop is that once you've figured out how to parse qemu-img map output as one client process, you can use any other process (such as qemu-nbd -c, then dd if=/dev/nbd0 of=$dest bs=64k skip=$((start/64/1024)) seek=$((start/64/1024)) count=$((len/64/1024)) conv=fdatasync) as the NBD client that reads the subset of data of interest (and thus, while qemu-io had to write to full1.img as qcow2, you can use an alternative client to write to raw or any other format of your choosing). Now that we've copied off the full backup image (or just a portion of it - after all, this is a pull model where we are in charge of how much data we want to read), it's time to tell libvirt that it can conclude the backup job: $ $virsh backup-end $dom 1 Backup id 1 completed again, where the command line '1' came from the output to backup-begin and could change to something else rather than being hard-coded in the demo. This maps to the libvirt API call virDomainBackupEnd(dom, 1, 0) which in turn maps to the QMP commands: {"execute":"nbd-server-remove", "arguments":{"name":"sdc"}} {"execute":"nbd-server-remove", "arguments":{"name":"sdd"}} {"execute":"nbd-server-stop"} {"execute":"block-job-cancel", "arguments":{"device":"sdc"}} {"execute":"block-job-cancel", "arguments":{"device":"sdd"}} {"execute":"blockdev-del", "arguments":{"node-name":"backup-sdc"}} {"execute":"blockdev-del", "arguments":{"node-name":"backup-sdd"}} to clean up all the things added during backup-begin. More to come in part 2. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 10/4/18 12:05 AM, Eric Blake wrote:
The following (long) email describes a portion of the work-flow of how my proposed incremental backup APIs will work, along with the backend QMP commands that each one executes. I will reply to this thread with further examples (the first example is long enough to be its own email). This is an update to a thread last posted here: https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html
More to come in part 2.
- Second example: a sequence of incremental backups via pull model In the first example, we did not create a checkpoint at the time of the full pull. That means we have no way to track a delta of changes since that point in time. Let's repeat the full backup (reusing the same backup.xml from before), but this time, we'll add a new parameter, a second XML file for describing the checkpoint we want to create. Actually, it was easy enough to get virsh to write the XML for me (because it was very similar to existing code in virsh that creates XML for snapshot creation): $ $virsh checkpoint-create-as --print-xml $dom check1 testing \ --diskspec sdc --diskspec sdd | tee check1.xml <domaincheckpoint> <name>check1</name> <description>testing</description> <disks> <disk name='sdc'/> <disk name='sdd'/> </disks> </domaincheckpoint> I had to supply two --diskspec arguments to virsh to select just the two qcow2 disks that I am using in my example (rather than every disk in the domain, which is the default when <disks> is not present). I also picked a name (mandatory) and description (optional) to be associated with the checkpoint. The backup.xml file that we plan to reuse still mentions scratch1.img and scratch2.img as files needed for staging the pull request. However, any contents in those files could interfere with our second backup (after all, every cluster written into that file from the first backup represents a point in time that was frozen at the first backup; but our second backup will want to read the data as the guest sees it now rather than what it was at the first backup), so we MUST regenerate the scratch files. (Perhaps I should have just deleted them at the end of example 1 in my previous email, had I remembered when typing that mail). $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img Now, to begin the full backup and create a checkpoint at the same time. Also, this time around, it would be nice if the guest had a chance to freeze I/O to the disks prior to the point chosen as the checkpoint. Assuming the guest is trusted, and running the qemu guest agent (qga), we can do that with: $ $virsh fsfreeze $dom $ $virsh backup-begin $dom backup.xml check1.xml Backup id 1 started backup used description from 'backup.xml' checkpoint used description from 'check1.xml' $ $virsh fsthaw $dom and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE flag to combine those three steps into a single API (matching what we've done on some other existing API). In other words, the sequence of QMP operations performed during virDomainBackupBegin are quick enough that they won't stall a freeze operation (at least Windows is picky if you stall a freeze operation longer than 10 seconds). The tweaked $virsh backup-begin now results in a call to: virDomainBackupBegin(dom, "<domainbackup ...>", "<domaincheckpoint ...", 0) and in turn libvirt makes a similar sequence of QMP calls as before, with a slight modification in the middle: {"execute":"nbd-server-start",... {"execute":"blockdev-add",... {"execute":"transaction", "arguments":{"actions":[ {"type":"blockdev-backup", "data":{ "device":"$node1", "target":"backup-sdc", "sync":"none", "job-id":"backup-sdc" }}, {"type":"blockdev-backup", "data":{ "device":"$node2", "target":"backup-sdd", "sync":"none", "job-id":"backup-sdd" }} {"type":"block-dirty-bitmap-add", "data":{ "node":"$node1", "name":"check1", "persistent":true}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node2", "name":"check1", "persistent":true}} ]}} {"execute":"nbd-server-add",... The only change was adding more actions to the "transaction" command - in addition to kicking off the fleece image in the scratch nodes, it ALSO added a persistent bitmap to each of the original images, to track all changes made after the point of the transaction. The bitmaps are persistent - at this point (well, it's better if you wait until after backup-end), you could shut the guest down and restart it, and libvirt will still remember that the checkpoint exists, and qemu will continue track guest writes via the bitmap. However, the backup job itself is currently live-only, and shutting down the guest while a backup operation is in effect will lose track of the backup job. What that really means is that if the guest shuts down, your current backup job is hosed (you cannot ever get back the point-in-time data from your API request - as your next API request will be a new point in time) - but you have not permanently ruined the guest, and your recovery is to just start a new backup. Pulling the data out from the backup is unchanged from example 1; virsh backup-dumpxml will show details about the job (yes, the job id is still 1 for now), and when ready, virsh backup-end will end the job and gracefully take down the NBD server with no difference in QMP commands from before. Thus, the creation of a checkpoint didn't change any of the fundamentals of capturing the current backup, but rather is in preparation for the next step. $ $virsh backup-end $dom 1 Backup id 1 completed $ rm scratch1.img scratch2.img [We have not yet designed how qemu bitmaps will interact with external snapshots - but I see two likely scenarios: 1. Down the road, I add a virDomainSnapshotCheckpointCreateXML() API, which adds a checkpointXML parameter but otherwise behaves like the existing virDomainSnapshotCreateXML - if that API is added in a different release than my current API proposals, that's yet another libvirt.so rebase to pickup the new API. 2. My current proposal of virDomainBackupBegin(dom, "<domainbackup>", "<domaincheckpoint>", flags) could instead be tweaked to a single XML parameter, virDomainBackupBegin(dom, " <domainbackup> <domaincheckpoint> ... </domaincheckpoint> </domainbackup>", flags) prior to adding my APIs to libvirt 4.9, then down the road, we also tweak <domainsnapshot> to take an optional <domaincheckpoint> sub-element, and thus reuse the existing virDomainSnapshotCreateXML() to now also create checkpoints without a further API addition. Speak up now if you have a preference between the two ideas] Now that we have concluded the full backup and created a checkpoint, we can do more things with the checkpoint (it is persistent, after all). For example: $ $virsh checkpoint-list $dom Name Creation Time -------------------------------------------- check1 2018-10-04 15:02:24 -0500 called virDomainListCheckpoints(dom, &array, 0) under the hood to get a list of virDomainCheckpointPtr objects, then called virDomainCheckpointGetXMLDesc(array[0], 0) to scrape the XML describing that checkpoint in order to display information. Or another approach, using virDomainCheckpointGetXMLDesc(virDomainCheckpointCurrent(dom, 0), 0): $ $virsh checkpoint-current $dom | head <domaincheckpoint> <name>check1</name> <description>testing</description> <creationTime>1538683344</creationTime> <disks> <disk name='vda' checkpoint='no'/> <disk name='sdc' checkpoint='bitmap' bitmap='check1'/> <disk name='sdd' checkpoint='bitmap' bitmap='check1'/> </disks> <domain type='kvm'> which shows the current checkpoint (that is, the checkpoint owning the bitmap that is still receiving live updates), and which bitmap names in the qcow2 files are in use. For convenience, it also recorded the full <domain> description at the time the checkpoint was captured (I used head to limit the size of this email), so that if you later hot-plug things, you still have a record of what state the machine had at the time the checkpoint was created. The XML output of a checkpoint description is normally static, but sometimes it is useful to know an approximate size of the guest data that has been dirtied since a checkpoint was created (a dynamic value that grows as a guest dirties more clusters). For that, it makes sense to have a flag to request the dynamic data; it's also useful to have a flag that suppresses the (length) <domain> output: $ $virsh checkpoint-current $dom --size --no-domain <domaincheckpoint> <name>check1</name> <description>testing</description> <creationTime>1538683344</creationTime> <disks> <disk name='vda' checkpoint='no'/> <disk name='sdc' checkpoint='bitmap' bitmap='check1' size='1048576'/> <disk name='sdd' checkpoint='bitmap' bitmap='check1' size='65536'/> </disks> </domaincheckpoint> This maps to virDomainCheckpointGetXMLDesc(chk, VIR_DOMAIN_CHECKPOINT_XML_NO_DOMAIN | VIR_DOMAIN_CHECKPOINT_XML_SIZE). Under the hood, libvirt calls {"execute":"query-block"} and converts the bitmap size reported by qemu into an estimate of the number of bytes that would be required if you were to start a backup from that checkpoint right now. Note that the result is just an estimate of the storage taken by guest-visible data; you'll probably want to use 'qemu-img measure' to convert that into a size of how much a matching qcow2 image would require when metadata is added in; also remember that the number is constantly growing as the guest writes and causes more of the image to become dirty. But having a feel for how much has changed can be useful for determining if continuing a chain of incremental backups still makes more sense, or if enough of the guest data has changed that doing a full backup is smarter; it is also useful for preallocating how much storage you will need for an incremental backup. Technically, libvirt mapping that a checkpoint size request to a single {"execute":"query-block"} works only when querying the size of the current bitmap. The command also works when querying the cumulative size since an older checkpoint, but under the hood, libvirt must juggle things to create a temporary bitmap, call a few x-block-dirty-bitmap-merge, query the size of that temporary bitmap, then clean things back up again (after all, size(A) + size(B) >= size(A|B), depending on how many clusters were touched during both A and B's tracking of dirty clusters). Again, a nice benefit of having libvirt manage multiple qemu bitmaps under a single libvirt API. Of course, the real reason we created a checkpoint with our full backup is that we want to take an incremental backup next, rather than repeatedly taking full backups. For this, we need a one-line modification to our backup XML to add an <incremental> element; we also want to update our checkpoint XML to start yet another checkpoint when we run our first incremental backup. $ cat > backup.xml <<EOF <domainbackup mode='pull'> <server transport='tcp' name='localhost' port='10809'/> <incremental>check1</incremental> <disks> <disk name='$orig1' type='file'> <scratch file='$PWD/scratch1.img'/> </disk> <disk name='sdd' type='file'> <scratch file='$PWD/scratch2.img'/> </disk> </disks> </domainbackup> EOF $ $virsh checkpoint-create-as --print-xml $dom check2 \ --diskspec sdc --diskspec sdd | tee check2.xml <domaincheckpoint> <name>check2</name> <disks> <disk name='sdc'/> <disk name='sdd'/> </disks> </domaincheckpoint> $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img And again, it's time to kick off the backup job: $ $virsh backup-begin $dom backup.xml check2.xml Backup id 1 started backup used description from 'backup.xml' checkpoint used description from 'check2.xml' This time, the incremental backup causes libvirt to do a bit more work under the hood: {"execute":"nbd-server-start", "arguments":{"addr":{"type":"inet", "data":{"host":"localhost", "port":"10809"}}}} {"execute":"blockdev-add", "arguments":{"driver":"qcow2", "node-name":"backup-sdc", "file":{"driver":"file", "filename":"$PWD/scratch1.img"}, "backing":"'$node1'"}} {"execute":"blockdev-add", "arguments":{"driver":"qcow2", "node-name":"backup-sdd", "file":{"driver":"file", "filename":"$PWD/scratch2.img"}, "backing":"'$node2'"}} {"execute":"block-dirty-bitmap-add", "arguments":{"node":"$node1", "name":"backup-sdc"}} {"execute":"x-block-dirty-bitmap-merge", "arguments":{"node":"$node1", "src_name":"check1", "dst_name":"backup-sdc"}}' {"execute":"block-dirty-bitmap-add", "arguments":{"node":"$node2", "name":"backup-sdd"}} {"execute":"x-block-dirty-bitmap-merge", "arguments":{"node":"$node2", "src_name":"check1", "dst_name":"backup-sdd"}}' {"execute":"transaction", "arguments":{"actions":[ {"type":"blockdev-backup", "data":{ "device":"$node1", "target":"backup-sdc", "sync":"none", "job-id":"backup-sdc" }}, {"type":"blockdev-backup", "data":{ "device":"$node2", "target":"backup-sdd", "sync":"none", "job-id":"backup-sdd" }}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node1", "name":"backup-sdc"}}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node2", "name":"backup-sdd"}}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node1", "name":"check1"}}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node2", "name":"check1"}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node1", "name":"check2", "persistent":true}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node2", "name":"check2", "persistent":true}} ]}} {"execute":"nbd-server-add", "arguments":{"device":"backup-sdc", "name":"sdc"}} {"execute":"nbd-server-add", "arguments":{"device":"backup-sdd", "name":"sdd"}} {"execute":"x-nbd-server-add-bitmap", "arguments":{"name":"sdc", "bitmap":"backup-sdc"}} {"execute":"x-nbd-server-add-bitmap", "arguments":{"name":"sdd", "bitmap":"backup-sdd"}} Two things stand out here, different from the earlier full backup. First is that libvirt is now creating a temporary non-persistent bitmap, merging all data fom check1 into the temporary, then freezing writes into the temporary bitmap during the transaction, and telling NBD to expose the bitmap to clients. The second is that since we want this backup to start a new checkpoint, we disable the old bitmap and create a new one. The two additions are independent - it is possible to create an incremental backup [<incremental> in backup XML]) without triggering a new checkpoint [presence of non-null checkpoint XML]. In fact, taking an incremental backup without creating a checkpoint is effectively doing differential backups, where multiple backups started at different times each contain all cumulative changes since the same original point in time, such that later backups are larger than earlier backups, but you no longer have to chain those backups to one another to reconstruct the state in any one of the backups). Now that the pull-model backup job is running, we want to scrape the data off the NBD server. Merely reading nbd://localhost:10809/sdc will read the full contents of the disk - but that defeats the purpose of using the checkpoint in the first place to reduce the amount of data to be backed up. So, let's modify our image-scraping loop from the first example, to now have one client utilizing the x-dirty-bitmap command line extension to drive other clients. Note: that extension is marked experimental in part because it has screwy semantics: if you use it, you can't reliably read any data from the NBD server, but instead can interpret 'qemu-img map' output by treating any "data":false lines as dirty, and "data":true entries as unchanged. $ image_opts=driver=nbd,export=sdc,server.type=inet, $ image_opts+=server.host=localhost,server.port=10809, $ image_opts+=x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc $ $qemu_img create -f qcow2 inc12.img $size_of_orig1 $ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \ inc12.img $ while read line; do [[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.false.* ]] || continue start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]} qemu-io -C -c "r $start $len" -f qcow2 inc12.img done < <($qemu_img map --output=json --image-opts $image_optsdriver=nbd,export=sdc,server.type=inet,server.host=localhost,server.port=10809,x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc) $ $qemu_img rebase -u -f qcow2 -b '' inc12.img As captured, inc12.img is an incomplete qcow2 file (it only includes clusters touched by the guest since the last incremental or full backup); but since we output into a qcow2 file, we can easily repair the damage: $ $qemu_img rebase -u -f qcow2 -F qcow2 -b full1.img inc12.img creating the qcow2 chain 'full1.img <- inc12.img' that contains identical guest-visible contents as would be present in a full backup done at the same moment. Of course, with the backups now captured, we clean up: $ $virsh backup-end $dom 1 Backup id 1 completed $ rm scratch1.img scratch2.img and this time, virDomainBackupEnd() had to do one additional bit of work to delete the temporary bitmaps: {"execute":"nbd-server-remove", "arguments":{"name":"sdc"}} {"execute":"nbd-server-remove", "arguments":{"name":"sdd"}} {"execute":"nbd-server-stop"} {"execute":"block-job-cancel", "arguments":{"device":"backup-sdc"}} {"execute":"block-job-cancel", "arguments":{"device":"backup-sdd"}} {"execute":"blockdev-del", "arguments":{"node-name":"backup-sdc"}} {"execute":"blockdev-del", "arguments":{"node-name":"backup-sdd"}} {"execute":"block-dirty-bitmap-remove", "arguments":{"node":"$node1", "name":"backup-sdc"}} {"execute":"block-dirty-bitmap-remove", "arguments":{"node":"$node2", "name":"backup-sdd"}} At this point, it should be fairly obvious that you can create more incremental backups, by repeatedly updating the <incremental> line in backup.xml, and adjusting the checkpoint XML to move on to a successive name. And while incremental backups are the most common (using the current active checkpoint as the <incremental> when starting the next), the scheme is also set up to permit differential backups from any existing checkpoint to the current point in time (since libvirt is already creating a temporary bitmap as its basis for the x-nbd-server-add-bitmap, all it has to do is just add an appropriate number of x-block-dirty-bitmap-merge calls to collect all bitmaps in the chain from the requested checkpoint to the current checkpoint). More to come in part 3. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On Fri, Oct 5, 2018 at 7:58 AM Eric Blake <eblake@redhat.com> wrote:
On 10/4/18 12:05 AM, Eric Blake wrote:
The following (long) email describes a portion of the work-flow of how my proposed incremental backup APIs will work, along with the backend QMP commands that each one executes. I will reply to this thread with further examples (the first example is long enough to be its own email). This is an update to a thread last posted here: https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html
More to come in part 2.
- Second example: a sequence of incremental backups via pull model
In the first example, we did not create a checkpoint at the time of the full pull. That means we have no way to track a delta of changes since that point in time.
Why do we want to support backup without creating a checkpoint? If we don't have any real use case, I suggest to always require a checkpoint.
Let's repeat the full backup (reusing the same backup.xml from before), but this time, we'll add a new parameter, a second XML file for describing the checkpoint we want to create.
Actually, it was easy enough to get virsh to write the XML for me (because it was very similar to existing code in virsh that creates XML for snapshot creation):
$ $virsh checkpoint-create-as --print-xml $dom check1 testing \ --diskspec sdc --diskspec sdd | tee check1.xml <domaincheckpoint> <name>check1</name>
We should use an id, not a name, even of name is name is also unique like in most libvirt apis. In RHV we will use always use a UUID for this.
<description>testing</description> <disks> <disk name='sdc'/> <disk name='sdd'/> </disks> </domaincheckpoint>
I had to supply two --diskspec arguments to virsh to select just the two qcow2 disks that I am using in my example (rather than every disk in the domain, which is the default when <disks> is not present).
So <disks /> is valid configuration, selecting all disks, or not having "disks" element selects all disks?
I also picked a name (mandatory) and description (optional) to be associated with the checkpoint.
The backup.xml file that we plan to reuse still mentions scratch1.img and scratch2.img as files needed for staging the pull request. However, any contents in those files could interfere with our second backup (after all, every cluster written into that file from the first backup represents a point in time that was frozen at the first backup; but our second backup will want to read the data as the guest sees it now rather than what it was at the first backup), so we MUST regenerate the scratch files. (Perhaps I should have just deleted them at the end of example 1 in my previous email, had I remembered when typing that mail).
$ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img
Now, to begin the full backup and create a checkpoint at the same time. Also, this time around, it would be nice if the guest had a chance to freeze I/O to the disks prior to the point chosen as the checkpoint. Assuming the guest is trusted, and running the qemu guest agent (qga), we can do that with:
$ $virsh fsfreeze $dom $ $virsh backup-begin $dom backup.xml check1.xml Backup id 1 started backup used description from 'backup.xml' checkpoint used description from 'check1.xml' $ $virsh fsthaw $dom
Great, this answer my (unsent) question about freeze/thaw from part 1 :-)
and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE flag to combine those three steps into a single API (matching what we've done on some other existing API). In other words, the sequence of QMP operations performed during virDomainBackupBegin are quick enough that they won't stall a freeze operation (at least Windows is picky if you stall a freeze operation longer than 10 seconds).
We use fsFreeze/fsThaw directly in RHV since we need to support external snapshots (e.g. ceph), so we don't need this functionality, but it sounds good idea to make it work like snapshot.
The tweaked $virsh backup-begin now results in a call to: virDomainBackupBegin(dom, "<domainbackup ...>", "<domaincheckpoint ...", 0) and in turn libvirt makes a similar sequence of QMP calls as before, with a slight modification in the middle: {"execute":"nbd-server-start",... {"execute":"blockdev-add",...
This does not work yet for network disks like "rbd" and "glusterfs" does it mean that they will not be supported for backup?
{"execute":"transaction", "arguments":{"actions":[ {"type":"blockdev-backup", "data":{ "device":"$node1", "target":"backup-sdc", "sync":"none", "job-id":"backup-sdc" }}, {"type":"blockdev-backup", "data":{ "device":"$node2", "target":"backup-sdd", "sync":"none", "job-id":"backup-sdd" }} {"type":"block-dirty-bitmap-add", "data":{ "node":"$node1", "name":"check1", "persistent":true}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node2", "name":"check1", "persistent":true}} ]}} {"execute":"nbd-server-add",...
What if this sequence fail in the middle? will libvirt handle all failures and rollback to the previous state? What is the semantics of "execute": "transaction"? does it mean that qemu will handle all possible failures in one of the actions? (Will continue later)
The only change was adding more actions to the "transaction" command - in addition to kicking off the fleece image in the scratch nodes, it ALSO added a persistent bitmap to each of the original images, to track all changes made after the point of the transaction. The bitmaps are persistent - at this point (well, it's better if you wait until after backup-end), you could shut the guest down and restart it, and libvirt will still remember that the checkpoint exists, and qemu will continue track guest writes via the bitmap. However, the backup job itself is currently live-only, and shutting down the guest while a backup operation is in effect will lose track of the backup job. What that really means is that if the guest shuts down, your current backup job is hosed (you cannot ever get back the point-in-time data from your API request - as your next API request will be a new point in time) - but you have not permanently ruined the guest, and your recovery is to just start a new backup.
Pulling the data out from the backup is unchanged from example 1; virsh backup-dumpxml will show details about the job (yes, the job id is still 1 for now), and when ready, virsh backup-end will end the job and gracefully take down the NBD server with no difference in QMP commands from before. Thus, the creation of a checkpoint didn't change any of the fundamentals of capturing the current backup, but rather is in preparation for the next step.
$ $virsh backup-end $dom 1 Backup id 1 completed $ rm scratch1.img scratch2.img
[We have not yet designed how qemu bitmaps will interact with external snapshots - but I see two likely scenarios: 1. Down the road, I add a virDomainSnapshotCheckpointCreateXML() API, which adds a checkpointXML parameter but otherwise behaves like the existing virDomainSnapshotCreateXML - if that API is added in a different release than my current API proposals, that's yet another libvirt.so rebase to pickup the new API. 2. My current proposal of virDomainBackupBegin(dom, "<domainbackup>", "<domaincheckpoint>", flags) could instead be tweaked to a single XML parameter, virDomainBackupBegin(dom, " <domainbackup> <domaincheckpoint> ... </domaincheckpoint> </domainbackup>", flags) prior to adding my APIs to libvirt 4.9, then down the road, we also tweak <domainsnapshot> to take an optional <domaincheckpoint> sub-element, and thus reuse the existing virDomainSnapshotCreateXML() to now also create checkpoints without a further API addition. Speak up now if you have a preference between the two ideas]
Now that we have concluded the full backup and created a checkpoint, we can do more things with the checkpoint (it is persistent, after all). For example:
$ $virsh checkpoint-list $dom Name Creation Time -------------------------------------------- check1 2018-10-04 15:02:24 -0500
called virDomainListCheckpoints(dom, &array, 0) under the hood to get a list of virDomainCheckpointPtr objects, then called virDomainCheckpointGetXMLDesc(array[0], 0) to scrape the XML describing that checkpoint in order to display information. Or another approach, using virDomainCheckpointGetXMLDesc(virDomainCheckpointCurrent(dom, 0), 0):
$ $virsh checkpoint-current $dom | head <domaincheckpoint> <name>check1</name> <description>testing</description> <creationTime>1538683344</creationTime> <disks> <disk name='vda' checkpoint='no'/> <disk name='sdc' checkpoint='bitmap' bitmap='check1'/> <disk name='sdd' checkpoint='bitmap' bitmap='check1'/> </disks> <domain type='kvm'>
which shows the current checkpoint (that is, the checkpoint owning the bitmap that is still receiving live updates), and which bitmap names in the qcow2 files are in use. For convenience, it also recorded the full <domain> description at the time the checkpoint was captured (I used head to limit the size of this email), so that if you later hot-plug things, you still have a record of what state the machine had at the time the checkpoint was created.
The XML output of a checkpoint description is normally static, but sometimes it is useful to know an approximate size of the guest data that has been dirtied since a checkpoint was created (a dynamic value that grows as a guest dirties more clusters). For that, it makes sense to have a flag to request the dynamic data; it's also useful to have a flag that suppresses the (length) <domain> output:
$ $virsh checkpoint-current $dom --size --no-domain <domaincheckpoint> <name>check1</name> <description>testing</description> <creationTime>1538683344</creationTime> <disks> <disk name='vda' checkpoint='no'/> <disk name='sdc' checkpoint='bitmap' bitmap='check1' size='1048576'/> <disk name='sdd' checkpoint='bitmap' bitmap='check1' size='65536'/> </disks> </domaincheckpoint>
This maps to virDomainCheckpointGetXMLDesc(chk, VIR_DOMAIN_CHECKPOINT_XML_NO_DOMAIN | VIR_DOMAIN_CHECKPOINT_XML_SIZE). Under the hood, libvirt calls {"execute":"query-block"} and converts the bitmap size reported by qemu into an estimate of the number of bytes that would be required if you were to start a backup from that checkpoint right now. Note that the result is just an estimate of the storage taken by guest-visible data; you'll probably want to use 'qemu-img measure' to convert that into a size of how much a matching qcow2 image would require when metadata is added in; also remember that the number is constantly growing as the guest writes and causes more of the image to become dirty. But having a feel for how much has changed can be useful for determining if continuing a chain of incremental backups still makes more sense, or if enough of the guest data has changed that doing a full backup is smarter; it is also useful for preallocating how much storage you will need for an incremental backup.
Technically, libvirt mapping that a checkpoint size request to a single {"execute":"query-block"} works only when querying the size of the current bitmap. The command also works when querying the cumulative size since an older checkpoint, but under the hood, libvirt must juggle things to create a temporary bitmap, call a few x-block-dirty-bitmap-merge, query the size of that temporary bitmap, then clean things back up again (after all, size(A) + size(B) >= size(A|B), depending on how many clusters were touched during both A and B's tracking of dirty clusters). Again, a nice benefit of having libvirt manage multiple qemu bitmaps under a single libvirt API.
Of course, the real reason we created a checkpoint with our full backup is that we want to take an incremental backup next, rather than repeatedly taking full backups. For this, we need a one-line modification to our backup XML to add an <incremental> element; we also want to update our checkpoint XML to start yet another checkpoint when we run our first incremental backup.
$ cat > backup.xml <<EOF <domainbackup mode='pull'> <server transport='tcp' name='localhost' port='10809'/> <incremental>check1</incremental> <disks> <disk name='$orig1' type='file'> <scratch file='$PWD/scratch1.img'/> </disk> <disk name='sdd' type='file'> <scratch file='$PWD/scratch2.img'/> </disk> </disks> </domainbackup> EOF $ $virsh checkpoint-create-as --print-xml $dom check2 \ --diskspec sdc --diskspec sdd | tee check2.xml <domaincheckpoint> <name>check2</name> <disks> <disk name='sdc'/> <disk name='sdd'/> </disks> </domaincheckpoint> $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img
And again, it's time to kick off the backup job:
$ $virsh backup-begin $dom backup.xml check2.xml Backup id 1 started backup used description from 'backup.xml' checkpoint used description from 'check2.xml'
This time, the incremental backup causes libvirt to do a bit more work under the hood:
{"execute":"nbd-server-start", "arguments":{"addr":{"type":"inet", "data":{"host":"localhost", "port":"10809"}}}} {"execute":"blockdev-add", "arguments":{"driver":"qcow2", "node-name":"backup-sdc", "file":{"driver":"file", "filename":"$PWD/scratch1.img"}, "backing":"'$node1'"}} {"execute":"blockdev-add", "arguments":{"driver":"qcow2", "node-name":"backup-sdd", "file":{"driver":"file", "filename":"$PWD/scratch2.img"}, "backing":"'$node2'"}} {"execute":"block-dirty-bitmap-add", "arguments":{"node":"$node1", "name":"backup-sdc"}} {"execute":"x-block-dirty-bitmap-merge", "arguments":{"node":"$node1", "src_name":"check1", "dst_name":"backup-sdc"}}' {"execute":"block-dirty-bitmap-add", "arguments":{"node":"$node2", "name":"backup-sdd"}} {"execute":"x-block-dirty-bitmap-merge", "arguments":{"node":"$node2", "src_name":"check1", "dst_name":"backup-sdd"}}' {"execute":"transaction", "arguments":{"actions":[ {"type":"blockdev-backup", "data":{ "device":"$node1", "target":"backup-sdc", "sync":"none", "job-id":"backup-sdc" }}, {"type":"blockdev-backup", "data":{ "device":"$node2", "target":"backup-sdd", "sync":"none", "job-id":"backup-sdd" }}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node1", "name":"backup-sdc"}}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node2", "name":"backup-sdd"}}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node1", "name":"check1"}}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node2", "name":"check1"}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node1", "name":"check2", "persistent":true}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node2", "name":"check2", "persistent":true}} ]}} {"execute":"nbd-server-add", "arguments":{"device":"backup-sdc", "name":"sdc"}} {"execute":"nbd-server-add", "arguments":{"device":"backup-sdd", "name":"sdd"}} {"execute":"x-nbd-server-add-bitmap", "arguments":{"name":"sdc", "bitmap":"backup-sdc"}} {"execute":"x-nbd-server-add-bitmap", "arguments":{"name":"sdd", "bitmap":"backup-sdd"}}
Two things stand out here, different from the earlier full backup. First is that libvirt is now creating a temporary non-persistent bitmap, merging all data fom check1 into the temporary, then freezing writes into the temporary bitmap during the transaction, and telling NBD to expose the bitmap to clients. The second is that since we want this backup to start a new checkpoint, we disable the old bitmap and create a new one. The two additions are independent - it is possible to create an incremental backup [<incremental> in backup XML]) without triggering a new checkpoint [presence of non-null checkpoint XML]. In fact, taking an incremental backup without creating a checkpoint is effectively doing differential backups, where multiple backups started at different times each contain all cumulative changes since the same original point in time, such that later backups are larger than earlier backups, but you no longer have to chain those backups to one another to reconstruct the state in any one of the backups).
Now that the pull-model backup job is running, we want to scrape the data off the NBD server. Merely reading nbd://localhost:10809/sdc will read the full contents of the disk - but that defeats the purpose of using the checkpoint in the first place to reduce the amount of data to be backed up. So, let's modify our image-scraping loop from the first example, to now have one client utilizing the x-dirty-bitmap command line extension to drive other clients. Note: that extension is marked experimental in part because it has screwy semantics: if you use it, you can't reliably read any data from the NBD server, but instead can interpret 'qemu-img map' output by treating any "data":false lines as dirty, and "data":true entries as unchanged.
$ image_opts=driver=nbd,export=sdc,server.type=inet, $ image_opts+=server.host=localhost,server.port=10809, $ image_opts+=x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc $ $qemu_img create -f qcow2 inc12.img $size_of_orig1 $ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \ inc12.img $ while read line; do [[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.false.* ]] || continue start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]} qemu-io -C -c "r $start $len" -f qcow2 inc12.img done < <($qemu_img map --output=json --image-opts
$image_optsdriver=nbd,export=sdc,server.type=inet,server.host=localhost,server.port=10809,x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc) $ $qemu_img rebase -u -f qcow2 -b '' inc12.img
As captured, inc12.img is an incomplete qcow2 file (it only includes clusters touched by the guest since the last incremental or full backup); but since we output into a qcow2 file, we can easily repair the damage:
$ $qemu_img rebase -u -f qcow2 -F qcow2 -b full1.img inc12.img
creating the qcow2 chain 'full1.img <- inc12.img' that contains identical guest-visible contents as would be present in a full backup done at the same moment.
Of course, with the backups now captured, we clean up:
$ $virsh backup-end $dom 1 Backup id 1 completed $ rm scratch1.img scratch2.img
and this time, virDomainBackupEnd() had to do one additional bit of work to delete the temporary bitmaps:
{"execute":"nbd-server-remove", "arguments":{"name":"sdc"}} {"execute":"nbd-server-remove", "arguments":{"name":"sdd"}} {"execute":"nbd-server-stop"} {"execute":"block-job-cancel", "arguments":{"device":"backup-sdc"}} {"execute":"block-job-cancel", "arguments":{"device":"backup-sdd"}} {"execute":"blockdev-del", "arguments":{"node-name":"backup-sdc"}} {"execute":"blockdev-del", "arguments":{"node-name":"backup-sdd"}} {"execute":"block-dirty-bitmap-remove", "arguments":{"node":"$node1", "name":"backup-sdc"}} {"execute":"block-dirty-bitmap-remove", "arguments":{"node":"$node2", "name":"backup-sdd"}}
At this point, it should be fairly obvious that you can create more incremental backups, by repeatedly updating the <incremental> line in backup.xml, and adjusting the checkpoint XML to move on to a successive name. And while incremental backups are the most common (using the current active checkpoint as the <incremental> when starting the next), the scheme is also set up to permit differential backups from any existing checkpoint to the current point in time (since libvirt is already creating a temporary bitmap as its basis for the x-nbd-server-add-bitmap, all it has to do is just add an appropriate number of x-block-dirty-bitmap-merge calls to collect all bitmaps in the chain from the requested checkpoint to the current checkpoint).
More to come in part 3.
-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 10/9/18 8:29 AM, Nir Soffer wrote:
On Fri, Oct 5, 2018 at 7:58 AM Eric Blake <eblake@redhat.com> wrote:
On 10/4/18 12:05 AM, Eric Blake wrote:
The following (long) email describes a portion of the work-flow of how my proposed incremental backup APIs will work, along with the backend QMP commands that each one executes. I will reply to this thread with further examples (the first example is long enough to be its own email). This is an update to a thread last posted here: https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html
More to come in part 2.
- Second example: a sequence of incremental backups via pull model
In the first example, we did not create a checkpoint at the time of the full pull. That means we have no way to track a delta of changes since that point in time.
Why do we want to support backup without creating a checkpoint?
Fleecing. If you want to examine a portion of the disk at a given point in time, then kicking off a pull model backup gives you access to the state of the disk at that time, and your actions are transient. Ending the job when you are done with the fleece cleans up everything needed to perform the fleece operation, and since you did not intend to capture a full (well, a complete) incremental backup, but were rather grabbing just a subset of the disk, you really don't want that point in time to be recorded as a new checkpoint. Also, incremental backups (which are what require checkpoints) are limited to qcow2 disks, but full backups can be performed on any format (including raw disks). If you have a guest that does not use qcow2 disks, you can perform a full backup, but cannot create a checkpoint.
If we don't have any real use case, I suggest to always require a checkpoint.
But we do have real cases for backup without checkpoint.
Let's repeat the full backup (reusing the same backup.xml from before), but this time, we'll add a new parameter, a second XML file for describing the checkpoint we want to create.
Actually, it was easy enough to get virsh to write the XML for me (because it was very similar to existing code in virsh that creates XML for snapshot creation):
$ $virsh checkpoint-create-as --print-xml $dom check1 testing \ --diskspec sdc --diskspec sdd | tee check1.xml <domaincheckpoint> <name>check1</name>
We should use an id, not a name, even of name is name is also unique like in most libvirt apis.
In RHV we will use always use a UUID for this.
Nothing prevents you from using a UUID as your name. But this particular choice of XML (<name>) matches what already exists in the snapshot XML.
<description>testing</description> <disks> <disk name='sdc'/> <disk name='sdd'/> </disks> </domaincheckpoint>
I had to supply two --diskspec arguments to virsh to select just the two qcow2 disks that I am using in my example (rather than every disk in the domain, which is the default when <disks> is not present).
So <disks /> is valid configuration, selecting all disks, or not having "disks" element selects all disks?
It's about a one-line change to get whichever behavior you find more useful. Right now, I'm leaning towards: <disks> omitted == backup all disks, <disks> present: you MUST have at least one <disk> subelement that explicitly requests a checkpoint (because any omitted <disk> when <disks> is present is skipped). A checkpoint only makes sense as long as there is at least one disk to create a checkpoint with. But I could also go with: <disks> omitted == backup all disks, <disks> present but <disk> subelements missing: the missing elements default to being backed up, and you have to explicitly provide <disk name='foo' checkpoint='no'> to skip a particular disk. Or even: <disks> omitted, or <disks> present but <disk> subelements missing: the missing elements defer to the hypervisor for their default state, and the qemu hypervisor defaults to qcow2 disks being backed up/checkpointed and to non-qcow2 disks being omitted. But this latter one feels like more magic, which is harder to document and liable to go wrong. A stricter version would be <disks> is mandatory, and no <disk> subelement can be missing (or else the API fails because you weren't explicit in your choice). But that's rather strict, especially since existing snapshots XML handling is not that strict.
I also picked a name (mandatory) and description (optional) to be associated with the checkpoint.
The backup.xml file that we plan to reuse still mentions scratch1.img and scratch2.img as files needed for staging the pull request. However, any contents in those files could interfere with our second backup (after all, every cluster written into that file from the first backup represents a point in time that was frozen at the first backup; but our second backup will want to read the data as the guest sees it now rather than what it was at the first backup), so we MUST regenerate the scratch files. (Perhaps I should have just deleted them at the end of example 1 in my previous email, had I remembered when typing that mail).
$ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img
Now, to begin the full backup and create a checkpoint at the same time. Also, this time around, it would be nice if the guest had a chance to freeze I/O to the disks prior to the point chosen as the checkpoint. Assuming the guest is trusted, and running the qemu guest agent (qga), we can do that with:
$ $virsh fsfreeze $dom $ $virsh backup-begin $dom backup.xml check1.xml Backup id 1 started backup used description from 'backup.xml' checkpoint used description from 'check1.xml' $ $virsh fsthaw $dom
Great, this answer my (unsent) question about freeze/thaw from part 1 :-)
and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE flag to combine those three steps into a single API (matching what we've done on some other existing API). In other words, the sequence of QMP operations performed during virDomainBackupBegin are quick enough that they won't stall a freeze operation (at least Windows is picky if you stall a freeze operation longer than 10 seconds).
We use fsFreeze/fsThaw directly in RHV since we need to support external snapshots (e.g. ceph), so we don't need this functionality, but it sounds good idea to make it work like snapshot.
And indeed, since a future enhancement will be figuring out how we can create a checkpoint at the same time as a snapshot (as mentioned elsewhere in the email). A snapshot and checkpoint created at the same atomic point should obviously both be able to happen at a quiescent point in guest I/O.
The tweaked $virsh backup-begin now results in a call to: virDomainBackupBegin(dom, "<domainbackup ...>", "<domaincheckpoint ...", 0) and in turn libvirt makes a similar sequence of QMP calls as before, with a slight modification in the middle: {"execute":"nbd-server-start",... {"execute":"blockdev-add",...
This does not work yet for network disks like "rbd" and "glusterfs" does it mean that they will not be supported for backup?
Full backups can happen regardless of underlying format. But incremental backups require checkpoints, and checkpoints require qcow2 persistent bitmaps. As long as you have a qcow2 format on rbd or glusterfs, you should be able to create checkpoints on that image, and therefore perform incremental backups. Storage-wise, during a pull model backup, you would have your qcow2 format on remote glusterfs storage which is where the persistent bitmap is written, and temporarily also have a scratch qcow2 file on the local machine for performing copy-on-write needed to preserve the point in time semantics for as long as the backup operation is running.
{"execute":"transaction", "arguments":{"actions":[ {"type":"blockdev-backup", "data":{ "device":"$node1", "target":"backup-sdc", "sync":"none", "job-id":"backup-sdc" }}, {"type":"blockdev-backup", "data":{ "device":"$node2", "target":"backup-sdd", "sync":"none", "job-id":"backup-sdd" }} {"type":"block-dirty-bitmap-add", "data":{ "node":"$node1", "name":"check1", "persistent":true}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node2", "name":"check1", "persistent":true}} ]}} {"execute":"nbd-server-add",...
What if this sequence fail in the middle? will libvirt handle all failures and rollback to the previous state?
What is the semantics of "execute": "transaction"? does it mean that qemu will handle all possible failures in one of the actions?
qemu already promises that a "transaction" succeeds or fails as a group. As to other failures, the full recovery sequence is handled by libvirt, and looks like: Fail on "nbd-server-start": - nothing to roll back Fail on first "blockdev-add": - nbd-server-stop Fail on subsequent "blockdev-add": - blockdev-remove on earlier scratch file additions - nbd-server-stop Fail on any "block-dirty-bitmap-add" or "x-block-dirty-bitmap-merge": - block-dirty-bitmap-remove on any temporary bitmaps that were created - blockdev-remove on all scratch file additions - nbd-server-stop Fail on "transaction": - block-dirty-bitmap-remove on all temporary bitmaps - blockdev-remove on all additions - nbd-server-stop Fail on "nbd-server-add" or "x-nbd-server-add-bitmap": - if a checkpoint was attempted during "transaction": -- perform x-block-dirty-bitmap-enable to re-enable bitmap that was in use prior to transaction -- perform x-block-dirty-bitmap-merge to merge new bitmap into re-enabled bitmap -- perform block-dirty-bitmap-remove on the new bitmap - block-job-cancel - block-dirty-bitmap-remove on all temporary bitmaps - blockdev-remove on all scratch file additions - nbd-server-stop
More to come in part 3.
I still need to finish writing that, but part 3 will be a demonstration of the push model (where qemu writes the backup to a given destination, without a scratch file, and without an NBD server, but where you are limited to what qemu knows how to write). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
participants (2)
-
Eric Blake
-
Nir Soffer