On Fri, Oct 5, 2018 at 7:58 AM Eric Blake <eblake(a)redhat.com> wrote:
On 10/4/18 12:05 AM, Eric Blake wrote:
> The following (long) email describes a portion of the work-flow of how
> my proposed incremental backup APIs will work, along with the backend
> QMP commands that each one executes. I will reply to this thread with
> further examples (the first example is long enough to be its own email).
> This is an update to a thread last posted here:
>
https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html
>
> More to come in part 2.
>
- Second example: a sequence of incremental backups via pull model
In the first example, we did not create a checkpoint at the time of the
full pull. That means we have no way to track a delta of changes since
that point in time.
Why do we want to support backup without creating a checkpoint?
If we don't have any real use case, I suggest to always require a
checkpoint.
Let's repeat the full backup (reusing the same
backup.xml from before), but this time, we'll add a new parameter, a
second XML file for describing the checkpoint we want to create.
Actually, it was easy enough to get virsh to write the XML for me
(because it was very similar to existing code in virsh that creates XML
for snapshot creation):
$ $virsh checkpoint-create-as --print-xml $dom check1 testing \
--diskspec sdc --diskspec sdd | tee check1.xml
<domaincheckpoint>
<name>check1</name>
We should use an id, not a name, even of name is name is also unique like
in most libvirt apis.
In RHV we will use always use a UUID for this.
<description>testing</description>
<disks>
<disk name='sdc'/>
<disk name='sdd'/>
</disks>
</domaincheckpoint>
I had to supply two --diskspec arguments to virsh to select just the two
qcow2 disks that I am using in my example (rather than every disk in the
domain, which is the default when <disks> is not present).
So <disks /> is valid configuration, selecting all disks, or not having
"disks" element
selects all disks?
I also picked
a name (mandatory) and description (optional) to be associated with the
checkpoint.
The backup.xml file that we plan to reuse still mentions scratch1.img
and scratch2.img as files needed for staging the pull request. However,
any contents in those files could interfere with our second backup
(after all, every cluster written into that file from the first backup
represents a point in time that was frozen at the first backup; but our
second backup will want to read the data as the guest sees it now rather
than what it was at the first backup), so we MUST regenerate the scratch
files. (Perhaps I should have just deleted them at the end of example 1
in my previous email, had I remembered when typing that mail).
$ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img
$ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img
Now, to begin the full backup and create a checkpoint at the same time.
Also, this time around, it would be nice if the guest had a chance to
freeze I/O to the disks prior to the point chosen as the checkpoint.
Assuming the guest is trusted, and running the qemu guest agent (qga),
we can do that with:
$ $virsh fsfreeze $dom
$ $virsh backup-begin $dom backup.xml check1.xml
Backup id 1 started
backup used description from 'backup.xml'
checkpoint used description from 'check1.xml'
$ $virsh fsthaw $dom
Great, this answer my (unsent) question about freeze/thaw from part 1 :-)
and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE
flag to combine those three steps into a single API (matching what we've
done on some other existing API). In other words, the sequence of QMP
operations performed during virDomainBackupBegin are quick enough that
they won't stall a freeze operation (at least Windows is picky if you
stall a freeze operation longer than 10 seconds).
We use fsFreeze/fsThaw directly in RHV since we need to support external
snapshots (e.g. ceph), so we don't need this functionality, but it sounds
good
idea to make it work like snapshot.
The tweaked $virsh backup-begin now results in a call to:
virDomainBackupBegin(dom, "<domainbackup ...>",
"<domaincheckpoint ...", 0)
and in turn libvirt makes a similar sequence of QMP calls as before,
with a slight modification in the middle:
{"execute":"nbd-server-start",...
{"execute":"blockdev-add",...
This does not work yet for network disks like "rbd" and "glusterfs"
does it mean that they will not be supported for backup?
{"execute":"transaction",
"arguments":{"actions":[
{"type":"blockdev-backup", "data":{
"device":"$node1", "target":"backup-sdc",
"sync":"none",
"job-id":"backup-sdc" }},
{"type":"blockdev-backup", "data":{
"device":"$node2", "target":"backup-sdd",
"sync":"none",
"job-id":"backup-sdd" }}
{"type":"block-dirty-bitmap-add", "data":{
"node":"$node1", "name":"check1",
"persistent":true}},
{"type":"block-dirty-bitmap-add", "data":{
"node":"$node2", "name":"check1",
"persistent":true}}
]}}
{"execute":"nbd-server-add",...
What if this sequence fail in the middle? will libvirt handle all failures
and rollback to the previous state?
What is the semantics of "execute": "transaction"? does it mean that
qemu
will handle all possible failures in one of the actions?
(Will continue later)
The only change was adding more actions to the "transaction" command -
in addition to kicking off the fleece image in the scratch nodes, it
ALSO added a persistent bitmap to each of the original images, to track
all changes made after the point of the transaction. The bitmaps are
persistent - at this point (well, it's better if you wait until after
backup-end), you could shut the guest down and restart it, and libvirt
will still remember that the checkpoint exists, and qemu will continue
track guest writes via the bitmap. However, the backup job itself is
currently live-only, and shutting down the guest while a backup
operation is in effect will lose track of the backup job. What that
really means is that if the guest shuts down, your current backup job is
hosed (you cannot ever get back the point-in-time data from your API
request - as your next API request will be a new point in time) - but
you have not permanently ruined the guest, and your recovery is to just
start a new backup.
Pulling the data out from the backup is unchanged from example 1; virsh
backup-dumpxml will show details about the job (yes, the job id is still
1 for now), and when ready, virsh backup-end will end the job and
gracefully take down the NBD server with no difference in QMP commands
from before. Thus, the creation of a checkpoint didn't change any of
the fundamentals of capturing the current backup, but rather is in
preparation for the next step.
$ $virsh backup-end $dom 1
Backup id 1 completed
$ rm scratch1.img scratch2.img
[We have not yet designed how qemu bitmaps will interact with external
snapshots - but I see two likely scenarios:
1. Down the road, I add a virDomainSnapshotCheckpointCreateXML() API,
which adds a checkpointXML parameter but otherwise behaves like the
existing virDomainSnapshotCreateXML - if that API is added in a
different release than my current API proposals, that's yet another
libvirt.so rebase to pickup the new API.
2. My current proposal of virDomainBackupBegin(dom, "<domainbackup>",
"<domaincheckpoint>", flags) could instead be tweaked to a single XML
parameter, virDomainBackupBegin(dom, "
<domainbackup>
<domaincheckpoint> ... </domaincheckpoint>
</domainbackup>", flags) prior to adding my APIs to libvirt 4.9, then
down the road, we also tweak <domainsnapshot> to take an optional
<domaincheckpoint> sub-element, and thus reuse the existing
virDomainSnapshotCreateXML() to now also create checkpoints without a
further API addition.
Speak up now if you have a preference between the two ideas]
Now that we have concluded the full backup and created a checkpoint, we
can do more things with the checkpoint (it is persistent, after all).
For example:
$ $virsh checkpoint-list $dom
Name Creation Time
--------------------------------------------
check1 2018-10-04 15:02:24 -0500
called virDomainListCheckpoints(dom, &array, 0) under the hood to get a
list of virDomainCheckpointPtr objects, then called
virDomainCheckpointGetXMLDesc(array[0], 0) to scrape the XML describing
that checkpoint in order to display information. Or another approach,
using virDomainCheckpointGetXMLDesc(virDomainCheckpointCurrent(dom, 0), 0):
$ $virsh checkpoint-current $dom | head
<domaincheckpoint>
<name>check1</name>
<description>testing</description>
<creationTime>1538683344</creationTime>
<disks>
<disk name='vda' checkpoint='no'/>
<disk name='sdc' checkpoint='bitmap'
bitmap='check1'/>
<disk name='sdd' checkpoint='bitmap'
bitmap='check1'/>
</disks>
<domain type='kvm'>
which shows the current checkpoint (that is, the checkpoint owning the
bitmap that is still receiving live updates), and which bitmap names in
the qcow2 files are in use. For convenience, it also recorded the full
<domain> description at the time the checkpoint was captured (I used
head to limit the size of this email), so that if you later hot-plug
things, you still have a record of what state the machine had at the
time the checkpoint was created.
The XML output of a checkpoint description is normally static, but
sometimes it is useful to know an approximate size of the guest data
that has been dirtied since a checkpoint was created (a dynamic value
that grows as a guest dirties more clusters). For that, it makes sense
to have a flag to request the dynamic data; it's also useful to have a
flag that suppresses the (length) <domain> output:
$ $virsh checkpoint-current $dom --size --no-domain
<domaincheckpoint>
<name>check1</name>
<description>testing</description>
<creationTime>1538683344</creationTime>
<disks>
<disk name='vda' checkpoint='no'/>
<disk name='sdc' checkpoint='bitmap' bitmap='check1'
size='1048576'/>
<disk name='sdd' checkpoint='bitmap' bitmap='check1'
size='65536'/>
</disks>
</domaincheckpoint>
This maps to virDomainCheckpointGetXMLDesc(chk,
VIR_DOMAIN_CHECKPOINT_XML_NO_DOMAIN | VIR_DOMAIN_CHECKPOINT_XML_SIZE).
Under the hood, libvirt calls
{"execute":"query-block"}
and converts the bitmap size reported by qemu into an estimate of the
number of bytes that would be required if you were to start a backup
from that checkpoint right now. Note that the result is just an
estimate of the storage taken by guest-visible data; you'll probably
want to use 'qemu-img measure' to convert that into a size of how much a
matching qcow2 image would require when metadata is added in; also
remember that the number is constantly growing as the guest writes and
causes more of the image to become dirty. But having a feel for how
much has changed can be useful for determining if continuing a chain of
incremental backups still makes more sense, or if enough of the guest
data has changed that doing a full backup is smarter; it is also useful
for preallocating how much storage you will need for an incremental backup.
Technically, libvirt mapping that a checkpoint size request to a single
{"execute":"query-block"} works only when querying the size of the
current bitmap. The command also works when querying the cumulative size
since an older checkpoint, but under the hood, libvirt must juggle
things to create a temporary bitmap, call a few
x-block-dirty-bitmap-merge, query the size of that temporary bitmap,
then clean things back up again (after all, size(A) + size(B) >=
size(A|B), depending on how many clusters were touched during both A and
B's tracking of dirty clusters). Again, a nice benefit of having
libvirt manage multiple qemu bitmaps under a single libvirt API.
Of course, the real reason we created a checkpoint with our full backup
is that we want to take an incremental backup next, rather than
repeatedly taking full backups. For this, we need a one-line
modification to our backup XML to add an <incremental> element; we also
want to update our checkpoint XML to start yet another checkpoint when
we run our first incremental backup.
$ cat > backup.xml <<EOF
<domainbackup mode='pull'>
<server transport='tcp' name='localhost' port='10809'/>
<incremental>check1</incremental>
<disks>
<disk name='$orig1' type='file'>
<scratch file='$PWD/scratch1.img'/>
</disk>
<disk name='sdd' type='file'>
<scratch file='$PWD/scratch2.img'/>
</disk>
</disks>
</domainbackup>
EOF
$ $virsh checkpoint-create-as --print-xml $dom check2 \
--diskspec sdc --diskspec sdd | tee check2.xml
<domaincheckpoint>
<name>check2</name>
<disks>
<disk name='sdc'/>
<disk name='sdd'/>
</disks>
</domaincheckpoint>
$ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img
$ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img
And again, it's time to kick off the backup job:
$ $virsh backup-begin $dom backup.xml check2.xml
Backup id 1 started
backup used description from 'backup.xml'
checkpoint used description from 'check2.xml'
This time, the incremental backup causes libvirt to do a bit more work
under the hood:
{"execute":"nbd-server-start",
"arguments":{"addr":{"type":"inet",
"data":{"host":"localhost",
"port":"10809"}}}}
{"execute":"blockdev-add",
"arguments":{"driver":"qcow2",
"node-name":"backup-sdc",
"file":{"driver":"file",
"filename":"$PWD/scratch1.img"},
"backing":"'$node1'"}}
{"execute":"blockdev-add",
"arguments":{"driver":"qcow2",
"node-name":"backup-sdd",
"file":{"driver":"file",
"filename":"$PWD/scratch2.img"},
"backing":"'$node2'"}}
{"execute":"block-dirty-bitmap-add",
"arguments":{"node":"$node1",
"name":"backup-sdc"}}
{"execute":"x-block-dirty-bitmap-merge",
"arguments":{"node":"$node1",
"src_name":"check1",
"dst_name":"backup-sdc"}}'
{"execute":"block-dirty-bitmap-add",
"arguments":{"node":"$node2",
"name":"backup-sdd"}}
{"execute":"x-block-dirty-bitmap-merge",
"arguments":{"node":"$node2",
"src_name":"check1",
"dst_name":"backup-sdd"}}'
{"execute":"transaction",
"arguments":{"actions":[
{"type":"blockdev-backup", "data":{
"device":"$node1", "target":"backup-sdc",
"sync":"none",
"job-id":"backup-sdc" }},
{"type":"blockdev-backup", "data":{
"device":"$node2", "target":"backup-sdd",
"sync":"none",
"job-id":"backup-sdd" }},
{"type":"x-block-dirty-bitmap-disable", "data":{
"node":"$node1", "name":"backup-sdc"}},
{"type":"x-block-dirty-bitmap-disable", "data":{
"node":"$node2", "name":"backup-sdd"}},
{"type":"x-block-dirty-bitmap-disable", "data":{
"node":"$node1", "name":"check1"}},
{"type":"x-block-dirty-bitmap-disable", "data":{
"node":"$node2", "name":"check1"}},
{"type":"block-dirty-bitmap-add", "data":{
"node":"$node1", "name":"check2",
"persistent":true}},
{"type":"block-dirty-bitmap-add", "data":{
"node":"$node2", "name":"check2",
"persistent":true}}
]}}
{"execute":"nbd-server-add",
"arguments":{"device":"backup-sdc",
"name":"sdc"}}
{"execute":"nbd-server-add",
"arguments":{"device":"backup-sdd",
"name":"sdd"}}
{"execute":"x-nbd-server-add-bitmap",
"arguments":{"name":"sdc",
"bitmap":"backup-sdc"}}
{"execute":"x-nbd-server-add-bitmap",
"arguments":{"name":"sdd",
"bitmap":"backup-sdd"}}
Two things stand out here, different from the earlier full backup. First
is that libvirt is now creating a temporary non-persistent bitmap,
merging all data fom check1 into the temporary, then freezing writes
into the temporary bitmap during the transaction, and telling NBD to
expose the bitmap to clients. The second is that since we want this
backup to start a new checkpoint, we disable the old bitmap and create a
new one. The two additions are independent - it is possible to create an
incremental backup [<incremental> in backup XML]) without triggering a
new checkpoint [presence of non-null checkpoint XML]. In fact, taking
an incremental backup without creating a checkpoint is effectively doing
differential backups, where multiple backups started at different times
each contain all cumulative changes since the same original point in
time, such that later backups are larger than earlier backups, but you
no longer have to chain those backups to one another to reconstruct the
state in any one of the backups).
Now that the pull-model backup job is running, we want to scrape the
data off the NBD server. Merely reading nbd://localhost:10809/sdc will
read the full contents of the disk - but that defeats the purpose of
using the checkpoint in the first place to reduce the amount of data to
be backed up. So, let's modify our image-scraping loop from the first
example, to now have one client utilizing the x-dirty-bitmap command
line extension to drive other clients. Note: that extension is marked
experimental in part because it has screwy semantics: if you use it, you
can't reliably read any data from the NBD server, but instead can
interpret 'qemu-img map' output by treating any "data":false lines as
dirty, and "data":true entries as unchanged.
$ image_opts=driver=nbd,export=sdc,server.type=inet,
$ image_opts+=server.host=localhost,server.port=10809,
$ image_opts+=x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc
$ $qemu_img create -f qcow2 inc12.img $size_of_orig1
$ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \
inc12.img
$ while read line; do
[[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.false.* ]] ||
continue
start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]}
qemu-io -C -c "r $start $len" -f qcow2 inc12.img
done < <($qemu_img map --output=json --image-opts
$image_optsdriver=nbd,export=sdc,server.type=inet,server.host=localhost,server.port=10809,x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc)
$ $qemu_img rebase -u -f qcow2 -b '' inc12.img
As captured, inc12.img is an incomplete qcow2 file (it only includes
clusters touched by the guest since the last incremental or full
backup); but since we output into a qcow2 file, we can easily repair the
damage:
$ $qemu_img rebase -u -f qcow2 -F qcow2 -b full1.img inc12.img
creating the qcow2 chain 'full1.img <- inc12.img' that contains
identical guest-visible contents as would be present in a full backup
done at the same moment.
Of course, with the backups now captured, we clean up:
$ $virsh backup-end $dom 1
Backup id 1 completed
$ rm scratch1.img scratch2.img
and this time, virDomainBackupEnd() had to do one additional bit of work
to delete the temporary bitmaps:
{"execute":"nbd-server-remove",
"arguments":{"name":"sdc"}}
{"execute":"nbd-server-remove",
"arguments":{"name":"sdd"}}
{"execute":"nbd-server-stop"}
{"execute":"block-job-cancel",
"arguments":{"device":"backup-sdc"}}
{"execute":"block-job-cancel",
"arguments":{"device":"backup-sdd"}}
{"execute":"blockdev-del",
"arguments":{"node-name":"backup-sdc"}}
{"execute":"blockdev-del",
"arguments":{"node-name":"backup-sdd"}}
{"execute":"block-dirty-bitmap-remove",
"arguments":{"node":"$node1",
"name":"backup-sdc"}}
{"execute":"block-dirty-bitmap-remove",
"arguments":{"node":"$node2",
"name":"backup-sdd"}}
At this point, it should be fairly obvious that you can create more
incremental backups, by repeatedly updating the <incremental> line in
backup.xml, and adjusting the checkpoint XML to move on to a successive
name. And while incremental backups are the most common (using the
current active checkpoint as the <incremental> when starting the next),
the scheme is also set up to permit differential backups from any
existing checkpoint to the current point in time (since libvirt is
already creating a temporary bitmap as its basis for the
x-nbd-server-add-bitmap, all it has to do is just add an appropriate
number of x-block-dirty-bitmap-merge calls to collect all bitmaps in the
chain from the requested checkpoint to the current checkpoint).
More to come in part 3.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization:
qemu.org |
libvirt.org