On 06/26/2018 11:36 AM, Nir Soffer wrote:
On Wed, Jun 13, 2018 at 7:42 PM Eric Blake <eblake(a)redhat.com>
wrote:
> Upcoming patches will add support for incremental backups via
> a new API; but first, we need a landing page that gives an
> overview of capturing various pieces of guest state, and which
> APIs are best suited to which tasks.
>
Needs blank line between list items for easier reading of the source.
Sure.
I think we should describe checkpoints before backups, since the
expected flow is:
- user start backup
- system create checkpoint using virDomainCheckpointCreateXML
- system query amount of data pointed by the previous checkpoint
bitmaps
- system create temporary storage for the backup
- system starts backup using virDomainBackupBegin
I actually think it will be more common to create checkpoints via
virDomainBackupBegin(), and not virDomainCheckpointCreateXML (the latter
exists because it is easy, and may have a use independent from
incremental backups, but it is the former that makes chains of
incremental backups reliable).
That is, your first backup will be a full backup (no checkpoint as its
start) but will create a checkpoint at the same time; then your second
backup is an incremental backup (use the checkpoint created at the first
backup as the start) and also creates a checkpoint in anticipation of a
third incremental backup.
You do have an interesting step in there - the ability to query how much
data is pointed to in the delta between two checkpoints (that is, before
I actually create a backup, can I pre-guess how much data it will end up
copying). On the other hand, the size of the temporary storage for the
backup is not related to the amount of data tracked in the bitmap.
Expanding on the examples in my 1/8 reply to you:
At T3, we have:
S1: |AAAA----| <- S2: |---BBB--|
B1: |XXXX----| B2: |---XXX--|
guest sees: |AAABBB--|
where by T4 we will have:
S1: |AAAA----| <- S2: |D--BBDD-|
B1: |XXXX----| B2: |---XXX--|
B3: |X----XX-|
guest sees: |DAABBDD-|
Back at T3, using B2 as our dirty bitmap, there are two backup models we
can pursue to get at the data tracked by that bitmap.
The first is push-model backup (blockdev-backup with "sync":"top" to
the
actual backup file) - qemu directly writes the |---BBB--| sequence into
the destination file (based on the contents of B2), whether or not S2 is
modified in the meantime; in this mode, qemu is smart enough to not
bother copying clusters to the destination that were not in the bitmap.
So the fact that B2 mentions 3 dirty clusters indeed proves to be the
right size for the destination file.
The second is pull-model backup (blockdev-backup with "sync":"none" to
a
temporary file, coupled with a read-only NBD server on the temporary
file that also exposes bitmap B2 via NBD_CMD_BLOCK_STATUS) - here, if
qemu can guarantee that the client would read only dirty clusters, then
it only has to write to the temporary file if the guest changes a
cluster that was tracked in B2 (so at most the temporary file would
contain |-----B--| if the NBD client finishes before T4); but more
likely, qemu will play conservative and write to the temporary file for
ANY changes whether or not they are to areas covered by B2 (in which
case the temporary file could contain |A----B0-| for the three writes
done by T4). Or put another way, if qemu can guarantee a nice client,
then the size of B2 probably overestimates the size of the temporary
file; but if qemu plays conservative by assuming the client will read
even portions of the file that weren't dirty, then keeping those reads
constant will require the temporary file to be as large as the guest is
able to dirty data while the backup continues, which may be far larger
than the size of B2. [And maybe this argues that we want a way for an
NBD export to force EIO read errors for anything outside of the exported
dirty bitmap, thus making the client play nice, so that the temporary
file does not have to grow beyond the size of the bitmap - but that's a
future feature request]
> + <h2><a
id="examples">Examples</a></h2>
> + <p>The following two sequences both capture the disk state of a
> + running guest, then complete with the guest running on its
> + original disk image; but with a difference that an unexpected
> + interruption during the first mode leaves a temporary wrapper
> + file that must be accounted for, while interruption of the
> + second mode has no impact to the guest.</p>
>
This is not clear, I read this several times and I'm not sure what do
you mean here.
I'm trying to convey the point that with example 1...
Blank line between paragraphs
> + <p>1. Backup via temporary snapshot
> + <pre>
> +virDomainFSFreeze()
> +virDomainSnapshotCreateXML(VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY)
...if you are interrupted here, your <domain> XML has changed to point
to the snapshot file...
> +virDomainFSThaw()
> +third-party copy the backing file to backup storage # most time spent here
+virDomainBlockCommit(VIR_DOMAIN_BLOCK_COMMIT_ACTIVE) per disk
> +wait for commit ready event per disk
> +virDomainBlockJobAbort() per disk
...and it is not until here that your <domain> XML is back to its
pre-backup state. If the backup is interrupted for any reason, you have
to manually get things back to the pre-backup layout, whether or not you
were able to salvage the backup data.
> + </pre></p>
>
I think we should mention virDomainFSFreeze and virDomainFSThaw before
this examples, in the same way we mention the other apis.
Can do.
> +
> + <p>2. Direct backup
> + <pre>
> +virDomainFSFreeze()
> +virDomainBackupBegin()
> +virDomainFSThaw()
> +wait for push mode event, or pull data over NBD # most time spent here
> +virDomainBackeupEnd()
In this example 2, using the new APIs, the <domain> XML is unchanged
through the entire operation. If you interrupt things in the middle,
you may have to scrap the backup data as not being viable, but you don't
have to do any manual cleanup to get your domain back to the pre-backup
layout.
> + </pre></p>
>
This means that virDomainBackupBegin will create a checkpoint, and libvirt
will have to create the temporary storage for the backup (.e.g disk for push
model, or temporary snapshot for the pull model). Libvirt will most likely
use
local storage which may fail if the host does not have enough local storage.
virDomainBackupBegin() has an optional <disks> XML element - if
provided, then YOU can control the files (the destination on push model,
ultimately including a remote network destination, such as via NBD,
gluster, sheepdog, ...; or the scratch file for pull model, which
probably only makes sense locally as the file gets thrown away as soon
as the 3rd-party NBD client finishes). Libvirt only generates a
filename if you don't provide that level of detail. You're right that
the local storage running out of space can be a concern - but also
remember that incremental backups are designed to be less invasive than
full backups, AND that if one backup fails, you can then kick off
another backup using the same checkpoint as starting point as the one
that failed (that is, when libvirt is using B1 as its basis for a
backup, but also created B2 at the same time, then you can use
virDomainCheckpointDelete to remove B2 by merging the B1/B2 bitmaps back
into B1, with B1 once again tracking changes from the previous
successful backup to the current point in time).
But this may be good enough for many users, so maybe it is good to
have this.
I think we need to show here the more low level flow that oVirt will use:
Backup using external temporary storage
- virDomainFSFreeze()
- virtDomainCreateCheckpointXML()
- virDomainFSThaw()
- Here oVirt will need to query the checkpoints, to understand how much
temporary storage is needed for the backup. I hope we have an API
for this (did not read the next patches yet).
I have not exposed one so far, nor do I know if qemu has that easily
available. But since it matters to you, we can make it a priority to
add that (and the API would need to be added to libvirt.so at the same
time as the other new APIs, whether or not I can make it in time for the
freeze at the end of this week).
- virDomainBackupBegin()
- third party copy data...
- virDomainBackeupEnd()
Again, note that oVirt will probably NOT call
virDomainCreateCheckpointXML() directly, but will instead do:
virDomainFSFreeze();
virDomainBackupBegin(dom, "<domainbackup type='pull'/>",
"<domaincheckpoint><name>B1</name></domaincheckpoint>",
0);
virDomainFSThaw();
third party copy data
virDomainBackupEnd();
for the first full backup, then for the next incremental backup, do:
virDomainFSFreeze();
virDomainBackupBegin(dom, "<domainbackup
type='pull'><incremental>B1</incremental></domainbackup>",
"domaincheckpoint><name>B2</name></domaincheckpoint>", 0);
virDomainFSThaw();
third party copy data
virDomainBackupEnd();
where you are creating bitmap B2 at the time of the first incremental
backup (the second backup overall), and that backup consists of the data
changed since the creation of bitmap B1 at the time of the earlier full
backup.
Then, as I mentioned earlier, the minimal XML forces libvirt to generate
filenames (which may or may not match what you want), so you can
certainly pass in more verbose XML:
<domainbackup type='pull'>
<incremental>B1</incremental>
<server transport='unix' socket='/path/to/server'>
<disks>
<disk name='vda' type='block'>
<scratch dev='/path/to/scratch/dev'>
</disk>
</disks>
</domainbackup>
and of course, we'll eventually want TLS thrown in the mix (my initial
implementation has completely bypassed that, other than the fact that
the <server> element is a great place to stick in the information needed
for telling qemu's server to only accept clients that know the right TLS
magic).
If this example helps, I can flush out the html to give these further
insights.
And, if wrapping FSFreeze/Thaw is that common, we'll probably want to
reach the point where we add VIR_DOMAIN_BACKUP_QUIESCE as a flag
argument to automatically do it as part of virDomainBackupBegin().
>
This is great documentation, showing both the APIs and how they are
used together, we need more of this!
Well, and it's also been a great resource for me as I continue to hammer
out the (LOADS) of code needed to reach a working demo.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization:
qemu.org |
libvirt.org