On 06/26/2018 02:51 PM, Nir Soffer wrote:
On Wed, Jun 13, 2018 at 7:42 PM Eric Blake <eblake(a)redhat.com>
wrote:
> Prepare for new checkpoint and backup APIs by describing the XML
> that will represent a checkpoint. This is modeled heavily after
> the XML for virDomainSnapshotPtr, since both represent a point in
> time of the guest. But while a snapshot exists with the intent
> of rolling back to that state, a checkpoint instead makes it
> possible to create an incremental backup at a later time.
>
> Add testsuite coverage of a minimal use of the XML.
> +++ b/docs/formatcheckpoint.html.in
> @@ -0,0 +1,273 @@
> +<?xml version="1.0" encoding="UTF-8"?>
> +<!DOCTYPE html>
> +<html
xmlns="http://www.w3.org/1999/xhtml">
> + <body>
> + <h1>Checkpoint and Backup XML format</h1>
> +
> + <ul id="toc"></ul>
> +
> + <h2><a id="CheckpointAttributes">Checkpoint
XML</a></h2>
>
id=CheckpointXML?
Matches what the existing formatsnapshot.html.in named its tag. (If you
haven't guessed, I'm heavily relying on snapshots as my template for
adding this).
> +
> + <p>
> + Domain disk backups, including incremental backups, are one form
> + of <a href="domainstatecapture.html">domain state
capture</a>.
> + </p>
> + <p>
> + Libvirt is able to facilitate incremental backups by tracking
> + disk checkpoints, or points in time against which it is easy to
> + compute which portion of the disk has changed. Given a full
> + backup (a backup created from the creation of the disk to a
> + given point in time, coupled with the creation of a disk
> + checkpoint at that time),
Not clear.
> and an incremental backup (a backup
> + created from just the dirty portion of the disk between the
> + first checkpoint and the second backup operation),
Also not clear.
Okay, I will try to improve these in v2. But (other than answering
these good review emails), my current priority is a working demo (to
prove the API works) prior to further doc polish.
> it is
> + possible to do an offline reconstruction of the state of the
> + disk at the time of the second backup, without having to copy as
> + much data as a second full backup would require. Most disk
> + checkpoints are created in concert with a backup,
> + via <code>virDomainBackupBegin()</code>; however, libvirt also
> + exposes enough support to create disk checkpoints independently
> + from a backup operation,
> + via <code>virDomainCheckpointCreateXML()</code>.
>
Thanks for the extra context.
> + </p>
> + <p>
> + Attributes of libvirt checkpoints are stored as child elements of
> + the <code>domaincheckpoint</code> element. At checkpoint
creation
> + time, normally only the <code>name</code>,
<code>description</code>,
> + and <code>disks</code> elements are settable; the rest of the
> + fields are ignored on creation, and will be filled in by
> + libvirt in for informational purposes
>
So the user is responsible for creating checkpoints names? Do we use these
the same name in qcow2?
My intent is that if the user does not assign a checkpoint name, then
libvirt will default it to the current time in seconds-since-the-Epoch.
Then, whatever name is given to the checkpoint (whether chosen by
libvirt or assigned by the user) will also be the default name of the
bitmap created in each qcow2 volume, but the XML also allows you to name
the qcow2 bitmaps something different than the checkpoint name (maybe
not a wise idea in the common case, but could come in handy later if you
use the _REDEFINE flag to teach libvirt about existing bitmaps that are
already present in a qcow2 image rather than placed there by libvirt).
> + <p>
> + Checkpoints are maintained in a hierarchy. A domain can have a
> + current checkpoint, which is the most recent checkpoint compared to
> + the current state of the domain (although a domain might have
> + checkpoints without a current checkpoint, if checkpoints have been
> + deleted in the meantime). Creating or reverting to a checkpoint
> + sets that checkpoint as current, and the prior current checkpoint is
> + the parent of the new checkpoint. Branches in the hierarchy can
> + be formed by reverting to a checkpoint with a child, then creating
> + another checkpoint.
>
This seems too complex. Why do we need arbitrary trees of checkpoints?
Because snapshots had an arbitrary tree, and it was easier to copy from
snapshots. Even if we only use a linear tree for now, it is still
feasible that in the future, we can facilitate a domain rolling back to
the disk state as captured at checkpoint C1, at which point you could
then have multiple children C2 (the bitmap created prior to rolling
back) and C3 (the bitmap created for tracking changes made after rolling
back). Again, for a first cut, I probably will punt and state that
snapshots and incremental backups do not play well together yet; but as
we get experience and add more code, the API is flexible enough that
down the road we really can offer reverting to an arbitrary snapshot and
ALSO updating checkpoints to match.
What is the meaning of reverting a checkpoint?
Hmm - right now, you can't (that was one Snapshot API that I
intentionally did not copy over to Checkpoint), so I should probably
reword that.
> + </p>
> + <p>
> + The top-level <code>domaincheckpoint</code> element may contain
> + the following elements:
> + </p>
> + <dl>
> + <dt><code>name</code></dt>
> + <dd>The name for this checkpoint. If the name is specified when
> + initially creating the checkpoint, then the checkpoint will have
> + that particular name. If the name is omitted when initially
> + creating the checkpoint, then libvirt will make up a name for
> + the checkpoint, based on the time when it was created.
> + </dd>
>
Why not simplify and require the use to provide a name?
Because we didn't require the user to provide names for snapshots, and
generating a name via the current timestamp is still fairly likely to be
usable.
> + <dt><code>description</code></dt>
> + <dd>A human-readable description of the checkpoint. If the
> + description is omitted when initially creating the checkpoint,
> + then this field will be empty.
> + </dd>
> + <dt><code>disks</code></dt>
> + <dd>On input, this is an optional listing of specific
> + instructions for disk checkpoints; it is needed when making a
> + checkpoint on only a subset of the disks associated with a
> + domain (in particular, since qemu checkpoints require qcow2
> + disks, this element may be needed on input for excluding guest
> + disks that are not in qcow2 format); if omitted on input, then
> + all disks participate in the checkpoint. On output, this is
> + fully populated to show the state of each disk in the
> + checkpoint. This element has a list of <code>disk</code>
> + sub-elements, describing anywhere from one to all of the disks
> + associated with the domain.
>
Why not always specify the disks?
Because if your guest uses all qcow2 images, and you don't want to
exclude any images from the checkpoint, then not specifying <disks> does
the right thing with less typing. Just because libvirt tries to have
sane defaults doesn't mean you have to rely on them, though.
> + <dl>
> + <dt><code>disk</code></dt>
> + <dd>This sub-element describes the checkpoint properties of
> + a specific disk. The attribute <code>name</code> is
> + mandatory, and must match either the <code><target
> + dev='name'/></code> or an unambiguous
<code><source
> + file='name'/></code> of one of
> + the <a href="formatdomain.html#elementsDisks">disk
> + devices</a> specified for the domain at the time of the
> + checkpoint. The attribute <code>checkpoint</code> is
> + optional on input; possible values are <code>no</code>
> + when the disk does not participate in this checkpoint;
> + or <code>bitmap</code> if the disk will track all changes
> + since the creation of this checkpoint via a bitmap, in
> + which case another attribute <code>bitmap</code> will be
> + the name of the tracking bitmap (defaulting to the
> + checkpoint name).
>
Seems too complicated. Why do we need to support a checkpoint
referencing a bitmap with a different name?
For the same reason that you can support an internal snapshot
referencing a qcow2 snapshot with a different name. Yeah, it's probably
not a common usage, but there are cases (such as when using _REDEFINE)
where it can prove invaluable. You're right that most users won't name
qcow2 bitmaps differently from the libvirt checkpoint name.
Instead we can have a list of disk that will participate in the checkpoint.
Anything not specified will not participate in the snapshot. The name of
the checkpoint is always the name of the bitmap.
My worry is about future extensibility of the XML. If the XML is too
simple, then we may lock ourselves into a corner at not being able to
support some other backend implementation of checkpoints (just because
qemu implements checkpoints via qcow2 bitmaps does not mean that some
other hyperviser won't come along that implements checkpoints via a
UUID, so I tried to leave room for <disk checkpoint='uuid'
uuid='..-..-...'/> as potential XML for such a hypervisor mapping - and
while bitmap names different from checkpoint names is unusual, it is
much more likely that UUIDs for multiple disks would have to be
different per disk)
> + </dd>
> + </dl>
> + </dd>
> + <dt><code>creationTime</code></dt>
> + <dd>The time this checkpoint was created. The time is specified
> + in seconds since the Epoch, UTC (i.e. Unix time). Readonly.
> + </dd>
> + <dt><code>parent</code></dt>
> + <dd>The parent of this checkpoint. If present, this element
> + contains exactly one child element, name. This specifies the
> + name of the parent checkpoint of this one, and is used to
> + represent trees of checkpoints. Readonly.
> + </dd>
>
I think we are missing here the size of the underlying data in every
disk. This probably means how many dirty bits we have in the bitmaps
referenced by the checkpoint for every disk.
That would be an output-only XML element, and only if qemu were even
modified to expose that information. But yes, I can see how exposing
that could be useful.
> + <dt><code>domain</code></dt>
> + <dd>The inactive <a href="formatdomain.html">domain
> + configuration</a> at the time the checkpoint was created.
> + Readonly.
>
What do you mean by "inactive domain configuration"?
Copy-and-paste from snapshots, but in general, what it would take to
start a new VM using a restoration of the backup images corresponding to
that checkpoint (that is, the XML is the smaller persistent form, rather
than the larger running form; my classic example used to be that the
'inactive domain configuration' omits <alias> tags while the 'running
configuration' does not - but since libvirt recently added support for
user-settable <alias> tags, that no longer holds...).
> + <dl>
> + <dt><code>incremental</code></dt>
> + <dd>Optional. If this element is present, it must name an
> + existing checkpoint of the domain, which will be used to make
> + this backup an incremental one (in the push model, only
> + changes since the checkpoint are written to the destination;
> + in the pull model, the NBD server uses the
> + NBD_OPT_SET_META_CONTEXT extension to advertise to the client
> + which portions of the export contain changes since the
> + checkpoint). If omitted, a full backup is performed.
>
Just to make it clear:
For example we start with:
c1 c2 [c3]
c3 is the active checkpoint.
We create a new checkpoint:
c1 c2 c3 [c4]
So
- using incremental=c2, we will get data referenced by c2?
Your incremental backup would get all changes since the point in time c2
was created (that is, the changes recorded by the merge of bitmaps c2
and c3).
- using incremental=c1, we will get data reference by both c1 and c2?
Your incremental backup would get all changes since the point in time c1
was created (that is, the changes recorded by the merge of bitmaps c1,
c2, and c3).
What if we want to backup only data from c1 to c2, not including c3?
Qemu can't do that right now, so this API can't do it either. Maybe
there's a way to add it into the API (and the fact that we used XML
leaves that door wide open), but not right now.
I don't have a use case for this, but if we can specify tow
checkpoints
this can be possible.
For example:
<chekpoints from="c1", to="c2">
Or
<checkpoints from="c2">
Or the current proposal of <incremental> serves as the 'from', and a new
sibling element <limit> becomes the 'to', if it becomes possible to
limit a backup to an earlier point in time than the present call to the API.
> + </dd>
> + <dt><code>server</code></dt>
> + <dd>Present only for a pull mode backup. Contains the same
> + attributes as the <code>protocol</code> element of a disk
> + attached via NBD in the domain (such as transport, socket,
> + name, port, or tls), necessary to set up an NBD server that
> + exposes the content of each disk at the time the backup
> + started.
> + </dd>
>
To get the list of changed blocks, we planned to use something like:
qemu-img map nbd+unix:/socket=server.sock
Is this possible now? planned?
Possible via the x-nbd-server-add-bitmap command added in qemu commit
767f0c7, coupled with a client that knows how to request
NBD_OPT_SET_META_CONTEXT "qemu:dirty-bitmap:foo" then read the bitmap
with NBD_CMD_BLOCK_STATUS (I have a hack patch sitting on the qemu list
that lets qemu-img behave as such a client:
https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg05993.html)
To get the actual data, oVirt needs a device to read from. We don't want
to write nbd-client, and we cannot use qemu-img since it does not support
streaming data, and we want to stream data using http to the backup
application.
I guess we will have do this:
qemu-nbd -c /dev/nbd0 nbd+unix:/socket=server.sock
And serve the data from /dev/nbd0.
Yes, except that the kernel NBD client plugin does not have support for
NBD_CMD_BLOCK_STATUS, so reading /dev/nbd0 won't be able to find the
dirty blocks. But you could always do it in two steps: first, connect a
client that only reads the bitmap (such as qemu-img with my hack), then
connect the kernel client so that you can stream just the portions of
/dev/nbd0 referenced in the map of the first step. (Or, since both
clients would be read-only, you can have them both connected to the qemu
server at once)
> + <dt><code>disks</code></dt>
> + <dd>This is an optional listing of instructions for disks
> + participating in the backup (if omitted, all disks
> + participate, and libvirt attempts to generate filenames by
> + appending the current timestamp as a suffix). When provided on
> + input, disks omitted from the list do not participate in the
> + backup. On output, the list is present but contains only the
> + disks participating in the backup job. This element has a
> + list of <code>disk</code> sub-elements, describing anywhere
> + from one to all of the disks associated with the domain.
> + <dl>
> + <dt><code>disk</code></dt>
> + <dd>This sub-element describes the checkpoint properties of
> + a specific disk. The attribute <code>name</code> is
> + mandatory, and must match either the <code><target
> + dev='name'/></code> or an unambiguous
<code><source
> + file='name'/></code> of one of
> + the <a href="formatdomain.html#elementsDisks">disk
> + devices</a> specified for the domain at the time of the
> + checkpoint. The optional attribute <code>type</code> can
> + be <code>file</code>, <code>block</code>,
> + or <code>networks</code>, similar to a disk declaration
> + for a domain, controls what additional sub-elements are
> + needed to describe the destination (such
> + as <code>protocol</code> for a network destination). In
> + push mode backups, the primary subelement
> + is <code>target</code>; in pull mode, the primary sublement
> + is <code>scratch</code>; but either way,
> + the primary sub-element describes the file name to be used
> + during the backup operation, similar to
> + the <code>source</code> sub-element of a domain disk. An
> + optional sublement <code>driver</code> can also be used to
> + specify a destination format different from qcow2.
>
This should be similar to the way we specify disks for vm, right?
Anything that works as vm disk will work for pushing backups?
Ultimately, yes, I'd like to support gluster/NBD/sheepdog/...
destinations. My initial implementation is less ambitious, and supports
just local files (because those are easier to test and therefore produce
a demo with).
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization:
qemu.org |
libvirt.org