
Here's my updated counterproposal for a backup API. In comparison to v2 posted by Nikolay: https://www.redhat.com/archives/libvir-list/2018-April/msg00115.html - changed terminology a bit: Nikolay's "BlockSnapshot" is now called a "Checkpoint", and "BlockExportStart/Stop" is now "BackupBegin/End" - flesh out more API descriptions - better documentation of proposed XML, for both checkpoints and backup Barring any major issues turned up during review, I've already starting to code this into libvirt with a goal of getting an implementation ready for review this month. Each domain will gain the ability to track a tree of Checkpoint objects (we've previously mentioned the term "system checkpoint" in the <domainsnapshot> XML as the combination of disk and RAM state; so I'll use the term "disk checkpoint" in prose as needed, to make it obvious that the checkpoints described here do not include RAM state). I will use the virDomainSnapshot API as a guide, meaning that we will track a tree of checkpoints where each checkpoint can have 0 or 1 parent checkpoints, in part because I plan to reuse a lot of the snapshot code as a starting point for implementing checkpoint tracking. Qemu does NOT track a relationship between internal snapshots, so libvirt has to manage the backing tree all by itself; by the same argument, if qemu does not add a parent relationship to dirty bitmaps, libvirt can probably manage everything itself by copying how it manages parent relationships between internal snapshots. However, I think it will be far easier for libvirt to exploit qemu dirty bitmaps if qemu DOES add bitmap tracking; particularly if qemu adds ways to easily compose a temporary bitmap that is the union of one bitmap plus a fixed number of its parents. Design-wise, libvirt will manage things so that there is only one enabled dirty-bitmap per qcow2 image at a time, when no backup operation is in effect. There is a notion of a current (or most recent) checkpoint; when a new checkpoint is created, that becomes the current one and the former checkpoint becomes the parent of the new one. If there is no current checkpoint, then there is no active dirty bitmap managed by libvirt. Representing things on a timeline, when a guest is first created, there is no dirty bitmap; later, the checkpoint "check1" is created, which in turn creates "bitmap1" in the qcow2 image for all changes past that point; when a second checkmark "check2" is created, a qemu transaction is used to create and enable the new "bitmap2" bitmap at the same time as disabling "bitmap1" bitmap. (Actually, it's probably easier to name the bitmap in the qcow2 file with the same name as the Checkpoint object being tracked in libvirt, but for discussion purposes, it's less confusing if I use separate names for now.) creation ....... check1 ....... check2 ....... active no bitmap bitmap1 bitmap2 When a user wants to create a backup, they select which point in time the backup starts from; the default value NULL represents a full backup (all content since disk creation to the point in time of the backup call, no bitmap is needed, use sync=full for push model or sync=none for the pull model); any other value represents the name of a checkpoint to use as an incremental backup (all content from the checkpoint to the point in time of the backup call; libvirt forms a temporary bitmap as needed, the uses sync=incremental for push model or sync=none plus exporting the bitmap for the pull model). For example, requesting an incremental backup from "check2" can just reuse "bitmap2", but requesting an incremental backup from "check1" requires the computation of the bitmap containing the union of "bitmap1" and "bitmap2". Libvirt will always create a new bitmap when starting a backup operation, whether or not the user requests that a checkpoint be created. Most users that want incremental backup sequences will create a new checkpoint every time they do a backup; the new bitmap that libvirt creates is then associated with that new checkpoint, and even after the backup operation completes, the new bitmap remains in the qcow2 file. But it is also possible to request a backup without a new checkpoint (it merely means that it is not possible to create a subsequent incremental backup from the backup just started); in that case, libvirt will have to take care of merging the new bitmap back into the previous one at the end of the backup operation. I think that it should be possible to run multiple backup operations in parallel in the long run. But in the interest of getting a proof of concept implementation out quickly, it's easier to state that for the initial implementation, libvirt supports at most one backup operation at a time (to do another backup, you have to wait for the current one to complete, or else abort and abandon the current one). As there is only one backup job running at a time, the existing virDomainGetJobInfo()/virDomainGetJobStats() will be able to report statistics about the job (insofar as such statistics are available). But in preparation for the future, when libvirt does add parallel job support, starting a backup job will return a job id; and presumably we'd add a new virDomainGetJobStatsByID() for grabbing statistics of an arbitrary (rather than the most-recently-started) job. Since live migration also acts as a job visible through virDomainGetJobStats(), I'm going to treat an active backup job and live migration as mutually exclusive. This is particularly true when we have a pull model backup ongoing: if qemu on the source is acting as an NBD server, you can't migrate away from that qemu and tell the NBD client to reconnect to the NBD server on the migration destination. So, to perform a migration, you have to cancel any pending backup operations. Conversely, if a migration job is underway, it will not be possible to start a new backup job until migration completes. However, we DO need to modify migration to ensure that any persistent bitmaps are migrated. I also think that in the long run, it should be possible to start a backup operation, and while it is still ongoing, create a new external snapshot, and still be able to coordinate the transfer of bitmaps from the old image to the new overlay. But for the first implementation, it's probably easiest to state that an ongoing backup prevents creation of a new snapshot. However, a current checkpoint (which means we DO have an active bitmap, even if there is no active backup) DOES need to be transfered to the new overlay, and conversely, a block commit job needs to merge all bitmaps from the old overlay to the backing file that is now becoming the active layer again. I don't know if qemu has primitives for this in place yet; and if it does not, the only conservative thing we can do in the initial implementation is to state that the use of checkpoints is exclusive from the use of snapshots (using one prevents the use of the other). Hopefully we don't have to stay in that state for long. For now, a user wanting guest I/O to be at a safe point can manually use virDomainFSFreeze()/virDomainBackupBegin()/virDomainFSThaw(); we may decide down the road to use the flags argument of virDomainBackupBegin() to provide automatic guest quiescing through one API (I'm not doing it right away, because we have to worry about undoing effects if we fail to thaw after starting the backup). So, to summarize, creating a backup will involve the following new APIs: /** * virDomainBackupBegin: * @domain: a domain object * @diskXml: description of storage to utilize and expose during * the backup, or NULL * @checkpointXml: description of a checkpoint to create, or NULL * @flags: not used yet, pass 0 * * Start a point-in-time backup job for the specified disks of a * running domain. * * A backup job is mutually exclusive with domain migration * (particularly when the job sets up an NBD export, since it is not * possible to tell any NBD clients about a server migrating between * hosts). For now, backup jobs are also mutually exclusive with any * other block job on the same device, although this restriction may * be lifted in a future release. Progress of the backup job can be * tracked via virDomainGetJobStats(). The job remains active until a * subsequent call to virDomainBackupEnd(), even if it no longer has * anything to copy. * * There are two fundamental backup approaches. The first, called a * push model, instructs the hypervisor to copy the state of the guest * disk to the designated storage destination (which may be on the * local file system or a network device); in this mode, the * hypervisor writes the content of the guest disk to the destination, * then emits VIR_DOMAIN_EVENT_ID_BLOCK_JOB_2 when the backup is * either complete or failed (the backup image is invalid if the job * is ended prior to the event being emitted). The second, called a * pull model, instructs the hypervisor to expose the state of the * guest disk over an NBD export; a third-party client can then * connect to this export, and read whichever portions of the disk it * desires. In this mode, there is no event; libvirt has to be * informed when the third-party NBD client is done and the backup * resources can be released. * * The @diskXml parameter is optional but usually provided, and * contains details about the backup, including which backup mode to * use, whether the backup is incremental from a previous checkpoint, * which disks participate in the backup, the destination for a push * model backup, and the temporary storage and NBD server details for * a pull model backup. If omitted, the backup attempts to default to * a push mode full backup of all disks, where libvirt generates a * filename for each disk by appending a suffix of a timestamp in * seconds since the Epoch. virDomainBackupGetXMLDesc() can be called * to actual values selected. For more information, see * formatcheckpoint.html#BackupAttributes. * * The @checkpointXml parameter is optional; if non-NULL, then libvirt * behaves as if virDomainCheckpointCreateXML() were called with * @checkpointXml, atomically covering the same guest state that will * be part of the backup. The creation of a new checkpoint allows for * future incremental backups. * * Returns a non-negative job id on success, or negative on failure. * This operation returns quickly, such that a user can choose to * start a backup job between virDomainFSFreeze() and * virDomainFSThaw() in order to create the backup while guest I/O is * quiesced. */ int virDomainBackupBegin(virDomainPtr domain, const char *diskXml, const char *checkpointXml, unsigned int flags); Note that this layout says that all disks participating in the backup job have share the same incremental checkpoint as their starting point (no way to have one backup job where disk A copies data since check1 while disk B copies data since check2). If we need the latter, then we could get rid of the 'incremental' parameter, and instead have each <disk> element within checkpointXml all out an optional <checkpoint> name as its starting point. Also, qemu supports exposing multiple disks through a single NBD server (you then connect multiple clients to the one server to grab state from each disk). So the NBD details are listed in parallel to the <disks>. Note that since a backup is NOT a guest-visible action, the backup job does not alter the normal <domain> XML. /** * virDomainBackupGetXMLDesc: * @domain: a domain object * @id: the id of an active backup job previously started with * virDomainBackupBegin() * @flags: not used yet, pass 0 * * In some cases, a user can start a backup job without supplying all * details, and rely on libvirt to fill in the rest (for example, * selecting the port used for an NBD export). This API can then be * used to learn what default values were chosen. * * Returns a NUL-terminated UTF-8 encoded XML instance, or NULL in * case of error. The caller must free() the returned value. */ char * virDomainBackupGetXMLDesc(virDomainPtr domain, int id, unsigned int flags); /** * virDomainBackupEnd: * @domain: a domain object * @id: the id of an active backup job previously started with * virDomainBackupBegin() * @flags: bitwise-OR of supported virDomainBackupEndFlags * * Conclude a point-in-time backup job @id on the given domain. * * If the backup job uses the push model, but the event marking that * all data has been copied has not yet been emitted, then the command * fails unless @flags includes VIR_DOMAIN_BACKUP_END_ABORT. If the * event has been issued, or if the backup uses the pull model, the * flag has no effect. * * Returns 0 on success and -1 on failure. */ int virDomainBackupEnd(virDomainPtr domain, int id, unsigned int flags); /** * virDomainCheckpointCreateXML: * @domain: a domain object * @xmlDesc: description of the checkpoint to create * @flags: bitwise-OR of supported virDomainCheckpointCreateFlags * * Create a new checkpoint using @xmlDesc on a running @domain. * Typically, it is more common to create a new checkpoint as part of * kicking off a backup job with virDomainBackupBegin(); however, it * is also possible to start a checkpoint without a backup. * * See formatcheckpoint.html#CheckpointAttributes document for more * details on @xmlDesc. * * If @flags includes VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE, then this * is a request to reinstate checkpoint metadata that was previously * discarded, rather than creating a new checkpoint. When redefining * checkpoint metadata, the current checkpoint will not be altered * unless the VIR_DOMAIN_CHECKPOINT_CREATE_CURRENT flag is also * present. It is an error to request the * VIR_DOMAIN_CHECKPOINT_CREATE_CURRENT flag without * VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE. * * If @flags includes VIR_DOMAIN_CHECKPOINT_CREATE_NO_METADATA, then * the domain's disk images are modified according to @xmlDesc, but * then the just-created checkpoint has its metadata deleted. This * flag is incompatible with VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE. * * Returns an (opaque) new virDomainCheckpointPtr on success, or NULL * on failure. */ virDomainCheckpointPtr virDomainCheckpointCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags); /** * virDomainCheckpointDelete: * @checkpoint: the checkpoint to remove * @flags: not used yet, pass 0 * @flags: bitwise-OR of supported virDomainCheckpointDeleteFlags * * Removes a checkpoint from the domain. * * When removing a checkpoint, the record of which portions of the * disk were dirtied after the checkpoint will be merged into the * record tracked by the parent checkpoint, if any. Likewise, if the * checkpoint being deleted was the current checkpoint, the parent * checkpoint becomes the new current checkpoint. * * If @flags includes VIR_DOMAIN_CHECKPOINT_DELETE_METADATA_ONLY, then * any checkpoint metadata tracked by libvirt is removed while keeping * the checkpoint contents intact; if a hypervisor does not require * any libvirt metadata to track checkpoints, then this flag is * silently ignored. * * Returns 0 on success, -1 on error. */ int virDomainCheckpointDelete(virDomainCheckpointPtr checkpoint, unsigned int flags); // Many additional functions copying heavily from virDomainSnapshot*: virDomainCheckpointList(virDomainPtr domain, virDomainCheckpointPtr **checkpoints, unsigned int flags); virDomainCheckpointGetXMLDesc(virDomainCheckpointPtr checkpoint, unsigned int flags); virDomainCheckpointPtr virDomainCheckpointLookupByName(virDomainPtr domain, const char *name, unsigned int flags); const char * virDomainCheckpointGetName(virDomainCheckpointPtr checkpoint); virDomainPtr virDomainCheckpointGetDomain(virDomainCheckpointPtr checkpoint); virConnectPtr virDomainCheckpointGetConnect(virDomainCheckpointPtr checkpoint); int virDomainHasCurrentCheckpoint(virDomainPtr domain, unsigned int flags); virDomainCheckpointPtr virDomainCheckpointCurrent(virDomainPtr domain, unsigned int flags); virDomainCheckpointPtr virDomainCheckpointGetParent(virDomainCheckpointPtr checkpoint, unsigned int flags); int virDomainCheckpointIsCurrent(virDomainCheckpointPtr checkpoint, unsigned int flags); int virDomainCheckpointRef(virDomainCheckpointPtr checkpoint); int virDomainCheckpointFree(virDomainCheckpointPtr checkpoint); int virDomainCheckpointListChildren(virDomainCheckpointPtr checkpoint, virDomainCheckpointPtr **children, unsigned int flags); Notably, none of the older racy list functions, like virDomainSnapshotNum, virDomainSnapshotNumChildren, or virDomainSnapshotListChildrenNames; also, for now, there is no revert support like virDomainSnapshotRevert. Eventually, if we add a way to roll back to the state recorded in an earlier bitmap, we'll want to tell libvirt that it needs to create a new bitmap as a child of an existing (non-current) checkpoint. That is, if we have: check1 .... check2 .... active bitmap1 bitmap2 and created a backup at the same time as check2, then when we later roll back to the state of that backup, we would want to end writes to bitmap2 and declare that check2 is no longer current, and create a new current check3 with associated bitmap3 and parent check1 to track all writes since the point of the revert. Until then, I don't think it's possible to have more than one child without manually using the REDEFINE flag to create such scenarios; but the API should not lock us out of supporting multiple children in the future. Here's my proposal for user-facing XML documentation, based on formatsnapshot.html.in: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <body> <h1>Checkpoint and Backup XML format</h1> <ul id="toc"></ul> <h2><a id="CheckpointAttributes">Checkpoint XML</a></h2> <p> Libvirt is able to facilitate incremental backups by tracking disk checkpoints, or points in time against which it is easy to compute which portion of the disk has changed. Given a full backup (a backup created from the creation of the disk to a given point in time, coupled with the creation of a disk checkpoint at that time), and an incremental backup (a backup created from just the dirty portion of the disk between the first checkpoint and the second backup operation), it is possible to do an offline reconstruction of the state of the disk at the time of the second backup, without having to copy as much data as a second full backup would require. Most disk checkpoints are created in concert with a backup, via <code>virDomainBackupBegin()</code>; however, libvirt also exposes enough support to create disk checkpoints independently from a backup operation, via <code>virDomainCheckpointCreateXML()</code>. </p> <p> Attributes of libvirt checkpoints are stored as child elements of the <code>domaincheckpoint</code> element. At checkpoint creation time, normally only the <code>name</code>, <code>description</code>, and <code>disks</code> elements are settable; the rest of the fields are ignored on creation, and will be filled in by libvirt in for informational purposes by <code>virDomainCheckpointGetXMLDesc()</code>. However, when redefining a checkpoint, with the <code>VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE</code> flag of <code>virDomainCheckpointCreateXML()</code>, all of the XML described here is relevant. </p> <p> Checkpoints are maintained in a hierarchy. A domain can have a current checkpoint, which is the most recent checkpoint compared to the current state of the domain (although a domain might have checkpoints without a current checkpoint, if checkpoints have been deleted in the meantime). Creating or reverting to a checkpoint sets that checkpoint as current, and the prior current checkpoint is the parent of the new checkpoint. Branches in the hierarchy can be formed by reverting to a checkpoint with a child, then creating another checkpoint. </p> <p> The top-level <code>domaincheckpoint</code> element may contain the following elements: </p> <dl> <dt><code>name</code></dt> <dd>The name for this checkpoint. If the name is specified when initially creating the checkpoint, then the checkpoint will have that particular name. If the name is omitted when initially creating the checkpoint, then libvirt will make up a name for the checkpoint, based on the time when it was created. </dd> <dt><code>description</code></dt> <dd>A human-readable description of the checkpoint. If the description is omitted when initially creating the checkpoint, then this field will be empty. </dd> <dt><code>disks</code></dt> <dd>On input, this is an optional listing of specific instructions for disk checkpoints; it is needed when making a checkpoint on only a subset of the disks associated with a domain (in particular, since qemu checkpoints require qcow2 disks, this element may be needed on input for excluding guest disks that are not in qcow2 format); if omitted on input, then all disks participate in the checkpoint. On output, this is fully populated to show the state of each disk in the checkpoint. This element has a list of <code>disk</code> sub-elements, describing anywhere from one to all of the disks associated with the domain. <dl> <dt><code>disk</code></dt> <dd>This sub-element describes the checkpoint properties of a specific disk. The attribute <code>name</code> is mandatory, and must match either the <code><target dev='name'/></code> or an unambiguous <code><source file='name'/></code> of one of the <a href="formatdomain.html#elementsDisks">disk devices</a> specified for the domain at the time of the checkpoint. The attribute <code>checkpoint</code> is optional on input; possible values are <code>no</code> when the disk does not participate in this checkpoint; or <code>bitmap</code> if the disk will track all changes since the creation of this checkpoint via a bitmap, in which case another attribute <code>bitmap</code> will be the name of the tracking bitmap (defaulting to the checkpoint name). </dd> </dl> </dd> <dt><code>creationTime</code></dt> <dd>The time this checkpoint was created. The time is specified in seconds since the Epoch, UTC (i.e. Unix time). Readonly. </dd> <dt><code>parent</code></dt> <dd>The parent of this checkpoint. If present, this element contains exactly one child element, name. This specifies the name of the parent checkpoint of this one, and is used to represent trees of checkpoints. Readonly. </dd> <dt><code>domain</code></dt> <dd>The inactive <a href="formatdomain.html">domain configuration</a> at the time the checkpoint was created. Readonly. </dd> </dl> <h2><a id="BackupAttributes">Backup XML</a></h2> <p> Creating a backup, whether full or incremental, is done via <code>virDomainBackupBegin()</code>, which takes an XML description of the actions to perform. There are two general modes for backups: a push mode (where the hypervisor writes out the data to the destination file, which may be local or remote), and a pull mode (where the hypervisor creates an NBD server that a third-party client can then read as needed, and which requires the use of temporary storage, typically local, until the backup is complete). </p> <p> The instructions for beginning a backup job are provided as attributes and elements of the top-level <code>domainbackup</code> element. This element includes an optional attribute <code>mode</code> which can be either "push" or "pull" (default push). Where elements are optional on creation, <code>virDomainBackupGetXMLDesc()</code> can be used to see the actual values selected (for example, learning which port the NBD server is using in the pull model, or what file names libvirt generated when none were supplied). The following child elements are supported: </p> <dl> <dt><code>incremental</code></dt> <dd>Optional. If this element is present, it must name an existing checkpoint of the domain, which will be used to make this backup an incremental one (in the push model, only changes since the checkpoint are written to the destination; in the pull model, the NBD server uses the NBD_OPT_SET_META_CONTEXT extension to advertise to the client which portions of the export contain changes since the checkpoint). If omitted, a full backup is performed. </dd> <dt><code>server</code></dt> <dd>Present only for a pull mode backup. Contains the same attributes as the <code>protocol</code> element of a disk attached via NBD in the domain (such as transport, socket, name, port, or tls), necessary to set up an NBD server that exposes the content of each disk at the time the backup started. </dd> <dt><code>disks</code></dt> <dd>This is an optional listing of instructions for disks participating in the backup (if omitted, all disks participate, and libvirt attempts to generate filenames by appending the current timestamp as a suffix). When provided on input, disks omitted from the list do not participate in the backup. On output, the list is present but contains only the disks participating in the backup job. This element has a list of <code>disk</code> sub-elements, describing anywhere from one to all of the disks associated with the domain. <dl> <dt><code>disk</code></dt> <dd>This sub-element describes the checkpoint properties of a specific disk. The attribute <code>name</code> is mandatory, and must match either the <code><target dev='name'/></code> or an unambiguous <code><source file='name'/></code> of one of the <a href="formatdomain.html#elementsDisks">disk devices</a> specified for the domain at the time of the checkpoint. The optional attribute <code>type</code> can be <code>file</code>, <code>block</code>, or <code>networks</code>, similar to a disk declaration for a domain, controls what additional sub-elements are needed to describe the destination (such as <code>protocol</code> for a network destination). In push mode backups, the primary subelement is <code>target</code>; in pull mode, the primary sublement is <code>scratch</code>; but either way, the primary sub-element describes the file name to be used during the backup operation, similar to the <code>source</code> sub-element of a domain disk. An optional sublement <code>driver</code> can also be used to specify a destination format different from qcow2. </dd> </dl> </dd> </dl> <h2><a id="example">Examples</a></h2> <p>Using this XML to create a checkpoint of just vda on a qemu domain with two disks and a prior checkpoint:</p> <pre> <domaincheckpoint> <description>Completion of updates after OS install</description> <disks> <disk name='vda' checkpoint='bitmap'/> <disk name='vdb' checkpoint='no'/> </disks> </domaincheckpoint></pre> <p>will result in XML similar to this from <code>virDomainCheckpointGetXMLDesc()</code>:</p> <pre> <domaincheckpoint> <name>1525889631</name> <description>Completion of updates after OS install</description> <creationTime>1525889631</creationTime> <parent> <name>1525111885</name> </parent> <disks> <disk name='vda' checkpoint='bitmap' bitmap='1525889631'/> <disk name='vdb' checkpoint='no'/> </disks> <domain> <name>fedora</name> <uuid>93a5c045-6457-2c09-e56c-927cdf34e178</uuid> <memory>1048576</memory> ... <devices> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/path/to/file1'/> <target dev='vda' bus='virtio'/> </disk> <disk type='file' device='disk' snapshot='external'> <driver name='qemu' type='raw'/> <source file='/path/to/file2'/> <target dev='vdb' bus='virtio'/> </disk> ... </devices> </domain> </domaincheckpoint></pre> <p>With that checkpoint created, the qcow2 image is now tracking all changes that occur in the image since the checkpoint via the persistent bitmap named <code>1525889631</code>. Now, we can make a subsequent call to <code>virDomainBackupBegin()</code> to perform an incremental backup of just this data, using the following XML to start a pull model NBD export of the vda disk: </p> <pre> <domainbackup mode="pull"> <incremental>1525889631</incremental> <server transport="unix" socket="/path/to/server"/> <disks/> <disk name='vda' type='file'/> <scratch file=/path/to/file1.scratch'/> </disk> </disks/> </domainbackup> </pre> </body> </html> -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org