[libvirt] [RFC v3] external (pull) backup API

Here's my updated counterproposal for a backup API. In comparison to v2 posted by Nikolay: https://www.redhat.com/archives/libvir-list/2018-April/msg00115.html - changed terminology a bit: Nikolay's "BlockSnapshot" is now called a "Checkpoint", and "BlockExportStart/Stop" is now "BackupBegin/End" - flesh out more API descriptions - better documentation of proposed XML, for both checkpoints and backup Barring any major issues turned up during review, I've already starting to code this into libvirt with a goal of getting an implementation ready for review this month. Each domain will gain the ability to track a tree of Checkpoint objects (we've previously mentioned the term "system checkpoint" in the <domainsnapshot> XML as the combination of disk and RAM state; so I'll use the term "disk checkpoint" in prose as needed, to make it obvious that the checkpoints described here do not include RAM state). I will use the virDomainSnapshot API as a guide, meaning that we will track a tree of checkpoints where each checkpoint can have 0 or 1 parent checkpoints, in part because I plan to reuse a lot of the snapshot code as a starting point for implementing checkpoint tracking. Qemu does NOT track a relationship between internal snapshots, so libvirt has to manage the backing tree all by itself; by the same argument, if qemu does not add a parent relationship to dirty bitmaps, libvirt can probably manage everything itself by copying how it manages parent relationships between internal snapshots. However, I think it will be far easier for libvirt to exploit qemu dirty bitmaps if qemu DOES add bitmap tracking; particularly if qemu adds ways to easily compose a temporary bitmap that is the union of one bitmap plus a fixed number of its parents. Design-wise, libvirt will manage things so that there is only one enabled dirty-bitmap per qcow2 image at a time, when no backup operation is in effect. There is a notion of a current (or most recent) checkpoint; when a new checkpoint is created, that becomes the current one and the former checkpoint becomes the parent of the new one. If there is no current checkpoint, then there is no active dirty bitmap managed by libvirt. Representing things on a timeline, when a guest is first created, there is no dirty bitmap; later, the checkpoint "check1" is created, which in turn creates "bitmap1" in the qcow2 image for all changes past that point; when a second checkmark "check2" is created, a qemu transaction is used to create and enable the new "bitmap2" bitmap at the same time as disabling "bitmap1" bitmap. (Actually, it's probably easier to name the bitmap in the qcow2 file with the same name as the Checkpoint object being tracked in libvirt, but for discussion purposes, it's less confusing if I use separate names for now.) creation ....... check1 ....... check2 ....... active no bitmap bitmap1 bitmap2 When a user wants to create a backup, they select which point in time the backup starts from; the default value NULL represents a full backup (all content since disk creation to the point in time of the backup call, no bitmap is needed, use sync=full for push model or sync=none for the pull model); any other value represents the name of a checkpoint to use as an incremental backup (all content from the checkpoint to the point in time of the backup call; libvirt forms a temporary bitmap as needed, the uses sync=incremental for push model or sync=none plus exporting the bitmap for the pull model). For example, requesting an incremental backup from "check2" can just reuse "bitmap2", but requesting an incremental backup from "check1" requires the computation of the bitmap containing the union of "bitmap1" and "bitmap2". Libvirt will always create a new bitmap when starting a backup operation, whether or not the user requests that a checkpoint be created. Most users that want incremental backup sequences will create a new checkpoint every time they do a backup; the new bitmap that libvirt creates is then associated with that new checkpoint, and even after the backup operation completes, the new bitmap remains in the qcow2 file. But it is also possible to request a backup without a new checkpoint (it merely means that it is not possible to create a subsequent incremental backup from the backup just started); in that case, libvirt will have to take care of merging the new bitmap back into the previous one at the end of the backup operation. I think that it should be possible to run multiple backup operations in parallel in the long run. But in the interest of getting a proof of concept implementation out quickly, it's easier to state that for the initial implementation, libvirt supports at most one backup operation at a time (to do another backup, you have to wait for the current one to complete, or else abort and abandon the current one). As there is only one backup job running at a time, the existing virDomainGetJobInfo()/virDomainGetJobStats() will be able to report statistics about the job (insofar as such statistics are available). But in preparation for the future, when libvirt does add parallel job support, starting a backup job will return a job id; and presumably we'd add a new virDomainGetJobStatsByID() for grabbing statistics of an arbitrary (rather than the most-recently-started) job. Since live migration also acts as a job visible through virDomainGetJobStats(), I'm going to treat an active backup job and live migration as mutually exclusive. This is particularly true when we have a pull model backup ongoing: if qemu on the source is acting as an NBD server, you can't migrate away from that qemu and tell the NBD client to reconnect to the NBD server on the migration destination. So, to perform a migration, you have to cancel any pending backup operations. Conversely, if a migration job is underway, it will not be possible to start a new backup job until migration completes. However, we DO need to modify migration to ensure that any persistent bitmaps are migrated. I also think that in the long run, it should be possible to start a backup operation, and while it is still ongoing, create a new external snapshot, and still be able to coordinate the transfer of bitmaps from the old image to the new overlay. But for the first implementation, it's probably easiest to state that an ongoing backup prevents creation of a new snapshot. However, a current checkpoint (which means we DO have an active bitmap, even if there is no active backup) DOES need to be transfered to the new overlay, and conversely, a block commit job needs to merge all bitmaps from the old overlay to the backing file that is now becoming the active layer again. I don't know if qemu has primitives for this in place yet; and if it does not, the only conservative thing we can do in the initial implementation is to state that the use of checkpoints is exclusive from the use of snapshots (using one prevents the use of the other). Hopefully we don't have to stay in that state for long. For now, a user wanting guest I/O to be at a safe point can manually use virDomainFSFreeze()/virDomainBackupBegin()/virDomainFSThaw(); we may decide down the road to use the flags argument of virDomainBackupBegin() to provide automatic guest quiescing through one API (I'm not doing it right away, because we have to worry about undoing effects if we fail to thaw after starting the backup). So, to summarize, creating a backup will involve the following new APIs: /** * virDomainBackupBegin: * @domain: a domain object * @diskXml: description of storage to utilize and expose during * the backup, or NULL * @checkpointXml: description of a checkpoint to create, or NULL * @flags: not used yet, pass 0 * * Start a point-in-time backup job for the specified disks of a * running domain. * * A backup job is mutually exclusive with domain migration * (particularly when the job sets up an NBD export, since it is not * possible to tell any NBD clients about a server migrating between * hosts). For now, backup jobs are also mutually exclusive with any * other block job on the same device, although this restriction may * be lifted in a future release. Progress of the backup job can be * tracked via virDomainGetJobStats(). The job remains active until a * subsequent call to virDomainBackupEnd(), even if it no longer has * anything to copy. * * There are two fundamental backup approaches. The first, called a * push model, instructs the hypervisor to copy the state of the guest * disk to the designated storage destination (which may be on the * local file system or a network device); in this mode, the * hypervisor writes the content of the guest disk to the destination, * then emits VIR_DOMAIN_EVENT_ID_BLOCK_JOB_2 when the backup is * either complete or failed (the backup image is invalid if the job * is ended prior to the event being emitted). The second, called a * pull model, instructs the hypervisor to expose the state of the * guest disk over an NBD export; a third-party client can then * connect to this export, and read whichever portions of the disk it * desires. In this mode, there is no event; libvirt has to be * informed when the third-party NBD client is done and the backup * resources can be released. * * The @diskXml parameter is optional but usually provided, and * contains details about the backup, including which backup mode to * use, whether the backup is incremental from a previous checkpoint, * which disks participate in the backup, the destination for a push * model backup, and the temporary storage and NBD server details for * a pull model backup. If omitted, the backup attempts to default to * a push mode full backup of all disks, where libvirt generates a * filename for each disk by appending a suffix of a timestamp in * seconds since the Epoch. virDomainBackupGetXMLDesc() can be called * to actual values selected. For more information, see * formatcheckpoint.html#BackupAttributes. * * The @checkpointXml parameter is optional; if non-NULL, then libvirt * behaves as if virDomainCheckpointCreateXML() were called with * @checkpointXml, atomically covering the same guest state that will * be part of the backup. The creation of a new checkpoint allows for * future incremental backups. * * Returns a non-negative job id on success, or negative on failure. * This operation returns quickly, such that a user can choose to * start a backup job between virDomainFSFreeze() and * virDomainFSThaw() in order to create the backup while guest I/O is * quiesced. */ int virDomainBackupBegin(virDomainPtr domain, const char *diskXml, const char *checkpointXml, unsigned int flags); Note that this layout says that all disks participating in the backup job have share the same incremental checkpoint as their starting point (no way to have one backup job where disk A copies data since check1 while disk B copies data since check2). If we need the latter, then we could get rid of the 'incremental' parameter, and instead have each <disk> element within checkpointXml all out an optional <checkpoint> name as its starting point. Also, qemu supports exposing multiple disks through a single NBD server (you then connect multiple clients to the one server to grab state from each disk). So the NBD details are listed in parallel to the <disks>. Note that since a backup is NOT a guest-visible action, the backup job does not alter the normal <domain> XML. /** * virDomainBackupGetXMLDesc: * @domain: a domain object * @id: the id of an active backup job previously started with * virDomainBackupBegin() * @flags: not used yet, pass 0 * * In some cases, a user can start a backup job without supplying all * details, and rely on libvirt to fill in the rest (for example, * selecting the port used for an NBD export). This API can then be * used to learn what default values were chosen. * * Returns a NUL-terminated UTF-8 encoded XML instance, or NULL in * case of error. The caller must free() the returned value. */ char * virDomainBackupGetXMLDesc(virDomainPtr domain, int id, unsigned int flags); /** * virDomainBackupEnd: * @domain: a domain object * @id: the id of an active backup job previously started with * virDomainBackupBegin() * @flags: bitwise-OR of supported virDomainBackupEndFlags * * Conclude a point-in-time backup job @id on the given domain. * * If the backup job uses the push model, but the event marking that * all data has been copied has not yet been emitted, then the command * fails unless @flags includes VIR_DOMAIN_BACKUP_END_ABORT. If the * event has been issued, or if the backup uses the pull model, the * flag has no effect. * * Returns 0 on success and -1 on failure. */ int virDomainBackupEnd(virDomainPtr domain, int id, unsigned int flags); /** * virDomainCheckpointCreateXML: * @domain: a domain object * @xmlDesc: description of the checkpoint to create * @flags: bitwise-OR of supported virDomainCheckpointCreateFlags * * Create a new checkpoint using @xmlDesc on a running @domain. * Typically, it is more common to create a new checkpoint as part of * kicking off a backup job with virDomainBackupBegin(); however, it * is also possible to start a checkpoint without a backup. * * See formatcheckpoint.html#CheckpointAttributes document for more * details on @xmlDesc. * * If @flags includes VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE, then this * is a request to reinstate checkpoint metadata that was previously * discarded, rather than creating a new checkpoint. When redefining * checkpoint metadata, the current checkpoint will not be altered * unless the VIR_DOMAIN_CHECKPOINT_CREATE_CURRENT flag is also * present. It is an error to request the * VIR_DOMAIN_CHECKPOINT_CREATE_CURRENT flag without * VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE. * * If @flags includes VIR_DOMAIN_CHECKPOINT_CREATE_NO_METADATA, then * the domain's disk images are modified according to @xmlDesc, but * then the just-created checkpoint has its metadata deleted. This * flag is incompatible with VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE. * * Returns an (opaque) new virDomainCheckpointPtr on success, or NULL * on failure. */ virDomainCheckpointPtr virDomainCheckpointCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags); /** * virDomainCheckpointDelete: * @checkpoint: the checkpoint to remove * @flags: not used yet, pass 0 * @flags: bitwise-OR of supported virDomainCheckpointDeleteFlags * * Removes a checkpoint from the domain. * * When removing a checkpoint, the record of which portions of the * disk were dirtied after the checkpoint will be merged into the * record tracked by the parent checkpoint, if any. Likewise, if the * checkpoint being deleted was the current checkpoint, the parent * checkpoint becomes the new current checkpoint. * * If @flags includes VIR_DOMAIN_CHECKPOINT_DELETE_METADATA_ONLY, then * any checkpoint metadata tracked by libvirt is removed while keeping * the checkpoint contents intact; if a hypervisor does not require * any libvirt metadata to track checkpoints, then this flag is * silently ignored. * * Returns 0 on success, -1 on error. */ int virDomainCheckpointDelete(virDomainCheckpointPtr checkpoint, unsigned int flags); // Many additional functions copying heavily from virDomainSnapshot*: virDomainCheckpointList(virDomainPtr domain, virDomainCheckpointPtr **checkpoints, unsigned int flags); virDomainCheckpointGetXMLDesc(virDomainCheckpointPtr checkpoint, unsigned int flags); virDomainCheckpointPtr virDomainCheckpointLookupByName(virDomainPtr domain, const char *name, unsigned int flags); const char * virDomainCheckpointGetName(virDomainCheckpointPtr checkpoint); virDomainPtr virDomainCheckpointGetDomain(virDomainCheckpointPtr checkpoint); virConnectPtr virDomainCheckpointGetConnect(virDomainCheckpointPtr checkpoint); int virDomainHasCurrentCheckpoint(virDomainPtr domain, unsigned int flags); virDomainCheckpointPtr virDomainCheckpointCurrent(virDomainPtr domain, unsigned int flags); virDomainCheckpointPtr virDomainCheckpointGetParent(virDomainCheckpointPtr checkpoint, unsigned int flags); int virDomainCheckpointIsCurrent(virDomainCheckpointPtr checkpoint, unsigned int flags); int virDomainCheckpointRef(virDomainCheckpointPtr checkpoint); int virDomainCheckpointFree(virDomainCheckpointPtr checkpoint); int virDomainCheckpointListChildren(virDomainCheckpointPtr checkpoint, virDomainCheckpointPtr **children, unsigned int flags); Notably, none of the older racy list functions, like virDomainSnapshotNum, virDomainSnapshotNumChildren, or virDomainSnapshotListChildrenNames; also, for now, there is no revert support like virDomainSnapshotRevert. Eventually, if we add a way to roll back to the state recorded in an earlier bitmap, we'll want to tell libvirt that it needs to create a new bitmap as a child of an existing (non-current) checkpoint. That is, if we have: check1 .... check2 .... active bitmap1 bitmap2 and created a backup at the same time as check2, then when we later roll back to the state of that backup, we would want to end writes to bitmap2 and declare that check2 is no longer current, and create a new current check3 with associated bitmap3 and parent check1 to track all writes since the point of the revert. Until then, I don't think it's possible to have more than one child without manually using the REDEFINE flag to create such scenarios; but the API should not lock us out of supporting multiple children in the future. Here's my proposal for user-facing XML documentation, based on formatsnapshot.html.in: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <body> <h1>Checkpoint and Backup XML format</h1> <ul id="toc"></ul> <h2><a id="CheckpointAttributes">Checkpoint XML</a></h2> <p> Libvirt is able to facilitate incremental backups by tracking disk checkpoints, or points in time against which it is easy to compute which portion of the disk has changed. Given a full backup (a backup created from the creation of the disk to a given point in time, coupled with the creation of a disk checkpoint at that time), and an incremental backup (a backup created from just the dirty portion of the disk between the first checkpoint and the second backup operation), it is possible to do an offline reconstruction of the state of the disk at the time of the second backup, without having to copy as much data as a second full backup would require. Most disk checkpoints are created in concert with a backup, via <code>virDomainBackupBegin()</code>; however, libvirt also exposes enough support to create disk checkpoints independently from a backup operation, via <code>virDomainCheckpointCreateXML()</code>. </p> <p> Attributes of libvirt checkpoints are stored as child elements of the <code>domaincheckpoint</code> element. At checkpoint creation time, normally only the <code>name</code>, <code>description</code>, and <code>disks</code> elements are settable; the rest of the fields are ignored on creation, and will be filled in by libvirt in for informational purposes by <code>virDomainCheckpointGetXMLDesc()</code>. However, when redefining a checkpoint, with the <code>VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE</code> flag of <code>virDomainCheckpointCreateXML()</code>, all of the XML described here is relevant. </p> <p> Checkpoints are maintained in a hierarchy. A domain can have a current checkpoint, which is the most recent checkpoint compared to the current state of the domain (although a domain might have checkpoints without a current checkpoint, if checkpoints have been deleted in the meantime). Creating or reverting to a checkpoint sets that checkpoint as current, and the prior current checkpoint is the parent of the new checkpoint. Branches in the hierarchy can be formed by reverting to a checkpoint with a child, then creating another checkpoint. </p> <p> The top-level <code>domaincheckpoint</code> element may contain the following elements: </p> <dl> <dt><code>name</code></dt> <dd>The name for this checkpoint. If the name is specified when initially creating the checkpoint, then the checkpoint will have that particular name. If the name is omitted when initially creating the checkpoint, then libvirt will make up a name for the checkpoint, based on the time when it was created. </dd> <dt><code>description</code></dt> <dd>A human-readable description of the checkpoint. If the description is omitted when initially creating the checkpoint, then this field will be empty. </dd> <dt><code>disks</code></dt> <dd>On input, this is an optional listing of specific instructions for disk checkpoints; it is needed when making a checkpoint on only a subset of the disks associated with a domain (in particular, since qemu checkpoints require qcow2 disks, this element may be needed on input for excluding guest disks that are not in qcow2 format); if omitted on input, then all disks participate in the checkpoint. On output, this is fully populated to show the state of each disk in the checkpoint. This element has a list of <code>disk</code> sub-elements, describing anywhere from one to all of the disks associated with the domain. <dl> <dt><code>disk</code></dt> <dd>This sub-element describes the checkpoint properties of a specific disk. The attribute <code>name</code> is mandatory, and must match either the <code><target dev='name'/></code> or an unambiguous <code><source file='name'/></code> of one of the <a href="formatdomain.html#elementsDisks">disk devices</a> specified for the domain at the time of the checkpoint. The attribute <code>checkpoint</code> is optional on input; possible values are <code>no</code> when the disk does not participate in this checkpoint; or <code>bitmap</code> if the disk will track all changes since the creation of this checkpoint via a bitmap, in which case another attribute <code>bitmap</code> will be the name of the tracking bitmap (defaulting to the checkpoint name). </dd> </dl> </dd> <dt><code>creationTime</code></dt> <dd>The time this checkpoint was created. The time is specified in seconds since the Epoch, UTC (i.e. Unix time). Readonly. </dd> <dt><code>parent</code></dt> <dd>The parent of this checkpoint. If present, this element contains exactly one child element, name. This specifies the name of the parent checkpoint of this one, and is used to represent trees of checkpoints. Readonly. </dd> <dt><code>domain</code></dt> <dd>The inactive <a href="formatdomain.html">domain configuration</a> at the time the checkpoint was created. Readonly. </dd> </dl> <h2><a id="BackupAttributes">Backup XML</a></h2> <p> Creating a backup, whether full or incremental, is done via <code>virDomainBackupBegin()</code>, which takes an XML description of the actions to perform. There are two general modes for backups: a push mode (where the hypervisor writes out the data to the destination file, which may be local or remote), and a pull mode (where the hypervisor creates an NBD server that a third-party client can then read as needed, and which requires the use of temporary storage, typically local, until the backup is complete). </p> <p> The instructions for beginning a backup job are provided as attributes and elements of the top-level <code>domainbackup</code> element. This element includes an optional attribute <code>mode</code> which can be either "push" or "pull" (default push). Where elements are optional on creation, <code>virDomainBackupGetXMLDesc()</code> can be used to see the actual values selected (for example, learning which port the NBD server is using in the pull model, or what file names libvirt generated when none were supplied). The following child elements are supported: </p> <dl> <dt><code>incremental</code></dt> <dd>Optional. If this element is present, it must name an existing checkpoint of the domain, which will be used to make this backup an incremental one (in the push model, only changes since the checkpoint are written to the destination; in the pull model, the NBD server uses the NBD_OPT_SET_META_CONTEXT extension to advertise to the client which portions of the export contain changes since the checkpoint). If omitted, a full backup is performed. </dd> <dt><code>server</code></dt> <dd>Present only for a pull mode backup. Contains the same attributes as the <code>protocol</code> element of a disk attached via NBD in the domain (such as transport, socket, name, port, or tls), necessary to set up an NBD server that exposes the content of each disk at the time the backup started. </dd> <dt><code>disks</code></dt> <dd>This is an optional listing of instructions for disks participating in the backup (if omitted, all disks participate, and libvirt attempts to generate filenames by appending the current timestamp as a suffix). When provided on input, disks omitted from the list do not participate in the backup. On output, the list is present but contains only the disks participating in the backup job. This element has a list of <code>disk</code> sub-elements, describing anywhere from one to all of the disks associated with the domain. <dl> <dt><code>disk</code></dt> <dd>This sub-element describes the checkpoint properties of a specific disk. The attribute <code>name</code> is mandatory, and must match either the <code><target dev='name'/></code> or an unambiguous <code><source file='name'/></code> of one of the <a href="formatdomain.html#elementsDisks">disk devices</a> specified for the domain at the time of the checkpoint. The optional attribute <code>type</code> can be <code>file</code>, <code>block</code>, or <code>networks</code>, similar to a disk declaration for a domain, controls what additional sub-elements are needed to describe the destination (such as <code>protocol</code> for a network destination). In push mode backups, the primary subelement is <code>target</code>; in pull mode, the primary sublement is <code>scratch</code>; but either way, the primary sub-element describes the file name to be used during the backup operation, similar to the <code>source</code> sub-element of a domain disk. An optional sublement <code>driver</code> can also be used to specify a destination format different from qcow2. </dd> </dl> </dd> </dl> <h2><a id="example">Examples</a></h2> <p>Using this XML to create a checkpoint of just vda on a qemu domain with two disks and a prior checkpoint:</p> <pre> <domaincheckpoint> <description>Completion of updates after OS install</description> <disks> <disk name='vda' checkpoint='bitmap'/> <disk name='vdb' checkpoint='no'/> </disks> </domaincheckpoint></pre> <p>will result in XML similar to this from <code>virDomainCheckpointGetXMLDesc()</code>:</p> <pre> <domaincheckpoint> <name>1525889631</name> <description>Completion of updates after OS install</description> <creationTime>1525889631</creationTime> <parent> <name>1525111885</name> </parent> <disks> <disk name='vda' checkpoint='bitmap' bitmap='1525889631'/> <disk name='vdb' checkpoint='no'/> </disks> <domain> <name>fedora</name> <uuid>93a5c045-6457-2c09-e56c-927cdf34e178</uuid> <memory>1048576</memory> ... <devices> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/path/to/file1'/> <target dev='vda' bus='virtio'/> </disk> <disk type='file' device='disk' snapshot='external'> <driver name='qemu' type='raw'/> <source file='/path/to/file2'/> <target dev='vdb' bus='virtio'/> </disk> ... </devices> </domain> </domaincheckpoint></pre> <p>With that checkpoint created, the qcow2 image is now tracking all changes that occur in the image since the checkpoint via the persistent bitmap named <code>1525889631</code>. Now, we can make a subsequent call to <code>virDomainBackupBegin()</code> to perform an incremental backup of just this data, using the following XML to start a pull model NBD export of the vda disk: </p> <pre> <domainbackup mode="pull"> <incremental>1525889631</incremental> <server transport="unix" socket="/path/to/server"/> <disks/> <disk name='vda' type='file'/> <scratch file=/path/to/file1.scratch'/> </disk> </disks/> </domainbackup> </pre> </body> </html> -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On Thu, May 17, 2018 at 05:43:37PM -0500, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
In comparison to v2 posted by Nikolay: https://www.redhat.com/archives/libvir-list/2018-April/msg00115.html - changed terminology a bit: Nikolay's "BlockSnapshot" is now called a "Checkpoint", and "BlockExportStart/Stop" is now "BackupBegin/End" - flesh out more API descriptions - better documentation of proposed XML, for both checkpoints and backup
Barring any major issues turned up during review, I've already starting to code this into libvirt with a goal of getting an implementation ready for review this month.
I think the key thing missing from the docs is some kind of explanation about the difference between a backup, and checkpoint and a snapshot. I'll admit I've not read the mail in detail, but at a high level it is not immediately obvious what the difference is & thus which APIs I would want to be using for a given scenario. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 05/18/2018 02:56 AM, Daniel P. Berrangé wrote:
On Thu, May 17, 2018 at 05:43:37PM -0500, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
In comparison to v2 posted by Nikolay: https://www.redhat.com/archives/libvir-list/2018-April/msg00115.html - changed terminology a bit: Nikolay's "BlockSnapshot" is now called a "Checkpoint", and "BlockExportStart/Stop" is now "BackupBegin/End" - flesh out more API descriptions - better documentation of proposed XML, for both checkpoints and backup
Barring any major issues turned up during review, I've already starting to code this into libvirt with a goal of getting an implementation ready for review this month.
I think the key thing missing from the docs is some kind of explanation about the difference between a backup, and checkpoint and a snapshot. I'll admit I've not read the mail in detail, but at a high level it is not immediately obvious what the difference is & thus which APIs I would want to be using for a given scenario.
Indeed, and that's a fair complaint. Here's a first draft, that I'll have to polish into a formal html document that both the snapshot and checkpoint/backup pages refer to (or maybe I merge snapshots and checkpoint descriptions into a single html page, although I'm not quite sure what to name the page then). One of the features made possible with virtual machines is live migration, or transferring all state related to the guest from one host to another, with minimal interruption to the guest's activity. A clever observer will then note that if all state is available for live migration, there is nothing stopping a user from saving that state at a given point of time, to be able to later rewind guest execution back to the state it previously had. There are several different libvirt APIs associated with capturing the state of a guest, such that the captured state can later be used to rewind that guest to the conditions it was in earlier. But since there are multiple APIs, it is best to understand the tradeoffs and differences between them, in order to choose the best API for a given task. Timing: Capturing state can be a lengthy process, so while the captured state ideally represents an atomic point in time correpsonding to something the guest was actually executing, some interfaces require up-front preparation (the state captured is not complete until the API ends, which may be some time after the command was first started), while other interfaces track the state when the command was first issued even if it takes some time to finish capturing the state. While it is possible to freeze guest I/O around either point in time (so that the captured state is fully consistent, rather than just crash-consistent), knowing whether the state is captured at the start or end of the command may determine which approach to use. A related concept is the amount of downtime the guest will experience during the capture, particularly since freezing guest I/O has time constraints. Amount of state: For an offline guest, only the contents of the guest disks needs to be captured; restoring that state is merely a fresh boot with the disks restored to that state. But for an online guest, there is a choice between storing the guest's memory (all that is needed during live migration where the storage is shared between source and destination), the guest's disk state (all that is needed if there are no pending guest I/O transactions that would be lost without the corresponding memory state), or both together. Unless guest I/O is quiesced prior to capturing state, then reverting to captured disk state of a live guest without the corresponding memory state is comparable to booting a machine that previously lost power without a clean shutdown; but for a guest that uses appropriate journaling methods, this crash-consistent state may be sufficient to avoid the additional storage and time needed to capture memory state. Quantity of files: When capturing state, some approaches store all state within the same file (internal), while others expand a chain of related files that must be used together (external), for more files that a management application must track. There are also differences depending on whether the state is captured in the same file in use by a running guest, or whether the state is captured to a distinct file without impacting the files used to run the guest. Third-party integration: When capturing state, particularly for a running, there are tradeoffs to how much of the process must be done directly by the hypervisor, and how much can be off-loaded to third-party software. Since capturing state is not instantaneous, it is essential that any third-party integration see consistent data even if the running guest continues to modify that data after the point in time of the capture. Full vs. partial: When capturing state, it is useful to minimize the amount of state that must be captured in relation to a previous capture, by focusing only on the portions of the disk that the guest has modified since the previous capture. Some approaches are able to take advantage of checkpoints to provide an incremental backup, while others are only capable of a full backup including portions of the disk that have not changed since the previous state capture. With those definitions, the following libvirt APIs have these properties: virDomainSnapshotCreateXML: This API wraps several approaches for capturing guest state, with a general premise of creating a snapshot (where the current guest resources are frozen in time and a new wrapper layer is opened for tracking subsequent guest changes). It can operate on both offline and running guests, can choose whether to capture the state of memory, disk, or both when used on a running guest, and can choose between internal and external storage for captured state. However, it is geared towards post-event captures (when capturing both memory and disk state, the disk state is not captured until all memory state has been collected first). For qemu as the hypervisor, internal snapshots currently have lengthy downtime that is incompatible with freezing guest I/O, but external snapshots are quick. Since creating an external snapshot changes which disk image resource is in use by the guest, this API can be coupled with virDomainBlockCommit to restore things back to the guest using its original disk image, where a third-party tool can read the backing file prior to the live commit. virDomainBlockCopy: This API wraps approaches for capturing the state of disks of a running guest, but does not track accompanying guest memory state. The capture is consistent only at the end of the operation, with a choice to either pivot to the new file that contains the copy (leaving the old file as the backup), or to return to the original file (leaving the new file as the backup). virDomainBackupStart: This API wraps approaches for capturing the state of disks of a running guest, but does not track accompanying guest memory state. The capture is consistent to the start of the operation, where the captured state is stored independently from the disk image in use with the guest, and where it can be easily integrated with a third-party for capturing the disk state. Since the backup operation is stored externally from the guest resources, there is no need to commit data back in at the completion of the operation. When coupled with checkpoints, this can be used to capture incremental backups instead of full. virDomainCheckpointCreateXML: This API does not actually capture guest state, so much as make it possible to track which portions of guest disks have change between checkpoints or between a current checkpoint and the live execution of the guest. When performing incremental backups, it is easier to create a new checkpoint at the same time as a new backup, so that the next incremental backup can refer to the incremental state since the checkpoint created during the current backup. Putting it together: the following two sequences both capture the disk state of a running guest, then complete with the guest running on its original disk image; but with a difference that an unexpected interruption during the first mode leaves a temporary wrapper file that must be accounted for, while interruption of the second mode has no impact to the guest. 1. Backup via temporary snapshot virDomainFSFreeze() virDomainSnapshotCreateXML(VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) virDomainFSThaw() third-party copy the backing file to backup storage # most time spent here virDomainBlockCommit(VIR_DOMAIN_BLOCK_COMMIT_ACTIVE) wait for commit ready event virDomainBlockJobAbort() 2. Direct backup virDomainFSFreeze() virDomainBackupBegin() virDomainFSThaw() wait for push mode event, or pull data over NBD # most time spent here virDomainBackeupEnd() -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

18.05.2018 01:43, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
[...]
Representing things on a timeline, when a guest is first created, there is no dirty bitmap; later, the checkpoint "check1" is created, which in turn creates "bitmap1" in the qcow2 image for all changes past that point; when a second checkmark "check2" is created, a qemu transaction is used to create and enable the new "bitmap2" bitmap at the same time as disabling "bitmap1" bitmap. (Actually, it's probably easier to name the bitmap in the qcow2 file with the same name as the Checkpoint object being tracked in libvirt, but for discussion purposes, it's less confusing if I use separate names for now.)
creation ....... check1 ....... check2 ....... active no bitmap bitmap1 bitmap2
When a user wants to create a backup, they select which point in time the backup starts from; the default value NULL represents a full backup (all content since disk creation to the point in time of the backup call, no bitmap is needed, use sync=full for push model or sync=none for the pull model); any other value represents the name of a checkpoint to use as an incremental backup (all content from the checkpoint to the point in time of the backup call; libvirt forms a temporary bitmap as needed, the uses sync=incremental for push model or sync=none plus exporting the bitmap for the pull model). For example, requesting an incremental backup from "check2" can just reuse "bitmap2", but requesting an incremental backup from "check1" requires the computation of the bitmap containing the union of "bitmap1" and "bitmap2".
I have a bit of criticism on this part, exactly on ability to create a backup not from last checkpoint but from any from the past. For this ability we are implementing the whole api with checkpoints, we are going to store several bitmaps in Qemu (and possibly going to implement checkpoints in Qemu in future). But personally, I don't know any real and adequate use cases for this ability. I heard about the following cases: 1. Incremental restore: we want to rollback to some point in time (some element in incremental backup chain), and don't want to copy all the data, but only changed. - It's not real case, because information about dirtiness is already in backup chain: we just need to find allocated areas and copy them + we should copy areas, corresponding to dirty bits in active dirty bitmap in Qemu. 2. Several backup solutions backing up the same vm - Ok, if we implement checkpoints, instead of maintaining several active dirty bitmaps, we can have only one active bitmap and others disabled, which lead to performance gain and possibility to save RAM space (if we unload disabled bitmaps from RAM to qcow2). But what are real cases? What is the real benefit? I doubt that somebody will use more than 2 - 3 different backup providers on same vm, so is it worth implementing such a big feature for this? It of course worth doing if we have 100 independent backup providers. Note: the word "independent" is important here. For example it may be two external backup tools, managed by different subsystems or different people or something like this. If we are just doing a backup weekly + daily, actually, we can synchronize them, so that weekly backup will be a merge of last 7 daily backups, so weekly backup don't need personal active dirty bitmap and even backup operation. 3. Some of backups in incremental backup chain are lost, and we want to recreate part of the chain as a new backup, instead of just dropping all chain and create full backup. In this case, I can say the following: disabled bitmaps (~ all checkpoints except the last one) are constant metadata, related to the backup chain, not to the vm. And it should be stored as constant data: may be on the same server as backup chain, maybe on the other, maybe in some database, but not in vm. VM is a dynamic structure, and I don't see any reason of storing (almost) unrelated constant metadata in it. Also, saving this constant backup-related metadata separately from vm will allow to check it's consistency with a help of checksums or something like this. Finally, I'm not a specialist in storing constant data, but I think that the vm is not the best place. Note: Hmm, do someone have real examples of such user cases? Why backups are lost, is it often case? (I heard an assumption, that it may be a tool, checking backups (for example create a vm over the backup and check that it at least can start), which is running in background. But I'm not sure, that we must drop backup if it failed, may be it's enough to merge it up) 3.1 About external backup: we have even already exported this metadata to the third backup tool. So, this tool should store this information for future use, instead of exporting from Qemu again. To summarize: 1. I doubt that discussed ability is really needed. 2. If it is needed, I doubt that it's a true way to store related disabled bitmaps (or checkpoints) in Qemu. -- Best regards, Vladimir

On 05/21/2018 10:52 AM, Vladimir Sementsov-Ogievskiy wrote:
18.05.2018 01:43, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
[...]
Representing things on a timeline, when a guest is first created, there is no dirty bitmap; later, the checkpoint "check1" is created, which in turn creates "bitmap1" in the qcow2 image for all changes past that point; when a second checkmark "check2" is created, a qemu transaction is used to create and enable the new "bitmap2" bitmap at the same time as disabling "bitmap1" bitmap. (Actually, it's probably easier to name the bitmap in the qcow2 file with the same name as the Checkpoint object being tracked in libvirt, but for discussion purposes, it's less confusing if I use separate names for now.)
creation ....... check1 ....... check2 ....... active no bitmap bitmap1 bitmap2
When a user wants to create a backup, they select which point in time the backup starts from; the default value NULL represents a full backup (all content since disk creation to the point in time of the backup call, no bitmap is needed, use sync=full for push model or sync=none for the pull model); any other value represents the name of a checkpoint to use as an incremental backup (all content from the checkpoint to the point in time of the backup call; libvirt forms a temporary bitmap as needed, the uses sync=incremental for push model or sync=none plus exporting the bitmap for the pull model). For example, requesting an incremental backup from "check2" can just reuse "bitmap2", but requesting an incremental backup from "check1" requires the computation of the bitmap containing the union of "bitmap1" and "bitmap2".
I have a bit of criticism on this part, exactly on ability to create a backup not from last checkpoint but from any from the past. For this ability we are implementing the whole api with checkpoints, we are going to store several bitmaps in Qemu (and possibly going to implement checkpoints in Qemu in future). But personally, I don't know any real and adequate use cases for this ability.
I heard about the following cases: 1. Incremental restore: we want to rollback to some point in time (some element in incremental backup chain), and don't want to copy all the data, but only changed. - It's not real case, because information about dirtiness is already in backup chain: we just need to find allocated areas and copy them + we should copy areas, corresponding to dirty bits in active dirty bitmap in Qemu.
If you do a pull mode backup (where the dirty bitmaps were exported over NBD), then yes, you can assume that the third-party app reading the backup data also saved the dirty bitmap in whatever form it likes, so that it only ever has to pull data from the most recent checkpoint and can reconstruct the union of changes from an earlier checkpoint offline without qemu help. But for a push mode backup (where qemu does the pushing), there is no way to expose the dirty bitmap of what the backup contains, unless you backup to something like a qcow2 image and track which clusters in the backup image were allocated as a result of the backup operation. So having a way in the libvirt API to grab an incremental backup from earlier than the most recent checkpoint may not be needed by everyone, but I don't see a problem in implementing it either.
2. Several backup solutions backing up the same vm - Ok, if we implement checkpoints, instead of maintaining several active dirty bitmaps, we can have only one active bitmap and others disabled, which lead to performance gain and possibility to save RAM space (if we unload disabled bitmaps from RAM to qcow2). But what are real cases? What is the real benefit? I doubt that somebody will use more than 2 - 3 different backup providers on same vm, so is it worth implementing such a big feature for this? It of course worth doing if we have 100 independent backup providers. Note: the word "independent" is important here. For example it may be two external backup tools, managed by different subsystems or different people or something like this. If we are just doing a backup weekly + daily, actually, we can synchronize them, so that weekly backup will be a merge of last 7 daily backups, so weekly backup don't need personal active dirty bitmap and even backup operation.
I'm not sure if this is a complaint that libvirt should allow more than one active bitmap at a time, vs. having exactly one active bitmap at a time and then reconstructing bitmaps over larger sequences of time as needed. But does that change the API that libvirt should expose to end users, or can it just be an implementation detail?
3. Some of backups in incremental backup chain are lost, and we want to recreate part of the chain as a new backup, instead of just dropping all chain and create full backup. In this case, I can say the following: disabled bitmaps (~ all checkpoints except the last one) are constant metadata, related to the backup chain, not to the vm. And it should be stored as constant data: may be on the same server as backup chain, maybe on the other, maybe in some database, but not in vm. VM is a dynamic structure, and I don't see any reason of storing (almost) unrelated constant metadata in it. Also, saving this constant backup-related metadata separately from vm will allow to check it's consistency with a help of checksums or something like this. Finally, I'm not a specialist in storing constant data, but I think that the vm is not the best place.
How does backup bitmap data get lost? If the image is only managed by libvirt, then libvirt shouldn't be losing arbitrary bitmaps. If a third-part entity is modifying the qcow2 images (presumably while the guest is offline, as editing a file that is simultaneously in use by qemu is a no-no), then all bets are off anyways, as you really shouldn't be trying to independently manage qcow2 files that are already being managed by libvirt.
Note: Hmm, do someone have real examples of such user cases? Why backups are lost, is it often case? (I heard an assumption, that it may be a tool, checking backups (for example create a vm over the backup and check that it at least can start), which is running in background. But I'm not sure, that we must drop backup if it failed, may be it's enough to merge it up)
I'm more worried about the implementation of checkpoints and dirty bitmaps across domain snapshots (where we have to consider copying one or more bitmaps from the base image to the active image when creating a snapshot, and conversely about merging bitmaps from the active image into the base when doing a live commit).
3.1 About external backup: we have even already exported this metadata to the third backup tool. So, this tool should store this information for future use, instead of exporting from Qemu again.
To summarize: 1. I doubt that discussed ability is really needed. 2. If it is needed, I doubt that it's a true way to store related disabled bitmaps (or checkpoints) in Qemu.
So, to make sure I understand, the only thing that you are debating whether we need is the ability to grab a backup image from earlier than the most-recent checkpoint? Remember, the proposal is whether we have a sufficiently powerful libvirt API to cover multiple use cases, even if not all users need all of the permutations of use cases, while still being something that is concise enough to document and implement on top of existing qemu semantics. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

22.05.2018 01:03, Eric Blake wrote:
On 05/21/2018 10:52 AM, Vladimir Sementsov-Ogievskiy wrote:
18.05.2018 01:43, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
[...]
Representing things on a timeline, when a guest is first created, there is no dirty bitmap; later, the checkpoint "check1" is created, which in turn creates "bitmap1" in the qcow2 image for all changes past that point; when a second checkmark "check2" is created, a qemu transaction is used to create and enable the new "bitmap2" bitmap at the same time as disabling "bitmap1" bitmap. (Actually, it's probably easier to name the bitmap in the qcow2 file with the same name as the Checkpoint object being tracked in libvirt, but for discussion purposes, it's less confusing if I use separate names for now.)
creation ....... check1 ....... check2 ....... active no bitmap bitmap1 bitmap2
When a user wants to create a backup, they select which point in time the backup starts from; the default value NULL represents a full backup (all content since disk creation to the point in time of the backup call, no bitmap is needed, use sync=full for push model or sync=none for the pull model); any other value represents the name of a checkpoint to use as an incremental backup (all content from the checkpoint to the point in time of the backup call; libvirt forms a temporary bitmap as needed, the uses sync=incremental for push model or sync=none plus exporting the bitmap for the pull model). For example, requesting an incremental backup from "check2" can just reuse "bitmap2", but requesting an incremental backup from "check1" requires the computation of the bitmap containing the union of "bitmap1" and "bitmap2".
I have a bit of criticism on this part, exactly on ability to create a backup not from last checkpoint but from any from the past. For this ability we are implementing the whole api with checkpoints, we are going to store several bitmaps in Qemu (and possibly going to implement checkpoints in Qemu in future). But personally, I don't know any real and adequate use cases for this ability.
I heard about the following cases: 1. Incremental restore: we want to rollback to some point in time (some element in incremental backup chain), and don't want to copy all the data, but only changed. - It's not real case, because information about dirtiness is already in backup chain: we just need to find allocated areas and copy them + we should copy areas, corresponding to dirty bits in active dirty bitmap in Qemu.
If you do a pull mode backup (where the dirty bitmaps were exported over NBD), then yes, you can assume that the third-party app reading the backup data also saved the dirty bitmap in whatever form it likes, so that it only ever has to pull data from the most recent checkpoint and can reconstruct the union of changes from an earlier checkpoint offline without qemu help. But for a push mode backup (where qemu does the pushing), there is no way to expose the dirty bitmap of what the backup contains, unless you backup to something like a qcow2 image and track which clusters in the backup image were allocated as a result of the backup operation. So having a way in the libvirt API to grab an incremental backup from earlier than the most recent checkpoint may not be needed by everyone, but I don't see a problem in implementing it either.
If we have a chain of incrementals from push backup, it should be possible to analyze their block status, so it should be something like qcow2. Otherwise we can't create a chain of backing files. If we backup incrementals to the same file, we anyway can't restore to some previous point except the last one.
2. Several backup solutions backing up the same vm - Ok, if we implement checkpoints, instead of maintaining several active dirty bitmaps, we can have only one active bitmap and others disabled, which lead to performance gain and possibility to save RAM space (if we unload disabled bitmaps from RAM to qcow2). But what are real cases? What is the real benefit? I doubt that somebody will use more than 2 - 3 different backup providers on same vm, so is it worth implementing such a big feature for this? It of course worth doing if we have 100 independent backup providers. Note: the word "independent" is important here. For example it may be two external backup tools, managed by different subsystems or different people or something like this. If we are just doing a backup weekly + daily, actually, we can synchronize them, so that weekly backup will be a merge of last 7 daily backups, so weekly backup don't need personal active dirty bitmap and even backup operation.
I'm not sure if this is a complaint that libvirt should allow more than one active bitmap at a time, vs. having exactly one active bitmap at a time and then reconstructing bitmaps over larger sequences of time as needed. But does that change the API that libvirt should expose to end users, or can it just be an implementation detail?
[2] is a benefit from new API, which allows such implementation detail (only one active bitmaps). And I just say, that this benefit looks not very significant.
3. Some of backups in incremental backup chain are lost, and we want to recreate part of the chain as a new backup, instead of just dropping all chain and create full backup. In this case, I can say the following: disabled bitmaps (~ all checkpoints except the last one) are constant metadata, related to the backup chain, not to the vm. And it should be stored as constant data: may be on the same server as backup chain, maybe on the other, maybe in some database, but not in vm. VM is a dynamic structure, and I don't see any reason of storing (almost) unrelated constant metadata in it. Also, saving this constant backup-related metadata separately from vm will allow to check it's consistency with a help of checksums or something like this. Finally, I'm not a specialist in storing constant data, but I think that the vm is not the best place.
How does backup bitmap data get lost? If the image is only managed by libvirt, then libvirt shouldn't be losing arbitrary bitmaps. If a third-part entity is modifying the qcow2 images (presumably while the guest is offline, as editing a file that is simultaneously in use by qemu is a no-no), then all bets are off anyways, as you really shouldn't be trying to independently manage qcow2 files that are already being managed by libvirt.
Note: Hmm, do someone have real examples of such user cases? Why backups are lost, is it often case? (I heard an assumption, that it may be a tool, checking backups (for example create a vm over the backup and check that it at least can start), which is running in background. But I'm not sure, that we must drop backup if it failed, may be it's enough to merge it up)
I'm more worried about the implementation of checkpoints and dirty bitmaps across domain snapshots (where we have to consider copying one or more bitmaps from the base image to the active image when creating a snapshot, and conversely about merging bitmaps from the active image into the base when doing a live commit).
Hm, I think we don't need merge, but copy in both operations. Both operations don't change the data from guest point of view, so bitmap should not change, just copied to the topmost image. And anyway on commit, we should drop all old bitmaps from base image (or update them by dirtying bits, corresponding to allocated clusters in top image). One related note here: we also can use dirty-bitmaps migration capability to save bitmaps to vmstate.
3.1 About external backup: we have even already exported this metadata to the third backup tool. So, this tool should store this information for future use, instead of exporting from Qemu again.
To summarize: 1. I doubt that discussed ability is really needed. 2. If it is needed, I doubt that it's a true way to store related disabled bitmaps (or checkpoints) in Qemu.
So, to make sure I understand, the only thing that you are debating whether we need is the ability to grab a backup image from earlier than the most-recent checkpoint? Remember, the proposal is whether we have a sufficiently powerful libvirt API to cover multiple use cases, even if not all users need all of the permutations of use cases, while still being something that is concise enough to document and implement on top of existing qemu semantics.
I agree, that checkpoint-based API is powerful. I just don't see real benefits (use cases). -- Best regards, Vladimir

On 05/17/2018 05:43 PM, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
In comparison to v2 posted by Nikolay: https://www.redhat.com/archives/libvir-list/2018-April/msg00115.html - changed terminology a bit: Nikolay's "BlockSnapshot" is now called a "Checkpoint", and "BlockExportStart/Stop" is now "BackupBegin/End" - flesh out more API descriptions - better documentation of proposed XML, for both checkpoints and backup
Barring any major issues turned up during review, I've already starting to code this into libvirt with a goal of getting an implementation ready for review this month.
// Many additional functions copying heavily from virDomainSnapshot*:
virDomainCheckpointList(virDomainPtr domain, virDomainCheckpointPtr **checkpoints, unsigned int flags);
int virDomainCheckpointListChildren(virDomainCheckpointPtr checkpoint, virDomainCheckpointPtr **children, unsigned int flags);
Notably, none of the older racy list functions, like virDomainSnapshotNum, virDomainSnapshotNumChildren, or virDomainSnapshotListChildrenNames; also, for now, there is no revert support like virDomainSnapshotRevert.
I'm finding it easier to understand if I name these: virDomainListCheckpoints() (find checkpoints relative to a domain) virDomainCheckpointListChildren() (find children relative to a checkpoint) The counterpart Snapshot API used virDomainListAllSnapshots(); the term 'All' was present because it was added after the initial racy virDomainSnapshotNum(), but as we are avoiding the racy API here we can skip it from the beginning. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On Fri, May 25, 2018 at 10:26:12 -0500, Eric Blake wrote:
On 05/17/2018 05:43 PM, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
In comparison to v2 posted by Nikolay: https://www.redhat.com/archives/libvir-list/2018-April/msg00115.html - changed terminology a bit: Nikolay's "BlockSnapshot" is now called a "Checkpoint", and "BlockExportStart/Stop" is now "BackupBegin/End" - flesh out more API descriptions - better documentation of proposed XML, for both checkpoints and backup
Barring any major issues turned up during review, I've already starting to code this into libvirt with a goal of getting an implementation ready for review this month.
// Many additional functions copying heavily from virDomainSnapshot*:
virDomainCheckpointList(virDomainPtr domain, virDomainCheckpointPtr **checkpoints, unsigned int flags);
int virDomainCheckpointListChildren(virDomainCheckpointPtr checkpoint, virDomainCheckpointPtr **children, unsigned int flags);
Notably, none of the older racy list functions, like virDomainSnapshotNum, virDomainSnapshotNumChildren, or virDomainSnapshotListChildrenNames; also, for now, there is no revert support like virDomainSnapshotRevert.
I'm finding it easier to understand if I name these:
virDomainListCheckpoints() (find checkpoints relative to a domain) virDomainCheckpointListChildren() (find children relative to a checkpoint)
If you are going to name them "checkpoints" here we should first rename "snapshots with memory" in our docs since we refer to them as checkpoints. We refer to disk-only snapshots as snapshots and wanted to emphasize the difference.
The counterpart Snapshot API used virDomainListAllSnapshots(); the term 'All' was present because it was added after the initial racy virDomainSnapshotNum(), but as we are avoiding the racy API here we can skip it from the beginning.
-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 05/17/2018 05:43 PM, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
/** * virDomainBackupEnd: * @domain: a domain object * @id: the id of an active backup job previously started with * virDomainBackupBegin() * @flags: bitwise-OR of supported virDomainBackupEndFlags * * Conclude a point-in-time backup job @id on the given domain. * * If the backup job uses the push model, but the event marking that * all data has been copied has not yet been emitted, then the command * fails unless @flags includes VIR_DOMAIN_BACKUP_END_ABORT. If the * event has been issued, or if the backup uses the pull model, the * flag has no effect. * * Returns 0 on success and -1 on failure. */ int virDomainBackupEnd(virDomainPtr domain, int id, unsigned int flags);
For this API, I'm considering a tri-state return, 1 if the backup job completed successfully (in the push model, the backup destination file is usable); 0 if the backup job was aborted (only possible if VIR_DOMAIN_BACKUP_END_ABORT was passed, the backup destination file is untrustworthy); and -1 on failure. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 05/17/2018 05:43 PM, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
/** * virDomainBackupBegin:
* * There are two fundamental backup approaches. The first, called a * push model, instructs the hypervisor to copy the state of the guest * disk to the designated storage destination (which may be on the * local file system or a network device); in this mode, the * hypervisor writes the content of the guest disk to the destination, * then emits VIR_DOMAIN_EVENT_ID_BLOCK_JOB_2 when the backup is * either complete or failed (the backup image is invalid if the job * is ended prior to the event being emitted).
Better is VIR_DOMAIN_EVENT_ID_JOB_COMPLETED (BLOCK_JOB can only inform status about one disk, while this is intended to inform about multiple disks done in a single transaction). I'm a bit depressed at our technical debt in this area: virDomainGetJobStats() and virDomainAbortJob() don't take a job id, but only operate on the most recently started job, but I did mention elsewhere in my plans:
I think that it should be possible to run multiple backup operations in parallel in the long run. But in the interest of getting a proof of concept implementation out quickly, it's easier to state that for the initial implementation, libvirt supports at most one backup operation at a time (to do another backup, you have to wait for the current one to complete, or else abort and abandon the current one). As there is only one backup job running at a time, the existing virDomainGetJobInfo()/virDomainGetJobStats() will be able to report statistics about the job (insofar as such statistics are available). But in preparation for the future, when libvirt does add parallel job support, starting a backup job will return a job id; and presumably we'd add a new virDomainGetJobStatsByID() for grabbing statistics of an arbitrary (rather than the most-recently-started) job.
Since live migration also acts as a job visible through virDomainGetJobStats(), I'm going to treat an active backup job and live migration as mutually exclusive. This is particularly true when we have a pull model backup ongoing: if qemu on the source is acting as an NBD server, you can't migrate away from that qemu and tell the NBD client to reconnect to the NBD server on the migration destination. So, to perform a migration, you have to cancel any pending backup operations. Conversely, if a migration job is underway, it will not be possible to start a new backup job until migration completes. However, we DO need to modify migration to ensure that any persistent bitmaps are migrated.
Yes, this means that virDomainBackupEnd() (which takes a job id) and virDomainJobAbort() (which does not, but until we support parallel backup jobs or a mix of backup and migration at once, it does not matter) can initially both do the work of aborting a backup job. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

On 05/17/2018 05:43 PM, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
/** * virDomainBackupBegin: * @domain: a domain object * @diskXml: description of storage to utilize and expose during * the backup, or NULL * @checkpointXml: description of a checkpoint to create, or NULL * @flags: not used yet, pass 0 *
Actually, since I'm taking two XML documents, this should really have a VIR_DOMAIN_BACKUP_VALIDATE flag for comparison of the XML against the schema.
/** * virDomainCheckpointCreateXML: * @domain: a domain object * @xmlDesc: description of the checkpoint to create * @flags: bitwise-OR of supported virDomainCheckpointCreateFlags *
*/ virDomainCheckpointPtr virDomainCheckpointCreateXML(virDomainPtr domain, const char *xmlDesc, unsigned int flags);
Ditto. And this was copied from virDomainSnapshotCreateXML, which should also gain a VALIDATE flag. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
participants (4)
-
Daniel P. Berrangé
-
Eric Blake
-
Peter Krempa
-
Vladimir Sementsov-Ogievskiy