
On 05/18/2018 02:56 AM, Daniel P. Berrangé wrote:
On Thu, May 17, 2018 at 05:43:37PM -0500, Eric Blake wrote:
Here's my updated counterproposal for a backup API.
In comparison to v2 posted by Nikolay: https://www.redhat.com/archives/libvir-list/2018-April/msg00115.html - changed terminology a bit: Nikolay's "BlockSnapshot" is now called a "Checkpoint", and "BlockExportStart/Stop" is now "BackupBegin/End" - flesh out more API descriptions - better documentation of proposed XML, for both checkpoints and backup
Barring any major issues turned up during review, I've already starting to code this into libvirt with a goal of getting an implementation ready for review this month.
I think the key thing missing from the docs is some kind of explanation about the difference between a backup, and checkpoint and a snapshot. I'll admit I've not read the mail in detail, but at a high level it is not immediately obvious what the difference is & thus which APIs I would want to be using for a given scenario.
Indeed, and that's a fair complaint. Here's a first draft, that I'll have to polish into a formal html document that both the snapshot and checkpoint/backup pages refer to (or maybe I merge snapshots and checkpoint descriptions into a single html page, although I'm not quite sure what to name the page then). One of the features made possible with virtual machines is live migration, or transferring all state related to the guest from one host to another, with minimal interruption to the guest's activity. A clever observer will then note that if all state is available for live migration, there is nothing stopping a user from saving that state at a given point of time, to be able to later rewind guest execution back to the state it previously had. There are several different libvirt APIs associated with capturing the state of a guest, such that the captured state can later be used to rewind that guest to the conditions it was in earlier. But since there are multiple APIs, it is best to understand the tradeoffs and differences between them, in order to choose the best API for a given task. Timing: Capturing state can be a lengthy process, so while the captured state ideally represents an atomic point in time correpsonding to something the guest was actually executing, some interfaces require up-front preparation (the state captured is not complete until the API ends, which may be some time after the command was first started), while other interfaces track the state when the command was first issued even if it takes some time to finish capturing the state. While it is possible to freeze guest I/O around either point in time (so that the captured state is fully consistent, rather than just crash-consistent), knowing whether the state is captured at the start or end of the command may determine which approach to use. A related concept is the amount of downtime the guest will experience during the capture, particularly since freezing guest I/O has time constraints. Amount of state: For an offline guest, only the contents of the guest disks needs to be captured; restoring that state is merely a fresh boot with the disks restored to that state. But for an online guest, there is a choice between storing the guest's memory (all that is needed during live migration where the storage is shared between source and destination), the guest's disk state (all that is needed if there are no pending guest I/O transactions that would be lost without the corresponding memory state), or both together. Unless guest I/O is quiesced prior to capturing state, then reverting to captured disk state of a live guest without the corresponding memory state is comparable to booting a machine that previously lost power without a clean shutdown; but for a guest that uses appropriate journaling methods, this crash-consistent state may be sufficient to avoid the additional storage and time needed to capture memory state. Quantity of files: When capturing state, some approaches store all state within the same file (internal), while others expand a chain of related files that must be used together (external), for more files that a management application must track. There are also differences depending on whether the state is captured in the same file in use by a running guest, or whether the state is captured to a distinct file without impacting the files used to run the guest. Third-party integration: When capturing state, particularly for a running, there are tradeoffs to how much of the process must be done directly by the hypervisor, and how much can be off-loaded to third-party software. Since capturing state is not instantaneous, it is essential that any third-party integration see consistent data even if the running guest continues to modify that data after the point in time of the capture. Full vs. partial: When capturing state, it is useful to minimize the amount of state that must be captured in relation to a previous capture, by focusing only on the portions of the disk that the guest has modified since the previous capture. Some approaches are able to take advantage of checkpoints to provide an incremental backup, while others are only capable of a full backup including portions of the disk that have not changed since the previous state capture. With those definitions, the following libvirt APIs have these properties: virDomainSnapshotCreateXML: This API wraps several approaches for capturing guest state, with a general premise of creating a snapshot (where the current guest resources are frozen in time and a new wrapper layer is opened for tracking subsequent guest changes). It can operate on both offline and running guests, can choose whether to capture the state of memory, disk, or both when used on a running guest, and can choose between internal and external storage for captured state. However, it is geared towards post-event captures (when capturing both memory and disk state, the disk state is not captured until all memory state has been collected first). For qemu as the hypervisor, internal snapshots currently have lengthy downtime that is incompatible with freezing guest I/O, but external snapshots are quick. Since creating an external snapshot changes which disk image resource is in use by the guest, this API can be coupled with virDomainBlockCommit to restore things back to the guest using its original disk image, where a third-party tool can read the backing file prior to the live commit. virDomainBlockCopy: This API wraps approaches for capturing the state of disks of a running guest, but does not track accompanying guest memory state. The capture is consistent only at the end of the operation, with a choice to either pivot to the new file that contains the copy (leaving the old file as the backup), or to return to the original file (leaving the new file as the backup). virDomainBackupStart: This API wraps approaches for capturing the state of disks of a running guest, but does not track accompanying guest memory state. The capture is consistent to the start of the operation, where the captured state is stored independently from the disk image in use with the guest, and where it can be easily integrated with a third-party for capturing the disk state. Since the backup operation is stored externally from the guest resources, there is no need to commit data back in at the completion of the operation. When coupled with checkpoints, this can be used to capture incremental backups instead of full. virDomainCheckpointCreateXML: This API does not actually capture guest state, so much as make it possible to track which portions of guest disks have change between checkpoints or between a current checkpoint and the live execution of the guest. When performing incremental backups, it is easier to create a new checkpoint at the same time as a new backup, so that the next incremental backup can refer to the incremental state since the checkpoint created during the current backup. Putting it together: the following two sequences both capture the disk state of a running guest, then complete with the guest running on its original disk image; but with a difference that an unexpected interruption during the first mode leaves a temporary wrapper file that must be accounted for, while interruption of the second mode has no impact to the guest. 1. Backup via temporary snapshot virDomainFSFreeze() virDomainSnapshotCreateXML(VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) virDomainFSThaw() third-party copy the backing file to backup storage # most time spent here virDomainBlockCommit(VIR_DOMAIN_BLOCK_COMMIT_ACTIVE) wait for commit ready event virDomainBlockJobAbort() 2. Direct backup virDomainFSFreeze() virDomainBackupBegin() virDomainFSThaw() wait for push mode event, or pull data over NBD # most time spent here virDomainBackeupEnd() -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org