On 10/9/18 8:29 AM, Nir Soffer wrote:
On Fri, Oct 5, 2018 at 7:58 AM Eric Blake <eblake(a)redhat.com>
wrote:
> On 10/4/18 12:05 AM, Eric Blake wrote:
>> The following (long) email describes a portion of the work-flow of how
>> my proposed incremental backup APIs will work, along with the backend
>> QMP commands that each one executes. I will reply to this thread with
>> further examples (the first example is long enough to be its own email).
>> This is an update to a thread last posted here:
>>
https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html
>>
>
>> More to come in part 2.
>>
>
> - Second example: a sequence of incremental backups via pull model
>
> In the first example, we did not create a checkpoint at the time of the
> full pull. That means we have no way to track a delta of changes since
> that point in time.
Why do we want to support backup without creating a checkpoint?
Fleecing. If you want to examine a portion of the disk at a given point
in time, then kicking off a pull model backup gives you access to the
state of the disk at that time, and your actions are transient. Ending
the job when you are done with the fleece cleans up everything needed to
perform the fleece operation, and since you did not intend to capture a
full (well, a complete) incremental backup, but were rather grabbing
just a subset of the disk, you really don't want that point in time to
be recorded as a new checkpoint.
Also, incremental backups (which are what require checkpoints) are
limited to qcow2 disks, but full backups can be performed on any format
(including raw disks). If you have a guest that does not use qcow2
disks, you can perform a full backup, but cannot create a checkpoint.
If we don't have any real use case, I suggest to always require a
checkpoint.
But we do have real cases for backup without checkpoint.
> Let's repeat the full backup (reusing the same
> backup.xml from before), but this time, we'll add a new parameter, a
> second XML file for describing the checkpoint we want to create.
>
> Actually, it was easy enough to get virsh to write the XML for me
> (because it was very similar to existing code in virsh that creates XML
> for snapshot creation):
>
> $ $virsh checkpoint-create-as --print-xml $dom check1 testing \
> --diskspec sdc --diskspec sdd | tee check1.xml
> <domaincheckpoint>
> <name>check1</name>
>
We should use an id, not a name, even of name is name is also unique like
in most libvirt apis.
In RHV we will use always use a UUID for this.
Nothing prevents you from using a UUID as your name. But this particular
choice of XML (<name>) matches what already exists in the snapshot XML.
> <description>testing</description>
> <disks>
> <disk name='sdc'/>
> <disk name='sdd'/>
> </disks>
> </domaincheckpoint>
>
> I had to supply two --diskspec arguments to virsh to select just the two
> qcow2 disks that I am using in my example (rather than every disk in the
> domain, which is the default when <disks> is not present).
So <disks /> is valid configuration, selecting all disks, or not having
"disks" element
selects all disks?
It's about a one-line change to get whichever behavior you find more
useful. Right now, I'm leaning towards: <disks> omitted == backup all
disks, <disks> present: you MUST have at least one <disk> subelement
that explicitly requests a checkpoint (because any omitted <disk> when
<disks> is present is skipped). A checkpoint only makes sense as long as
there is at least one disk to create a checkpoint with.
But I could also go with: <disks> omitted == backup all disks, <disks>
present but <disk> subelements missing: the missing elements default to
being backed up, and you have to explicitly provide <disk name='foo'
checkpoint='no'> to skip a particular disk.
Or even: <disks> omitted, or <disks> present but <disk> subelements
missing: the missing elements defer to the hypervisor for their default
state, and the qemu hypervisor defaults to qcow2 disks being backed
up/checkpointed and to non-qcow2 disks being omitted. But this latter
one feels like more magic, which is harder to document and liable to go
wrong.
A stricter version would be <disks> is mandatory, and no <disk>
subelement can be missing (or else the API fails because you weren't
explicit in your choice). But that's rather strict, especially since
existing snapshots XML handling is not that strict.
> I also picked
> a name (mandatory) and description (optional) to be associated with the
> checkpoint.
>
> The backup.xml file that we plan to reuse still mentions scratch1.img
> and scratch2.img as files needed for staging the pull request. However,
> any contents in those files could interfere with our second backup
> (after all, every cluster written into that file from the first backup
> represents a point in time that was frozen at the first backup; but our
> second backup will want to read the data as the guest sees it now rather
> than what it was at the first backup), so we MUST regenerate the scratch
> files. (Perhaps I should have just deleted them at the end of example 1
> in my previous email, had I remembered when typing that mail).
>
> $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img
> $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img
>
> Now, to begin the full backup and create a checkpoint at the same time.
> Also, this time around, it would be nice if the guest had a chance to
> freeze I/O to the disks prior to the point chosen as the checkpoint.
> Assuming the guest is trusted, and running the qemu guest agent (qga),
> we can do that with:
>
> $ $virsh fsfreeze $dom
> $ $virsh backup-begin $dom backup.xml check1.xml
> Backup id 1 started
> backup used description from 'backup.xml'
> checkpoint used description from 'check1.xml'
> $ $virsh fsthaw $dom
>
Great, this answer my (unsent) question about freeze/thaw from part 1 :-)
>
> and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE
> flag to combine those three steps into a single API (matching what we've
> done on some other existing API). In other words, the sequence of QMP
> operations performed during virDomainBackupBegin are quick enough that
> they won't stall a freeze operation (at least Windows is picky if you
> stall a freeze operation longer than 10 seconds).
>
We use fsFreeze/fsThaw directly in RHV since we need to support external
snapshots (e.g. ceph), so we don't need this functionality, but it sounds
good
idea to make it work like snapshot.
And indeed, since a future enhancement will be figuring out how we can
create a checkpoint at the same time as a snapshot (as mentioned
elsewhere in the email). A snapshot and checkpoint created at the same
atomic point should obviously both be able to happen at a quiescent
point in guest I/O.
>
> The tweaked $virsh backup-begin now results in a call to:
> virDomainBackupBegin(dom, "<domainbackup ...>",
> "<domaincheckpoint ...", 0)
> and in turn libvirt makes a similar sequence of QMP calls as before,
> with a slight modification in the middle:
> {"execute":"nbd-server-start",...
> {"execute":"blockdev-add",...
>
This does not work yet for network disks like "rbd" and "glusterfs"
does it mean that they will not be supported for backup?
Full backups can happen regardless of underlying format. But incremental
backups require checkpoints, and checkpoints require qcow2 persistent
bitmaps. As long as you have a qcow2 format on rbd or glusterfs, you
should be able to create checkpoints on that image, and therefore
perform incremental backups. Storage-wise, during a pull model backup,
you would have your qcow2 format on remote glusterfs storage which is
where the persistent bitmap is written, and temporarily also have a
scratch qcow2 file on the local machine for performing copy-on-write
needed to preserve the point in time semantics for as long as the backup
operation is running.
> {"execute":"transaction",
> "arguments":{"actions":[
> {"type":"blockdev-backup", "data":{
> "device":"$node1",
"target":"backup-sdc", "sync":"none",
> "job-id":"backup-sdc" }},
> {"type":"blockdev-backup", "data":{
> "device":"$node2",
"target":"backup-sdd", "sync":"none",
> "job-id":"backup-sdd" }}
> {"type":"block-dirty-bitmap-add", "data":{
> "node":"$node1", "name":"check1",
"persistent":true}},
> {"type":"block-dirty-bitmap-add", "data":{
> "node":"$node2", "name":"check1",
"persistent":true}}
> ]}}
> {"execute":"nbd-server-add",...
>
What if this sequence fail in the middle? will libvirt handle all failures
and rollback to the previous state?
What is the semantics of "execute": "transaction"? does it mean that
qemu
will handle all possible failures in one of the actions?
qemu already promises that a "transaction" succeeds or fails as a group.
As to other failures, the full recovery sequence is handled by libvirt,
and looks like:
Fail on "nbd-server-start":
- nothing to roll back
Fail on first "blockdev-add":
- nbd-server-stop
Fail on subsequent "blockdev-add":
- blockdev-remove on earlier scratch file additions
- nbd-server-stop
Fail on any "block-dirty-bitmap-add" or "x-block-dirty-bitmap-merge":
- block-dirty-bitmap-remove on any temporary bitmaps that were created
- blockdev-remove on all scratch file additions
- nbd-server-stop
Fail on "transaction":
- block-dirty-bitmap-remove on all temporary bitmaps
- blockdev-remove on all additions
- nbd-server-stop
Fail on "nbd-server-add" or "x-nbd-server-add-bitmap":
- if a checkpoint was attempted during "transaction":
-- perform x-block-dirty-bitmap-enable to re-enable bitmap that was
in use prior to transaction
-- perform x-block-dirty-bitmap-merge to merge new bitmap into
re-enabled bitmap
-- perform block-dirty-bitmap-remove on the new bitmap
- block-job-cancel
- block-dirty-bitmap-remove on all temporary bitmaps
- blockdev-remove on all scratch file additions
- nbd-server-stop
>
> More to come in part 3.
I still need to finish writing that, but part 3 will be a demonstration
of the push model (where qemu writes the backup to a given destination,
without a scratch file, and without an NBD server, but where you are
limited to what qemu knows how to write).
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization:
qemu.org |
libvirt.org