[libvirt] Overview of libvirt incremental backup API, part 1 (full pull mode)

4 Oct 2018

      The following (long) email describes a portion of the work-flow of how 
my proposed incremental backup APIs will work, along with the backend 
QMP commands that each one executes.  I will reply to this thread with 
further examples (the first example is long enough to be its own email). 
This is an update to a thread last posted here:
https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html

I'm still pulling together pieces in the src/qemu directory of libvirt 
while implementing my proposed API, but you can track the status of my 
code (currently around 6000 lines of code and 1500 lines of 
documentation added) at:
https://repo.or.cz/libvirt/ericb.git
The documentation below describes the end goal of my demo (which I will 
be presenting at KVM Forum), even if the current git checkout of my work 
in progress doesn't quite behave that way.

My hope is that the API itself is in a stable enough state to include in 
the libvirt 4.9 release (end of this month - which really means upstream 
commit prior to KVM Forum!) by demo-ing how it is used with qemu 
experimental commands, even if the qemu driver portions of my series are 
not yet ready to be committed because they are waiting for the qemu side 
of incremental backups to stabilize.  If we like the API and are willing 
to commit to it, then downstream vendors can backport whatever fixes in 
the qemu driver on top of the existing API without having to suffer from 
rebase barriers preventing the addition of new API.

Performing a full backup can work on any disk format, but incremental 
(all changes since the most recent checkpoint) and differential (all 
changes since an arbitrary earlier checkpoint) backups require the use 
of a persistent bitmap for tracking the changes between checkpoints, and 
that in turn requires a disk with qcow2 format. The API can handle 
multiple disks at the same point in time (so I'll demonstrate two at 
once), and is designed to handle both push model (qemu writes to a 
specific destination, and the format has to be one that qemu knows) and 
pull model (qemu opens up an NBD server for all disks, then you connect 
one or more read-only client per export on that server to read the 
information of interest into a destination of your choosing).

This demo also shows how I consume the data over a pull model backup. 
Remember, in the pull model, you don't have to use a qemu binary as the 
NBD client (you merely need a client that can request base:allocation 
and qemu:dirty-bitmap:name contexts) - it's just that it is easier to 
demonstrate everything with the tools already at hand.  Thus, I use 
existing qemu-img 3.0 functionality to extract the dirty bitmap (the 
qemu:dirty-bitmap:name context) in one process, and a second qemu-io 
process (using base:allocation to optimize reads of holes) for 
extracting the actual data; the demo shows both processes accessing the 
read-only NBD server in parallel.  While I use two processes, it is also 
feasible to write a single client that can get at both contexts through 
a single NBD connection (the qemu 3.0 server supports that, even if none 
of the qemu 3.0 clients can request multiple contexts).  Down the road, 
we may further enhance tools shipped with qemu to be easier to use as 
such a client, but that does not affect the actual backup API (which is 
merely what it takes to get the NBD server up and running).

- Preliminary setup:
I'm using bash as my shell, and set

$ orig1=/path/to/disk1.img orig2=/path/to/disk2.img
$ dom=my_domain qemu_img=/path/to/qemu-img
$ virsh="/path/to/virsh -k 0"

to make later steps easier to type. While the steps below should work 
with qemu 3.0, I found it easier to test with both self-built qemu 
(modify the <emulator> line in my domain) and self-built libvirtd 
(systemctl stop libvirtd, then run src/libvrtd, also my use of $virsh 
with heartbeat disabled, so that I was able to attach gdb during 
development without having to worry about the connection dying).  Also, 
you may need 'setenforce 0' when using self-built binaries, since 
otherwise SELinux labeling gets weird (obviously, when the actual code 
is ready to check into libvirt, it will work with SELinux enforcing and 
with system-installed rather than self-installed binaries). I also used:

$ $virsh domblklist $dom

to verify that I have plugged in $orig1 and $orig2 as two of the disks 
to $dom (I used:
     <disk type='file' device='disk'>
       <driver name='qemu' type='qcow2' cache='none' error_policy='stop' 
io='native'/>
       <source file='/path/to/disk1.img'/>
       <backingStore/>
       <target dev='sdc' bus='scsi'/>
     </disk>
     <disk type='file' device='disk'>
       <driver name='qemu' type='qcow2' cache='none' error_policy='stop' 
io='native'/>
       <source file='/path/to/disk2.img'/>
       <backingStore/>
       <target dev='sdd' bus='scsi'/>
     </disk>
in my domain XML)

- First example: creating a full backup via pull model, initially with 
no checkpoint created
$ cat > backup.xml <<EOF
<domainbackup mode='pull'>
   <server transport='tcp' name='localhost' port='10809'/>
   <disks>
     <disk name='$orig1' type='file'>
       <scratch file='$PWD/scratch1.img'/>
     </disk>
     <disk name='sdd' type='file'>
       <scratch file='$PWD/scratch2.img'/>
     </disk>
   </disks>
</domainbackup>
EOF
$ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img
$ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img

Here, I'm explicitly requesting a pull backup (the API defaults to push 
otherwise), as well as explicitly requesting the NBD server to be set up 
(the XML should support both transport='tcp' and transport='unix'). Note 
that the <server> is global; but the server will support multiple export 
names at once, so that you can connect multiple clients to process those 
exports in parallel.  Ideally, if <server> is omitted, libvirt should 
auto-generate an appropriate server name, and has a way for you to query 
what it generated (right now, I don't have that working in libvirt, so 
being explicit is necessary - but again, the goal now is to prove that 
the API is reasonable for including it in libvirt 4.9; enhancements like 
making <server> optional can come later even if they miss libvirt 4.9). 
I'm also requesting that the backup operate on only two disks of the 
domain, and pointing libvirt to the scratch storage it needs to use for 
the duration of the backup (ideally, libvirt will generate an 
appropriate scratch file name itself if omitted from the XML, and create 
scratch files itself instead of me having to pre-create them). Note that 
I can give either the path to my original disk ($orig1, $orig2) or the 
target name in the domain XML (in my case sdc, sdd); libvirt will 
normalize my input and always uses the target name when reposting the 
XML in output.

$ $virsh backup-begin $dom backup.xml
Backup id 1 started
backup used description from 'backup.xml'

Kicks off the backup job. virsh called
  virDomainBackupBegin(dom, "<domainbackup ...>", NULL, 0)
and in turn libvirt makes the all of following QMP calls (if any QMP 
call fails, libvirt attempts to unroll things so that there is no 
lasting change to the guest before actually reporting failure):
{"execute":"nbd-server-start",
  "arguments":{"addr":{"type":"inet",
   "data":{"host":"localhost", "port":"10809"}}}}
{"execute":"blockdev-add",
  "arguments":{"driver":"qcow2", "node-name":"backup-sdc",
   "file":{"driver":"file",
    "filename":"$PWD/scratch1.img"},
    "backing":"'$node1'"}}
{"execute":"blockdev-add",
  "arguments":{"driver":"qcow2", "node-name":"backup-sdd",
   "file":{"driver":"file",
    "filename":"$PWD/scratch2.img"},
    "backing":"'$node2'"}}
{"execute":"transaction",
  "arguments":{"actions":[
   {"type":"blockdev-backup", "data":{
    "device":"$node1", "target":"backup-sdc", "sync":"none",
    "job-id":"backup-sdc" }},
   {"type":"blockdev-backup", "data":{
    "device":"$node2", "target":"backup-sdd", "sync":"none",
    "job-id":"backup-sdd" }}
  ]}}
{"execute":"nbd-server-add",
  "arguments":{"device":"backup-sdc", "name":"sdc"}}
{"execute":"nbd-server-add",
  "arguments":{"device":"backup-sdd", "name":"sdd"}}

libvirt populated $node1 and $node2 to be the node names actually 
assigned by qemu; until Peter's work on libvirt using node names 
everywhere actually lands, libvirt is scraping the auto-generated 
#blockNNN name from query-block and friends (the same as it already does 
in other situations like write threshold).

With this command complete, libvirt has now kicked off a pull backup 
job, which includes single qemu NBD server, with two separate exports 
named 'sdc' and 'sdd' that expose the state of the disks at the time of 
the API call (any guest writes to $orig1 or $orig2 trigger copy-on-write 
actions into scratch1.img and scratch2.img to preserve the fact that 
reading from NBD sees unchanging contents).

We can double-check what libvirt is tracking for the running backup job, 
including the fact that libvirt normalized the <disk> names to match the 
domain XML target listings, and matching the names of the exports being 
served over the NBD server:

$ $virsh backup-dumpxml $dom 1
<domainbackup type='pull' id='1'>
   <server transport='tcp' name='localhost' port='10809'/>
   <disks>
     <disk name='sdc' type='file'>
       <scratch file='/home/eblake/scratch1.img'/>
     </disk>
     <disk name='sdd' type='file'>
       <scratch file='/home/eblake/scratch2.img'/>
     </disk>
   </disks>
</domainbackup>

where 1 on the command line would be replaced by whatever id was printed 
by the earlier backup-begin command (yes, my demo can hard-code things 
to 1, because the current qemu and initial libvirt implementations only 
support one backup job at a time, although we have plans to allow 
parallel jobs in the future).

This translated to the libvirt API call
  virDomainBackupGetXMLDesc(dom, 1, 0)
and did not have to make any QMP calls into qemu.

Now that the backup job is running, we want to scrape the data off the 
NBD server.  The most naive way is:

$ $qemu_img convert -f raw -O $fmt nbd://localhost:10809/sdc full1.img
$ $qemu_img convert -f raw -O $fmt nbd://localhost:10809/sdd full2.img

where we hope that qemu-img convert is able to recognize the holes in 
the source and only write into the backup copy where actual data lives. 
You don't have to uses qemu-img; it's possible to use any NBD client, 
such as the kernel NBD module:

$ modprobe nbd
$ qemu-nbd -c /dev/nbd0 -f raw nbd://localhost:10809/sdc
$ cp /dev/nbd0 full1.img
$ qemu-nbd -d /dev/nbd0

The above demonstrates the flexibility of the pull model (your backup 
file can be ANY format you choose; here I did 'cp' to copy it to a raw 
destination), but it was also a less efficient NBD client, since the 
kernel module doesn't yet know about NBD_CMD_BLOCK_STATUS for learning 
where the holes are, nor about NBD_OPT_STRUCTURED_REPLY for faster reads 
of those holes.

Of course, we don't have to blindly read the entire image, but can 
instead use two clients in parallel (per exported disk), one which is 
using 'qemu-img map' to learn which parts of the export contain data, 
then feeds it through a bash 'while read' loop to parse out which 
offsets contain interesting data, and spawning a second client per 
region to copy just that subset of the file.  Here, I'll use 'qemu-io 
-C' to perform copy-on-read - that requires that my output file be qcow2 
rather than any other particular format, but I'm guaranteed that my 
output backup file is only populated in the same places that $orig1 was 
populated at the time the backup started.

$ $qemu_img create -f qcow2 full1.img $size_of_orig1
$ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \
   full1.img
$ while read line; do
   [[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.true.* ]] ||
     continue
   start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]}
   qemu-io -C -c "r $start $len" -f qcow2 full1.img
done < <($qemu_img map --output=json -f raw nbd://localhost:10809/sdc)
$ $qemu_img rebase -u -f qcow2 -b '' full1.img

and the nice thing about this loop is that once you've figured out how 
to parse qemu-img map output as one client process, you can use any 
other process (such as qemu-nbd -c, then dd if=/dev/nbd0 of=$dest bs=64k 
skip=$((start/64/1024)) seek=$((start/64/1024)) count=$((len/64/1024)) 
conv=fdatasync) as the NBD client that reads the subset of data of 
interest (and thus, while qemu-io had to write to full1.img as qcow2, 
you can use an alternative client to write to raw or any other format of 
your choosing).

Now that we've copied off the full backup image (or just a portion of it 
- after all, this is a pull model where we are in charge of how much 
data we want to read), it's time to tell libvirt that it can conclude 
the backup job:

$ $virsh backup-end $dom 1
Backup id 1 completed

again, where the command line '1' came from the output to backup-begin 
and could change to something else rather than being hard-coded in the 
demo. This maps to the libvirt API call
  virDomainBackupEnd(dom, 1, 0)
which in turn maps to the QMP commands:
{"execute":"nbd-server-remove",
  "arguments":{"name":"sdc"}}
{"execute":"nbd-server-remove",
  "arguments":{"name":"sdd"}}
{"execute":"nbd-server-stop"}
{"execute":"block-job-cancel",
  "arguments":{"device":"sdc"}}
{"execute":"block-job-cancel",
  "arguments":{"device":"sdd"}}
{"execute":"blockdev-del",
  "arguments":{"node-name":"backup-sdc"}}
{"execute":"blockdev-del",
  "arguments":{"node-name":"backup-sdd"}}

to clean up all the things added during backup-begin.

More to come in part 2.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org