The following (long) email describes a portion of the work-flow of how
my proposed incremental backup APIs will work, along with the backend
QMP commands that each one executes. I will reply to this thread with
further examples (the first example is long enough to be its own email).
This is an update to a thread last posted here:
https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html
I'm still pulling together pieces in the src/qemu directory of libvirt
while implementing my proposed API, but you can track the status of my
code (currently around 6000 lines of code and 1500 lines of
documentation added) at:
https://repo.or.cz/libvirt/ericb.git
The documentation below describes the end goal of my demo (which I will
be presenting at KVM Forum), even if the current git checkout of my work
in progress doesn't quite behave that way.
My hope is that the API itself is in a stable enough state to include in
the libvirt 4.9 release (end of this month - which really means upstream
commit prior to KVM Forum!) by demo-ing how it is used with qemu
experimental commands, even if the qemu driver portions of my series are
not yet ready to be committed because they are waiting for the qemu side
of incremental backups to stabilize. If we like the API and are willing
to commit to it, then downstream vendors can backport whatever fixes in
the qemu driver on top of the existing API without having to suffer from
rebase barriers preventing the addition of new API.
Performing a full backup can work on any disk format, but incremental
(all changes since the most recent checkpoint) and differential (all
changes since an arbitrary earlier checkpoint) backups require the use
of a persistent bitmap for tracking the changes between checkpoints, and
that in turn requires a disk with qcow2 format. The API can handle
multiple disks at the same point in time (so I'll demonstrate two at
once), and is designed to handle both push model (qemu writes to a
specific destination, and the format has to be one that qemu knows) and
pull model (qemu opens up an NBD server for all disks, then you connect
one or more read-only client per export on that server to read the
information of interest into a destination of your choosing).
This demo also shows how I consume the data over a pull model backup.
Remember, in the pull model, you don't have to use a qemu binary as the
NBD client (you merely need a client that can request base:allocation
and qemu:dirty-bitmap:name contexts) - it's just that it is easier to
demonstrate everything with the tools already at hand. Thus, I use
existing qemu-img 3.0 functionality to extract the dirty bitmap (the
qemu:dirty-bitmap:name context) in one process, and a second qemu-io
process (using base:allocation to optimize reads of holes) for
extracting the actual data; the demo shows both processes accessing the
read-only NBD server in parallel. While I use two processes, it is also
feasible to write a single client that can get at both contexts through
a single NBD connection (the qemu 3.0 server supports that, even if none
of the qemu 3.0 clients can request multiple contexts). Down the road,
we may further enhance tools shipped with qemu to be easier to use as
such a client, but that does not affect the actual backup API (which is
merely what it takes to get the NBD server up and running).
- Preliminary setup:
I'm using bash as my shell, and set
$ orig1=/path/to/disk1.img orig2=/path/to/disk2.img
$ dom=my_domain qemu_img=/path/to/qemu-img
$ virsh="/path/to/virsh -k 0"
to make later steps easier to type. While the steps below should work
with qemu 3.0, I found it easier to test with both self-built qemu
(modify the <emulator> line in my domain) and self-built libvirtd
(systemctl stop libvirtd, then run src/libvrtd, also my use of $virsh
with heartbeat disabled, so that I was able to attach gdb during
development without having to worry about the connection dying). Also,
you may need 'setenforce 0' when using self-built binaries, since
otherwise SELinux labeling gets weird (obviously, when the actual code
is ready to check into libvirt, it will work with SELinux enforcing and
with system-installed rather than self-installed binaries). I also used:
$ $virsh domblklist $dom
to verify that I have plugged in $orig1 and $orig2 as two of the disks
to $dom (I used:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'
error_policy='stop'
io='native'/>
<source file='/path/to/disk1.img'/>
<backingStore/>
<target dev='sdc' bus='scsi'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'
error_policy='stop'
io='native'/>
<source file='/path/to/disk2.img'/>
<backingStore/>
<target dev='sdd' bus='scsi'/>
</disk>
in my domain XML)
- First example: creating a full backup via pull model, initially with
no checkpoint created
$ cat > backup.xml <<EOF
<domainbackup mode='pull'>
<server transport='tcp' name='localhost' port='10809'/>
<disks>
<disk name='$orig1' type='file'>
<scratch file='$PWD/scratch1.img'/>
</disk>
<disk name='sdd' type='file'>
<scratch file='$PWD/scratch2.img'/>
</disk>
</disks>
</domainbackup>
EOF
$ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img
$ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img
Here, I'm explicitly requesting a pull backup (the API defaults to push
otherwise), as well as explicitly requesting the NBD server to be set up
(the XML should support both transport='tcp' and transport='unix'). Note
that the <server> is global; but the server will support multiple export
names at once, so that you can connect multiple clients to process those
exports in parallel. Ideally, if <server> is omitted, libvirt should
auto-generate an appropriate server name, and has a way for you to query
what it generated (right now, I don't have that working in libvirt, so
being explicit is necessary - but again, the goal now is to prove that
the API is reasonable for including it in libvirt 4.9; enhancements like
making <server> optional can come later even if they miss libvirt 4.9).
I'm also requesting that the backup operate on only two disks of the
domain, and pointing libvirt to the scratch storage it needs to use for
the duration of the backup (ideally, libvirt will generate an
appropriate scratch file name itself if omitted from the XML, and create
scratch files itself instead of me having to pre-create them). Note that
I can give either the path to my original disk ($orig1, $orig2) or the
target name in the domain XML (in my case sdc, sdd); libvirt will
normalize my input and always uses the target name when reposting the
XML in output.
$ $virsh backup-begin $dom backup.xml
Backup id 1 started
backup used description from 'backup.xml'
Kicks off the backup job. virsh called
virDomainBackupBegin(dom, "<domainbackup ...>", NULL, 0)
and in turn libvirt makes the all of following QMP calls (if any QMP
call fails, libvirt attempts to unroll things so that there is no
lasting change to the guest before actually reporting failure):
{"execute":"nbd-server-start",
"arguments":{"addr":{"type":"inet",
"data":{"host":"localhost",
"port":"10809"}}}}
{"execute":"blockdev-add",
"arguments":{"driver":"qcow2",
"node-name":"backup-sdc",
"file":{"driver":"file",
"filename":"$PWD/scratch1.img"},
"backing":"'$node1'"}}
{"execute":"blockdev-add",
"arguments":{"driver":"qcow2",
"node-name":"backup-sdd",
"file":{"driver":"file",
"filename":"$PWD/scratch2.img"},
"backing":"'$node2'"}}
{"execute":"transaction",
"arguments":{"actions":[
{"type":"blockdev-backup", "data":{
"device":"$node1", "target":"backup-sdc",
"sync":"none",
"job-id":"backup-sdc" }},
{"type":"blockdev-backup", "data":{
"device":"$node2", "target":"backup-sdd",
"sync":"none",
"job-id":"backup-sdd" }}
]}}
{"execute":"nbd-server-add",
"arguments":{"device":"backup-sdc",
"name":"sdc"}}
{"execute":"nbd-server-add",
"arguments":{"device":"backup-sdd",
"name":"sdd"}}
libvirt populated $node1 and $node2 to be the node names actually
assigned by qemu; until Peter's work on libvirt using node names
everywhere actually lands, libvirt is scraping the auto-generated
#blockNNN name from query-block and friends (the same as it already does
in other situations like write threshold).
With this command complete, libvirt has now kicked off a pull backup
job, which includes single qemu NBD server, with two separate exports
named 'sdc' and 'sdd' that expose the state of the disks at the time of
the API call (any guest writes to $orig1 or $orig2 trigger copy-on-write
actions into scratch1.img and scratch2.img to preserve the fact that
reading from NBD sees unchanging contents).
We can double-check what libvirt is tracking for the running backup job,
including the fact that libvirt normalized the <disk> names to match the
domain XML target listings, and matching the names of the exports being
served over the NBD server:
$ $virsh backup-dumpxml $dom 1
<domainbackup type='pull' id='1'>
<server transport='tcp' name='localhost' port='10809'/>
<disks>
<disk name='sdc' type='file'>
<scratch file='/home/eblake/scratch1.img'/>
</disk>
<disk name='sdd' type='file'>
<scratch file='/home/eblake/scratch2.img'/>
</disk>
</disks>
</domainbackup>
where 1 on the command line would be replaced by whatever id was printed
by the earlier backup-begin command (yes, my demo can hard-code things
to 1, because the current qemu and initial libvirt implementations only
support one backup job at a time, although we have plans to allow
parallel jobs in the future).
This translated to the libvirt API call
virDomainBackupGetXMLDesc(dom, 1, 0)
and did not have to make any QMP calls into qemu.
Now that the backup job is running, we want to scrape the data off the
NBD server. The most naive way is:
$ $qemu_img convert -f raw -O $fmt nbd://localhost:10809/sdc full1.img
$ $qemu_img convert -f raw -O $fmt nbd://localhost:10809/sdd full2.img
where we hope that qemu-img convert is able to recognize the holes in
the source and only write into the backup copy where actual data lives.
You don't have to uses qemu-img; it's possible to use any NBD client,
such as the kernel NBD module:
$ modprobe nbd
$ qemu-nbd -c /dev/nbd0 -f raw nbd://localhost:10809/sdc
$ cp /dev/nbd0 full1.img
$ qemu-nbd -d /dev/nbd0
The above demonstrates the flexibility of the pull model (your backup
file can be ANY format you choose; here I did 'cp' to copy it to a raw
destination), but it was also a less efficient NBD client, since the
kernel module doesn't yet know about NBD_CMD_BLOCK_STATUS for learning
where the holes are, nor about NBD_OPT_STRUCTURED_REPLY for faster reads
of those holes.
Of course, we don't have to blindly read the entire image, but can
instead use two clients in parallel (per exported disk), one which is
using 'qemu-img map' to learn which parts of the export contain data,
then feeds it through a bash 'while read' loop to parse out which
offsets contain interesting data, and spawning a second client per
region to copy just that subset of the file. Here, I'll use 'qemu-io
-C' to perform copy-on-read - that requires that my output file be qcow2
rather than any other particular format, but I'm guaranteed that my
output backup file is only populated in the same places that $orig1 was
populated at the time the backup started.
$ $qemu_img create -f qcow2 full1.img $size_of_orig1
$ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \
full1.img
$ while read line; do
[[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.true.* ]] ||
continue
start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]}
qemu-io -C -c "r $start $len" -f qcow2 full1.img
done < <($qemu_img map --output=json -f raw nbd://localhost:10809/sdc)
$ $qemu_img rebase -u -f qcow2 -b '' full1.img
and the nice thing about this loop is that once you've figured out how
to parse qemu-img map output as one client process, you can use any
other process (such as qemu-nbd -c, then dd if=/dev/nbd0 of=$dest bs=64k
skip=$((start/64/1024)) seek=$((start/64/1024)) count=$((len/64/1024))
conv=fdatasync) as the NBD client that reads the subset of data of
interest (and thus, while qemu-io had to write to full1.img as qcow2,
you can use an alternative client to write to raw or any other format of
your choosing).
Now that we've copied off the full backup image (or just a portion of it
- after all, this is a pull model where we are in charge of how much
data we want to read), it's time to tell libvirt that it can conclude
the backup job:
$ $virsh backup-end $dom 1
Backup id 1 completed
again, where the command line '1' came from the output to backup-begin
and could change to something else rather than being hard-coded in the
demo. This maps to the libvirt API call
virDomainBackupEnd(dom, 1, 0)
which in turn maps to the QMP commands:
{"execute":"nbd-server-remove",
"arguments":{"name":"sdc"}}
{"execute":"nbd-server-remove",
"arguments":{"name":"sdd"}}
{"execute":"nbd-server-stop"}
{"execute":"block-job-cancel",
"arguments":{"device":"sdc"}}
{"execute":"block-job-cancel",
"arguments":{"device":"sdd"}}
{"execute":"blockdev-del",
"arguments":{"node-name":"backup-sdc"}}
{"execute":"blockdev-del",
"arguments":{"node-name":"backup-sdd"}}
to clean up all the things added during backup-begin.
More to come in part 2.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization:
qemu.org |
libvirt.org