Re: [libvirt] [RFC v2] external (pull) backup API

12 Apr 2018

      On 11.04.2018 19:32, Eric Blake wrote:
...
On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
...
Hi, all.
This is another RFC on pull backup API. This API provides means to read domain                   
disks in a snapshotted state so that client can back them up as well as means                    
to write domain disks to revert them to backed up state. The previous version                    
of RFC is [1]. I'll also describe the API implementation details to shed light                   
on misc qemu dirty bitmap commands usage.
This is a first-pass review (making comments as I first encounter
something, even if it gets explained later in the email)
...
This API does not use existent disks snapshots. Instead it introduces snapshots                  
provided by qemu's blockdev-backup command. The reason is we need snapshotted                    
disk state only temporarily for duration of backup operation and newly                           
introduced snapshots can be easily discarded at the end of operation without                     
block commit operation. Technically difference is next. On usual snapshot we                     
create new image backed by original and all new data goes to the new image thus                  
original image stays in a snapshotted state. In temporary snapshots we create                    
new image backed by original and all new data still goes to the original image                   
but before new data is written old data to be overwritten is popped out to the new               
image thus we get snapshotted state thru new image.
So, rewriting this to make sure I understand, let's start with a disk
with contents A, then take a snapshot, then write B:
In the existing libvirt snapshot APIs, the data gets distributed as:
base (contents A) <- new active (contents B)
where you want the new API:
base, remains active (contents B) ~~~ backup (contents A)
Exactly
...
...
Disks snapshots as well as disks itself are avaiable to read/write thru qemu                     
NBD server.
So the biggest reason for a new libvirt API is that we need management
actions to control which NBD images from qemu are exposed and torn down
at the appropriate sequences.
...
Here is typical actions on domain backup:
- create temporary snapshot of domain disks of interest                                          
- export snaphots thru NBD                                                                       
- back them up                                                                                   
- remove disks from export                                                                       
- delete temporary snapshot
and typical actions on domain restore:
- start domain in paused state                                                                   
- export domain disks of interest thru NBD for write                                             
- restore them                                                                                   
- remove disks from export                                                                       
- resume or destroy domain
Now let's write down API in more details. There are minor changes in comparison                  
with previous version [1].
*Temporary snapshot API*
In previous version it is called 'Fleece API' after qemu terms and I'll still
use BlockSnapshot prefix for commands as in previous RFC instead of
TmpSnapshots which I inclined more now.
virDomainBlockSnapshotPtr
virDomainBlockSnapshotCreateXML(virDomainPtr domain,
                                const char *xmlDesc,
                                unsigned int flags);
Just to make sure, we have the existing API of:
virDomainSnapshotPtr virDomainSnapshotCreateXML(virDomainPtr domain,
                                                const char *xmlDesc,
                                                unsigned int flags);
So you are creating a new object (virDomainBlockSnapshotPtr) rather than
reusing the existing VirDomainSnapshotPtr, and although the two commands
are similar, we get to design a new XML schema from scratch rather than
trying to overload yet even more functionality onto the existing API.
Yes. Existing snapshots are different from temporary snapshots in many ways.
The former for example form a tree structure and the latter are not.
...
Should we also have:
const char *virDomainBlockSnapshotGetName(virDomainBlockSnapshotPtr
snapshot);
virDomainPtr virDomainBlockSnapshotGetDomain(virDomainBlockSnapshotPtr
snapshot);
virConnectPtr virDomainBlockSnapshotGetConnect(virDomainBlockSnapshotPtr
snapshot);
for symmetry with existing snapshot API?
Yes. I ommited these calls in RFC as they are trivial and don't need
to be considered to grasp the picture.
...
...
virDomainBlockSnapshotDelete(virDomainBlockSnapshotPtr snapshot,
                             unsigned int flags);
virDomainBlockSnapshotList(virDomainPtr domain,
                           virDomainBlockSnapshotPtr **snaps,
                           unsigned int flags);
I'm guessing this is the counterpart to virDomainListAllSnapshots() (the
modern listing interface), and that we probably don't want counterparts
for virDomainSnapshotNum/virDomainSnapshotListNames (the older listing
interface, which was inherently racy as the list could change in length
between the two calls).
That's right.
...
...
virDomainBlockSnapshotGetXMLDesc(virDomainBlockSnapshotPtr snapshot,
                                 unsigned int flags);
virDomainBlockSnapshotPtr
virDomainBlockSnapshotLookupByName(virDomainPtr domain,
                                   const char *name,
                                   unsigned int flags);
Also, the virDomainSnapshotPtr had a number of API to track a tree-like
hierarchy between snapshots (that is, you very much want to know if
snapshot B is a child of snapshot A), while it looks like your new
virDomainBlockSnapshotPtrs are completely independent (no relationships
between the snapshots, each can be independently created or torn down,
without having to rewrite a relationship tree between them, and there is
no need for counterparts to things like virDomainSnapshotNumChildren).
Okay, I think that makes sense, and is a good reason for introducing a
new object type rather than shoe-horning this into the existing API.
This fact motivates me introduce new API too.
...
...
Here is an example of snapshot xml description:
<domainblocksnapshot>
    <name>d068765e-8b50-4d74-9b72-1e55c663cbf8</name>
    <disk name='sda' type="file">
        <fleece file="/tmp/snapshot-a.hdd"/>
    </disk>
    <disk name='sdb' type="file">
        <fleece file="/tmp/snapshot-b.hdd"/>
    </disk>
</domainblocksnapshot>
Temporary snapshots are indepentent thus they are not organized in tree structure
as usual snapshots, so the 'list snapshots' and 'lookup' function will suffice.
So in the XML, the <fleece> element describes the destination file (back
to my earlier diagram, it would be the file that is created and will
hold content 'A' when the main active image is changed to hold content
'B' after the snapshot was created)?
Yes.
...
...
Qemu can track what disk's blocks are changed from snapshotted state so on next
backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML
accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option
for snapshot which means to track changes from this particular snapshot. I used
checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used
to provide changed blocks from the given checkpoint to current snapshot in
current implementation (see *Implemenation* section for more details). Also
bitmap keeps block changes and thus itself changes in time and checkpoint is
a more statical terms means you can query changes from that moment in time.
Checkpoints are visible in active domain xml:
<disk type='file' device='disk'>
      ..
      <target dev='sda' bus='scsi'/>
      <alias name='scsi0-0-0-0'/>
      <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178">
      <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c">
      ..
    </disk>
Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default
dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used.
So client need to manage checkpoints and delete unused. Thus next API function:
int
virDomainBlockCheckpointRemove(virDomainPtr domain,
                               const char *name,
                               unsigned int flags);
I'm trying to figure out how BlockCheckpoint and BlockSnapshots relate.
Maybe it will be more clear when I read the implementation section
below.  Is the idea that I can't create a BlockSnapshot without first
having a checkpoint available?  If so, where does that fit in the
<domainblocksnapshot> XML?
No, you can create snapshot without available checkpoints. Actually the first snapshot
is like that.

Now if you create a snapshot with checkpoint and then delete the snapshot
the checkpoint remains, so we need an API to delete them if we wish.
...
...
*Block export API*
I guess it is natural to treat qemu NBD server as a domain device. So
we can use virDomainAttachDeviceFlags/virDomainDetachDeviceFlags API to start/stop NBD
server and virDomainUpdateDeviceFlags to add/delete disks to be exported.
This feels a bit awkward - up to now, attaching a device is something
visible to the guest, but you are trying to reuse the interface to
attach something tracked by the domain, but which has no impact to the
guest.  That is, the guest has no clue whether a block export exists
pointing to a particular checkpoint, nor does it care.
Not entirely true. Take a graphical framebuffers (vnc) or serial devices.
The guest are completely unaware of vnc. Serial device is related to guest
device but again guest is not aware of such relation.
...
...
While I'm have no doubts about start/stop operations using virDomainUpdateDeviceFlags 
looks a bit inconvinient so I decided to add a pair of API functions just
to add/delete disks to be exported:
int
virDomainBlockExportStart(virDomainPtr domain,
                          const char *xmlDesc,
                          unsigned int flags);
int
virDomainBlockExportStop(virDomainPtr domain,
                         const char *xmlDesc,
                         unsigned int flags);
I guess more appropriate names are virDomainBlockExportAdd and
virDomainBlockExportRemove but as I already have a patch series implementing pull
backups with these names I would like to keep these names now.
What does the XML look like in these calls?
...
These names also reflect that in the implementation I decided to start/stop NBD
server in a lazy manner. While it is a bit innovative for libvirt API I guess
it is convinient because to refer NBD server to add/remove disks to we need to
identify it thru it's parameters like type, address etc until we introduce some
device id (which does not looks consistent with current libvirt design).
This just reinforces my thoughts above - is the reason it doesn't make
sense to assign a device id to the export due to the fact that the
export is NOT guest-visible?  Does it even belong under the
By export you mean a NBD server or disk being exported? What is this id for?
Is this libvirt alias for devices or something different?
...
"domain/devices/" xpath of the domain XML, or should it be a new sibling
of <devices> with an xpath of "domain/blockexports/"?
...
So it
looks like we have all parameters to start/stop server in the frame of these
calls so why have extra API calls just to start/stop server manually. If we
later need to have NBD server without disks we can perfectly support
virDomainAttachDeviceFlags/virDomainDetachDeviceFlags.
Here is example of xml to add/remove disks (specifying checkpoint
attribute is not needed for removing disks of course):
<domainblockexport type="nbd">
    <address type="ip" host="0.0.0.0" port="8000"/>
    <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
                     checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/>
    <disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
                     checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"/>
</domainblockexport>
So this is the XML you pass to virDomainBlockExportStart, with the goal
of telling qemu to start or stop an NBD export on the backing chain
associated with disk "sda", where the export is serving up data tied to
checkpoint "d068765e-8b50-4d74-9b72-1e55c663cbf8", and which will be
associated with the destination snapshot file described by the
<domainblocksnapshot> named "0044757e-1a2d-4c2c-b92f-bb403309bb17"?
I would rephrase. I didn't think of arbitrary backing chains in this API.
It just exports the temporary snapshot of disk "sda". Snapshot is referenced
by its name "0044757e-1a2d-4c2c-b92f-bb403309bb17". Additionally you can
ask to export CBT from some earlier snapshot of "sda" referenced by "d068765e-8b50-4d74-9b72-1e55c663cbf8"
to the exported snapshot ("0044757e-1a2d-4c2c-b92f-bb403309bb17"). To make
exporting CBT possible the earlier snapshot should be created with VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT 
flag.

So this export API is somewhat oriented to block snapshots. May be one
day we want to export backing chain of regular snaphots then this API will be insufficient...
...
Why is it named <domainblockexport> here, but...
...
And this is how this NBD server will be exposed in domain xml:
<devices>
    ...
    <blockexport type="nbd">
<blockexport> here?
In this case we already have domain context as xpath is /domain/devices/blockexport.
...
...
<address type="ip" host="0.0.0.0" port="8000"/>
        <disk name="sda" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
                         checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8"
                         exportname="sda-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
The exportname property is new here compared to the earlier listing - is
that something that libvirt generates, or that the user chooses?
In current implementation it is generated. I see no obstacles for "exportname" to
be specified in input too.
...
...
<disk name="sdb" snapshot="0044757e-1a2d-4c2c-b92f-bb403309bb17"
                         checkpoint="d068765e-8b50-4d74-9b72-1e55c663cbf8
                         exportname="sdb-0044757e-1a2d-4c2c-b92f-bb403309bb17"/>
    </blockexport>
    ...
</devices>
*Implementation details from qemu-libvirt interactions POV*
1. Temporary snapshot
- create snapshot
Which libvirt API triggers this action? virDomainBlockSnapshotCreateXML?
Yes.
...
...
- add fleece blockdev backed by disk of interest
    - start fleece blockjob which will pop out data to be overwritten to fleece blockdev
{
        "execute": "blockdev-add"
        "arguments": {
            "backing": "drive-scsi0-0-0-0",
            "driver": "qcow2",
            "file": {
                "driver": "file",
                "filename": "/tmp/snapshot-a.hdd"
Is qemu creating this file, or is libvirt pre-creating it and qemu just
opening it?  I guess this is a case where libvirt would want to
The latter.
...
pre-create an empty qcow2 file (either by qemu-img, or by the new
x-blockdev-create in qemu 2.12)?  Okay, it looks like this file is what
you listed in the XML for <domainblocksnapshot>, so libvirt is creating
it.  Does the new file have a backing image, or does it read as
completely zeroes?
File is created by qemu-img without backing chain so it is read as zeros.
But it does not matter as the file is not meant to be read/write outside
of qemu process. I guess after blockdev-add command the fleece image gets
active image as backing in qemu internals.
...
...
},
            "node-name": "snapshot-scsi0-0-0-0"
        },
    }
No trailing comma in JSON {}, but it's not too hard to figure out what
you mean.
Oops) I used python -mjson.tool to pretty print json grabbed from qemu logs.
It sorts json keys alphabetically which is not convinient in this case -
"execute" is better to be above "arguments". So I just moved "execute" line
in editor and completely forgot about commas)
...
...
{
        "execute": "transaction"
        "arguments": {
            "actions": [
                {
                    "type": "blockdev-backup"
                    "data": {
                        "device": "drive-scsi0-0-0-0",
                        "target": "snapshot-scsi0-0-0-0"
                        "sync": "none",
                    },
                }
            ]
You showed a transaction with only one element; but presumably we are
using a transaction because if we want to create a point in time for
multiple disks at once, we need two separate blockdev-backup actions
joined in the same transaction to cover the two disks.  So this command
Yes, strictly speaking we don't need a transaction here. I provide here
qemu logs from current dumb implementation) I guess I'd better use
snapshot for 2 disks in the example as you suggest.
...
is telling qemu to start using a brand-new qcow2 file as its local
storage for tracking that a snapshot is being taken, and that point in
Yes.
...
time is the checkpoint?
No, this actions will not create a checkpoint. Examples for checkpoints
are below. In case of checkpoints we additionally add a new dirty bimap
in transaction for every disk and manipulate with existing dirty bitmaps.
...
Am I correct that you would then tell qemu to export an NBD view of this
qcow2 snapshot which a third-party client can connect to and use
NBD_CMD_BLOCK_STATUS to learn which portions of the file contain data
(that is, which clusters has qemu copied into the backup, because the
active image has changed them since the checkpoint, but anything not
dirty in this file is still identical to the last backup?
No. In this example we don't talk about checkpoints which is for
incremental backups. This is plain full backup. You create a snapshot
and export it.

Even if we created the snapshot with checkpoint the
checkpoint is of no use for the first backup. The first backup can
not be anything but full copy of snapshot. But lately if you make
first backup, delete snapshot, then after sometime want to create
another backup you create new snapshot and this time if first snapshot
was created with checkpoint we can tell thru NBD_CMD_BLOCK_STATUS
what portions of disk in second snapshot are changed relative to 
the first snapshot. Using this info you can create incremental
backup.
...
Would libvirt ever want to use something other than "sync":"none"?
I don't know of usecases for other modes. Looks like "none" is sufficient
for snapshot purpuses.
...
...
},
    }
- delete snapshot
    - cancel fleece blockjob
    - delete fleece blockdev
{
        "execute": "block-job-cancel"
        "arguments": {
            "device": "drive-scsi0-0-0-0"
        },
    }
    {
        "execute": "blockdev-del"
        "arguments": {
            "node-name": "snapshot-scsi0-0-0-0"
        },
    }
2. Block export
- add disks to export
    - start NBD server if it is not started
    - add disks
{
        "execute": "nbd-server-start"
        "arguments": {
            "addr": {
                "type": "inet"
                "data": {
                    "host": "0.0.0.0",
                    "port": "49300"
                },
            }
        },
    }
    {
        "execute": "nbd-server-add"
        "arguments": {
            "device": "snapshot-scsi0-0-0-0",
            "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8",
            "writable": false
So this is telling qemu to export the temporary qcow2 image created in
the point above.  An NBD client would see the export getting
progressively more blocks with data as the guest continues to write more
clusters (as qemu has to copy the data from the checkpoint to the
temporary file before updating the main image with the new data).  If
the NBD client reads a cluster that has not yet been copied by qemu
(because the guest has not written to that cluster since the block job
started), would it see zeroes, or the same data that the guest still sees?
No client will see a snapshotted disk state. The snapshot does not get changes at all.
If guest makes a write then first old data is written to the fleece image and
then new data is written to the active image.

There is no checkpoints in this example also. Just a snapshot of disk and
this snapshot is exported thru NBD.
...
...
},
    }
- remove disks from export
    - remove disks
    - stop NBD server if there are no disks left
{
        "arguments": {
            "mode": "hard",
            "name": "sda-d068765e-8b50-4d74-9b72-1e55c663cbf8"
        },
        "execute": "nbd-server-remove"
    }
    {
        "execute": "nbd-server-stop"
    }
3. Checkpoints (the most interesting part)
First a few facts about qemu dirty bitmaps.
Bitmap can be either in active or disable state. In disabled state it does not
get changed on guest writes. And oppositely in active state it tracks guest
writes. This implementation uses approach with only one active bitmap at
a time. This should reduce guest write penalties in the presence of
checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes
from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap
B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2
- changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and
gets most disk change after latest snapshot.
Getting changed blocks bitmap from some checkpoint in past till current snapshot
is quite simple in this scheme. For example if the last snapshot is 7 then
to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3,
B4, B4 and B6. Merge is just logical OR on bitmap bits.
Deleting a checkpoint somewhere in the middle of checkpoint sequence requires
merging correspondent bitmap to the previous bitmap in this scheme.
We use persitent bitmaps in the implementation. This means upon qemu process
termination bitmaps are saved in disks images metadata and restored back on
qemu process start. This makes checkpoint a persistent property that is we
keep them across domain start/stops. Qemu does not try hard to keep bitmaps.
If upon save something goes wrong bitmap is dropped. The same is applied to the
migration process too. For backup process it is not critical. If we don't
discover a checkpoint we always can make a full backup. Also qemu provides no
special means to track order of bitmaps. These facts are critical for
implementation with one active bitmap at a time. We need right order of bitmaps upon
merge - for snapshot N and block changes from snanpshot K, K < N to N we need
to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged
is missing we can't calculate desired block changes too.
So the implementation encode bitmap order in their names. For snapshot A1, bitmap
name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming
encoding upon domain start we can find out bitmap order and check for missing
ones. This complicates a bit bitmap removing though. For example removing
a bitmap somewhere in the middle looks like this:
- removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1}
    - create new bitmap named NAME_{K+1}^NAME_{K-1}      ---. 
    - disable new bitmap                                    | This is effectively renaming
    - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap    | of bitmap K+1 to comply the naming scheme
    - remove bitmap NAME_{K+1}^NAME_{K}                  ___/
    - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2}
    - remove bitmap NAME_{K}^NAME_{K-1}
As you can see we need to change name for bitmap K+1 to keep our bitmap
naming scheme. This is done creating new K+1 bitmap with appropriate name
and copying old K+1 bitmap into new.
So while it is possible to have only one active bitmap at a time it costs
some exersices at managment layer. To me it looks like qemu itself is a better
place to track bitmaps chain order and consistency.
Libvirt is already tracking a tree relationship between internal
snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track
that (true, internal snapshots don't get as much attention as external
snapshots) - but the fact remains that qemu is probably not the best
place to track relationship between multiple persistent bitmaps, any
more than it tracks relationships between internal snapshots.  So having
libvirt track relations between persistent bitmaps is just fine.  Do we
The situations are different. For example you can delete internal snapshot
S and this will not hurt any children snapshots. Changes for parent snapshot
of A to A itself will be merged to children. In this sense qemu tracks snapshot
relationships.

Now let's consider dirty bitmaps. Say you have B1, B2, B3, B4, B5. All but
B5 are disabled and B5 is active and get changes on guest writes. B1 keep
changes from point in time 1 to point in time 2 and so on. Now if you simply
delete B3 then B2 for example became invalid as now B2 does not reflect all
changes from point in time 2 to point in time 4 as we want in our scheme.
Qemu does not automatically merge B3 to B2. In this sense qemu does not track
bitmap relationships.
...
really have to rename bitmaps in the qcow2 file, or can libvirt track it
all on its own?
Libvirt needs naming scheme described above to track bitmaps order on
domain restarts. Thus we need to rename on deletion.
...
Earlier, you said that the new virDomainBlockSnapshotPtr are
independent, with no relations between them.  But here, you are wanting
to keep incremental backups related to one another.
Yes, but backups are not snapshots. All backup relation management are on
client. In pull backup scheme libvirt is only here to export a snapshotted
disk state with optionally a CBT from some point in time. Client itself
makes backups and track their relationships.

However as we use chain of disabled bitmaps with one active bitmap on tip
of the chain and qemu does not track their order we need to do it in
libvirt.
...
...
Now how exporting bitmaps looks like.
- add to export disk snapshot N with changes from checkpoint K
    - add fleece blockdev to NBD exports
    - create new bitmap T
    - disable bitmap T
    - merge bitmaps K, K+1, .. N-1 into T
    - add bitmap to T to nbd export
- remove disk snapshot from export
    - remove fleece blockdev from NBD exports
    - remove bitmap T
Here is qemu commands examples for operation with checkpoints, I'll make
several snapshots with checkpoints for purpuse of better illustration.
- create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint
    - same as without checkpoint but additionally add bitmap on fleece blockjob start
...
    {
        "execute": "transaction"
        "arguments": {
            "actions": [
                {
                    "type": "blockdev-backup"
                    "data": {
                        "device": "drive-scsi0-0-0-0",
                        "sync": "none",
                        "target": "snapshot-scsi0-0-0-0"
                    },
                },
                {
                    "type": "block-dirty-bitmap-add"
                    "data": {
                        "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
                        "node": "drive-scsi0-0-0-0",
                        "persistent": true
                    },
                }
Here, the transaction makes sense; you have to create the persistent
dirty bitmap to track from the same point in time.  The dirty bitmap is
tied to the active image, not the backup, so that when you create the
NEXT incremental backup, you have an accurate record of which sectors
were touched in snapshot-scsi0-0-0-0 between this transaction and the next.
Yes.
...
...
]
        },
    }
- delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8
    - same as without checkpoints
- create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint
    - same actions as for the first snapshot, but additionally disable the first bitmap
Again, you're showing the QMP commands that libvirt is issuing; which
libvirt API calls are driving these actions?
Well I thought of this section of RFC to be more specific on qemu commands issued by
libvirt to qemu during some API call so that one can better understand how we use qemu API.
I thouht the API call and its arguments are clear from description above. In this case it is

virDomainBlockSnapshotCreateXML with VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag set.

xml is next:

<domainblocksnapshot>
    <name>0044757e-1a2d-4c2c-b92f-bb403309bb17</name>
    <disk name='sda' type="file">
        <fleece file="/tmp/snapshot-a.hdd"/>
    </disk>
</domainblocksnapshot>
...
...
...
    {
        "execute": "transaction"
        "arguments": {
            "actions": [
                {
                    "type": "blockdev-backup"
                    "data": {
                        "device": "drive-scsi0-0-0-0",
                        "sync": "none",
                        "target": "snapshot-scsi0-0-0-0"
                    },
                },
                {
                    "type": "x-vz-block-dirty-bitmap-disable"
                    "data": {
Do you have measurements on whether having multiple active bitmaps hurts
performance?  I'm not yet sure that managing a chain of disabled bitmaps
(and merging them as needed for restores) is more or less efficient than
Vova, can you shed a ligh on this topic?
...
managing multiple bitmaps all the time.  On the other hand, you do have
a point that restore is a less frequent operation than backup, so making
backup as lean as possible and putting more work on restore is a
reasonable tradeoff, even if it adds complexity to the management for
doing restores.
Sorry, I'm not understand what is tradeoff from you words.
...
...
"name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
                        "node": "drive-scsi0-0-0-0"
                    },
                },
                {
                    "type": "block-dirty-bitmap-add"
                    "data": {
                        "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8",
                        "node": "drive-scsi0-0-0-0",
                        "persistent": true
                    },
                }
            ]
        },
    }
- delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17
- create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
- add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with
  changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8
    - same as add export without checkpoint, but aditionally
        - form result bitmap
        - add bitmap to NBD export
...
    {
        "execute": "transaction"
        "arguments": {
            "actions": [
                {
                    "type": "block-dirty-bitmap-add"
                    "data": {
                        "node": "drive-scsi0-0-0-0",
                        "name": "libvirt-__export_temporary__",
                        "persistent": false
                    },
                },
                {
                    "type": "x-vz-block-dirty-bitmap-disable"
                    "data": {
                        "node": "drive-scsi0-0-0-0"
                        "name": "libvirt-__export_temporary__",
                    },
                },
                {
                    "type": "x-vz-block-dirty-bitmap-merge"
                    "data": {
                        "node": "drive-scsi0-0-0-0",
                        "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8"
                        "dst_name": "libvirt-__export_temporary__",
                    },
                },
                {
                    "type": "x-vz-block-dirty-bitmap-merge"
                    "data": {
                        "node": "drive-scsi0-0-0-0",
                        "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
                        "dst_name": "libvirt-__export_temporary__",
                    },
                }
            ]
        },
    }
    {
        "execute": "x-vz-nbd-server-add-bitmap"
        "arguments": {
            "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b"
            "bitmap": "libvirt-__export_temporary__",
            "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8",
        },
Adding a bitmap to a server is would would advertise to the NBD client
that it can query the
"qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace
during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
I guess so. I don't know neither NBD protocol nor it's extensions.

Vova, can you clarify?
...
...
}
-  remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export
    - same as without checkpoint but additionally remove temporary bitmap
...
    {
        "arguments": {
            "name": "libvirt-__export_temporary__",
            "node": "drive-scsi0-0-0-0"
        },
        "execute": "block-dirty-bitmap-remove"
    }
- delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17
    (similar operation is described in the section about naming scheme for bitmaps,
     with difference that K+1 is N here and thus new bitmap should not be disabled)
A suggestion on the examples - while UUIDs are nice and handy for
management tools, they are a pain to type and for humans to quickly
read.  Is there any way we can document a sample transaction stream with
all the actors involved (someone issues a libvirt API call XYZ, libvirt
in turn issues QMP command ABC), and using shorter names that are easier
to read as humans?
Sure. I'll definetely do so in next round of RFC if there will be one)
...
...
{
        "arguments": {
            "actions": [
                {
                    "type": "block-dirty-bitmap-add"
                    "data": {
                        "node": "drive-scsi0-0-0-0",
                        "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8",
                        "persistent": true
                    },
                },
                {
                    "type": "x-vz-block-dirty-bitmap-merge"
                    "data": {
                        "node": "drive-scsi0-0-0-0",
                        "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
                        "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
                    },
                },
                {
                    "type": "x-vz-block-dirty-bitmap-merge"
                    "data": {
                        "node": "drive-scsi0-0-0-0",
                        "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1#
                        "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf#
                    },
                },
            ]
        },
        "execute": "transaction"
    }
    {
        "execute": "x-vz-block-dirty-bitmap-remove"
        "arguments": {
            "node": "drive-scsi0-0-0-0"
            "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17",
        },
    },
    {
        "execute": "x-vz-block-dirty-bitmap-remove"
        "arguments": {
            "node": "drive-scsi0-0-0-0"
            "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8",
        },
    }
Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
x-vz-block-dirty-bitmap-remove
x-vz-block-dirty-bitmap-merge
x-vz-block-dirty-bitmap-disable
x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint)
x-vz-nbd-server-add-bitmap
*Restore operation nuances*
As it was written above to restore a domain one needs to start it in paused
state, export domain's disks and write them from backup. However qemu currently does
not let export disks for write even for a domain that never starts guests CPU.
We have an experimental qemu command option -x-vz-nbd-restore (passed together
with -incoming option) to fix it.
Why can't restore be done while the guest is offline?  (Oh right, we
still haven't added decent qemu-img support for bitmap manipulation, so
we need a qemu process around for any bitmap changes).
As I understand it, the point of bitmaps and snapshots is to create an
NBD server that a third-party can use to read just the dirty portions of
a disk in relation to a known checkpoint, to save off data in whatever
form it wants; so you are right that the third party then needs a way to
rewrite data from whatever internal form it stored it in back to the
view that qemu can consume when rolling back to a given backup, prior to
starting the guest on the restored data.  Do you need additional libvirt
APIs exposed for this, or do the proposed APIs for adding snapshots
cover everything already with just an additional flag parameter that
says whether the <domainblocksnapshot> is readonly (the third-party is
using it for collecting the incremental backup data) or writable (the
third-party is actively writing its backup into the file, and when it is
done, then perform a block-commit to merge that data back onto the main
qcow2 file)?
We don't need snapshots for restore at all. Restore is described at very
top of document:

and typical actions on domain restore:                                                           

- start domain in paused state

  Here we use virDomainCreateXML/virDomainCreate with VIR_DOMAIN_START_PAUSED and
  VIR_DOMAIN_START_EXPORTABLE set. The latter is new flag described in *Nuances* section.

- export domain disks of interest thru NBD for write

  Here we use next xml for virDomainBlockExportStart/virDomainBlockExportStop.

  <domainblockexport type="nbd">
      <address type="ip" host="0.0.0.0" port="8000"/>
      <disk name="sda"/>
      <disk name="sdb"/>
  </domainblockexport>

- restore them                                                                                   
- remove disks from export                                                                       
- resume or destroy domain

So to be able to restore we need additionally only VIR_DOMAIN_START_EXPORTABLE to start a domain.
The export API is same, just the xml specifies plain disks without snapshots/checkpoints.

Note that VIR_DOMAIN_START_EXPORTABLE is a kind of workaround. Not sure this should
be in an API. We don't need this flags if qemu let us exports disks for write for just
freshly started domain in paused state.

Nikolay