On Tue, Mar 12, 2019 at 9:49 PM Eric Blake <eblake@redhat.com> wrote:
On 3/12/19 1:10 PM, Daniel P. Berrangé wrote:
> On Mon, Mar 11, 2019 at 09:38:31PM -0500, Eric Blake wrote:
>> While looking at my work on incremental backups, Nir raised a good
>> point: if we want to recreate a set of known checkpoints on one
>> machine that will be taking over a domain from another machine,
>> my initial proposal required making multiple API calls to list the
>> XML for each checkpoint on the source, and then more API calls to
>> redefine each checkpoint on the destination; it also had the drawback
>> that the list has to be presented in topological order (the API won't
>> let you define a child checkpoint if the parent is not defined first).
>> He asked if I could instead add bulk APIs, for getting the XML for
>> all checkpoints at once, and then for redefining all checkpoints at
>> once.
>
> To me the key blocking problem here is the topological sorting one
> as it makes it incredibly painful to run individual APIs. Your other
> series on list addresses this nicely by adding the topological flag.
>
> Using that flag, in psuedo-python like code the process of exporting
> and importing snapshots from one host to another should look approx
> like this
>
>   snapshots = srcdom.ListAllSnapshots(libvirt.VIR_DOMAIN_SNAPSHOT_LIST_TOPOLOGICAL)
>   for snapshot in snapshts:
>       xml = snapshot.XMLDesc()
>       dstdom.SnapshotCreateXML(xml)
>   
> I don't really see a reason why this is so hard that it justifies
> adding these bulk snapshot list/define  APIs.

Nir, do you want to chime in here?

We don't have a need to list or define snapshots since we managed snapshots on oVirt side.
We want an API to list and redefine checkpoints.

Our use case is:

1. Start VM on some host
2. When oVirt starts a backup, it adds a checkpoint in oVirt database
3. When libvirt starts a backup, libvirt add the checkpoint oVirt created
4. Steps 2 and 3 are repeated while the VM is running...
5. Stop VM - at this point we undefine the VM, so libivrt lost all info about checkopints
6. Start VM on some host
7. oVirt use the bulk checkpoint API to redefine all checkpoints in oVirt database
8. If there is unknown bitmap on a disk, or a bitmap is missing, it means that someone
   modified the disk behind oVirt back, and oVirt will delete the affected checkpoints
   and use libvirt checkpoints API to delete the checkpoints and bitmaps. This forces
  full backup for the next backup of the affected disks.

We would like that libvirt will fail to redefine the checkpoints if the info in the checkpoints
does not match the the bitmap on the disks mentioned in the checkpoints. To do this,
libvirt must have the complete checkpoint state. It cannot if we define checkpoints one
by one.

If libivirt cannot fail, we need to a way to list all checkpoints and bitmaps, so we can 
sync oVirt database with reality.

If we don't validate bitmaps with the expected bitmaps, the backup may fail when starting
a backup (good), or succeed, including wrong data in the backup.

It also does not make sense to redefine checkpoint state in incremental way, this like
building a VM XML with many devices with one call per device.

The same checkpoint management will have to be duplicated in any libvirt client that
manage more then one host (e.g. KubeVirt, OpenStack), since libvirt does not handle
more than one host.

For more info see

>
> At the same time there is a very real downside to providing these bulk
> snapshot APIs in terms of scalability. The <domain> XML for a guest
> can get very large. The <domainsnapshot> XML contains a copy of the
> <domain> XML from the point in time of the snapsht. Therefore a
> <snapshots> XML document is going to contain *many* <domain> XML
> documents concatenated.
>
> For a domain with many devices and many snapshots, the bulk list
> APIs have non-negligible risk of hitting the RPC protocol limit
> on string size.

I did not consider that, but you are definitely correct about it being a
real problem (I didn't trip over it in my small domain testing, but
combined XML is definitely a repetition of a <domain> per <domainsnapshot>).

One idea I had on the checkpoint series was adding a NO_DOMAIN flag
which outputs much more compact XML by omitting the <domain> element.
With checkpoints, the <domain> at the time of the checkpoint is less
important (useful to know if we need it, but we can't revert to that
state); but with snapshots, the <domain> at the time of the snapshot is
very important (to the point that you have to request a flag for an
unsafe revert in virDomainSnapshotRevert() if the <domain> information
was incomplete).  So using a NO_DOMAIN flag to minimize XML so that the
bulk output doesn't exceed RPC limits is a possibility, but not a very
nice one (as you'd STILL have to do a one-API-per-snapshot call to
reinstate the <domain> of any snapshot that you plan to revert to).

>
> We've seen this before with the bulk domain stats APIs we added.
> We none the less still added the bulk domain stats APIs because
> it was a compelling performance benefit for something that is run
> very frequently. For apps to be scalable though, they have to
> limit how many guests they request bulk stats for at any time.
> This in fact makes the code using the bulk stats APIs more complex
> than if it just called the individual APIs. The performance benefit
> is still a win though in this case.
>
> In the bulk snapshot import/export case though, I'm not seeing a
> compelling performance reason for their usage. Making the APIs
> scalable would require to limit how many snapshots are returned
> in any single call. There would then need to be repeated calls
> to fetch the rest. This makes it more complicated than just
> fetching/defining each individual snapshot

And the frequency of migrating snapshots is less than the frequency of
querying bulk stats. It makes the most sense to speed up APIs called
frequently.

>
> IOW, overall I'm not seeing a big enough justification to add these
> new APIs, that would outweigh the downsides they bring in terms of
> scalability.

You raise some good points.

>
> The topological sorting flag solves the really big pain point well
> enough.

The topological sorting thing kind of fell on my lap as I was working on
the bulk code, but now that I've written it, it definitely solves a
problem that we previously had, even if I had not properly identified
the problem (randomized list ordering is painful to work with).  Either
redefine has to accept randomized ordering (which it currently doesn't),
or listing should output in sane ordering (which topological adds) - and
once we have topological output, subsequent input gets a LOT easier,
even if split across multiple API calls.

>
>> I'm also toying with the idea of followup patches that store all
>> snapshots alongside the persistent <domain> XML in a single file,
>> rather than in one snapshot per <domainsnapshot> file (we'd still
>> support reading from the old method at libvirtd startup, but would
>> not need to output in that manner any more).
>
> What would be the benefit to having it all in one file ? During
> initial libvirtd startup we could read the info from one files,
> but we still need code to read the info from many files so we're
> not saving code in that respect. Most of the snapshot APIs that
> save XML to disk only care about a single <domainsnapshot>
> document so wouldn't benefit from having all snapshots in one
> place.

One possible benefit: in the current
virDomainSnapshotCreateXML(REDEFINE) code, we potentially end up
changing three separate snapshots in one call: if we are redefining a
replacement snapshot (maybe supplying a <domain> subelement that was
previously missing), AND marking the redefinition as the new current
element, it requires rewriting the XML file of the previous current
snapshot (to drop an internal-only <active>1</active> marker), deleting
the XML file of the previous state of the snapshot being modified, and
writing the new XML file of the redefined snapshot.  If any one of those
steps die, it's very hard to roll back to the earlier state.  But if we
store all snapshots in a single file using the <snapshots
current='name'> (rather than a per-snapshot <active>1</active> marker),
we only have to modify a single file when making any changes to the
snapshot hierarchy.  If we fail at any earlier step of the way, we have
not partially corrupted the snapshot hierarchy in a manner that is hard
to roll back.

Another (weak) benefit - right now, qemu snapshots cannot be named
"../blah", because that would create a filename that escapes the proper
subdirectory.  With a single <snapshots>, you can allow more flexible
snapshot names, as those names no longer influence host file names.

>
> Re-writing the entire <domain> XML means that a host failure during
> a snapshot save can put the entire domain definition at risk of loss.
> Though we do try to mitigate that by writing to a temp file and doing
> an atomic rename, that doesn't eliminate the risk entirely.

Okay, so it's probably better to store the <snapshots> XML in a file
separate from the <domain> itself (failure to write the snapshots file
may lose the snapshots, but does not lose the domain), but still, I
can't help but think that two files is potentially nicer than a
subdirectory full of one file per snapshot.

Unless Nir speaks up with strong reasons in favor of new bulk APIs
(rather than living with the new topological sort flags), I'm okay
dropping this series. 

I'm ok with dropping bulk snapshots, but I think libvirt should provide
bulk APIs for checkpoints.

Nir
 
The patches that are already in (the head of the
v3 posting of this series, for generating/parsing <snapshots>) may still
prove to be useful if you agree with me that a single file for ALL
snapshots is more robust than one file per snapshot, particularly when
we have exiting APIs that already have a hard time rolling back to a
sane state on partial failure when manipulating more than one file. But
as I have not yet posted any series showing how that would work, that
doesn't change that these v4 patches are probably not worth pursuing
further.

And, since it is more important to get new API in by 5.2 (since rebasing
APIs is difficult to backport) than it is to do internal-only changes (a
subdirectory of multiple files vs. one <snapshots> file is not
user-visible, and can be backported without as much hassle), I'll focus
more on the checkpoint APIs rather than any refactoring of internals. So
that may mean that 5.2 is released with my <snapshots> format/parse code
with no consumers of that code, but I don't think it hurts enough to
revert them (as we may end up with a consumer for that code, even if not
in 5.2).

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org