On Tue, Mar 12, 2019 at 11:52 PM Eric Blake <eblake@redhat.com> wrote:
On 3/12/19 4:35 PM, Nir Soffer wrote:

>>> We don't have a need to list or define snapshots since we managed
>> snapshots
>>> on oVirt side.
>>> We want an API to list and redefine checkpoints.
>>
>> But the proposed <domaincheckpoint> also has a <domain> subelement, so
>> it has the same problem (the XML for a bulk operation can become
>> prohibitively large).
>>
>
> Why do we need <domain> in a checkpoint?

The initial design for snapshots did not have <domain>, and it bit us
hard; you cannot safely revert to a snapshot if you don't know the state
of the domain at that time. The initial design for checkpoints has thus
mandated that <domain> be present (my v5 series fails to redefine a
snapshot if you omit <domain>, even though it has a NO_DOMAIN flag
during dumpxml for reduced output).  If we are convinced that defining a
snapshot without <domain> is safe enough, then I can relax the
checkpoint code to allow redefined metadata without <domain> the way
snapshots already have to do it, even though I was hoping that
checkpoints could start life with fewer back-compats that snapshot has
had to carry along.  But I'd rather start strict and relax later
(require <domain> and then remove it when proven safe), and not start
loose (leave <domain> optional, and then wish we had made it mandatory).

You keep mentioning snapshots, but we are not going to redefine snapshots,
only checkpoints.

We need to do this only because qcow2 does not store a chain of bitmaps, but
un-ordered bitmaps without any metadata, so we need to remind libvirt which
bitmaps we have in every disk, and what is the order of the bitmaps. We will be
happy if we could just specify the bitmaps and disks in the backup API, and avoid
rebuilding the checkopints history, just like we avoid rebuilding the snapshot 
history.

> You can see here what we store in a checkpoint (based on your patches v2 or
> v3):
> https://ovirt.org/develop/release-management/features/storage/incremental-backup.html#engine-database-changes
>
> Which is:
>
> vm_checkpoints table
>
> - checkpoint_id: UUID
> - parent: UUID
> - vm_id: UUID
> - creation_date: TIMESTAMP
>
> vm_checkpoint_disk_map table
> - disk_id: UUID
> - checkpoint_id: UUID
>

And no state of the <domain> at the time the checkpoint was created?

No, but the configuration of the VM is stored in oVirt, so the backup application
can get this configuration from oVirt at the time of the backup.

>>
>>> Our use case is:
>>>
>>> 1. Start VM on some host
>>> 2. When oVirt starts a backup, it adds a checkpoint in oVirt database
>>
>> And does this database preserve topological ordering? If not, why not?
>>
>
> Using the parent UUID we can create the correct chain in the right order,
> and this is what we send to vdsm. Based on this, vdsm will generate the XML
> for libvirt.
>
> There is no domain info related to these checkpoints, and I hope we are not
> adding such requirements at this stage.

Rather, that requirement has been there in all of my drafts, and it
sounds like you are trying to get me to relax that requirement to not
need a <domain> during checkpoint metadata redefine.

We never understood this requirement it in the html documentation built from your patches.

Looking again in v3 - I see:

    domain
    The inactive domain configuration at the time the checkpoint was created. Readonly.

And we have an example:

<domaincheckpoint>
  <description>Completion of updates after OS install</description>
  <disks>
    <disk name='vda' checkpoint='bitmap'/>
    <disk name='vdb' checkpoint='no'/>
  </disks>
</domaincheckpoint>

Which does not have a domain, since it is read-only.

There is an example of the xml we get from libvirt:

<domaincheckpoint>
  <name>1525889631</name>
  <description>Completion of updates after OS install</description>
  <creationTime>1525889631</creationTime>
  <parent>
    <name>1525111885</name>
  </parent>
  <disks>
    <disk name='vda' checkpoint='bitmap' bitmap='1525889631'/>
    <disk name='vdb' checkpoint='no'/>
  </disks>
  <domain type='qemu'>
    <name>fedora</name>
    <uuid>93a5c045-6457-2c09-e56c-927cdf34e178</uuid>
    <memory>1048576</memory>
    ...
    <devices>
      <disk type='file' device='disk'>
        <driver name='qemu' type='qcow2'/>
        <source file='/path/to/file1'/>
        <target dev='vda' bus='virtio'/>
      </disk>
      <disk type='file' device='disk' snapshot='external'>
        <driver name='qemu' type='raw'/>
        <source file='/path/to/file2'/>
        <target dev='vdb' bus='virtio'/>
      </disk>
      ...
    </devices>
  </domain>
</domaincheckpoint>

But as the domain is read-only, we assumed that we don't need to pass it in the
domaincheckpoint xml. We also don't need it since we keep the the disks uuids
included in every checkpoint, and from this you can find the information needed
to generate the tiny checkpoint xml above.

Can you explain why historic domain xml is needed for performing a backup?

The caller of the API is responsible to give you the list of bitmaps that should be
included for every disk. Libvirt needs to create a new active bitmap and start the 
NBD server exposing the bitmaps specified by the caller. How having the xml for
every checkpoint/bitmap is needed for perform these steps?

Note also that libvirt does not have snapshot history on oVirt host, since we manage
snapshots externally. When libvirt starts a VM, the only thing it knows about snapshots
is the chain of disks. Even if we restore the domain xml for every checkpoint, libvirt 
cannot validate it against the non-existing snapshots.

While we use persistent VMs since 4.2, our VMs are actually transient. We started to use
persistent VMs because they are more tested, and we can use the VM <metadata> to persist
stuff, instead of maintaining our own temporary metadata persistence solution. 

>> That's a different feature request - you're asking for
>> virDomainCheckpointCreateXML(REDEFINE) to have some additional flag
>> where the metadata re-creation fails if the underlying bitmap is
>> corrupt/missing. Which may still be a good flag to add, but is something
>> you can do with one-at-a-time checkpoint redefinitions, and doesn't, in
>> itself, require bulk redefinition.
>>
>
> Checking for unknown bitmap on disk is not possible unless you have the
> complete
> list of checkpoints. Unless you add something like:
>
>     virDomainCheckpointValidate()
>
> Which will fail if the checkpoints known to libvirt do not match the
> bitmaps in the
> underlying qcow2 images.

Yes, an API like that may be better than trying to validate on a
per-redefine basis whether libvirt metadata matches qcow2 metadata.

>
> But even if we add this, the user will have to do:
>
>     try:
>         for checkpoint in orderdered_checkpoints:
>            generate xml...
>            redefine checkpoint...
>     except ...:
>         remove all checkpoints...
>
>     validate checkpoints

Yes, but that's not hard. And it's no different whether the 'try/except'
code is a single libvirt API (with bulk redefine) or multiple libvirt
APIs (no bulk redefine, but easy enough to one-at-a-time redefine in
topological order).  Why you have to generate the XML instead of reusing
the XML that libvirt produces is something I'm not sure of (that is,
what's wrong with using libvirt's dumpxml output as your input, rather
than trying to reconstruct it on the fly?).

The xml libvirt produces may not be unstable. For example the disk uuid
"22e5a7d6-5409-482d-bda4-f5939b27aad4" may be called now "sda", but
when the vm was running on another host 2 weeks ago maybe it was called
"sdb", since the user added or removed some devices.

When we create the xml using the disk uuuid, we will always refer to the current
name of the disk, regardless of the history of the vm.
> Unless the xml for single checkpoint is big, I don't see why all libvirt
> users
> need to redefine checkopoints one by one and handle all the possible errors
> (e.g. some checkpoints redefined, some not).

As currently proposed, <domaincheckpoint> includes a mandatory <domain>
sub-element on redefinition, which means that the XML _is_ big for a
series of checkpoints.  It also means that you're better off storing the
XML produced by libvirt than trying to regenerate it by hand, when it
comes to metadata redefinition.

>
> Note that vdsm may be killed in the middle of the redefine loop, and in
> this case
> we leave livbirt with partial info about checkpoints, and we need to
> redefine
> the checkpoints again handling this partial sate.

But that's relatively easy - if you don't know whether libvirt might
have partial data, then wipe the data and start the redefine loop from
scratch.

But this easy boilerplate will be duplicated in all libvirt clients, instead of 
having single API to restore the state.

Of course if we must keep historic domain configuration for every checkpoint
bulk API does not make sense, but I don't understand this requirement.

Nir