On Tue, Mar 12, 2019 at 04:52:24PM -0500, Eric Blake wrote:
On 3/12/19 4:35 PM, Nir Soffer wrote:
>>> We don't have a need to list or define snapshots since we managed
>> snapshots
>>> on oVirt side.
>>> We want an API to list and redefine checkpoints.
>>
>> But the proposed <domaincheckpoint> also has a <domain> subelement,
so
>> it has the same problem (the XML for a bulk operation can become
>> prohibitively large).
>>
>
> Why do we need <domain> in a checkpoint?
The initial design for snapshots did not have <domain>, and it bit us
hard; you cannot safely revert to a snapshot if you don't know the state
of the domain at that time. The initial design for checkpoints has thus
mandated that <domain> be present (my v5 series fails to redefine a
snapshot if you omit <domain>, even though it has a NO_DOMAIN flag
during dumpxml for reduced output). If we are convinced that defining a
snapshot without <domain> is safe enough, then I can relax the
checkpoint code to allow redefined metadata without <domain> the way
snapshots already have to do it, even though I was hoping that
checkpoints could start life with fewer back-compats that snapshot has
had to carry along. But I'd rather start strict and relax later
(require <domain> and then remove it when proven safe), and not start
loose (leave <domain> optional, and then wish we had made it mandatory).
Given that a guest domain XML can be change at runtime at any point,
I don't see how omitting <domain> from the checkpoint XML is safe
in general. Even if apps think it is safe now and omit it, a future
version of the app might change in a way that makes omitting the
<domain> unsafe. If we didn't historically record the <domain> in
the checkpoint in the first place, then the new version of the app
is potentially in trouble. So I think it is good that we are strict
and mandate the <domain> XML even if it is not technically required
in some use cases.
> Note that vdsm may be killed in the middle of the redefine loop,
and in
> this case
> we leave livbirt with partial info about checkpoints, and we need to
> redefine
> the checkpoints again handling this partial sate.
But that's relatively easy - if you don't know whether libvirt might
have partial data, then wipe the data and start the redefine loop from
scratch.
Of course the same failure scenario applies if libvirt is doing it via
a bulk operation. The redefine loop still exists, just inside libvirt
instead, which might be killed or die part way though. So you're not
really fixing a failure scenario, just moving the failure to a different
piece. That's no net win.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|