Re: [libvirt] [PATCH v6 8/9] backup: Introduce virDomainBackup APIs

27 Mar 2019


      On 3/26/19 1:13 PM, Eric Blake wrote:

TL;DR:

After more thought, I'm reluctant to make a domain job an opaque type.
I'm also leaning towards withdrawing virDomainListJobIds() from the set
of APIs I'm proposing for 5.2, on the grounds that with a documented
magic job id of '0' that means "the currently running job" for both
virDomainBackupGetXMLDesc() and virDomainBackupEnd(), we don't lose any
functionality over what qemu currently provides (of no parallel jobs)
while still having a sane API that can be properly used with non-zero
job ids in the future when we do add parallel job support.
...
...
...
Since multiple backup jobs can be run in parallel in the future (well,
qemu doesn't support it yet, but we don't want to preclude the idea),
virDomainBackupBegin() returns a job id, which can also be queried by
virDomainListJobIds(), and this job id must be used for
virDomainBackupGetXMLDesc() and virDomainBackupEnd(). In the future,
we may also extend other jobs (migration as the default global job
impacting virDomainJobStats(), and the various block jobs) to all have
ids, where the existing APIs act like thin wrappers around more
powerful APIs that support a job id everywhere.
This is the part that has me most worried about 5.2 - anything that we
find a future libvirt supplying about job management APIs will not be
backported without a .so bump if it didn't make it into 5.2. So the hope
is that whatever lands now is usable (even if only one job at a time) by
whatever downstreams pick 5.2 as their starting point, without wishing
we had additional Job apis available.

...
...
...
@@ -3231,6 +3234,18 @@ int virDomainGetJobStats(virDomainPtr domain,
                          unsigned int flags);
 int virDomainAbortJob(virDomainPtr dom);
+typedef struct _virDomainJobId virDomainJobId;
+typedef virDomainJobId *virDomainJobIdPtr;
+struct _virDomainJobId {
Shouldn't this be called just "virDomainJob" ? Id is just
one piece of info inside the struct.
Should we be making this struct opaque, and adding
virDomainJobGetID and virDomainJobGetType accessors,
and thne passing a virDomainJobPtr to the other APIs
instead of just an id ?
It's a late request on a late API addition, but it sounds reasonable. We
have the benefit that it is NOT a type that has to be backed by on-disk
XML, so while it requires more patching to datatypes.h and figuring out
how to express the type over RPC, it should be a lot less grunt-work
than what I've been doing with CheckpointXML and to a lesser extent
Backup XML.
...
It feels safer if virDomainBackupGetXMLDesc were
given the full virDomainJobPtr, as then it can
validate that the "type" field represents an
backup job. This could detect the case where a
stale job ID was passed in that now points to a
completely different job type.
Interesting idea; and not that much harder (s/int/virDomainJobPtr/).
Actually, it turns out to make life a bit more difficult if the type is
opaque.  My demo at KVM Forum showed the following:

$ virsh backup-begin $dom backup_pull_full.xml
Backup id 1 created
backup used description from 'backup_pull_full.xml'
$ # do thired-party copy
$ virsh backup-end $dom 1
Backup id 1 completed

Note the time and process separation between creation and completion; if
we make virDomainBackupEnd() take an opaque pointer, then the first
virsh instance has to call virDomainJobFree(job) prior to exiting (well,
that depends a lot on whether we make virDomainJobPtr light-weight
scalar-only that just records type/id where a leak is harmless, or
heavy-weight where it also records the virConnectionPtr and/or
virDomainPtr it was associated for better API safety, but where freeing
the pointer is important in a long-running client).  Then, the second
virsh call would have to call virDomainJobLookupId(domain, 1) to
re-obtain the proper virDomainJobPtr object to hand to
virDomainBackupEnd(domain, job), and then call virDomainJobFree(ptr)
again to conclude things.

Thus, making it an opaque type adds a lot of API overhead to using
things, compared to the interface that virsh wants to give the user of
just printing an integer and letting the user re-input that integer; we
also have to worry about whether multiple consecutive API calls can
result in TOCTTOU races that could confuse things.

You raised the question of whether storing type/id as an opaque type
would add any safety to prevent the user from requesting an API on an
unrelated job ID. But after thinking about it more, my argument is that
it doesn't really help - when calling virDomainBackupEnd(domain, 1),
we've already implied that we want to operate on backup job 1 (and throw
an error if there is no backup job with id 1, even if there is some
other job type that overlaps namespaces and has an id 1).  Calling
virDomainBackupEnd(domain, virDomainJobCompose(domain,
VIR_DOMAIN_JOB_BACKUP, 1)) adds verbosity, but I don't see it adding safety.

Along those lines we can look at file descriptors as an example - they
are plain integers, but still serve as an opaque type (you use
fcntl(F_GETFD/F_GETFL) to learn things about the fd, and have various
errors like ESPIPE or EISDIR when operating on the wrong id compared to
the function trying to do something on an fd).

Then there's the question of what precedence we have for opaque vs.
user-visible types in public headers. We already expose
virTypedParameter as a user-visible type rather than opaque, and the
more time I've spent on the code this afternoon playing with the idea of
making jobs full-bodied and opaque, the more I'm finding it to be too
much overhead.  For example, the python generator can easily support
ints, but needs a lot more modifications to support a new opaque type.
If we insist on making virDomainBackupBegin() return an opaque type
instead of an integer id, I'm very doubtful on whether the changes to
libvirt-python would be made in time for making the 5.2 release.

Thinking for the future, when we eventually do get around to adding full
parallel job ID support, it will touch a lot of APIs:
 - virDomainGetJobInfo(), virDomainGetJobStats(), and
virDomainAbortJob() - currently operate on a default job (in particular,
the only job actually affected by these are jobs started by
virDomainMigrate() and friends)
 - virDomainBlockJobAbort(), virDomainGetBlockJobInfo(),
virDomainBlockJobSetSpeed() - currently act on a single job per disk,
where multiple jobs can be run in parallel if you have multiple disks,
but again with no id; affects jobs started by virDomainBlockCopy(),
virDomainBlockCommit(), virDomainBlockPull(), and friends
 - my new virDomainBlockBackup() - which I'm trying to design for the
future by returning a job id
 - we may want to add a virDomainJobGetDefaultID(domain) or even
virDomainBlockJobGetID(domain, diskname) to make it easy to
reverse-engineer which job id was utilized by the legacy APIs, but at
the same time would want new APIs that return the job id directly when
creating one of those jobs
 - we may want to add a virDomainJobSetDefaultID(domain, id) to select
which job (out of multiple running in parallel) will be targetted by the
existing job functions that don't take a job id
 - side note: virDomainAbortJob() vs. virDomainBlockJobAbort() is
confusing; it's hard to remember which name form to use




So, that said, what are the options for 5.2?

If we commit virDomainBackupEnd(virDomainPtr, int job, flags) in time
for Wednesday's freeze, we can still change the API (go with a different
parameter type) and/or revert the addition up until Friday.

If the release happens with virDomainBackupEnd() but no
virDomainListAllJobs(), we can still run backup jobs (you just can't
easily query what job id was chosen, but have to assume a default job).
But I could also document that although virDomainBackupBegin() always
returns a non-zero job id on success, we guarantee that the reserved job
id '0' stands for "the currently-running job". If there is no active
backup job, or if there are more than one active backup jobs, then id
"0" will fail - but we can also document that if a server does not
support the virDomainListAllJobs() API, then that server does not
support parallel active backup jobs.  Thus, the use of a job id is a
formality for 5.2 (we know that qemu won't have parallel jobs, so we can
blindly pass 0 for the job id to all calls that document a job id - but
still code virsh to allow the user to pass job id '1' or whatever the
output from backup-begin said if they don't want to rely on the default
of job '0'). At the same time, down the road (in release 6.0 or whenever
we finally add parallel job ids) that we DO finally add
virDomainListAllJobs() and all the other APIs needed for true job
management, we've avoided needing to add wrapper APIs to fix
virDomainBackupBegin/End not working with job ids (even though we have
to add wrappers for all the other legacy job-related APIs that did not
work with ids from the get-go).

We could decide that a job id is too much for right now, and state that
virDomainBackupBegin() returns 0 on success rather than a job id (for
exactly one job possible), and where virDomainBackupEnd() no longer
needs a job id parameter (in other words, making it more like existing
virDomainMigrate()/virDomainAbortJob()). Then future parallel jobs would
have to add a new API for backup as well as everything else (less work
now, and at least consistent with other APIs, but more work down the
road). I'm less in favor of this option, because it feels like adding
technical debt.

If the release happens now with both virDomainBackupEnd() taking a job
id, AND the addition of a virDomainListAllJobs() API, then we are stuck
supporting the virDomainListAllJobs() API even if it turns out we don't
like the design it locked us into. But if we think we have a
sufficiently extensible design for that one API, even without the rest
of a nice parallel job framework, and feel we can reasonably add
remaining job API enhancements later, then more power to us for getting
it right.

So having said all that, I'm leaning towards keeping job ids as an
integer, having no virDomainListAllJobs() in this release, and adding
sufficient documentation and magic around a reserved job id '0' so that
callers can start experimenting with backup jobs in 5.2.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org