This is the counter-proposal to my earlier RFC for storage migration via
snapshot mirrors[1], resulting from a NACK on the code review for that
earlier proposal[2]. In particular, this proposal fleshes out some of
Paolo's design overview on the qemu wiki[3].
[1]
https://www.redhat.com/archives/libvir-list/2012-March/msg00578.html
[2]
https://www.redhat.com/archives/libvir-list/2012-March/msg01033.html
[3]
http://wiki.qemu.org/Features/SnapshotsMultipleDevices
My plan is to have everything in this RFC coded up in the next couple of
days (hopefully no later than Thursday); this has missed the feature
freeze for 0.9.11, so it should not be applied upstream until after the
weekend release, as one of the first patches for 0.9.12. Backport-wise,
the new flags can be backported as far back as the 0.9.10 .so API, but
the new virDomainBlockCopy() API cannot be exported when doing a
backport without breaking .so versions (although it's implemenation can
be used internally).
Additions
=========
The following new error code will be added:
VIR_ERR_BLOCK_COPY_IN_PROGRESS
The following new API will be added:
int virDomainBlockCopy(virDomainPtr dom,
const char *disk,
const char *base,
const char *dest,
const char *format,
unsigned long bandwidth,
unsigned int flags);
The following new named values will be added:
enum virDomainBlockJobType (used in virDomainBlockJobInfo):
VIR_DOMAIN_BLOCK_JOB_TYPE_COPY = 2
The following new flags will be added:
for virDomainBlockRebase:
VIR_DOMAIN_BLOCK_REBASE_SHALLOW = 1 << 0
VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT = 1 << 1
VIR_DOMAIN_BLOCK_REBASE_COPY = 1 << 2
for virDomainBlockCopy:
VIR_DOMAIN_BLOCK_COPY_SHALLOW = 1 << 0
VIR_DOMAIN_BLOCK_COPY_REUSE_EXT = 1 << 1
for virDomainBlockJobAbort:
VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT
Add some XML:
Under //domain/drivers/disk, next to <source file='...'/>, add <mirror
file='...'/>
Semantics
=========
virDomainBlockCopy sets up a BLOCK_JOB_TYPE_COPY job. 'disk' names the
disk to be copied (can be 'vda' or '/path/to/source', as with other
block commands) and must not be NULL. 'base' names the path to the
backing file in the chain of the source that will be the new backing
file of the destination; if this parameter is NULL, then the destination
file defaults to a complete block pull, but the COPY_SHALLOW flag
instead requests a pull of just the top file in the source backing
chain. 'dest' names the copy being created and must not be NULL;
normally, this file is created by the hypervisor/libvirt, but the
COPY_REUSE_EXT flag lets an application pass in a pre-created file
(allowing metadata to include a relative instead of absolute backing
file name). 'format' gives the format of the copy, or NULL to either
probe the format of a COPY_REUSE_EXT dest or to reuse the same format as
the source. flags cannot contain COPY_SHALLOW unless 'base' is NULL.
Once a block copy job is started, calls to virDomainGetBlockJobInfo()
for the same 'disk' will report an info with
VIR_DOMAIN_BLOCK_JOB_TYPE_COPY as the type. This job never completes on
its own, but must be stopped by the user (this enables mirroring to
continue until the user informs libvirt that any backing files, perhaps
located at different locations as specified by relative path names using
REUSE_EXT, have been externally copied into place). There are two
phases to a TYPE_COPY job. In the first phase, cur < end when querying
progress, calls to virDomainBlockJobAbort(dom, disk, 0) will cancel the
operation and revert to the source, and calls to
virDomainBlockJobAbort(dom, disk, VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT) will
fail with VIR_ERR_BLOCK_COPY_IN_PROGRESS. In the second phase, cur ==
end when querying progress, calls to virDomainBlockJobAbort(dom, disk,
0) will break the mirroring and revert to the source, while calls to
virDomainBlockJobAbort(dom, disk, VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT) will
break the mirroring and pivot to the destination. Use of
VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT on a non-copy block job will fail with
VIR_ERR_INVALID_ARG.
virDomainBlockRebase(dom, disk, dest, bandwidth,
VIR_DOMAIN_BLOCK_REBASE_COPY | (flags & 3)) is shorthand for
virDomainBlockCopy(dom, disk, NULL, dest, NULL, bandwidth, (flags & 3))
- that is, use of the REBASE_COPY flag treats the BlockRebase 'base'
argument as the BlockCopy 'dest' argument, creates the destination with
the same file format as the source (or probes the backing file format if
REBASE_REUSE_EXT is used), and passes the COPY_SHALLOW and
COPY_REUSE_EXT flags through (note that the similarly named flags were
conveniently chosen to be the same values). Attempts to use
REBASE_SHALLOW or REBASE_REUSE_EXT without also using REBASE_COPY will
fail with VIR_ERR_INVALID_ARG.
While a copy operation is in place, virDomainGetXMLDesc (dumpxml) will
show the <mirror> element for that <disk>.
Initial Implementation
======================
When virDomainBlockCopy is called (perhaps via the virDomainBlockRebase
alias), it first sets up a mirror using the 'drive-mirror' monitor
command and the destination file name. The mirror is opened with the
'existing' mode if _REUSE_EXT is present; otherwise it is opened with
the 'absolute-paths' mode if _SHALLOW is present or 'no-backing-file'
mode if no flags are present. Next, the function calls the
'block_stream' monitor command to start the streaming. The streaming
command uses 'base' as its starting point, except that when _SHALLOW was
specified, libvirt will use the backing file of the source disk, rather
than NULL (this can be obtained by the 'query-block' monitor command;
although someday libvirt should start tracking this information in
<domain> XML rather than relying on qemu). At this point, control
returns to the user, and the stream proceeds in the background;
virDomainBlockJobSetSpeed can tune the speed of the block streaming.
At least in the initial implementation, as long as the block job is
active, libvirt will prevent 'virDomainMigrate', 'virDomainSave',
'virDomainSnapshotCreateXML' in general, and 'virDomainDetachDevice' of
the disk being mirrored, all with the new VIR_ERR_BLOCK_COPY_IN_PROGRESS
error. This is because I don't have an easy way to resume mirroring
when restarting a new qemu process; preventing these actions until the
user first cancels the ongoing mirroring will result in fewer corner
cases that libvirt has to worry about. This also implies that the
initial implementation will fail for persistent domains, and only be
useful for transient domains. Attempts to define a domain with a
<mirror> element are rejected, leaving <mirror> as output-only XML
useful in restoring state when restarting libvirtd. It may be possible
to add persistent support in the future, once we determine how to make
qemu resume a mirrored block device; at that point, it would be possible
to specify <mirror> in domain xml during domain creation or during
device hotplug.
When the block stream finishes, qemu will send an event to libvirt
(libvirt will also have to manually check for completion on a libvirtd
restart, based on whether cur == end in the block job info). I'm not
yet sure whether to expose this event to the user so that they do not
have to poll the block job info, or whether to consume it internally.
At any rate, before this event occurs, the BLOCK_JOB_ABORT_PIVOT flag is
rejected, and virDomainBlockJobAbort without flags uses the
'block_job_cancel' monitor command to stop the streaming early, then the
'drive-reopen' monitor command to break the mirroring back to the
source; it is feasible that there is a race where a 'block_job_cancel'
can be called after the pull is complete but before the completion event
has been processed, so the code must proceed on to the 'drive-reopen'
even if the job cancel fails. After the event occurs,
virDomainBlockJobAbort only needs to use the 'drive-reopen' monitor
command, with either the source or the destination file depending on the
BLOCK_JOB_ABORT_PIVOT flag.
Until 'drive-reopen' is made atomic in qemu (by adding code to support
it inside 'transaction'), the user risks a block job abort rendering the
disk unusable, because the source was closed before the destination was
opened; hopefully this situation is rare, in part because libvirt will
do stat() checks and SELinux labeling on destination files before
starting qemu monitor commands, as a sanity check that qemu will be able
to use the specified files. If qemu ever adds atomic 'drive-reopen'
support, we can add a new flag BLOCK_JOB_ABORT_ATOMIC that fails on
older qemu, and ensures the use of 'transaction' on the newer qemu that
supports an atomic reopen.
If a block mirror is aborted (whether by the user calling
virDomainBlockJobAbort with no flag, or by the qemu process ending due
to things like a guest-initiated shutdown), then the mirror can be
safely discarded, and restarting the domain will be unmirrored where the
virDomainBlockRebase can be called again from scratch.
Examples
========
For some examples, starting with base <- snap1 <- snap2 <- snap3 as the
backing chain for disk 'vda',
virDomainBlockCopy(dom, "vda", NULL, "/path/to/copy", NULL, 0)
would set up a job that results in /path/to/copy being the same file
format as snap3, but containing the entire chain
virDomainBlockCopy(dom, "vda", "/path/to/snap1",
"/path/to/copy",
"qed", 0)
would set up a job that results in base <- snap1 <- copy as the mirrored
backing chain, and ensuring that copy is formatted as qed regardless of
the format of snap3
virDomainBlockCopy(dom, "vda", NULL, "/path/to/copy",
NULL, VIR_DOMAIN_BLOCK_COPY_SHALLOW)
is shorthand for
virDomainBlockCopy(dom, "vda", "/path/to/snap2",
"/path/to/copy",
NULL, 0)
and results in base <- snap1 <- snap2 <- copy creates copy using the
same format as snap3
virDomainBlockCopy(dom, "vda", "/path/to/snap2",
"/path/to/copy",
NULL, VIR_DOMAIN_BLOCK_COPY_REUSE_EXT)
requires /path/to/copy to already exist, probes it for existing format
(which might be different from snap3), and proceeds to mirror everything
so that snap2 is the base of copy (and the user is at fault if the
pre-existing file doesn't call out a backing file that happens to be
identical in content to snap2)
oVirt will probably use the sequence:
- use qemu-img to create an empty qcow2 file with relative backing name
to the destination storage
- call virDomainBlockRebase(dom, disk, "/path/to/copy",
VIR_DOMAIN_BLOCK_REBASE_COPY | VIR_DOMAIN_BLOCK_REBASE_SHALLOW |
VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT)
- copy the base files from source to destination storage (this can be
done in parallel, either before or after the virDomainBlockRebase call)
- wait for the block pull to finish (either by waiting for an event if I
propagate the event, or by polling virDomainBlockJobInfo, or even by
polling virDomainBlockJobAbort(VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT) and
checking for VIR_ERR_BLOCK_COPY_IN_PROGRESS
- once both the base files are in place and the block pull half of the
copy job is complete (and without regard to whether the block stream or
the external base file copying completed first), call
virDomainBlockJobAbort(,VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT) to reopen to
the new storage domain chain
Comparison to first RFC
=======================
This proposal exposes only one disk at a time, while the earlier
virDomainSnapshotCreateXML <mirror> approach could atomically set up
mirroring on multiple disks. However, the nature of block jobs being a
background process means that parallel jobs can be run on independent
disks, so the user can do the overall block migration with the time cost
of the slowest disk, rather than having to do things serially with the
time cost of all disks added together.
This proposal avoids having to create an intermediate snapshot, so the
pull is more efficient and the source chain does not get longer, no
matter how many times the process is aborted and restarted.
This proposal can expose the no-backing-file mode, while the snapshot
approach did not.
To survive across libvirtd restarts, the snapshot approach was using
<domainsnapshot> to store the mirroring status in a user-visible
location; this approach has to modify the internal live xml (alongside
other internal data, such as the qemu pid). Or perhaps I can add a
<mirror> subelement to <disk> of a domain and make it user-visible after
all, and treat that as an output-only parameter for now.
Both approaches face the dilemma of how to start a new qemu process with
mirroring intact, and my solution in both patch series will be to
prevent any action that would force libvirt to save domain state until
after the user has first canceled all current mirroring jobs. This
limitation is not permanent - if future qemu provides better ways to
restart mirroring, and as libvirt is taught to store the full backing
chain in <domain> xml instead of probing it on the fly, we can relax
this restriction in the future.
--
Eric Blake eblake(a)redhat.com +1-919-301-3266
Libvirt virtualization library
http://libvirt.org