tl;dr:
I am working on a series of patches to expose backing chain information
in <domain> XML. Comments are welcome, to make sure my XML design is on
the right track.
Purpose
=======
Among other things, this will help us support Peter's proposal of
enhancing the block-pull and block-commit actions to specify a
destination by relative depth in the backing chain (where "vda[0]"
represents the active image, "vda[1]" represents the backing file of the
active image, and so on).
It will also help debug situations where libvirt and qemu disagree on
what constitutes a backing chain, and therefore causes sVirt labeling
discrepancies or prohibits block-pull/block-commit actions. For
example, given the chain "base <- mid <- top", if top forgot the
backing_fmt attribute, and /etc/libvirt/qemu.conf
allow_disk_format_probing=0 (which it is by default for security
resasons), libvirt treats 'mid' as a raw file and refuses to acknowledge
that 'base' is part of the chain, while qemu would happily treat mid as
qcow2 and therefore use 'base' if permissions allow it to. I have
helped debug this scenario several times on IRC or in bugzilla reports.
This feature is being driven in part by
https://bugzilla.redhat.com/show_bug.cgi?id=1069407
Existing design
===============
Note that libvirt can already expose backing file details (but only one
layer; it is not recursive) when using virStorageVolGetXMLDesc(); for
example:
# virsh vol-dumpxml --pool gluster img3
<volume type='network'>
<name>img3</name>
<key>vol1/img3</key>
...
<target>
<path>gluster://localhost/vol1/img3</path>
<format type='qcow2'/>
...
</target>
<backingStore>
<path>gluster://localhost/vol1/img2</path>
<format type='qcow2'/>
<permissions>
<mode>00</mode>
<owner>0</owner>
<group>0</group>
</permissions>
</backingStore>
</volume>
In the current volume representation, if a <backingStore> element is
present, it gives the <path> to the backing file. But this
representation is a bit limited: it is rather hard-coded to the
assumption that there is only one backing file, and does not do a good
job when the backing image is not in the same storage pool as the volume
it is describing. Some of the enhancements I'm proposing for <domain>
should also be applied to the information output by <volume> XML, which
means I have to be careful that the design I'm proposing will mesh well
with the storage xml to maximize code reuse.
The volume approach is a bit painful to users trying to track the
backing chain of a disk tied to a <domain> because it necessitates
creating a storage pool and making multiple calls to follow the chain,
so we need to expose the backing chain directly in the <disk> element of
a domain, and recursively show the entire chain. Furthermore, there are
some formats that require multiple resources: for example, both qemu
2.0's new quorum driver and HyperV VHDX images can have multiple backing
files, and where these files can in turn have more backing images.
Thus, any proper representation of disk resources needs to show a full
tree of relationships. Thankfully, circular references in backing files
would form an invalid image (all known virtual disk image formats
require a DAG of relationships).
With existing API, we still have not fully implemented 'virsh
snapshot-delete' of external snapshots. So our current advice is for
people to manually use qemu-img to alter backing chains, then update
libvirt to match. Once libvirt starts tracking backing chains, it
becomes all the more important to provide two new actions in libvirt: we
need a validation mode (check that what is recorded on disk matches what
is recorded in XML and flag an error if they differ) and a correction
mode (ignore what is recorded in XML and regenerate it to match what is
actually on disk).
Proposal
========
For each <disk> of a domain, I will be adding a new <backingStore>
element. The element is optional on input, which allows libvirt to
continue to understand input from older versions, but will always be
present on output, to show what libvirt is tracking as the backing chain.
For a file with no backing store (including raw file format), the usage
is simple:
<disk type='file' device='disk'>
<driver name='qemu' type='raw'/>
<source file='/path/to/somewhere'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
</disk>
The new explicit <backingStore/> makes it clear that there is no backing
chain.
A backing chain of 3 files (base <- mid <- top) in the local file system:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/top.qcow2'/>
<backingStore type='file'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/mid.qcow2'/>
<backingStore type='file'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/base.qcow2'/>
<backingStore/>
</backingStore>
</backingStore>
<target dev='vda' bus='virtio'/>
</disk>
Note that this is intentionally nested, so that for file formats that
support more than one backing resource, it can list parallel
<backingStore> as siblings to describe those related resources (thus
leaving the door open to expose a qemu quorum as a <disk type='quorum'>
with no direct <source> but instead with three <backingStore> sibling
elements for each member of the quorum, and where each member of the
quorum can further have its own backing chain).
Design wise, the <backingStore> element is either completely empty
(end-of-chain), or has a mandatory type='...' attribute that mirrors the
same type attribute of a <disk>. Then, within the backingStore element,
there is a <source> or other appropriate sub-elements similar to what
<disk> already uses for describing a single host resource. So, for an
example, here would be the output for a 2-element chain on gluster:
<disk type='network' device='disk'>
<driver name='qemu' type='qcow2'/>
<source protocol='gluster' name='vol1/img2'>
<host name='red'/>
</source>
<backingStore type='network'>
<driver name='qemu' type='qcow2'/>
<source protocol='gluster' name='vol1/img1'>
<host name='red'/>
</source>
<backingStore/>
</backingStore>
<target dev='vdb' bus='virtio'/>
</disk>
Or again, but this time using volume references to a storage pool
(assuming 'glusterVol1' is the storage pool wrapping gluster://red/vol1):
<disk type='volume' device='disk'>
<driver name='qemu' type='qcow2'/>
<source pool='glusterVol1' volume='img2'/>
<backingStore type='volume'>
<driver name='qemu' type='qcow2'/>
<source pool='glusterVol1' volume='img1'/>
<backingStore/>
</backingStore>
<target dev='vdb' bus='virtio'/>
</disk>
As can be seen, this design heavily reuses existing <disk type='...'>
handling, which should make it easier to reuse blocks of code both in
libvirt to handle the backing chains, and in clients when processing
backing chains to hand to libvirt up front or in inspecting the dumpxml
results. Management apps like vdsm that use transient domains should
start supplying <backingStore> elements to fully describe chains.
Implementation
==============
The following APIs will be affected:
defining domain XML (whether via define for persistent domains, or
create for transient domains): parse the new element. If the element is
already present, default to trusting the backing chain in that element
instead of reading from the disk files. If the element is absent, read
the disk files and populate the element. It is probably also worth
adding a flag to trigger validation mode: read the disk files to ensure
they match the xml, and refuse the operation if there is a mismatch (as
for updating xml to match reality, the simplest is to edit the XML and
delete the <backingStore> element then try the define again, so I don't
see the need for a flag for that action).
I may also need to figure out if it is worth tainting a domain any time
where libvirt detects that the XML backing chain vs. the disk file read
backing chain have diverged.
Note that defining domain XML includes loading from saved state or from
incoming migration.
dumping domain XML: always output the new element, by default without
consulting disk files. By tracking the chain in memory ever since the
guest is defined, it should already be available for output. I'm
debating whether we need a flag (similar to virsh dumpxml --update-cpu)
that can force libvirt to re-read the disk files at the time of the dump
and regenerate the chain to match reality of any changes made behind
libvirt's back.
creating external snapshots: the <domainsnapshot> XMl will continue to
be the picture of the domain prior to the creation of the snapshot (but
this picture will now include any <backingStore> elements already
present in the chain), but after the snapshot is taken, the <domain> XML
will also be modified to record the updated chain (the old disk source
is now the <backingStore> of the new disk source).
deleting external snapshots is not yet implemented, but the
implementation will have to shrink the backingStore chain to match reality.
block-pull (block-rebase in pull mode), block-commit: at the completion
of the pull, the <backingStore> needs to be updated to reflect the new
shorter state of the chain
block-copy (block-rebase in copy mode): the operation starts out by
creating a mirror, but during the first phase, the mirror is not usable
as an accurate copy of what the guest sees. Right now we fudge by
saying that block copy can only be done on transient domains; but even
with that, we still track a <mirror> element in the <disk> XML to track
that a block copy is underway (so that the operation survives a libvirtd
restart). The <mirror> element will now need to be taught a
<backingStore>, particularly if the user passes in a pre-existing file
to be reused as the copy destination. Then, when the second phase is
complete and the mirroring is ended, the <disk> will need another update
to select which side of the backing chain is now in force
virsh domblklist: should be taught a new flag to show the backing chain
in a tree format, since the command already exists to extract <disk>
information from a domain into a nicer human format
sVirt security labeling: right now, we are read the disk files to both
label and remove labels on a backing chain - obviously, once the chain
is tracked natively as part of the <disk>, we should be labeling without
having to read disk files
storage volumes - investigate how much of the backing chain code can be
reused in enhancing storage volume xml output
anything else you can think of in the code base that will be impacted?
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library
http://libvirt.org