Adam Litke has been asking if I can expose watermark information from
qemu when doing block commit. Qemu still doesn't expose that
information when doing 'virsh blockcopy' (QMP drive-mirror), but DOES
expose it for regular and active 'virsh blockcommit'. The idea is that
when you are writing to more than one file at a time, management needs
to know if the file is nearing a watermark for usage that necessitates
growing the storage volume before hitting an ENOSPC error. In
particular, Adam's use is running qcow2 format on top of block devices,
where it is easy to enlarge the block device.
The current libvirt API virDomainBlockInfo() can only get watermark
information for the active image in a disk chain. It shows three numbers:
capacity: the disk size seen by the guest (can be grown via
virt-resize) - usually larger than the host block device if the guest
has not used the complete disk, but can also be smaller than the host
block device due to overhead of qcow2 and the disk is mostly in use
allocation: the known usage of the host file/block device, should never
be larger than the physical size (other than rounding up to file sector
sizing). For sparse files, this number is smaller than total size based
by the amount of holes in the file. For block devices with qcow2 format,
this number is reported by qemu as the maximum offset in use by the
qcow2 file (without regards to whether earlier offsets are holes that
could be reused). Compare this to what 'du' would report.
physical: the total size of the host file/block device. Compare this to
what 'ls' would report.
Also, the libvirt API virStorageVolGetXMLDesc reports two of those
numbers for a top-level image: <capacity> and <allocation> are listed as
siblings of <target>. But it is not present for a <backingStore>; you
have to use the API twice.
Now that we have a common virStorageSourcePtr type in the C code, we
could do a better job of exposing full information for the entire chain
in a single API call.
I've got a couple ideas of where we can extend existing APIs (and the
extensions do not involve bumping the .so versioning, so it can also be
backported, although it gets MUCH harder to backport without
virStorageSourcePtr).
First, I think the virStorageVolGetXMLDesc should show all three
numbers, by adding a <physical unit='bytes'>...</physical> element
alongside the existing <capacity> and <allocation> elements. Also, I
think it might be nice if we could enhance the API to do a full chain
recursion (probably requires an explicit flag to turn on) where it shows
details on the full backing chain, rather than just partial details on
the immediate backing file; in doing that, the <backingStore> element
would gain recursive <backingStore> (similar to what we recently did in
<domain> XML). In that mode, each layer of <backingStore> would also
report <capacity>, <allocation>, and <physical>. Something like:
# virsh vol-dumpxml --pool default f20.snap2
<volume type='file'>
<name>f20.snap2</name>
<key>/var/lib/libvirt/images/f20.snap2</key>
<source>
</source>
<capacity unit='bytes'>12884901888</capacity>
<allocation unit='bytes'>2503548928</allocation>
<physical unit='bytes'>2503548928</allocation>
<target>
<path>/var/lib/libvirt/images/f20.snap2</path>
<format type='qcow2'/>
<permissions>
<mode>0600</mode>
<owner>0</owner>
<group>0</group>
<label>system_u:object_r:virt_image_t:s0</label>
</permissions>
<timestamps>
<atime>1407295598.583411967</atime>
<mtime>1403064822.622766566</mtime>
<ctime>1404318525.899951254</ctime>
</timestamps>
<compat>1.1</compat>
<features/>
</target>
<backingStore>
<path>/var/lib/libvirt/images/f20.snap1</path>
<capacity unit='bytes'>12884901888</capacity>
<allocation unit='bytes'>2503548928</allocation>
<physical unit='bytes'>2503548928</allocation>
<format type='qcow2'/>
<permissions>
<mode>0600</mode>
<owner>107</owner>
<group>107</group>
<label>system_u:object_r:virt_content_t:s0</label>
</permissions>
<timestamps>
<atime>1407295598.623411816</atime>
<mtime>1402005765.810488875</mtime>
<ctime>1404318523.313955796</ctime>
</timestamps>
<compat>1.1</compat>
<features/>
<backingStore>
<path>/var/lib/libvirt/images/f20.base</path>
<capacity unit='bytes'>10737418240</capacity>
<allocation unit='bytes'>2503548928</allocation>
<physical unit='bytes'>10737418240</allocation>
<format type='raw'/>
<permissions>
<mode>0600</mode>
<owner>107</owner>
<group>107</group>
<label>system_u:object_r:virt_content_t:s0</label>
</permissions>
<timestamps>
<atime>1407295598.623411816</atime>
<mtime>1402005765.810488875</mtime>
<ctime>1404318523.313955796</ctime>
</timestamps>
<backingStore/>
</backingStore>
</backingStore>
</volume>
Also, the current storage volume API is rather hard-coded to assume that
backing elements are in the same storage pool, which is not always true.
It may be time to introduce <backingStore type='file'> or <backingStore
type='network'> to allow better details of cross-pool backing elements,
while leaving plain <backingStore> as a back-compat synonym for
<backingStore type='volume'> for the current hard-coded layout that
assumes the backing element is in the same storage pool.
The other idea I've had is to expand the <domain> XML to expose more
information about backing chains, including to make it expose details
that are redundant with virDomainBlockInfo() for the top level, or maybe
even what virDomainBlockStatsFlags() reports. Here, we have a bit of a
choice - storage volume XML was inconsistent on which attributes were
siblings to <target> (such as <capacity>) vs. children (such as
<timestamps>); it might be nicer to stick all per-file elements at the
same level in <disk> XML (probably as siblings to <source>). On the
other hand, I strongly feel that <compat> is a feature of the <format>,
so it should have been a child rather than a sibling. So, as an example
of what the XML might look like:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'>
<compat>1.1</compat>
<features/>
</driver>
<source file='/tmp/snap2.img'/>
<capacity unit='bytes'>12884901888</capacity>
<allocation unit='bytes'>2503548928</allocation>
<physical unit='bytes'>2503548928</allocation>
<permissions>
<mode>0600</mode>
<owner>107</owner>
<group>107</group>
<label>system_u:object_r:virt_content_t:s0</label>
</permissions>
<timestamps>
<atime>1407295598.623411816</atime>
<mtime>1402005765.810488875</mtime>
<ctime>1404318523.313955796</ctime>
</timestamps>
<backingStore type='file' index='1'>
<format type='qcow2'>
<compat>1.1</compat>
<features/>
</format>
<source file='/tmp/snap1.img'/>
<capacity unit='bytes'>12884901888</capacity>
<allocation unit='bytes'>2503548928</allocation>
<physical unit='bytes'>2503548928</allocation>
<permissions>
<mode>0600</mode>
<owner>0</owner>
<group>0</group>
<label>system_u:object_r:virt_image_t:s0</label>
</permissions>
<timestamps>
<atime>1407295598.583411967</atime>
<mtime>1403064822.622766566</mtime>
<ctime>1404318525.899951254</ctime>
</timestamps>
<backingStore type='file' index='2'>
<format type='raw'/>
<capacity unit='bytes'>10737418240</capacity>
<allocation unit='bytes'>2503548928</allocation>
<physical unit='bytes'>10737418240</allocation>
<source file='/tmp/base.img'/>
<permissions>
<mode>0600</mode>
<owner>107</owner>
<group>107</group>
<label>system_u:object_r:virt_content_t:s0</label>
</permissions>
<timestamps>
<atime>1407295598.623411816</atime>
<mtime>1402005765.810488875</mtime>
<ctime>1404318523.313955796</ctime>
</timestamps>
<backingStore/>
</backingStore>
</backingStore>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00'
slot='0x03'
function='0x0'/>
</disk>
Again, this is a lot of new information, so it may be wise to add a new
flag that must be turned on to request the information. But adding this
information would allow watermark tracking for a blockcommit operation -
when collapsing 'base <- snap1 <- snap2' into 'base <- snap2' by
committing snap1 into base, the <allocation> sublement of the
appropriate <backingStore> level will do live tracking of the qemu
values as more data is being written into base, and thus be usable to
determine if the block device behind base needs to be externally
expanded before hitting an ENOSPC situation.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library
http://libvirt.org