[libvirt] RFC: Exposing backing chains in <domain> XML

tl;dr: I am working on a series of patches to expose backing chain information in <domain> XML. Comments are welcome, to make sure my XML design is on the right track. Purpose ======= Among other things, this will help us support Peter's proposal of enhancing the block-pull and block-commit actions to specify a destination by relative depth in the backing chain (where "vda[0]" represents the active image, "vda[1]" represents the backing file of the active image, and so on). It will also help debug situations where libvirt and qemu disagree on what constitutes a backing chain, and therefore causes sVirt labeling discrepancies or prohibits block-pull/block-commit actions. For example, given the chain "base <- mid <- top", if top forgot the backing_fmt attribute, and /etc/libvirt/qemu.conf allow_disk_format_probing=0 (which it is by default for security resasons), libvirt treats 'mid' as a raw file and refuses to acknowledge that 'base' is part of the chain, while qemu would happily treat mid as qcow2 and therefore use 'base' if permissions allow it to. I have helped debug this scenario several times on IRC or in bugzilla reports. This feature is being driven in part by https://bugzilla.redhat.com/show_bug.cgi?id=1069407 Existing design =============== Note that libvirt can already expose backing file details (but only one layer; it is not recursive) when using virStorageVolGetXMLDesc(); for example: # virsh vol-dumpxml --pool gluster img3 <volume type='network'> <name>img3</name> <key>vol1/img3</key> ... <target> <path>gluster://localhost/vol1/img3</path> <format type='qcow2'/> ... </target> <backingStore> <path>gluster://localhost/vol1/img2</path> <format type='qcow2'/> <permissions> <mode>00</mode> <owner>0</owner> <group>0</group> </permissions> </backingStore> </volume> In the current volume representation, if a <backingStore> element is present, it gives the <path> to the backing file. But this representation is a bit limited: it is rather hard-coded to the assumption that there is only one backing file, and does not do a good job when the backing image is not in the same storage pool as the volume it is describing. Some of the enhancements I'm proposing for <domain> should also be applied to the information output by <volume> XML, which means I have to be careful that the design I'm proposing will mesh well with the storage xml to maximize code reuse. The volume approach is a bit painful to users trying to track the backing chain of a disk tied to a <domain> because it necessitates creating a storage pool and making multiple calls to follow the chain, so we need to expose the backing chain directly in the <disk> element of a domain, and recursively show the entire chain. Furthermore, there are some formats that require multiple resources: for example, both qemu 2.0's new quorum driver and HyperV VHDX images can have multiple backing files, and where these files can in turn have more backing images. Thus, any proper representation of disk resources needs to show a full tree of relationships. Thankfully, circular references in backing files would form an invalid image (all known virtual disk image formats require a DAG of relationships). With existing API, we still have not fully implemented 'virsh snapshot-delete' of external snapshots. So our current advice is for people to manually use qemu-img to alter backing chains, then update libvirt to match. Once libvirt starts tracking backing chains, it becomes all the more important to provide two new actions in libvirt: we need a validation mode (check that what is recorded on disk matches what is recorded in XML and flag an error if they differ) and a correction mode (ignore what is recorded in XML and regenerate it to match what is actually on disk). Proposal ======== For each <disk> of a domain, I will be adding a new <backingStore> element. The element is optional on input, which allows libvirt to continue to understand input from older versions, but will always be present on output, to show what libvirt is tracking as the backing chain. For a file with no backing store (including raw file format), the usage is simple: <disk type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/path/to/somewhere'/> <backingStore/> <target dev='vda' bus='virtio'/> </disk> The new explicit <backingStore/> makes it clear that there is no backing chain. A backing chain of 3 files (base <- mid <- top) in the local file system: <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> <backingStore/> </backingStore> </backingStore> <target dev='vda' bus='virtio'/> </disk> Note that this is intentionally nested, so that for file formats that support more than one backing resource, it can list parallel <backingStore> as siblings to describe those related resources (thus leaving the door open to expose a qemu quorum as a <disk type='quorum'> with no direct <source> but instead with three <backingStore> sibling elements for each member of the quorum, and where each member of the quorum can further have its own backing chain). Design wise, the <backingStore> element is either completely empty (end-of-chain), or has a mandatory type='...' attribute that mirrors the same type attribute of a <disk>. Then, within the backingStore element, there is a <source> or other appropriate sub-elements similar to what <disk> already uses for describing a single host resource. So, for an example, here would be the output for a 2-element chain on gluster: <disk type='network' device='disk'> <driver name='qemu' type='qcow2'/> <source protocol='gluster' name='vol1/img2'> <host name='red'/> </source> <backingStore type='network'> <driver name='qemu' type='qcow2'/> <source protocol='gluster' name='vol1/img1'> <host name='red'/> </source> <backingStore/> </backingStore> <target dev='vdb' bus='virtio'/> </disk> Or again, but this time using volume references to a storage pool (assuming 'glusterVol1' is the storage pool wrapping gluster://red/vol1): <disk type='volume' device='disk'> <driver name='qemu' type='qcow2'/> <source pool='glusterVol1' volume='img2'/> <backingStore type='volume'> <driver name='qemu' type='qcow2'/> <source pool='glusterVol1' volume='img1'/> <backingStore/> </backingStore> <target dev='vdb' bus='virtio'/> </disk> As can be seen, this design heavily reuses existing <disk type='...'> handling, which should make it easier to reuse blocks of code both in libvirt to handle the backing chains, and in clients when processing backing chains to hand to libvirt up front or in inspecting the dumpxml results. Management apps like vdsm that use transient domains should start supplying <backingStore> elements to fully describe chains. Implementation ============== The following APIs will be affected: defining domain XML (whether via define for persistent domains, or create for transient domains): parse the new element. If the element is already present, default to trusting the backing chain in that element instead of reading from the disk files. If the element is absent, read the disk files and populate the element. It is probably also worth adding a flag to trigger validation mode: read the disk files to ensure they match the xml, and refuse the operation if there is a mismatch (as for updating xml to match reality, the simplest is to edit the XML and delete the <backingStore> element then try the define again, so I don't see the need for a flag for that action). I may also need to figure out if it is worth tainting a domain any time where libvirt detects that the XML backing chain vs. the disk file read backing chain have diverged. Note that defining domain XML includes loading from saved state or from incoming migration. dumping domain XML: always output the new element, by default without consulting disk files. By tracking the chain in memory ever since the guest is defined, it should already be available for output. I'm debating whether we need a flag (similar to virsh dumpxml --update-cpu) that can force libvirt to re-read the disk files at the time of the dump and regenerate the chain to match reality of any changes made behind libvirt's back. creating external snapshots: the <domainsnapshot> XMl will continue to be the picture of the domain prior to the creation of the snapshot (but this picture will now include any <backingStore> elements already present in the chain), but after the snapshot is taken, the <domain> XML will also be modified to record the updated chain (the old disk source is now the <backingStore> of the new disk source). deleting external snapshots is not yet implemented, but the implementation will have to shrink the backingStore chain to match reality. block-pull (block-rebase in pull mode), block-commit: at the completion of the pull, the <backingStore> needs to be updated to reflect the new shorter state of the chain block-copy (block-rebase in copy mode): the operation starts out by creating a mirror, but during the first phase, the mirror is not usable as an accurate copy of what the guest sees. Right now we fudge by saying that block copy can only be done on transient domains; but even with that, we still track a <mirror> element in the <disk> XML to track that a block copy is underway (so that the operation survives a libvirtd restart). The <mirror> element will now need to be taught a <backingStore>, particularly if the user passes in a pre-existing file to be reused as the copy destination. Then, when the second phase is complete and the mirroring is ended, the <disk> will need another update to select which side of the backing chain is now in force virsh domblklist: should be taught a new flag to show the backing chain in a tree format, since the command already exists to extract <disk> information from a domain into a nicer human format sVirt security labeling: right now, we are read the disk files to both label and remove labels on a backing chain - obviously, once the chain is tracked natively as part of the <disk>, we should be labeling without having to read disk files storage volumes - investigate how much of the backing chain code can be reused in enhancing storage volume xml output anything else you can think of in the code base that will be impacted? -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 03/12/14 21:21, Eric Blake wrote:
tl;dr: I am working on a series of patches to expose backing chain information in <domain> XML. Comments are welcome, to make sure my XML design is on the right track.
...
Existing design ===============
...
a domain, and recursively show the entire chain. Furthermore, there are some formats that require multiple resources: for example, both qemu 2.0's new quorum driver and HyperV VHDX images can have multiple backing
With this in mind ...
files, and where these files can in turn have more backing images. Thus, any proper representation of disk resources needs to show a full tree of relationships. Thankfully, circular references in backing files would form an invalid image (all known virtual disk image formats require a DAG of relationships).
...
Proposal ======== For each <disk> of a domain, I will be adding a new <backingStore> element. The element is optional on input, which allows libvirt to continue to understand input from older versions, but will always be present on output, to show what libvirt is tracking as the backing chain.
A backing chain of 3 files (base <- mid <- top) in the local file system:
<disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/>
... we should add an attribute with the index of the backing chain element in the backing chain. This will: 1) allow easier user retrieval of the index to be used for block_rebase 2) allow us to avoid ambiguity when a backing chain will become a backing tree without the need to invent some kind of hierarchical indexing approach. Instead we can just number the backing elements in some (internal) fashion and expect the users to provide the correct index.
<source file='/var/lib/libvirt/images/mid.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> <backingStore/> </backingStore> </backingStore> <target dev='vda' bus='virtio'/> </disk>
My first impression is good though. I will go through the design again tomorrow in a more in-depth way. Peter

On 03/12/2014 02:42 PM, Peter Krempa wrote:
A backing chain of 3 files (base <- mid <- top) in the local file system:
<disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/>
... we should add an attribute with the index of the backing chain element in the backing chain.
Hmm. Another feature coming down the pipes in qemu 2.0 is the ability to give an alias to any portion of the backing chain. Right now, we have an <alias> element tied to the <disk> as a whole (in qemu parlance, the device id), but some qemu operations will be easier if we also have a name tied to each file in the chain (in qemu parlance, a node id for the bd [block driver structure]). Maybe we kill two birds with one stone, by having each <backingStore> track an <alias> sub-element with the name of the node, when communicating with qemu 2.0 and newer. For a specific instance, consider a quorum vs. a snapshot create action - there are two approaches: create a single qcow2 whose backing file is the quorum (that is, request the snapshot on the node tied to the quorum): Q[a, b, c] <- snap or create a new quorum of three qcow2 files, with each qcow2 file wrapping a member of the old quorum (actually, a 'transaction' command that creates three files in one go): Q[a <- snapA, b <- snapB, c <- snapC] or even anything in between (request a snapshot of the node tied to A, while leaving b and c alone, since node A is on the storage most amenable to copying off the snapshot for backup purposes while node B and C are remote). The way qemu is exposing this is by specifying that when creating the new node for the snapshot, whether its backing file is the node id of the overall quorum or of a node id of one of the pieces of the quorum. So while the overall <disk> alias remains constant, the quorum node is different from any of its three backing files. It's further evidence that the quorum itself does not use any file resources, but instead relies on multiple backingStores, and taking the snapshot (or snapshots) needs control over all possible nodes as the starting point that will be gaining a new qcow2 node as part of the snapshot creation. Right now, <alias> is currently a run-time and output-only parameter, but we someday want to support offline block-pull and friends, where we'd need the index to exist even when <alias> does not. Likewise, while each <backingStore> corresponds to a qemu node, and can thus have one name, the top-level <disk> has the chance for BOTH a device alias (which moves around whenever the active image changes due to snapshots, block copy, or block commit operations) and a node index (which is tied to the file name, even if the file changes to no longer being the active image in the chain). Thanks for making me think about that! Code-wise, I'm looking at splitting 'struct _virDomainDiskDef' into two parts. The outermost part is _virDomainDiskDef, which tracks anything tied to the guest view, or to the device as a whole (<target>, <alias>, <address>); the inner part is a new _virDomainDiskSrcDef, which tracks anything related to a host view (node name, <source>, <driver>, <backingStore>), and where each backingStore is also a _virDomainDiskSrcDef, as a recursive structure - we just special case the output so that the first _virDomainDiskSrcDef feeds the XML of <disk> element, while all other _virDomainDiskSrcDef feed the XML of a <backingStore>. For tracking node ids, I would then add a counter of nodes created so far to the outer structure, (more important for an online domain, as we want to track node names that mesh with qemu node names, and must not reuse names no matter how many snapshots or block-commits happen in between), and where each inner structure grabs the next increment of the counter. So revisiting various operations: On snapshot, we are going from: domainDiskDef (counter 1, alias "ide0-0-0") + domainDiskSrcDef (node "ide0-0-0[0]", source "base") to: domainDiskDef (counter 2, alias "ide0-0-0") + domainDiskSrcDef (node "ide0-0-0[1]", source "snap") + domainDiskSrcDef (node "ide0-0-0[0]", source "base") Note that the node names grow in order of creation, which is NOT the same as a top-down breadth-first numbering. <alias> and nodeid would be output only (ignored on input); as long as qemu is running we cannot reuse old nodeids, but when qemu is offline, we could rename things to start back from 0; maybe only when passed a specific flag (similar to the update cpu flag forcing us to update portions of the xml that we otherwise leave unchanged). Do we need both a node id and a <backingStore> index? We already allow disk operations by <alias> name; so referring to the node id may be sufficient. On the other hand, having index as an attribute might make it easier to write XPath queries that resolve to a numbered node regardless of depth (I'm a bit weak on XPath, but there's bound to be a way to lookup a <disk> element whose target is named "vda" and that has a "backingStore[index=4]" sub-element). So, for a theoretical quorum with 2/3 majority and where one of the disks is a backing chain, as in Q[a, b <- c, d], and where qemu is running, it might look like: <disk type='quorum' device='disk'> <driver name='qemu' type='quorum' threshold='2' node='[4]'/> <backingStore type='file' index='1'> <driver name='qemu' type='raw' node='[0]'/> <source path='/path/to/a'/> <backingStore/> </backingStore> <backingStore type='file' index='2'> <driver name='qemu' type='qcow2' node='[2]'/> <source path='/path/to/c'/> <backingStore type='file' index='3' node='[1]'> <driver name='qemu' type='raw'/> <source path='/path/to/b'/> <backingStore/> </backingStore> </backingStore> <backingStore type='file' index='4'> <driver name='qemu' type='raw' node='[3]'/> <source path='/path/to/d'/> <backingStore/> </backingStore> <target dev='hda' bus='ide'/> <alias name='ide0-0-0'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> then the node names that qemu uses will be the concatenation of the <disk> alias and each DiskSrcDef node ("ide0-0-0[4]" is the quorum, "ide0-0-0[0]" is the node for file A, ...), and where you can also refer to backing stores by index ("vda" or "vda[0]" is the quorum, "vda[1]" is file A from the quorum, "vda[2]" is the active part of the chain from the second member of the quorum, ...) -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 03/12/2014 05:34 PM, Eric Blake wrote:
Code-wise, I'm looking at splitting 'struct _virDomainDiskDef' into two parts. The outermost part is _virDomainDiskDef, which tracks anything tied to the guest view, or to the device as a whole (<target>, <alias>, <address>); the inner part is a new _virDomainDiskSrcDef, which tracks anything related to a host view (node name, <source>, <driver>, <backingStore>), and where each backingStore is also a _virDomainDiskSrcDef, as a recursive structure - we just special case the output so that the first _virDomainDiskSrcDef feeds the XML of <disk> element, while all other _virDomainDiskSrcDef feed the XML of a <backingStore>.
This won't compile, but here's the split I'm currently envisioning. Also, Peter reminded me on IRC that it is going to be nicer if the host-side source resource can be reusable in the src/util/virstorage framework, which means I need to move the inner struct out of conf/domain_conf.h (conf/ can use util/, but util/ cannot use conf/): diff --git i/src/conf/domain_conf.h w/src/conf/domain_conf.h index 37191a8..5a3cd77 100644 --- i/src/conf/domain_conf.h +++ w/src/conf/domain_conf.h @@ -688,108 +688,119 @@ enum virDomainDiskSourcePoolMode { ... }; typedef virDomainDiskSourcePoolDef *virDomainDiskSourcePoolDefPtr; -/* Stores the virtual disk configuration */ -struct _virDomainDiskDef { - int type; - int device; - int bus; +struct _virDomainDiskSourceDef { + int type; /* enum virDomainDiskType */ char *src; - char *dst; - int tray_status; - int removable; - int protocol; + int protocol; /* enum virDomainDiskProtocol */ size_t nhosts; virDomainDiskHostDefPtr hosts; virDomainDiskSourcePoolDefPtr srcpool; struct { char *username; int secretType; /* enum virDomainDiskSecretType */ union { unsigned char uuid[VIR_UUID_BUFLEN]; char *usage; } secret; } auth; + virStorageEncryptionPtr encryption; char *driverName; int format; /* enum virStorageFileFormat */ virStorageFileMetadataPtr backingChain; + size_t nseclabels; + virSecurityDeviceLabelDefPtr *seclabels; + + bool noBacking; + virDomainDiskSourceDefPtr backing; +}; +typedef struct _virDomainDiskSourceDef virDomainDiskSourceDef; +typedef virDomainDiskSourceDef *virDomainDiskSourceDefPtr; + +/* Stores the virtual disk configuration */ +struct _virDomainDiskDef { + virDomainDiskSourceDef src; + + int device; /* enum virDomainDiskDevice */ + int bus; /* enum virDomainDiskBus */ + char *dst; + int tray_status; + int removable; + virStorageFileMetadataPtr backingChain; + char *mirror; int mirrorFormat; /* enum virStorageFileFormat */ bool mirroring; struct { unsigned int cylinders; unsigned int heads; unsigned int sectors; int trans; } geometry; struct { unsigned int logical_block_size; unsigned int physical_block_size; } blockio; virDomainBlockIoTuneInfo blkdeviotune; char *serial; char *wwn; char *vendor; char *product; int cachemode; int error_policy; /* enum virDomainDiskErrorPolicy */ int rerror_policy; /* enum virDomainDiskErrorPolicy */ int iomode; int ioeventfd; int event_idx; int copy_on_read; int snapshot; /* enum virDomainSnapshotLocation, snapshot_conf.h */ int startupPolicy; /* enum virDomainStartupPolicy */ bool readonly; bool shared; bool transient; virDomainDeviceInfo info; - virStorageEncryptionPtr encryption; bool rawio_specified; int rawio; /* no = 0, yes = 1 */ int sgio; /* enum virDomainDeviceSGIO */ int discard; /* enum virDomainDiskDiscard */ - - size_t nseclabels; - virSecurityDeviceLabelDefPtr *seclabels; }; If I did the split right, then everything that is per-device remains in the outer struct, and everything that is per-file is in the inner struct. noBacking is required to know whether 'backing==NULL' implies <backingStore/> as an explicit end of chain, vs. omitting the subelement from older versions or user input that is still expecting libvirt to populate the backing chain into the xml. Another interesting observation - we have obviously not done much with chains of encrypted volumes, because just a little thought makes it obvious that <auth> and storageEncryption must be per-file attributes (it is feasible to have a chain of two separate encrypted qcow2 images[*], where the two images need SEPARATE passwords) while the current design of only one <auth> per device doesn't cope. Similarly, we can finally express the fact that the security label on backing stores is readonly while the top-most file is read-write; as well as designate when we have changed a backing store to read-write in order to update metadata such as during a commit operation (there are some FIXMEs in qemu_driver about knowing when to revert read-write privileges of backing stores if a block commit extends over a libvirtd restart). [*] Of course, I must give the caveat that I'd highly recommend AGAINST using qcow2 encryption - it is known to be a lousy implementation, when compared to LUKS. In making the proposed split, I noticed that we've abused the <driver> element to contain a hodgepodge of things that are per-device (for example, cache is a per-device setting, while format is a per-file setting), so I'm now trying to figure out how to tweak the XML to express the difference. I may end up keeping <driver> only at the top level, and adding a new <format> subelement inside <backingStore>, then for back-compat reasons duplicate <driver format='...'/> and <format> at the top level, or teaching the disk source formatter to merely append in a string of device-level attributes when formatting the active disk of the chain. Peter, how does this split coincide with what you were looking at? -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

[adding qemu-devel] Background to those new to the thread: Previously, libvirt has been tracking a lot of disk tunables alongside the active layer of a backing chain, without regards to any backing files in the chain. However, now that qemu supports named BDS nodes anywhere in the backing chain, I'm realizing that libvirt needs to track the entire backing chain in its <domain> XML for maximum control over each qemu BDS. On 03/13/2014 09:26 AM, Eric Blake wrote:
In making the proposed split, I noticed that we've abused the <driver> element to contain a hodgepodge of things that are per-device (for example, cache is a per-device setting, while format is a per-file setting), so I'm now trying to figure out how to tweak the XML to express the difference. I may end up keeping <driver> only at the top level, and adding a new <format> subelement inside <backingStore>, then for back-compat reasons duplicate <driver format='...'/> and <format> at the top level, or teaching the disk source formatter to merely append in a string of device-level attributes when formatting the active disk of the chain.
Among other things, libvirt can append the following to a -drive command-line option: cache= aio= rerror= werror= discard= sgio= bps= ... Looking in the schema file, BlockdevOptionsBase supports many of this options on a per blockdev basis. Does that mean that libvirt should allow for a different rerror= on a backing file than it does for the active file? Similarly for cache= or discard=? Or are some of these options really only sensible at the active layer, belonging more to the -drive than to each backing BDS within the drive? Knowing which options belong where will help me partition the libvirt structure into attributes that are per-file vs. those that are per-device. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Wed, Mar 12, 2014 at 05:34:17PM -0600, Eric Blake wrote:
Hmm. Another feature coming down the pipes in qemu 2.0 is the ability to give an alias to any portion of the backing chain. Right now, we have an <alias> element tied to the <disk> as a whole (in qemu parlance, the device id), but some qemu operations will be easier if we also have a name tied to each file in the chain (in qemu parlance, a node id for the bd [block driver structure]). Maybe we kill two birds with one stone, by having each <backingStore> track an <alias> sub-element with the name of the node, when communicating with qemu 2.0 and newer.
I like the idea of having a string-alias against each node to allow unambigously references to it. We could do an integer index, but a string alias is a bit more flexible, allowing us to tie the alias value to the QEMU name if desired.
Right now, <alias> is currently a run-time and output-only parameter, but we someday want to support offline block-pull and friends, where we'd need the index to exist even when <alias> does not.
Originally we only output <alias> when running because older QEMUs did not allow us to choose aliases ourselves. We could only decide what aliases to use when we decide what QEMU binary to invoke. As long as we know that QEMU will honour our requested aliases, though we could include that while shutoff too. At some point we should also probably decide to ditch support for some older QEMU versions. Backcompat is good....within reason. I don't think we need to neccessarily support QEMU versions that are 7+ years old, not least because so few people use such old versions that we're not getting any real testing so don't really know whether they still even work.
On snapshot, we are going from:
domainDiskDef (counter 1, alias "ide0-0-0") + domainDiskSrcDef (node "ide0-0-0[0]", source "base")
to:
domainDiskDef (counter 2, alias "ide0-0-0") + domainDiskSrcDef (node "ide0-0-0[1]", source "snap") + domainDiskSrcDef (node "ide0-0-0[0]", source "base")
Note that the node names grow in order of creation, which is NOT the same as a top-down breadth-first numbering. <alias> and nodeid would be output only (ignored on input); as long as qemu is running we cannot reuse old nodeids, but when qemu is offline, we could rename things to start back from 0; maybe only when passed a specific flag (similar to the update cpu flag forcing us to update portions of the xml that we otherwise leave unchanged).
Do we need both a node id and a <backingStore> index? We already allow disk operations by <alias> name; so referring to the node id may be sufficient. On the other hand, having index as an attribute might make it easier to write XPath queries that resolve to a numbered node regardless of depth (I'm a bit weak on XPath, but there's bound to be a way to lookup a <disk> element whose target is named "vda" and that has a "backingStore[index=4]" sub-element).
Node-id feels like a very QEMU specific concept that wouldn't map nicely for other hypervisors. Index meanwhile is fairly generic, as is a string-format alias. So I'd prefer either of the latter over a QEMU specific 'node' attribute Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Mar 14, 2014 at 10:53:48AM +0000, Daniel P. Berrange wrote:
On Wed, Mar 12, 2014 at 05:34:17PM -0600, Eric Blake wrote:
Hmm. Another feature coming down the pipes in qemu 2.0 is the ability to give an alias to any portion of the backing chain. Right now, we have an <alias> element tied to the <disk> as a whole (in qemu parlance, the device id), but some qemu operations will be easier if we also have a name tied to each file in the chain (in qemu parlance, a node id for the bd [block driver structure]). Maybe we kill two birds with one stone, by having each <backingStore> track an <alias> sub-element with the name of the node, when communicating with qemu 2.0 and newer.
I like the idea of having a string-alias against each node to allow unambigously references to it. We could do an integer index, but a string alias is a bit more flexible, allowing us to tie the alias value to the QEMU name if desired.
You either have to force the caller to provide an alias for each node, or you have to auto-assign them. But if you auto-assign them (say "node123"), what if the user provides a label for that node later on? What if the user gives that label to a different node? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/

On 03/14/2014 07:59 AM, Richard W.M. Jones wrote:
On Fri, Mar 14, 2014 at 10:53:48AM +0000, Daniel P. Berrange wrote:
On Wed, Mar 12, 2014 at 05:34:17PM -0600, Eric Blake wrote:
Hmm. Another feature coming down the pipes in qemu 2.0 is the ability to give an alias to any portion of the backing chain. Right now, we have an <alias> element tied to the <disk> as a whole (in qemu parlance, the device id), but some qemu operations will be easier if we also have a name tied to each file in the chain (in qemu parlance, a node id for the bd [block driver structure]). Maybe we kill two birds with one stone, by having each <backingStore> track an <alias> sub-element with the name of the node, when communicating with qemu 2.0 and newer.
I like the idea of having a string-alias against each node to allow unambigously references to it. We could do an integer index, but a string alias is a bit more flexible, allowing us to tie the alias value to the QEMU name if desired.
You either have to force the caller to provide an alias for each node, or you have to auto-assign them. But if you auto-assign them (say "node123"), what if the user provides a label for that node later on? What if the user gives that label to a different node?
Right now, we auto-assign. Another good reason to auto-assign is that the alias namespace is shared among ALL qemu objects - allowing the user to pick arbitrary names risks collisions not only with other disk objects, but with non-disk objects. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Wed, Mar 12, 2014 at 02:21:46PM -0600, Eric Blake wrote:
This feature is being driven in part by https://bugzilla.redhat.com/show_bug.cgi?id=1069407
Also: https://bugzilla.redhat.com/show_bug.cgi?id=1011063 Perhaps this too? https://bugzilla.redhat.com/show_bug.cgi?id=921135 Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org

On Wed, Mar 12, 2014 at 02:21:46PM -0600, Eric Blake wrote:
A backing chain of 3 files (base <- mid <- top) in the local file system:
<disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> <backingStore/> </backingStore> </backingStore> <target dev='vda' bus='virtio'/> </disk>
Libguestfs gets a disk image that we usually know nothing about. Then we _either_: (1) Add it to libvirt XML directly. (2) Create an overlay on top of the disk image, and add the overlay to the libvirt XML. It seems like for (1) we don't need to change anything. For (2) we might add: <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='overlay.qcow2'/> <backingStore type='file'> <source file='disk.img'/> </backingStore> </disk> Note not specifying disk.img's backing store (we don't know it). Also, non-disk sources: nbd, iscsi, gluster, ceph etc. It's especially hard to discover what is in these since it may involve multiple opens [breaks nbd sometimes], from multiple processes [security context issues]; and network connections are slower than opening a local file. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming blog: http://rwmj.wordpress.com Fedora now supports 80 OCaml packages (the OPEN alternative to F#)

On 03/13/2014 05:18 PM, Richard W.M. Jones wrote:
It seems like for (1) we don't need to change anything. For (2) we might add:
<disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='overlay.qcow2'/> <backingStore type='file'> <source file='disk.img'/> </backingStore> </disk>
Note not specifying disk.img's backing store (we don't know it).
Also, non-disk sources: nbd, iscsi, gluster, ceph etc. It's especially hard to discover what is in these since it may involve multiple opens [breaks nbd sometimes], from multiple processes [security context issues]; and network connections are slower than opening a local file.
For most network files, libvirt currently treats it as the end of the chain and that the network is 'raw', but our recent work with gluster allows for a non-raw network file where libvirt can expand the xml to follow the chain even further. On the input side, your proposed usage would be just fine - by omitting the nested <backingStore> element, you've admitted that you don't know/care about the rest of the chain; then libvirt will populate the rest of the chain (where it can) to show you what files it actually tweaked sVirt labels on. I'm also hoping that this work in exposing the entire backing chain may make it easier to someday implement the <transient/> tag, where libvirt would create the overlay on your behalf. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Wed, Mar 12, 2014 at 02:21:46PM -0600, Eric Blake wrote:
# virsh vol-dumpxml --pool gluster img3 <volume type='network'> <name>img3</name> <key>vol1/img3</key> ... <target> <path>gluster://localhost/vol1/img3</path>
A shame we chose this representation instead of something that matched the format used in the domain XML. At least we can add the domain XML format here without breaking compat.
Proposal ======== For each <disk> of a domain, I will be adding a new <backingStore> element. The element is optional on input, which allows libvirt to continue to understand input from older versions, but will always be present on output, to show what libvirt is tracking as the backing chain.
For a file with no backing store (including raw file format), the usage is simple:
<disk type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/path/to/somewhere'/> <backingStore/> <target dev='vda' bus='virtio'/> </disk>
The new explicit <backingStore/> makes it clear that there is no backing chain.
A backing chain of 3 files (base <- mid <- top) in the local file system:
<disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> <backingStore/> </backingStore> </backingStore> <target dev='vda' bus='virtio'/> </disk>
Note that this is intentionally nested, so that for file formats that support more than one backing resource, it can list parallel <backingStore> as siblings to describe those related resources (thus leaving the door open to expose a qemu quorum as a <disk type='quorum'> with no direct <source> but instead with three <backingStore> sibling elements for each member of the quorum, and where each member of the quorum can further have its own backing chain).
I understand why you chose to use nesting, but I can't say I like the appearance of nesting. I think that in the common case where we have a single non-branching chain, the XML structure is kind of unpleasant and would be nicer if just a flat list. Using nesting makes it harder to extract info about backing files from the XML structure with XPath because you can't simply ask for all <source> elements at a given location. If we're going to add <alias> I wonder if we should just use that to express the nesting. eg have a flat list of <backingStore> ordered by a depth first search, and then have <alias parent="foo" name="bar"/> to express the nesting. That would allow nesting info to be extracted for the few scenarios that actually need it, but keep the common case simple.
Implementation ============== The following APIs will be affected:
defining domain XML (whether via define for persistent domains, or create for transient domains): parse the new element. If the element is already present, default to trusting the backing chain in that element instead of reading from the disk files. If the element is absent, read the disk files and populate the element. It is probably also worth adding a flag to trigger validation mode: read the disk files to ensure they match the xml, and refuse the operation if there is a mismatch (as for updating xml to match reality, the simplest is to edit the XML and delete the <backingStore> element then try the define again, so I don't see the need for a flag for that action).
I may also need to figure out if it is worth tainting a domain any time where libvirt detects that the XML backing chain vs. the disk file read backing chain have diverged.
I don't think we want todo that - there are genuine use cases where that is a reasonable thing todo. eg you can provide a raw file to a guest and that guest may genuinely want to format the virtual disk it received with some other format. We don't want to taint such use cases.
dumping domain XML: always output the new element, by default without consulting disk files. By tracking the chain in memory ever since the guest is defined, it should already be available for output. I'm debating whether we need a flag (similar to virsh dumpxml --update-cpu) that can force libvirt to re-read the disk files at the time of the dump and regenerate the chain to match reality of any changes made behind libvirt's back.
It feels like apps should just query the storage pool APIs if they want to fetch refreshed notion of backing file formats.
sVirt security labeling: right now, we are read the disk files to both label and remove labels on a backing chain - obviously, once the chain is tracked natively as part of the <disk>, we should be labeling without having to read disk files
We also likely want to be able to set labels in the XML against individual backing files too, so we're not unconditionally using a read-only label for backing files which may soon need write ability. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Mar 14, 2014 at 11:07:21AM +0000, Daniel P. Berrange wrote:
I understand why you chose to use nesting, but I can't say I like the appearance of nesting. I think that in the common case where we have a single non-branching chain, the XML structure is kind of unpleasant and would be nicer if just a flat list. Using nesting makes it harder to extract info about backing files from the XML structure with XPath because you can't simply ask for all <source> elements at a given location.
OTOH, with nesting, existing XPath queries keep working. https://github.com/libguestfs/libguestfs/blob/master/src/libvirt-domain.c#L4... Have a look in this file for existing XPath queries involving /source However a flat list of backingStore nodes (as you suggested later) would not break anything.
I don't think we want todo that - there are genuine use cases where that is a reasonable thing todo. eg you can provide a raw file to a guest and that guest may genuinely want to format the virtual disk it received with some other format. We don't want to taint such use cases.
Ewww by formatting you mean turning raw into qcow2?? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming blog: http://rwmj.wordpress.com Fedora now supports 80 OCaml packages (the OPEN alternative to F#)

On Fri, Mar 14, 2014 at 02:05:49PM +0000, Richard W.M. Jones wrote:
On Fri, Mar 14, 2014 at 11:07:21AM +0000, Daniel P. Berrange wrote:
I understand why you chose to use nesting, but I can't say I like the appearance of nesting. I think that in the common case where we have a single non-branching chain, the XML structure is kind of unpleasant and would be nicer if just a flat list. Using nesting makes it harder to extract info about backing files from the XML structure with XPath because you can't simply ask for all <source> elements at a given location.
OTOH, with nesting, existing XPath queries keep working.
https://github.com/libguestfs/libguestfs/blob/master/src/libvirt-domain.c#L4...
Have a look in this file for existing XPath queries involving /source
However a flat list of backingStore nodes (as you suggested later) would not break anything.
I don't think we want todo that - there are genuine use cases where that is a reasonable thing todo. eg you can provide a raw file to a guest and that guest may genuinely want to format the virtual disk it received with some other format. We don't want to taint such use cases.
Ewww by formatting you mean turning raw into qcow2??
Yes, RHEV for example formats block devices as QCow2. I'm not saying this is a good idea, but we know of apps which do this and so we shouldn't taint this. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 03/14/2014 08:34 AM, Daniel P. Berrange wrote:
I don't think we want todo that - there are genuine use cases where that is a reasonable thing todo. eg you can provide a raw file to a guest and that guest may genuinely want to format the virtual disk it received with some other format. We don't want to taint such use cases.
Ewww by formatting you mean turning raw into qcow2??
Yes, RHEV for example formats block devices as QCow2. I'm not saying this is a good idea, but we know of apps which do this and so we shouldn't taint this.
RHEV is the host, not the guest - and as long as RHEV tells us <driver format='qcow2'>, then they keep libvirt in the loop on what the backing chain should be. I'm only thinking of tainting where the backing chain as explicitly stated in XML differs from the backing chain found by actual scans, _and_ where the actual scans do not probe file types from any file explicitly marked raw in the XML. There's a good reason we refuse to scan any file explicitly marked raw. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 03/14/2014 05:07 AM, Daniel P. Berrange wrote:
On Wed, Mar 12, 2014 at 02:21:46PM -0600, Eric Blake wrote:
# virsh vol-dumpxml --pool gluster img3 <volume type='network'> <name>img3</name> <key>vol1/img3</key> ... <target> <path>gluster://localhost/vol1/img3</path>
A shame we chose this representation instead of something that matched the format used in the domain XML. At least we can add the domain XML format here without breaking compat.
Evidence that the storage volume and domain disk descriptions were not originally designed to be shared, although we certainly want to get to that point.
I understand why you chose to use nesting, but I can't say I like the appearance of nesting. I think that in the common case where we have a single non-branching chain, the XML structure is kind of unpleasant and would be nicer if just a flat list. Using nesting makes it harder to extract info about backing files from the XML structure with XPath because you can't simply ask for all <source> elements at a given location.
If we're going to add <alias> I wonder if we should just use that to express the nesting. eg have a flat list of <backingStore> ordered by a depth first search, and then have <alias parent="foo" name="bar"/> to express the nesting. That would allow nesting info to be extracted for the few scenarios that actually need it, but keep the common case simple.
Might work, but how does it express Rich's case where we know _part_ of a chain? If the linear list of backing files is not present, it's obvious that libvirt should populate it; but if the linear list contains only one element, how do we distinguish between the user telling us the amount of the chain the explicitly know while expecting us to probe the remainder of the chain, vs. the user telling us the entire chain and requesting that we probe no further? We're still at the stage where getting the XML right is important, and before it affects too much of the code I'm working on first (that is, I can go ahead and code the split into a new structure in src/util that represents everything we need for an XML element of a single backing chain element, whether or not we then choose to have a tree or a flat array of those structures).
I may also need to figure out if it is worth tainting a domain any time where libvirt detects that the XML backing chain vs. the disk file read backing chain have diverged.
I don't think we want todo that - there are genuine use cases where that is a reasonable thing todo. eg you can provide a raw file to a guest and that guest may genuinely want to format the virtual disk it received with some other format. We don't want to taint such use cases.
No, if libvirt knows that you handed the disk to the guest as raw, then libvirt will always treat it as raw, rather than probing to see what the guest has done with that storage. It is only in the case where you hand storage to the guest without also specifying the storage format where probing becomes an issue. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 03/12/2014 02:21 PM, Eric Blake wrote:
tl;dr: I am working on a series of patches to expose backing chain information in <domain> XML. Comments are welcome, to make sure my XML design is on the right track.
A backing chain of 3 files (base <- mid <- top) in the local file system:
<disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> <backingStore/> </backingStore> </backingStore> <target dev='vda' bus='virtio'/> </disk>
Now that I've been working on virStorageSource for a while, I think I want to modify this slightly. Most attributes of <driver> are per-device (things you only set once, such as cache='writethrough'); furthermore, the <volume> XML for storage volumes uses a dedicated <format type='qcow2'/> rather than a <driver> element. For back-compat, we can't drop the old spelling, but I'm thinking the backing chain should prefer <format>. So if we use nested format, it would look something like: <disk type='file' device='disk'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> <backingStore/> </backingStore> </backingStore> <driver name='qemu' type='qcow2' cache='writethrough'/> <target dev='vda' bus='virtio'/> </disk> Or with Dan's proposal to prefer a flat listing, something like: <disk type='file' device='disk'> <id index='1'/> <format type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <id index='2' parent='1'/> <format type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> </backingStore> <backingStore type='file'> <id index='3' parent='2'/> <format type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> </backingStore> <backingStore/> <driver name='qemu' type='qcow2' cache='writethrough'/> <target dev='vda' bus='virtio'/> </disk> As I look at it, I'm also worried that a flat listing makes it hard to tell the difference between a partial chain (for example, we know 'top' has a backing of 'mid', but haven't followed the chain to see if 'mid' also has a backing file) vs. an explicit end of the chain. At least with the nested listing, a <backingStore/> marker fits in better. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Sat, Apr 12, 2014 at 08:49:50PM -0600, Eric Blake wrote:
On 03/12/2014 02:21 PM, Eric Blake wrote:
tl;dr: I am working on a series of patches to expose backing chain information in <domain> XML. Comments are welcome, to make sure my XML design is on the right track.
A backing chain of 3 files (base <- mid <- top) in the local file system:
<disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> <backingStore type='file'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> <backingStore/> </backingStore> </backingStore> <target dev='vda' bus='virtio'/> </disk>
Now that I've been working on virStorageSource for a while, I think I want to modify this slightly. Most attributes of <driver> are per-device (things you only set once, such as cache='writethrough'); furthermore, the <volume> XML for storage volumes uses a dedicated <format type='qcow2'/> rather than a <driver> element. For back-compat, we can't drop the old spelling, but I'm thinking the backing chain should prefer <format>. So if we use nested format, it would look something like:
<disk type='file' device='disk'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> <backingStore/> </backingStore> </backingStore> <driver name='qemu' type='qcow2' cache='writethrough'/> <target dev='vda' bus='virtio'/> </disk>
Or with Dan's proposal to prefer a flat listing, something like:
<disk type='file' device='disk'> <id index='1'/> <format type='qcow2'/> <source file='/var/lib/libvirt/images/top.qcow2'/> <backingStore type='file'> <id index='2' parent='1'/> <format type='qcow2'/> <source file='/var/lib/libvirt/images/mid.qcow2'/> </backingStore> <backingStore type='file'> <id index='3' parent='2'/> <format type='qcow2'/> <source file='/var/lib/libvirt/images/base.qcow2'/> </backingStore> <backingStore/> <driver name='qemu' type='qcow2' cache='writethrough'/> <target dev='vda' bus='virtio'/> </disk>
As I look at it, I'm also worried that a flat listing makes it hard to tell the difference between a partial chain (for example, we know 'top' has a backing of 'mid', but haven't followed the chain to see if 'mid' also has a backing file) vs. an explicit end of the chain. At least with the nested listing, a <backingStore/> marker fits in better.
Ok, I won't object to using a nested listing, since that's what most folks seem to prefer. Hopefully apps won't create chains that go crazy deep :-) Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (4)
-
Daniel P. Berrange
-
Eric Blake
-
Peter Krempa
-
Richard W.M. Jones