On 02/16, Peter Krempa wrote:
On Fri, Feb 12, 2016 at 13:22:01 -0700, Eric Blake wrote:
> On 12/02/2015 02:08 AM, Simon Kollberg wrote:
[...]
> > My initial thought was to manually set the alias of the
> > disk and add a new reference element to the backingStore:
> > <disk type='file' device='disk'>
> > ...
> > <alias name='colo1'/>
> > </disk>
> > <disk type='file' device='disk'>
> > ...
> > <backingStore type='file'>
> > ...
> > <reference name='colo1'/>
> > </backingStore>
> > </disk>
> >
> > Though, I quickly realized that setting the alias is only done by the
> > hypervisor and is therefore not an option with the current code.
> >
> > Would it be bad letting the user set the alias, and if so, do you
> > have any
> > ideas of how to solve the referencing?
>
> I'm a little bit leery of letting the user set the alias; one benefit
> we've had of NOT letting the user control it is that we could avoid name
> collisions. It's not a strong enough reason to reject the idea, but
> certainly worth thinking about.
I'd rather not allow the user to set the alias at all. The effort to
allow this will be too complex.
Since the XML allows hierarchy we certainly can use that to express this
too rather than using alias as a cross reference.
Although I'm not opposed to the idea, using a hierarchical definition
will probably not be as straight forward as having two separate disks
since the secondary disk and the replication disk buffers are separated
in the qemu command line. It would make other stuff way easier though,
see my example below.
> Another consideration, if you do 'virsh dumpxml' on a
running domain,
> the live xml contains alias names; you can then 'virsh define' that xml,
> and the aliases will be silently dropped. This is in fact useful, if we
> have to change the alias name we generate under the hood when first
> starting a domain under a newer version of qemu. If the user can set
Another example why that would be a bad idea.
> the alias, we are stuck with that name. On the other hand, as long as
> we have an alias name and use it consistently, we can just document that
> the user can't cause conflicts, making the name persistent may
> rather easy.
I certainly see more problems than gain in doing this.
Agreed, letting the user set the alias seem to be the wrong way to
handle the referencing.
> On the other hand, we DO want to make the index='1' of
<backingStore>
> something that becomes persistent. And the <target dev='...'>
attribute
> coupled with the <backingStore index='...'> is sufficiently unique to
> reference ANY element of the backing chain.
Yes. My plan is to make the index='..' eventually match a node name of a
qemu block driver tree so that field will eventually be remembered and
will stop being auto-assigned. This is unfortunately a pre-requisite for
a lot of other block stuff that needs to happen. Without node-name
support we can't really predictibly work on the backing chain especially
when it consists of networked storage. In addition it makes block jobs
(blockCopy (drive-mirror), snapshots, etc ..) unusable with eg. the
quorum driver which is currently under discussion in libvirt.
I'm planing to add node name and backing chain tracking support to
libvirt in a "short" time frame.
> That is, I would lean towards something more like this:
>
> <disk type='file' device='disk'>
> ...
> <source file='...' index='0'/>
> <backingStore/>
> <target dev='vda' bus='virtio'/>
> </disk>
> <disk type='file' device='disk'>
> ...
> <backingStore type='replication'>
> ...
> <reference dev='vda' index='0'/>
> </backingStore>
> </disk>
I don't think this is a good idea a <disk> element adds a guest visible
disk. Since the replication mechanism does not add a new drive to the
guest but just mirrors the writes of the referenced drive it should not
be stored as <disk>.
We either should add a new section, eg.. <resources> that will similarly
to <devices> allow to add images and other stuff tracked by libvirt that
does not influence the guest ABI, but is used by internal stuff, or use
a way to express it in the disk element itself. Adding a <disk> that
does not add a disk to a guest is not the right way.
Additionally I'm worried by abusing the <backingStore> element to
represent the top level data for either this purpose or for quorum
disks. Using it in that way then basically changes the meaning of the
disk element to represent a qemu filter in this case.
Unfortunately, we didn't make <backingStore> a sub-element of <source>
which would clarify this greatly. Unfortunately <source> itself has bad
design as the type of it is stolen from the parent <disk> element.
For this instance I don't think that we have to stick with the old
subelements of the <disk> elements and can make a more sane design
without trying to bend the XML to look the same. If users want to use
new features, we certainly can make the XML look different.
We can then have either multiple sources with the correct type
(basically make them look the same to <backingStore>) and nest
<backingStore> directly under source. This will then make it really
clear what files are the source for the quourum/replication/whatever and
what is the backing of that. We also have to have a way how to nest
stuff as quorum, since it may become a backing of a snapshot.
I also found it a bit strange that backingStore was used for quorum disk
sources for the same reason that you listed. But in the case of block
replication, qemu is chaining the replication buffer disks and the
secondary disk with backing. Still, using backingStore for the secondary
disk is not a great way to handle it. I don't see an issue using
backingStore for the active/hidden disk relation though.
> A couple of things to note there: I think a new
type='replication'
> (rather than reusing existing type='file') will make it obvious that we
> are adding new XML specifically for block replication; then in that new
> type, we can add a new <reference> that refers to dev='vda' and
> index='0' (we'll have to start exposing an index for the active layer,
> not just the backingStore layers), as what the device will be replicating.
Or we can change <source> or other stuff too, since this is a new
feature.
>
> > 2.
> > The format of the disk and the driver type currently shares the same
> > attribute in libvirt (the type attribute on driver XML element).
> > However,
> > with
> > the new replication disk driver you need to be able to set both the disk
> > format
> > and also the driver name.
> >
> > Example from the wiki:
> > -drive if=xxx,driver=replication,mode=secondary,\
> > file.file.filename=active_disk.qcow2,\
> > file.driver=qcow2,\
>
> So we are basically stacking TWO drivers on top of a single file. I
> think that means we'll want two layers of XML, something like:
>
> <disk type='replication'>
> <backingStore type='file'>
> <driver name='qemu' type='qcow2'>
> <source file='/path/to/active_disk.qcow2'/>
> </backingStore>
> </disk>
>
> Again, anywhere we have two layers of protocol in qemu to get to the
> underlying file, it makes sense to have two layers of XML in libvirt.
> We'll want the same sort of type='quorum' as a new disk type for
> handling quorum drives, where those 0 direct <source> elements but
> instead have multiple <backingStore> child elements. Ideally, since
I'm worried by this a bit. This bends the definition of <backingStore>,
<source> and <disk> itself.
<backingStore> is bent by becoming the top level image.
<source> is not present (this already will break existing apps in some
cases)
and <disk> becomes a synonym for the quorum filter itself. In turn this
needs to be expressed in <backingStore> too, since they can be possibly
nested. This should be partially ok, but I don't really like
<backingStore> becoming the writable image. Also it feels a bit too qemu
specific in some aspects.
> everything can be represented as a BDS tree in qemu, it should also be
> represented as a similar tree in XML in libvirt, except that libvirt has
> already taken the shortcut that a single protocol and file layer can be
> combined (that is, we show qcow2 images and source files in the same
> layer), due to historical usage.
This should be fine since it's currently a 1 to 1 mapping. Well perhaps
except for backing files
> > ...
> >
> > I saw that there was a function in libvirt called
> > virStorageFileProbeFormat
> > that could let us get the format of the disk without stating it in
> > the XML.
> > But
> > as I'm sure you know, it's strongly advised not to be used since you
can
> > trick
> > the function by modifying the disk file.
>
> Correct, any solution that requires probing rather than explicit format
> will not fly.
>
> >
> >
> > 3.
> > When using the replication driver the secondary disk is supposed
> > to be added
> > but not attached.
> > Example from the wiki:
> > -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
> > -drive if=xxx,driver=replication,mode=secondary,\
> > ...
> >
> > Clearly, trying to setup a disk without a target is not allowed at the
> > moment.
> > Is there any better way of doing it?
>
> Hmm. I'm almost wondering if <disk> is the wrong element. Most of the
> XML is trying to describe something the guest will see, but if we are
> creating a replication driver that is NOT visible to the guest, that
> almost argues that we should create an entirely new sibling element next
> to <disk>. The new element would not need a <target> (because it is
not
> guest visible), but would otherwise be similar to <disk>.
You should have written this first so that I would not have to rant
right away. Yes, using <disk> is a terrible idea in this context.
I get why using <disk> for something that is not going to be visible to
the guest is a bad idea. On the other hand, in this case the disk that
is not going to be attached to the guest is actually the normal disk
(labeled "secondary disk" on the block replication wiki). I don't think
it makes sense to make that into something that is not <disk>.
I think many issues that has been brought up would be solved by a new
disk definition where the replication and the normal disk is combined.
But again, I don't think that the quorum and the block replication
should be defined the same way since block replication acts more as a
chain while the quorum as an array.
What do you guys think about this?
<disk type='file' device='disk'>
<driver name='qemu' type='raw'/>
<source file='/opt/libvirt/1.raw'/>
<target dev='vda' bus='virtio'/>
<replication type='file' device='disk'
mode='secondary'>
<driver name='qemu' type='qcow2'/>
<source file='/opt/libvirt/active_disk.qcow2'/>
<backingStore type='file'>
<format type='qcow2'/>
<source file='/opt/libvirt/hidden_disk.qcow2'/>
</backingStore>
</replication>
</disk>
This would solve all three issues: referencing, replication driver +
file driver in the same definition and a single target. One downside is
that the <replication> element is going to be more or less a <disk>
clone.
Another option would be to allow for nested <disk> elements and add a
bogus replication target.
<disk type='file' device='disk'>
<driver name='qemu' type='raw'/>
<source file='/opt/libvirt/1.raw'/>
<target dev='replication' bus='virtio'/>
<disk type='file' device='disk' mode='secondary'>
<driver name='qemu' type='qcow2'/>
<source file='/opt/libvirt/active_disk.qcow2'/>
<target dev='vda' bus='virtio'/>
<backingStore type='file'>
<format type='qcow2'/>
<source file='/opt/libvirt/hidden_disk.qcow2'/>
</backingStore>
</disk>
</disk>
Additionally, looking through the docs for block replication posted
on
qemu devel require to use unstable commands such as "x-blockdev-change"
which, until it's stable is not an option in libvirt.
I agree that it is still way to early to push this stuff upstream. But I
think it's a good thing to start the discussion and experiment with new
features before they are marked as stable and the x-tag is removed. That
way libvirt is keeping up to date and can also impact the current work
done in qemu by doing early testing.
Also for all this to be useful, the rest of the COLO stuff needs to
be
merged too, otherwise the block replication code won't be used at all.
From what I've gathered, block replication is supposed to be an
individual feature, separate from COLO. Even so, I see your point about
not including something that is not being used at the moment.