On Fri, Feb 12, 2016 at 13:22:01 -0700, Eric Blake wrote:
On 12/02/2015 02:08 AM, Simon Kollberg wrote:
[...]
> My initial thought was to manually set the alias of the
> disk and add a new reference element to the backingStore:
> <disk type='file' device='disk'>
> ...
> <alias name='colo1'/>
> </disk>
> <disk type='file' device='disk'>
> ...
> <backingStore type='file'>
> ...
> <reference name='colo1'/>
> </backingStore>
> </disk>
>
> Though, I quickly realized that setting the alias is only done by the
> hypervisor and is therefore not an option with the current code.
>
> Would it be bad letting the user set the alias, and if so, do you have any
> ideas of how to solve the referencing?
I'm a little bit leery of letting the user set the alias; one benefit
we've had of NOT letting the user control it is that we could avoid name
collisions. It's not a strong enough reason to reject the idea, but
certainly worth thinking about.
I'd rather not allow the user to set the alias at all. The effort to
allow this will be too complex.
Since the XML allows hierarchy we certainly can use that to express this
too rather than using alias as a cross reference.
Another consideration, if you do 'virsh dumpxml' on a running
domain,
the live xml contains alias names; you can then 'virsh define' that xml,
and the aliases will be silently dropped. This is in fact useful, if we
have to change the alias name we generate under the hood when first
starting a domain under a newer version of qemu. If the user can set
Another example why that would be a bad idea.
the alias, we are stuck with that name. On the other hand, as long
as
we have an alias name and use it consistently, we can just document that
the user can't cause conflicts, making the name persistent may rather easy.
I certainly see more problems than gain in doing this.
On the other hand, we DO want to make the index='1' of
<backingStore>
something that becomes persistent. And the <target dev='...'> attribute
coupled with the <backingStore index='...'> is sufficiently unique to
reference ANY element of the backing chain.
Yes. My plan is to make the index='..' eventually match a node name of a
qemu block driver tree so that field will eventually be remembered and
will stop being auto-assigned. This is unfortunately a pre-requisite for
a lot of other block stuff that needs to happen. Without node-name
support we can't really predictibly work on the backing chain especially
when it consists of networked storage. In addition it makes block jobs
(blockCopy (drive-mirror), snapshots, etc ..) unusable with eg. the
quorum driver which is currently under discussion in libvirt.
I'm planing to add node name and backing chain tracking support to
libvirt in a "short" time frame.
That is, I would lean towards something more like this:
<disk type='file' device='disk'>
...
<source file='...' index='0'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
</disk>
<disk type='file' device='disk'>
...
<backingStore type='replication'>
...
<reference dev='vda' index='0'/>
</backingStore>
</disk>
I don't think this is a good idea a <disk> element adds a guest visible
disk. Since the replication mechanism does not add a new drive to the
guest but just mirrors the writes of the referenced drive it should not
be stored as <disk>.
We either should add a new section, eg.. <resources> that will similarly
to <devices> allow to add images and other stuff tracked by libvirt that
does not influence the guest ABI, but is used by internal stuff, or use
a way to express it in the disk element itself. Adding a <disk> that
does not add a disk to a guest is not the right way.
Additionally I'm worried by abusing the <backingStore> element to
represent the top level data for either this purpose or for quorum
disks. Using it in that way then basically changes the meaning of the
disk element to represent a qemu filter in this case.
Unfortunately, we didn't make <backingStore> a sub-element of <source>
which would clarify this greatly. Unfortunately <source> itself has bad
design as the type of it is stolen from the parent <disk> element.
For this instance I don't think that we have to stick with the old
subelements of the <disk> elements and can make a more sane design
without trying to bend the XML to look the same. If users want to use
new features, we certainly can make the XML look different.
We can then have either multiple sources with the correct type
(basically make them look the same to <backingStore>) and nest
<backingStore> directly under source. This will then make it really
clear what files are the source for the quourum/replication/whatever and
what is the backing of that. We also have to have a way how to nest
stuff as quorum, since it may become a backing of a snapshot.
A couple of things to note there: I think a new
type='replication'
(rather than reusing existing type='file') will make it obvious that we
are adding new XML specifically for block replication; then in that new
type, we can add a new <reference> that refers to dev='vda' and
index='0' (we'll have to start exposing an index for the active layer,
not just the backingStore layers), as what the device will be replicating.
Or we can change <source> or other stuff too, since this is a new
feature.
> 2.
> The format of the disk and the driver type currently shares the same
> attribute in libvirt (the type attribute on driver XML element). However,
> with
> the new replication disk driver you need to be able to set both the disk
> format
> and also the driver name.
>
> Example from the wiki:
> -drive if=xxx,driver=replication,mode=secondary,\
> file.file.filename=active_disk.qcow2,\
> file.driver=qcow2,\
So we are basically stacking TWO drivers on top of a single file. I
think that means we'll want two layers of XML, something like:
<disk type='replication'>
<backingStore type='file'>
<driver name='qemu' type='qcow2'>
<source file='/path/to/active_disk.qcow2'/>
</backingStore>
</disk>
Again, anywhere we have two layers of protocol in qemu to get to the
underlying file, it makes sense to have two layers of XML in libvirt.
We'll want the same sort of type='quorum' as a new disk type for
handling quorum drives, where those 0 direct <source> elements but
instead have multiple <backingStore> child elements. Ideally, since
I'm worried by this a bit. This bends the definition of <backingStore>,
<source> and <disk> itself.
<backingStore> is bent by becoming the top level image.
<source> is not present (this already will break existing apps in some
cases)
and <disk> becomes a synonym for the quorum filter itself. In turn this
needs to be expressed in <backingStore> too, since they can be possibly
nested. This should be partially ok, but I don't really like
<backingStore> becoming the writable image. Also it feels a bit too qemu
specific in some aspects.
everything can be represented as a BDS tree in qemu, it should also
be
represented as a similar tree in XML in libvirt, except that libvirt has
already taken the shortcut that a single protocol and file layer can be
combined (that is, we show qcow2 images and source files in the same
layer), due to historical usage.
This should be fine since it's currently a 1 to 1 mapping. Well perhaps
except for backing files
> ...
>
> I saw that there was a function in libvirt called virStorageFileProbeFormat
> that could let us get the format of the disk without stating it in the XML.
> But
> as I'm sure you know, it's strongly advised not to be used since you can
> trick
> the function by modifying the disk file.
Correct, any solution that requires probing rather than explicit format
will not fly.
>
>
> 3.
> When using the replication driver the secondary disk is supposed to be added
> but not attached.
> Example from the wiki:
> -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
> -drive if=xxx,driver=replication,mode=secondary,\
> ...
>
> Clearly, trying to setup a disk without a target is not allowed at the
> moment.
> Is there any better way of doing it?
Hmm. I'm almost wondering if <disk> is the wrong element. Most of the
XML is trying to describe something the guest will see, but if we are
creating a replication driver that is NOT visible to the guest, that
almost argues that we should create an entirely new sibling element next
to <disk>. The new element would not need a <target> (because it is not
guest visible), but would otherwise be similar to <disk>.
You should have written this first so that I would not have to rant
right away. Yes, using <disk> is a terrible idea in this context.
Additionally, looking through the docs for block replication posted on
qemu devel require to use unstable commands such as "x-blockdev-change"
which, until it's stable is not an option in libvirt.
Also for all this to be useful, the rest of the COLO stuff needs to be
merged too, otherwise the block replication code won't be used at all.
Peter