On 12/02/2015 02:08 AM, Simon Kollberg wrote:
Hi!
Apologies for not noticing this mail sooner.
I'm working on supporting a new FT/HA solution for qemu called COLO
(
http://wiki.qemu.org/Features/COLO). The part that is currently being
focused
on for libvirt integration is Block Replication
(
http://wiki.qemu.org/Features/BlockReplication) which enables guest state
synchronization for disks.
Here's some rough thoughts on the matter, although we may go through
several iterations before landing on something that everyone likes.
Right now there are three issues that I'd like to get your input on:
1.
As you can see on the block replication wiki-page we need to reference the
secondary disk id.
Example from the wiki:
-drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
-drive if=xxx,driver=replication,mode=secondary,\
...
file.backing.backing=colo1
My initial thought was to manually set the alias of the
disk and add a new reference element to the backingStore:
<disk type='file' device='disk'>
...
<alias name='colo1'/>
</disk>
<disk type='file' device='disk'>
...
<backingStore type='file'>
...
<reference name='colo1'/>
</backingStore>
</disk>
Though, I quickly realized that setting the alias is only done by the
hypervisor and is therefore not an option with the current code.
Would it be bad letting the user set the alias, and if so, do you have any
ideas of how to solve the referencing?
I'm a little bit leery of letting the user set the alias; one benefit
we've had of NOT letting the user control it is that we could avoid name
collisions. It's not a strong enough reason to reject the idea, but
certainly worth thinking about.
Another consideration, if you do 'virsh dumpxml' on a running domain,
the live xml contains alias names; you can then 'virsh define' that xml,
and the aliases will be silently dropped. This is in fact useful, if we
have to change the alias name we generate under the hood when first
starting a domain under a newer version of qemu. If the user can set
the alias, we are stuck with that name. On the other hand, as long as
we have an alias name and use it consistently, we can just document that
the user can't cause conflicts, making the name persistent may rather easy.
On the other hand, we DO want to make the index='1' of <backingStore>
something that becomes persistent. And the <target dev='...'> attribute
coupled with the <backingStore index='...'> is sufficiently unique to
reference ANY element of the backing chain.
That is, I would lean towards something more like this:
<disk type='file' device='disk'>
...
<source file='...' index='0'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
</disk>
<disk type='file' device='disk'>
...
<backingStore type='replication'>
...
<reference dev='vda' index='0'/>
</backingStore>
</disk>
A couple of things to note there: I think a new type='replication'
(rather than reusing existing type='file') will make it obvious that we
are adding new XML specifically for block replication; then in that new
type, we can add a new <reference> that refers to dev='vda' and
index='0' (we'll have to start exposing an index for the active layer,
not just the backingStore layers), as what the device will be replicating.
2.
The format of the disk and the driver type currently shares the same
attribute in libvirt (the type attribute on driver XML element). However,
with
the new replication disk driver you need to be able to set both the disk
format
and also the driver name.
Example from the wiki:
-drive if=xxx,driver=replication,mode=secondary,\
file.file.filename=active_disk.qcow2,\
file.driver=qcow2,\
So we are basically stacking TWO drivers on top of a single file. I
think that means we'll want two layers of XML, something like:
<disk type='replication'>
<backingStore type='file'>
<driver name='qemu' type='qcow2'>
<source file='/path/to/active_disk.qcow2'/>
</backingStore>
</disk>
Again, anywhere we have two layers of protocol in qemu to get to the
underlying file, it makes sense to have two layers of XML in libvirt.
We'll want the same sort of type='quorum' as a new disk type for
handling quorum drives, where those 0 direct <source> elements but
instead have multiple <backingStore> child elements. Ideally, since
everything can be represented as a BDS tree in qemu, it should also be
represented as a similar tree in XML in libvirt, except that libvirt has
already taken the shortcut that a single protocol and file layer can be
combined (that is, we show qcow2 images and source files in the same
layer), due to historical usage.
...
I saw that there was a function in libvirt called virStorageFileProbeFormat
that could let us get the format of the disk without stating it in the XML.
But
as I'm sure you know, it's strongly advised not to be used since you can
trick
the function by modifying the disk file.
Correct, any solution that requires probing rather than explicit format
will not fly.
3.
When using the replication driver the secondary disk is supposed to be added
but not attached.
Example from the wiki:
-drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
-drive if=xxx,driver=replication,mode=secondary,\
...
Clearly, trying to setup a disk without a target is not allowed at the
moment.
Is there any better way of doing it?
Hmm. I'm almost wondering if <disk> is the wrong element. Most of the
XML is trying to describe something the guest will see, but if we are
creating a replication driver that is NOT visible to the guest, that
almost argues that we should create an entirely new sibling element next
to <disk>. The new element would not need a <target> (because it is not
guest visible), but would otherwise be similar to <disk>.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library
http://libvirt.org