Hi Peter,
I'm the current maintainer of Quorum in QEMU and I'd like to try to
answer some of your comments.
On Fri, Jan 08, 2016 at 06:20:04PM +0100, Peter Krempa wrote:
So I have a few comments/observations regarding the quorum block
driver in qemu and it's usability.
At first I'd like to as you to describe your use case a bit
more. I'm currently lacking the motivation to do anything about
this, as the series is just partial and I don't really see any
advantage of using the qorum driver at all and can't come up with
any useful use case.
Also a good use case is usually a good reason to drive development
of a feature and I'm afraid that this could become abandoned without
any real use.
The original use case for which Quorum was designed was a data center
doing redundancy with storage in multiple separate rooms shared using
NFS.
One of the issues that the customer was facing was not only problems
in the file servers themselves but -mainly- data corruption accross
the network. Quorum can correct this on the fly and is able to
identify which one of the file servers is causing the problem without
having to rebuild a whole array (like it would be the case with RAID).
Quorum is also used for the COLO block replication functionality
currently being discussed in QEMU:
http://wiki.qemu.org/Features/BlockReplication
1) No traking of integrity
As the quorum members don't have headers, failed quorum members
are not recorded and remembered. The user or management app then
has to do this externally for given storage devices.
2) No internal tracking of quorum members
Members of the quorum don't have any header marking them
as such and thus any images may be mixed together with
unforseen/catastrophic results. Higher level management then
needs to take the role of remembering which images belong
together. Reimplementing this looks like reimplementing a
distriuted storage system to me.
That's right, Quorum does not have its own file format and was
designed to work with any driver or protocol that QEMU supports, so
I'm not sure if there's much that can be done about this.
3) Lack of auto-resync:
Once the quorum get's few inconsistencies it does not
automatically resync like the linux MD driver. With the current
implementation the only way to resync this would be to issue a
block-mirror (blockCopy) to /dev/null so that all blocks are
read and rewritten to the identical copy. This also requires a
user action.
Additionally the member of the quorum is not ignored if it was
out of sync in any previous time without being resynced allowing
for split-brain/corruption scenarios.
Quorum can fix errors on the fly (there's the 'rewrite-corrupted' flag
for that), so in those cases no manual intervention is required.
If we want a way to auto-resync a complete image that should be
doable, I believe it's relatively simple to implement in QEMU
(depending on the semantics).
For the manual resync I also agree that it would be good to have a
simple API to do that in case the user wants to do it manually. That
can be done.
4) Necessity for at least 3 copies
Since a majority needs to win in a vote, you need at least 3
member disks for this to be fault-tolerant.
5) Lack of speedup
Since always all blocks are read from all members and verified
the quorum backend doesn't really add any speed to the
reads. This can be mostly attributed to the fact that fault
tracking is not present.
In other cases, due to internal error correcting codes it's very
unlikely that a storage medium would return a corrupted sector
without producing a error.
4) and 5) are part of the design of Quorum, as I said one the goals
is to detect (and correct) silent data corruption on the fly, not to
speed up disk access or to be space efficient.
6) Almost every remote storage technology does quorums internally
Any distributed storage (ceph/rbd, gluster, sheepdog, etc..)
provide the quorum functionality internally with added benefit
that their internal working fixes problems when split of the
network occurs.
7) Tools are restricted to qemu and qemu-img
It's a "proprietary" implementation so for a rebuild you have
to use one of the two tools. AFAIK qemu-img is not really
user friendly for the less common disk backends and we don't
really provide any abstraction on top of that. This means
that there really aren't any reasonable tools to do a offline
resync. (Okay, if you know which instance is okay, you can just
copy it ...)
Right. If this is important I can propose to write a tool for QEMU to
deal with this. It's probably a good idea anyway.
This series also lacks implementation of any user/maganement
warning method that a block operation didn't have 100% votes in the
quorum voting thus it's not really possible for the users to do a
rebuild/diagnostic if something fails.
I can't say much about this series because I haven't looked into the
code in detail yet, but I'm willing to help fix the existing problems,
add the missing features and improve the code (both in libvirt and
QEMU) if there are no other major blockers.
Thanks,
Berto