On Wed, Jan 20, 2016 at 16:47:56 +0200, Alberto Garcia wrote:
Hi Peter,
Hi Alberto,
I'm the current maintainer of Quorum in QEMU and I'd like to try to
answer some of your comments.
On Fri, Jan 08, 2016 at 06:20:04PM +0100, Peter Krempa wrote:
> So I have a few comments/observations regarding the quorum block
> driver in qemu and it's usability.
>
> At first I'd like to as you to describe your use case a bit
> more. I'm currently lacking the motivation to do anything about
> this, as the series is just partial and I don't really see any
> advantage of using the qorum driver at all and can't come up with
> any useful use case.
>
> Also a good use case is usually a good reason to drive development
> of a feature and I'm afraid that this could become abandoned without
> any real use.
The original use case for which Quorum was designed was a data center
doing redundancy with storage in multiple separate rooms shared using
NFS.
There are quite a few existing networked storage cluster solutions,
wouldn't that be a more reasonable option?
One of the issues that the customer was facing was not only problems
in the file servers themselves but -mainly- data corruption accross
the network. Quorum can correct this on the fly and is able to
Whoah. Data corruption accross network? I'm not quite sure whether I'd
use this to cover up a problem with the storage technology or network
rather than just fix the root cause. If you have 3 copies, and manage to
have a sector where all 3 differ then the quorum driver won't help. And
it will make it even harder to find any possible problems.
identify which one of the file servers is causing the problem
without
having to rebuild a whole array (like it would be the case with RAID).
Libvirt tries to stay out of doing any usage policy, so this might be
considered a feature. The series needs then polishing to add the rebuild
capability and quorum event handling so that sub-quorate failed
operations are properly reported.
I think the rebuild is actually a useful in most cases, since it ensures
that all copies are the same.
Quorum is also used for the COLO block replication functionality
currently being discussed in QEMU:
http://wiki.qemu.org/Features/BlockReplication
Oh, so it actually uses the FIFO mode of quorum which I didn't know
about. So basically the quorum driver for COLO serves as a block
duplicator so that one write is sent to the "primary disk" and second
write is sent using nbd to the arbiter rather than using a
blockdev-mirror job. Interresting approach, but COLO stuf was not really
yet considered in libvirt.
Btw, this series explicitly forbids using less than 2 as vote threshold.
> 1) No traking of integrity
> As the quorum members don't have headers, failed quorum members
> are not recorded and remembered. The user or management app then
> has to do this externally for given storage devices.
>
> 2) No internal tracking of quorum members
> Members of the quorum don't have any header marking them
> as such and thus any images may be mixed together with
> unforseen/catastrophic results. Higher level management then
> needs to take the role of remembering which images belong
> together. Reimplementing this looks like reimplementing a
> distriuted storage system to me.
That's right, Quorum does not have its own file format and was
designed to work with any driver or protocol that QEMU supports, so
I'm not sure if there's much that can be done about this.
> 3) Lack of auto-resync:
> Once the quorum get's few inconsistencies it does not
> automatically resync like the linux MD driver. With the current
> implementation the only way to resync this would be to issue a
> block-mirror (blockCopy) to /dev/null so that all blocks are
> read and rewritten to the identical copy. This also requires a
> user action.
>
> Additionally the member of the quorum is not ignored if it was
> out of sync in any previous time without being resynced allowing
> for split-brain/corruption scenarios.
Quorum can fix errors on the fly (there's the 'rewrite-corrupted' flag
for that), so in those cases no manual intervention is required.
If we want a way to auto-resync a complete image that should be
doable, I believe it's relatively simple to implement in QEMU
(depending on the semantics).
For the manual resync I also agree that it would be good to have a
simple API to do that in case the user wants to do it manually. That
can be done.
This would be beneficial to have if you don't have 'rewrite-corrupted'
enabled. In that case you want a way to enable it and then perhaps
initiate a full read so that every block gets checked.
> 4) Necessity for at least 3 copies
> Since a majority needs to win in a vote, you need at least 3
> member disks for this to be fault-tolerant.
>
> 5) Lack of speedup
> Since always all blocks are read from all members and verified
> the quorum backend doesn't really add any speed to the
> reads. This can be mostly attributed to the fact that fault
> tracking is not present.
>
> In other cases, due to internal error correcting codes it's very
> unlikely that a storage medium would return a corrupted sector
> without producing a error.
4) and 5) are part of the design of Quorum, as I said one the goals
is to detect (and correct) silent data corruption on the fly, not to
speed up disk access or to be space efficient.
I'm thinking more that it tries to cover up possible silent data
corruption. If your storage is prone to corrupt your data without
detecting it, using quorum will make the corruption less likely;
It does not fix it.
> 6) Almost every remote storage technology does quorums
internally
> Any distributed storage (ceph/rbd, gluster, sheepdog, etc..)
> provide the quorum functionality internally with added benefit
> that their internal working fixes problems when split of the
> network occurs.
>
> 7) Tools are restricted to qemu and qemu-img
> It's a "proprietary" implementation so for a rebuild you have
> to use one of the two tools. AFAIK qemu-img is not really
> user friendly for the less common disk backends and we don't
> really provide any abstraction on top of that. This means
> that there really aren't any reasonable tools to do a offline
> resync. (Okay, if you know which instance is okay, you can just
> copy it ...)
Right. If this is important I can propose to write a tool for QEMU to
deal with this. It's probably a good idea anyway.
That is not that important in the end. If you can use it with
qemu-img/qemu-nbd that should be enough basically for the same usecases
as somebody would use QCOW images for.
> This series also lacks implementation of any user/maganement
> warning method that a block operation didn't have 100% votes in the
> quorum voting thus it's not really possible for the users to do a
> rebuild/diagnostic if something fails.
I can't say much about this series because I haven't looked into the
code in detail yet, but I'm willing to help fix the existing problems,
add the missing features and improve the code (both in libvirt and
QEMU) if there are no other major blockers.
There are two things which make me skeptical about quorums and libvirt:
1) Apart from abusing quorums in fifo mode for COLO I still don't think
that they are hugely useful. (no, data corruption on NFS didn't persuade
me)
2) The implementation in this series as in current state adds a lot of
code to mintain that wouldn't much used be and is incomplete in many
aspects:
* no support for setting the FIFO or any other possible mode
* no support for the quorum failure events and reporting
* no way to control 'rewrite-corrupted'
* since we don't use node-names yet, it's not really possible to do
block jobs on quorum disks, thus they are forbidden
* since block jobs are forbidden and rewrite-corrupted can't be
* enabled, no way to do the rebuild
Peter