On Tue, Jan 26, 2016 at 05:36:36PM +0100, Peter Krempa wrote:
Hi Alberto,
Hi and sorry for the late reply,
> I'm the current maintainer of Quorum in QEMU and I'd
like to try
> to answer some of your comments.
>
> > So I have a few comments/observations regarding the quorum block
> > driver in qemu and it's usability.
> >
> > At first I'd like to as you to describe your use case a bit
> > more. I'm currently lacking the motivation to do anything about
> > this, as the series is just partial and I don't really see any
> > advantage of using the qorum driver at all and can't come up
> > with any useful use case.
> >
> > Also a good use case is usually a good reason to drive
> > development of a feature and I'm afraid that this could become
> > abandoned without any real use.
>
> The original use case for which Quorum was designed was a data
> center doing redundancy with storage in multiple separate rooms
> shared using NFS.
There are quite a few existing networked storage cluster solutions,
wouldn't that be a more reasonable option?
I don't know all the details of the setup, but as far as I'm aware
the goal is to have redundancy and high availability and at the same
time keep each one of the servers independent from each other in case
the others crash. It's also easier to set up. I can try to get more
details if you want.
> One of the issues that the customer was facing was not only
> problems in the file servers themselves but -mainly- data
> corruption accross the network. Quorum can correct this on the fly
> and is able to
Whoah. Data corruption accross network? I'm not quite sure whether
I'd use this to cover up a problem with the storage technology or
network rather than just fix the root cause. If you have 3 copies,
and manage to have a sector where all 3 differ then the quorum
driver won't help. And it will make it even harder to find any
possible problems.
But in that case you detect that it went wrong and you get an I/O
error. The problem with silent data corruption is that it can be hard
to detect.
If there's a bit-flip across the network Quorum can detect it,
report it and correct the faulty version without needing to rebuild
everything.
> identify which one of the file servers is causing the problem
> without having to rebuild a whole array (like it would be the case
> with RAID).
Libvirt tries to stay out of doing any usage policy, so this might
be considered a feature. The series needs then polishing to add the
rebuild capability and quorum event handling so that sub-quorate
failed operations are properly reported.
I think the rebuild is actually a useful in most cases, since it
ensures that all copies are the same.
I think the ability to rebuild is in general a good feature, I will
look into it and see how to add it to QEMU first.
> Quorum is also used for the COLO block replication
functionality
> currently being discussed in QEMU:
>
>
http://wiki.qemu.org/Features/BlockReplication
Oh, so it actually uses the FIFO mode of quorum which I didn't know
about. So basically the quorum driver for COLO serves as a block
duplicator so that one write is sent to the "primary disk" and
second write is sent using nbd to the arbiter rather than using a
blockdev-mirror job. Interresting approach, but COLO stuf was not
really yet considered in libvirt.
Btw, this series explicitly forbids using less than 2 as vote
threshold.
Yes, I just wanted to point out one other example of how Quorum is
being used. This current series of Quorum for libvirt is not taking
COLO into account at all, in fact it is still under review in QEMU.
> Quorum can fix errors on the fly (there's the
'rewrite-corrupted'
> flag for that), so in those cases no manual intervention is
> required.
>
> If we want a way to auto-resync a complete image that should be
> doable, I believe it's relatively simple to implement in QEMU
> (depending on the semantics).
>
> For the manual resync I also agree that it would be good to
> have a simple API to do that in case the user wants to do it
> manually. That can be done.
This would be beneficial to have if you don't have
'rewrite-corrupted' enabled. In that case you want a way to enable
it and then perhaps initiate a full read so that every block gets
checked.
Exactly. As I said earlier I think it's a good idea.
> > 4) Necessity for at least 3 copies
> > Since a majority needs to win in a vote, you need at least 3
> > member disks for this to be fault-tolerant.
> >
> > 5) Lack of speedup
> > Since always all blocks are read from all members and verified
> > the quorum backend doesn't really add any speed to the
> > reads. This can be mostly attributed to the fact that fault
> > tracking is not present.
> >
> > In other cases, due to internal error correcting codes it's very
> > unlikely that a storage medium would return a corrupted sector
> > without producing a error.
>
> 4) and 5) are part of the design of Quorum, as I said one the
> goals is to detect (and correct) silent data corruption on the
> fly, not to speed up disk access or to be space efficient.
I'm thinking more that it tries to cover up possible silent data
corruption. If your storage is prone to corrupt your data without
detecting it, using quorum will make the corruption less likely; It
does not fix it.
Well, if something is wrong with the hardware you cannot fix that,
but you can report where the error happened and still keep the system
running.
> I can't say much about this series because I haven't
looked into
> the code in detail yet, but I'm willing to help fix the existing
> problems, add the missing features and improve the code (both in
> libvirt and QEMU) if there are no other major blockers.
There are two things which make me skeptical about quorums and
libvirt:
1) Apart from abusing quorums in fifo mode for COLO I still don't
think that they are hugely useful. (no, data corruption on NFS
didn't persuade me)
It is one of the main reasons why Quorum was written. Here's one more
example of silent data corruption over the network:
https://cds.cern.ch/record/2026187/files/Adler32_Data_Corruption.pdf
2) The implementation in this series as in current state adds a lot
of code to mintain that wouldn't much used be and is incomplete in
many aspects:
As I said I didn't look deeply into this series yet, I just tested
it and had an overview of the code, but I'm willing to help make it
better if all else is fine.
* no support for setting the FIFO or any other possible mode
I would say FIFO is not strictly necessary (not for this use case at
least). It can be added later if needed, but it's probably easy enough
to add it now.
* no support for the quorum failure events and reporting
* no way to control 'rewrite-corrupted'
I can look into these.
* since we don't use node-names yet, it's not really
possible to do
block jobs on quorum disks, thus they are forbidden
I'm not sure what's the status of node names in libvirt, I could also
try to help to make it happen.
* since block jobs are forbidden and rewrite-corrupted can't
be
* enabled, no way to do the rebuild
'rewrite-corrupted' can be easily added to the series so I don't
think that's a problem. The block jobs thing I would need to see
first. Would you really need to have node names in order to rebuild a
Quorum?
Regards,
Berto