[libvirt] Quorum block driver libvirt support proposal

Hello list, I want to implement libvirt Quorum support. (https://github.com/qemu/qemu/commit/c88a1de51ab2f26a9a37ffc317249736de8c015c) Quorum is a QEMU RAID like block storage driver. Data are written on n replicas and when a read is done a comparison between the replica read is done. If more than threshold reads are identical the read succeed else it's and error. For example a Quorum with n = 3 and threshold = 2 would be made of three QCOW2 backing chains used as identicals replicas. threshold = 2 means that at least 2 replica must be identical when doing a read. I want to make use of the new backingStore xml element to implement quorum. Proposed Quorum libvirt format: ------------------------------- <disk type='quorum' device='disk'> <driver name='qemu' type='quorum'/> <threshold value=2/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file1.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file2.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file3.qcow2'/> </backingStore> <target dev='vda' bus='virtio'/> </disk> Implementation plan: -------------------- * Add VIR_STORAGE_TYPE_QUORUM * In src/util/virstoragefile.h change _virStorageSource to contain a virStorageSourcePtrPtr backingStores. I think doing it at this level allow to keep a 1-1 mapping with the qemu BlockDriverState hiearchy * Add a int quorum_threshold field to the same structure * Add support for parsing treshold in virDomainDiskDefParseXML * Change virDomainDiskBackingStoreParse to virDomainDiskBackingStoresParse to parse all the backingStore at once an use realloc to grow the backingStores field. * Modify virDomainDiskDefFormat to call virDomainDiskBackingStoreFormat in a loop for saving * hook into qemuBuildDriveStr around line 3442 to create the quorum parameters Do you feel that I am missing something ? Best regards Benoît

On Fri, May 16, 2014 at 12:33:04PM +0200, Benoît Canet wrote:
Hello list,
I want to implement libvirt Quorum support. (https://github.com/qemu/qemu/commit/c88a1de51ab2f26a9a37ffc317249736de8c015c) Quorum is a QEMU RAID like block storage driver. Data are written on n replicas and when a read is done a comparison between the replica read is done. If more than threshold reads are identical the read succeed else it's and error.
For example a Quorum with n = 3 and threshold = 2 would be made of three QCOW2 backing chains used as identicals replicas. threshold = 2 means that at least 2 replica must be identical when doing a read.
I want to make use of the new backingStore xml element to implement quorum.
Proposed Quorum libvirt format: -------------------------------
<disk type='quorum' device='disk'> <driver name='qemu' type='quorum'/> <threshold value=2/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file1.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file2.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file3.qcow2'/> </backingStore> <target dev='vda' bus='virtio'/> </disk>
It feels rather odd to have <backingStore> elements but no top level disk images. Really these are all top level images Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

The Friday 16 May 2014 à 09:54:43 (-0400), Daniel P. Berrange wrote :
On Fri, May 16, 2014 at 12:33:04PM +0200, Benoît Canet wrote:
Hello list,
I want to implement libvirt Quorum support. (https://github.com/qemu/qemu/commit/c88a1de51ab2f26a9a37ffc317249736de8c015c) Quorum is a QEMU RAID like block storage driver. Data are written on n replicas and when a read is done a comparison between the replica read is done. If more than threshold reads are identical the read succeed else it's and error.
For example a Quorum with n = 3 and threshold = 2 would be made of three QCOW2 backing chains used as identicals replicas. threshold = 2 means that at least 2 replica must be identical when doing a read.
I want to make use of the new backingStore xml element to implement quorum.
Proposed Quorum libvirt format: -------------------------------
<disk type='quorum' device='disk'> <driver name='qemu' type='quorum'/> <threshold value=2/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file1.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file2.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file3.qcow2'/> </backingStore> <target dev='vda' bus='virtio'/> </disk>
It feels rather odd to have <backingStore> elements but no top level disk images. Really these are all top level images
It reflect the ways QEMU does it. A single BlockDriverState holding n quorum BlockDriverState children. There is a 1-1 mapping. How would you see it ? Best regards Benoît
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 05/16/14 16:05, Benoît Canet wrote:
The Friday 16 May 2014 à 09:54:43 (-0400), Daniel P. Berrange wrote :
On Fri, May 16, 2014 at 12:33:04PM +0200, Benoît Canet wrote:
Hello list,
I want to implement libvirt Quorum support. (https://github.com/qemu/qemu/commit/c88a1de51ab2f26a9a37ffc317249736de8c015c) Quorum is a QEMU RAID like block storage driver. Data are written on n replicas and when a read is done a comparison between the replica read is done. If more than threshold reads are identical the read succeed else it's and error.
For example a Quorum with n = 3 and threshold = 2 would be made of three QCOW2 backing chains used as identicals replicas. threshold = 2 means that at least 2 replica must be identical when doing a read.
I want to make use of the new backingStore xml element to implement quorum.
Proposed Quorum libvirt format: -------------------------------
<disk type='quorum' device='disk'> <driver name='qemu' type='quorum'/> <threshold value=2/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file1.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file2.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file3.qcow2'/> </backingStore> <target dev='vda' bus='virtio'/> </disk>
It feels rather odd to have <backingStore> elements but no top level disk images. Really these are all top level images
It reflect the ways QEMU does it. A single BlockDriverState holding n quorum BlockDriverState children. There is a 1-1 mapping.
How would you see it ?
We'd rather see multiple source elements for the top level disk. Backing store is the property of the source image, thus every single of those sources should have it's own list. (or perhaps a tree?)
Best regards
Benoît
Peter

On 05/16/2014 08:07 AM, Peter Krempa wrote:
It feels rather odd to have <backingStore> elements but no top level disk images. Really these are all top level images
It reflect the ways QEMU does it. A single BlockDriverState holding n quorum BlockDriverState children. There is a 1-1 mapping.
How would you see it ?
We'd rather see multiple source elements for the top level disk. Backing store is the property of the source image, thus every single of those sources should have it's own list. (or perhaps a tree?)
I don't see how you can possibly have multiple source elements. Remember, part of the determination of what forms a valid <source> element is the type='...' attribute tied to the <disk> parent element - but you can't have duplicate attributes. As I see it, a quorum HAS to be a special chain element with 0 sources and multiple backingStore children, where each backingStore then includes the type='...' attribute for how to interpret the <source> element of that child. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

The Friday 16 May 2014 à 08:20:22 (-0600), Eric Blake wrote :
On 05/16/2014 08:07 AM, Peter Krempa wrote:
It feels rather odd to have <backingStore> elements but no top level disk images. Really these are all top level images
It reflect the ways QEMU does it. A single BlockDriverState holding n quorum BlockDriverState children. There is a 1-1 mapping.
How would you see it ?
We'd rather see multiple source elements for the top level disk. Backing store is the property of the source image, thus every single of those sources should have it's own list. (or perhaps a tree?)
I don't see how you can possibly have multiple source elements. Remember, part of the determination of what forms a valid <source> element is the type='...' attribute tied to the <disk> parent element - but you can't have duplicate attributes. As I see it, a quorum HAS to be a special chain element with 0 sources and multiple backingStore children, where each backingStore then includes the type='...' attribute for how to interpret the <source> element of that child.
Additionally quorum support taking snapshots so we need one entity to bind them together. Best regards Benoît
-- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 05/16/2014 07:54 AM, Daniel P. Berrange wrote:
<disk type='quorum' device='disk'> <driver name='qemu' type='quorum'/> <threshold value=2/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file1.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file2.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file3.qcow2'/> </backingStore> <target dev='vda' bus='virtio'/> </disk>
It feels rather odd to have <backingStore> elements but no top level disk images. Really these are all top level images
Unfortunately, we are allowed to have a quorum with mixed-mode sources - I could have a quorum where file 1 is a local file, file 2 is a block device, and file 3 is a gluster protocol. But since we encode the type of file at the <disk type='...'> level, there is NO way to list three different <source> elements for those three quorum members. I think Benoit's proposal makes sense - a quorum is a node in the backing chain with NO <source> element, but instead has MULTIPLE <backingStore> elements. <disk type='quorum' device='disk'> <driver name='qemu' type='quorum'/> <threshold value=2/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file1.qcow2'/> </backingStore> <backingStore type='block'> <format type='qcow2'/> <source dev='/var/lib/libvirt/images/file2.qcow2'/> </backingStore> <backingStore type='network'> <format type='qcow2'/> <source protocol='gluster' name='Volume1/Image'> <host name='example.org'/> </source> </backingStore> <target dev='vda' bus='virtio'/> </disk> -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Fri, May 16, 2014 at 08:18:36AM -0600, Eric Blake wrote:
On 05/16/2014 07:54 AM, Daniel P. Berrange wrote:
<disk type='quorum' device='disk'> <driver name='qemu' type='quorum'/> <threshold value=2/> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file1.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file2.qcow2'/> </backingStore> <backingStore type='file'> <format type='qcow2'/> <source file='/var/lib/libvirt/images/file3.qcow2'/> </backingStore> <target dev='vda' bus='virtio'/> </disk>
It feels rather odd to have <backingStore> elements but no top level disk images. Really these are all top level images
Unfortunately, we are allowed to have a quorum with mixed-mode sources - I could have a quorum where file 1 is a local file, file 2 is a block device, and file 3 is a gluster protocol. But since we encode the type of file at the <disk type='...'> level, there is NO way to list three different <source> elements for those three quorum members. I think Benoit's proposal makes sense - a quorum is a node in the backing chain with NO <source> element, but instead has MULTIPLE <backingStore> elements.
Ok, I reluctantly agree. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 05/16/2014 04:33 AM, Benoît Canet wrote:
I want to make use of the new backingStore xml element to implement quorum.
Proposed Quorum libvirt format: -------------------------------
<disk type='quorum' device='disk'> <driver name='qemu' type='quorum'/> <threshold value=2/>
Rather than making <threshold> a sub-element, I'd stick it as an attribute, as in: <disk type='quorum' threshold='2' device='disk'>
* Add VIR_STORAGE_TYPE_QUORUM
* In src/util/virstoragefile.h change _virStorageSource to contain a virStorageSourcePtrPtr backingStores.
PtrPtr doesn't make sense. Just keep it as a single pointer, but add an nBackingStores field and treat it as an array (all existing callers are now an array of 1, quorum is a new array of N).
I think doing it at this level allow to keep a 1-1 mapping with the qemu BlockDriverState hiearchy
* Add a int quorum_threshold field to the same structure
size_t, not int
* Add support for parsing treshold in virDomainDiskDefParseXML
* Change virDomainDiskBackingStoreParse to virDomainDiskBackingStoresParse to parse all the backingStore at once an use realloc to grow the backingStores field.
* Modify virDomainDiskDefFormat to call virDomainDiskBackingStoreFormat in a loop for saving
* hook into qemuBuildDriveStr around line 3442 to create the quorum parameters
Do you feel that I am missing something ?
We'll probably find more as you go, but this sounds like a reasonable start -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
participants (4)
-
Benoît Canet
-
Daniel P. Berrange
-
Eric Blake
-
Peter Krempa