[libvirt-users] libvirt with sanlock

Hello, I configured libvirtd with the sanlock lock manager plugin: # rpm -qa | egrep "libvirt-0|sanlock-[01]" libvirt-lock-sanlock-0.9.4-23.el6_2.4.x86_64 sanlock-1.8-2.el6.x86_64 libvirt-0.9.4-23.el6_2.4.x86_64 # egrep -v "^#|^$" /etc/libvirt/qemu-sanlock.conf auto_disk_leases = 1 disk_lease_dir = "/var/lib/libvirt/sanlock" host_id = 4 # mount | grep sanlock /dev/mapper/kvm--shared-sanlock on /var/lib/libvirt/sanlock type gfs2 (rw,noatime,hostdata=jid=0) # cat /etc/sysconfig/sanlock SANLOCKOPTS="-R 1 -o 30" I increased the sanlock io_timeout to 30 seconds (default = 10), because the sanlock dir is on a GFS2 volume and can be blocked for some time while fencing and journal recovery takes place. With the default sanlock io timeout, I get lease timeouts because IO is blocked: Mar 5 15:37:14 raiti sanlock[5858]: 3318 s1 check_our_lease warning 79 last_success 3239 Mar 5 15:37:15 raiti sanlock[5858]: 3319 s1 check_our_lease failed 80 So far, all fine, but when I restart sanlock and libvirtd, it takes about 2 * 30 seconds = 1 minute before libvirtd is usable. "virsh list" hangs during this time. I can still live with that... But it gets worse after a reboot, when running a "virsh list" even takes a couple of minutes (like about 5 minutes) before it responds. After this initial time, virsh is responding normally, so it looks like an initialization issue to me. Is this a configuration issue, a bug, or expected behavior? Thanks for any advice... Frido

On Tue, Mar 13, 2012 at 03:42:36PM +0100, Frido Roose wrote:
Hello,
I configured libvirtd with the sanlock lock manager plugin:
# rpm -qa | egrep "libvirt-0|sanlock-[01]" libvirt-lock-sanlock-0.9.4-23.el6_2.4.x86_64 sanlock-1.8-2.el6.x86_64 libvirt-0.9.4-23.el6_2.4.x86_64
# egrep -v "^#|^$" /etc/libvirt/qemu-sanlock.conf auto_disk_leases = 1 disk_lease_dir = "/var/lib/libvirt/sanlock" host_id = 4
# mount | grep sanlock /dev/mapper/kvm--shared-sanlock on /var/lib/libvirt/sanlock type gfs2 (rw,noatime,hostdata=jid=0)
# cat /etc/sysconfig/sanlock SANLOCKOPTS="-R 1 -o 30"
I increased the sanlock io_timeout to 30 seconds (default = 10), because the sanlock dir is on a GFS2 volume and can be blocked for some time while fencing and journal recovery takes place. With the default sanlock io timeout, I get lease timeouts because IO is blocked: Mar 5 15:37:14 raiti sanlock[5858]: 3318 s1 check_our_lease warning 79 last_success 3239 Mar 5 15:37:15 raiti sanlock[5858]: 3319 s1 check_our_lease failed 80
So far, all fine, but when I restart sanlock and libvirtd, it takes about 2 * 30 seconds = 1 minute before libvirtd is usable. "virsh list" hangs during this time. I can still live with that... But it gets worse after a reboot, when running a "virsh list" even takes a couple of minutes (like about 5 minutes) before it responds. After this initial time, virsh is responding normally, so it looks like an initialization issue to me.
Is this a configuration issue, a bug, or expected behavior?
Each libvirtd instance has a lease that it owns. When restarting libvirtd it tries to acquire this lease. I don't really understand why, but sanlock sometimes has to wait a very long time between starting & completing its lease acquisition. You'll probably have to ask the sanlock developers for an explanation of why Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 03/13/2012 10:42 PM, Frido Roose wrote:
Hello,
I configured libvirtd with the sanlock lock manager plugin:
# rpm -qa | egrep "libvirt-0|sanlock-[01]" libvirt-lock-sanlock-0.9.4-23.el6_2.4.x86_64 sanlock-1.8-2.el6.x86_64 libvirt-0.9.4-23.el6_2.4.x86_64
# egrep -v "^#|^$" /etc/libvirt/qemu-sanlock.conf auto_disk_leases = 1 disk_lease_dir = "/var/lib/libvirt/sanlock" host_id = 4
# mount | grep sanlock /dev/mapper/kvm--shared-sanlock on /var/lib/libvirt/sanlock type gfs2 (rw,noatime,hostdata=jid=0)
# cat /etc/sysconfig/sanlock SANLOCKOPTS="-R 1 -o 30"
I increased the sanlock io_timeout to 30 seconds (default = 10), because the sanlock dir is on a GFS2 volume and can be blocked for some time while fencing and journal recovery takes place. With the default sanlock io timeout, I get lease timeouts because IO is blocked: Mar 5 15:37:14 raiti sanlock[5858]: 3318 s1 check_our_lease warning 79 last_success 3239 Mar 5 15:37:15 raiti sanlock[5858]: 3319 s1 check_our_lease failed 80
So far, all fine, but when I restart sanlock and libvirtd, it takes about 2 * 30 seconds = 1 minute before libvirtd is usable. "virsh list" hangs during this time. I can still live with that... But it gets worse after a reboot, when running a "virsh list" even takes a couple of minutes (like about 5 minutes) before it responds. After this initial time, virsh is responding normally, so it looks like an initialization issue to me.
Is this a configuration issue, a bug, or expected behavior? Hi Frido, I'm not sure whether you met a sanlock AVC error in your /var/log/audit/audit.log, could you check it and provide your selinux-policy version? in addition, you should turn on selinux bool value for sanlock, for example,
# getsebool -a|grep sanlock virt_use_sanlock --> off # setsebool -P virt_use_sanlock on # getsebool -a|grep sanlock virt_use_sanlock --> on In addition, could you provide libvirt log as a attachment? please refer the following configuration: 1. /etc/libvirt/libvirtd.conf log_filters="1:libvirt 1:conf 1:locking" log_outputs="1:file:/var/log/libvirt/libvirtd.log" 2. service libvirtd restart 3. repeat your test steps Good Luck! Alex
Thanks for any advice... Frido
_______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users

On Wed, Mar 14, 2012 at 8:32 AM, Alex Jia <ajia@redhat.com> wrote:
** On 03/13/2012 10:42 PM, Frido Roose wrote:
Hello,
I configured libvirtd with the sanlock lock manager plugin:
# rpm -qa | egrep "libvirt-0|sanlock-[01]" libvirt-lock-sanlock-0.9.4-23.el6_2.4.x86_64 sanlock-1.8-2.el6.x86_64 libvirt-0.9.4-23.el6_2.4.x86_64
# egrep -v "^#|^$" /etc/libvirt/qemu-sanlock.conf auto_disk_leases = 1 disk_lease_dir = "/var/lib/libvirt/sanlock" host_id = 4
# mount | grep sanlock /dev/mapper/kvm--shared-sanlock on /var/lib/libvirt/sanlock type gfs2 (rw,noatime,hostdata=jid=0)
# cat /etc/sysconfig/sanlock SANLOCKOPTS="-R 1 -o 30"
I increased the sanlock io_timeout to 30 seconds (default = 10), because the sanlock dir is on a GFS2 volume and can be blocked for some time while fencing and journal recovery takes place. With the default sanlock io timeout, I get lease timeouts because IO is blocked: Mar 5 15:37:14 raiti sanlock[5858]: 3318 s1 check_our_lease warning 79 last_success 3239 Mar 5 15:37:15 raiti sanlock[5858]: 3319 s1 check_our_lease failed 80
So far, all fine, but when I restart sanlock and libvirtd, it takes about 2 * 30 seconds = 1 minute before libvirtd is usable. "virsh list" hangs during this time. I can still live with that... But it gets worse after a reboot, when running a "virsh list" even takes a couple of minutes (like about 5 minutes) before it responds. After this initial time, virsh is responding normally, so it looks like an initialization issue to me.
Is this a configuration issue, a bug, or expected behavior?
Hi Frido, I'm not sure whether you met a sanlock AVC error in your /var/log/audit/audit.log, could you check it and provide your selinux-policy version? in addition, you should turn on selinux bool value for sanlock, for example,
# getsebool -a|grep sanlock virt_use_sanlock --> off # setsebool -P virt_use_sanlock on # getsebool -a|grep sanlock virt_use_sanlock --> on
Hello Alex, Thanks for your suggestions! I don't have any AVC errors in audit.log, but I also disabled selinux on the nodes for now.
In addition, could you provide libvirt log as a attachment? please refer the following configuration:
1. /etc/libvirt/libvirtd.conf
log_filters="1:libvirt 1:conf 1:locking" log_outputs="1:file:/var/log/libvirt/libvirtd.log"
2. service libvirtd restart
3. repeat your test steps
I enabled the extra debug logging, which you can find as attachment. Instead of restarting libvirtd, I did a echo b >/proc/sysreq-trigger to force an unclean reboot. The result was that it took 300s to register the lockspace: 09:59:28.919: 3457: debug : virLockManagerSanlockInit:267 : version=1000000 configFile=/etc/libvirt/qemu-sanlock.conf flags=0 10:05:29.539: 3457: debug : virLockManagerSanlockSetupLockspace:247 : Lockspace /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ has been registered I also had a little discussion about this with David Teigland on the sanlock dev list. He gave me some more details about how the delays and timeouts work, and it explains why it takes 300s. I'll quote his reply: David: "Yes, all the timeouts are derived from the io_timeout and are dictated by the recovery requirements and the algorithm the host_id leases are based on: "Light-Weight Leases for Storage-Centric Coordination" by Gregory Chockler and Dahlia Malkhi. Here are the actual equations copied from sanlock_internal.h. "delta" refers to host_id leases that take a long time to acquire at startup "free" corresponds to starting up after a clean shutdown "held" corresponds to starting up after an unclean shutdown You should find that with 30 sec io timeout these come out to 1 min / 4 min which you see when starting after a clean / unclean shutdown." Since I configured an io_timeout of 30s in sanlock (SANLOCKOPTS="-R 1 -o 30"), the delays at sanlock startup is defined by the delta_acquire_held_min variable, which is calculated as: int max = host_dead_seconds; if (delta_large_delay > max) max = delta_large_delay; int delta_acquire_held_min = max; So max is host_dead_seconds, which is calculated as: int host_dead_seconds = id_renewal_fail_seconds + WATCHDOG_FIRE_TIMEOUT; And id_renewal_fail_seconds is: int id_renewal_fail_seconds = 8 * io_timeout_seconds; WATCHDOG_FIRE_TIMEOUT = 60 So that makes 8 * 30 + 60, or a total of 300s before the lock can be acquired. And that's exactly the time that is shown in libvirtd.log before the lockspace is registered. When a proper reboot is done, or when sanlock/libvirtd is just restarted, delta_acquire_free_min defines the delay: int delta_short_delay = 2 * io_timeout_seconds; int delta_acquire_free_min = delta_short_delay; Which is confirmed by the 60s delay in the libvirtd.log file: 10:33:54.097: 7983: debug : virLockManagerSanlockInit:267 : version=1000000 configFile=/etc/libvirt/qemu-sanlock.conf flags=0 10:34:55.111: 7983: debug : virLockManagerSanlockSetupLockspace:247 : Lockspace /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ has been registered So the whole delay is caused by the io_timeout which is set to 30s because the lockspace is on GFS2, and the GFS2 volume can be locked for some time while a node gets fenced, and the journal is applied. Depending on clean/unclean restarts, the delay may differ. So delta_acquire_held_min is based on host_dead_seconds, because sanlock wants to be sure that the lock won't be reacquired before the host is actually dead. David told me that sanlock is supposed to use a block device. I wonder if this makes sense when used on top of GFS2, which already has its own locking and safety mechanism... It looks to me like it just adds the delay without any reason in this case? Perhaps delta_acquire_held_min should be the same as delta_acquire_free_min because we don't need this safety delay on top of gfs2. For NFS, it may be a whole other story... Best regards, Frido

On 03/14/2012 05:39 PM, Frido Roose wrote:
On Wed, Mar 14, 2012 at 8:32 AM, Alex Jia <ajia@redhat.com <mailto:ajia@redhat.com>> wrote:
On 03/13/2012 10:42 PM, Frido Roose wrote:
Hello,
I configured libvirtd with the sanlock lock manager plugin:
# rpm -qa | egrep "libvirt-0|sanlock-[01]" libvirt-lock-sanlock-0.9.4-23.el6_2.4.x86_64 sanlock-1.8-2.el6.x86_64 libvirt-0.9.4-23.el6_2.4.x86_64
# egrep -v "^#|^$" /etc/libvirt/qemu-sanlock.conf auto_disk_leases = 1 disk_lease_dir = "/var/lib/libvirt/sanlock" host_id = 4
# mount | grep sanlock /dev/mapper/kvm--shared-sanlock on /var/lib/libvirt/sanlock type gfs2 (rw,noatime,hostdata=jid=0)
# cat /etc/sysconfig/sanlock SANLOCKOPTS="-R 1 -o 30"
I increased the sanlock io_timeout to 30 seconds (default = 10), because the sanlock dir is on a GFS2 volume and can be blocked for some time while fencing and journal recovery takes place. With the default sanlock io timeout, I get lease timeouts because IO is blocked: Mar 5 15:37:14 raiti sanlock[5858]: 3318 s1 check_our_lease warning 79 last_success 3239 Mar 5 15:37:15 raiti sanlock[5858]: 3319 s1 check_our_lease failed 80
So far, all fine, but when I restart sanlock and libvirtd, it takes about 2 * 30 seconds = 1 minute before libvirtd is usable. "virsh list" hangs during this time. I can still live with that... But it gets worse after a reboot, when running a "virsh list" even takes a couple of minutes (like about 5 minutes) before it responds. After this initial time, virsh is responding normally, so it looks like an initialization issue to me.
Is this a configuration issue, a bug, or expected behavior?
Hi Frido, I'm not sure whether you met a sanlock AVC error in your /var/log/audit/audit.log, could you check it and provide your selinux-policy version? in addition, you should turn on selinux bool value for sanlock, for example,
# getsebool -a|grep sanlock virt_use_sanlock --> off # setsebool -P virt_use_sanlock on # getsebool -a|grep sanlock virt_use_sanlock --> on
Hello Alex,
Thanks for your suggestions! I don't have any AVC errors in audit.log, but I also disabled selinux on the nodes for now.
In addition, could you provide libvirt log as a attachment? please refer the following configuration:
1. /etc/libvirt/libvirtd.conf
log_filters="1:libvirt 1:conf 1:locking" log_outputs="1:file:/var/log/libvirt/libvirtd.log"
2. service libvirtd restart
3. repeat your test steps
I enabled the extra debug logging, which you can find as attachment. Instead of restarting libvirtd, I did a echo b >/proc/sysreq-trigger to force an unclean reboot. The result was that it took 300s to register the lockspace: 09:59:28.919: 3457: debug : virLockManagerSanlockInit:267 : version=1000000 configFile=/etc/libvirt/qemu-sanlock.conf flags=0 10:05:29.539: 3457: debug : virLockManagerSanlockSetupLockspace:247 : Lockspace /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ has been registered
I also had a little discussion about this with David Teigland on the sanlock dev list. He gave me some more details about how the delays and timeouts work, and it explains why it takes 300s. I'll quote his reply: Frido, David gave a great explanation, thanks for you forwarding these.
David: "Yes, all the timeouts are derived from the io_timeout and are dictated by the recovery requirements and the algorithm the host_id leases are based on: "Light-Weight Leases for Storage-Centric Coordination" by Gregory Chockler and Dahlia Malkhi.
Here are the actual equations copied from sanlock_internal.h. "delta" refers to host_id leases that take a long time to acquire at startup "free" corresponds to starting up after a clean shutdown "held" corresponds to starting up after an unclean shutdown
You should find that with 30 sec io timeout these come out to 1 min / 4 min which you see when starting after a clean / unclean shutdown."
Since I configured an io_timeout of 30s in sanlock (SANLOCKOPTS="-R 1 -o 30"), the delays at sanlock startup is defined by the delta_acquire_held_min variable, which is calculated as:
int max = host_dead_seconds; if (delta_large_delay > max) max = delta_large_delay;
int delta_acquire_held_min = max;
So max is host_dead_seconds, which is calculated as: int host_dead_seconds = id_renewal_fail_seconds + WATCHDOG_FIRE_TIMEOUT;
And id_renewal_fail_seconds is: int id_renewal_fail_seconds = 8 * io_timeout_seconds;
WATCHDOG_FIRE_TIMEOUT = 60
So that makes 8 * 30 + 60, or a total of 300s before the lock can be acquired. And that's exactly the time that is shown in libvirtd.log before the lockspace is registered.
When a proper reboot is done, or when sanlock/libvirtd is just restarted, delta_acquire_free_min defines the delay: int delta_short_delay = 2 * io_timeout_seconds; int delta_acquire_free_min = delta_short_delay; Which is confirmed by the 60s delay in the libvirtd.log file: 10:33:54.097: 7983: debug : virLockManagerSanlockInit:267 : version=1000000 configFile=/etc/libvirt/qemu-sanlock.conf flags=0 10:34:55.111: 7983: debug : virLockManagerSanlockSetupLockspace:247 : Lockspace /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ has been registered
So the whole delay is caused by the io_timeout which is set to 30s because the lockspace is on GFS2, and the GFS2 volume can be locked for some time while a node gets fenced, and the journal is applied. Depending on clean/unclean restarts, the delay may differ.
So delta_acquire_held_min is based on host_dead_seconds, because sanlock wants to be sure that the lock won't be reacquired before the host is actually dead. David told me that sanlock is supposed to use a block device. I wonder if this makes sense when used on top of GFS2, which already has its own locking and safety mechanism... It looks to me like it just adds the delay without any reason in this case? Perhaps delta_acquire_held_min should be the same as delta_acquire_free_min because we don't need this safety delay on top of gfs2.
For NFS, it may be a whole other story...
Best regards, Frido

On 03/14/2012 01:32 AM, Alex Jia wrote:
I'm not sure whether you met a sanlock AVC error in your /var/log/audit/audit.log, could you check it and provide your selinux-policy version? in addition, you should turn on selinux bool value for sanlock, for example,
# getsebool -a|grep sanlock virt_use_sanlock --> off # setsebool -P virt_use_sanlock on # getsebool -a|grep sanlock virt_use_sanlock --> on
Yuck - we have a documentation bug, since http://libvirt.org/locking.html doesn't mention virt_use_sanlock at all. What sort of AVCs are expected if the bool is false, and what security implications are there by setting it to true? For example, if virt_use_nfs is false, you can't use NFS storage for guest disk images (at least not until qemu adds better support for fd passing everywhere); but if it is true, then you are admitting that a compromised qemu guest can do whatever it wants to other files within the confines of your NFS mount point, rather than the normal sVirt guarantee that it can only touch the files that have been labeled for that guest - if you trust your guests, or use different NFS mount points per guest, then setting the bool to true won't pose a significant risk to you; if you don't trust your guests, then documenting the risks of this bool would be enough to convince me to use iSCSI or other shared storage alternative with more security guarantees even though it requires more administrative setup on my part. But I don't even know the risks of virt_use_sanlock to document them or what could be used as alternatives. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
participants (4)
-
Alex Jia
-
Daniel P. Berrange
-
Eric Blake
-
Frido Roose