
Hi, I am interested to find out how libvirt envisions image locking. i.e., how do we make sure multiple nodes are not trying to access the same storage volume, probably causing image corruption. I know this can be solved by means of a cluster, but it seems excessive (and not possible in all scenarios). Central administration (ovirt-like) is also problematic, since unless fencing is assumed, it cannot verify the image is no longer being used. If the image/storage volume could be locked (or leased lock), either by the storage/LVM/NFS preventing from a node access to a specific image, or by having libvirt/qemu mutually check the lock before accessing it (very simply, in leased lock we can check every X minutes, and kill the process to make sure it honors the lock). Any thoughts on the subject? Thanks, Itamar

On Wed, Oct 15, 2008 at 10:23:03AM -0700, Itamar Heim wrote:
Hi,
I am interested to find out how libvirt envisions image locking.
i.e., how do we make sure multiple nodes are not trying to access the same storage volume, probably causing image corruption.
I know this can be solved by means of a cluster, but it seems excessive (and not possible in all scenarios).
Central administration (ovirt-like) is also problematic, since unless fencing is assumed, it cannot verify the image is no longer being used.
If the image/storage volume could be locked (or leased lock), either by the storage/LVM/NFS preventing from a node access to a specific image, or by having libvirt/qemu mutually check the lock before accessing it (very simply, in leased lock we can check every X minutes, and kill the process to make sure it honors the lock).
In the domain XML format, the semantics are that every <disk> section added to a guest config is read-write, with an exclusive lock. To allow multiple guests to use the same disk, is intended that you add either <readonly/> or <sharable/> element within the <disk>. That all said, we only implement this for the Xen driver, handing off the actuall logic to XenD to perform. That we don't implement this in the QEMU driver is a clear shortcoming that needs addressing. If we only care about locking within the scope of a single host, it is trivial - libvirtd knows all VMs and their config, so can trivially ensure the appropriate exlusivity checks are done at time of VM start. As you point out, ideally this locking would be enforced across hosts too, in the case of shared storage. Cluster software can't actually magically solve this for us - it can really only make sure the same VM is not started twice. I'm not sure that libivirt can neccessarily solve it in the general case either, but we can at least make an effort in some cases. If, for instance, we were to take a proper fcntl() lock over the files, this would work for disks backed by a file on shared filesystems like NFS / GFS. fcntl() locks won't work on disks backed by iSCSI/SCSI block devices though - and this will actuall play nicely with clustersoftrware too, since they can be made to forcably release NFS locks when fencing a node. It is possible that SCSI reservations can help in the FibreChannel case. So as an immediate option we should perform the exclusivity checks in libvirt, and also apply fcntl() locks over file backed disks. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Hi,
I am interested to find out how libvirt envisions image locking.
i.e., how do we make sure multiple nodes are not trying to access the same storage volume, probably causing image corruption.
I know this can be solved by means of a cluster, but it seems excessive (and not possible in all scenarios).
Central administration (ovirt-like) is also problematic, since unless fencing is assumed, it cannot verify the image is no longer being used.
If the image/storage volume could be locked (or leased lock), either by the storage/LVM/NFS preventing from a node access to a specific image, or by having libvirt/qemu mutually check the lock before accessing it (very simply, in leased lock we can check every X minutes, and kill
The main issue is indeed across nodes. As you pointed out, fnctl() won't solve the issue for iSCSI. What about the leased lock mechanism, for example, with libvirt playing an external watchdog and terminating the qemu process if it cannot renew the lock, or a central (nfs? Central service) storage to maintain the global locks. (would have been very nice if the storage would have played along and accepted ownership assignment/release, but I guess this is too much to ask for) -----Original Message----- From: libvir-list-bounces@redhat.com [mailto:libvir-list-bounces@redhat.com] On Behalf Of Daniel P. Berrange Sent: Wednesday, October 15, 2008 21:59 PM To: Itamar Heim Cc: libvir-list@redhat.com Subject: Re: [libvirt] image locking? On Wed, Oct 15, 2008 at 10:23:03AM -0700, Itamar Heim wrote: the
process to make sure it honors the lock).
In the domain XML format, the semantics are that every <disk> section added to a guest config is read-write, with an exclusive lock. To allow multiple guests to use the same disk, is intended that you add either <readonly/> or <sharable/> element within the <disk>. That all said, we only implement this for the Xen driver, handing off the actuall logic to XenD to perform. That we don't implement this in the QEMU driver is a clear shortcoming that needs addressing. If we only care about locking within the scope of a single host, it is trivial - libvirtd knows all VMs and their config, so can trivially ensure the appropriate exlusivity checks are done at time of VM start. As you point out, ideally this locking would be enforced across hosts too, in the case of shared storage. Cluster software can't actually magically solve this for us - it can really only make sure the same VM is not started twice. I'm not sure that libivirt can neccessarily solve it in the general case either, but we can at least make an effort in some cases. If, for instance, we were to take a proper fcntl() lock over the files, this would work for disks backed by a file on shared filesystems like NFS / GFS. fcntl() locks won't work on disks backed by iSCSI/SCSI block devices though - and this will actuall play nicely with clustersoftrware too, since they can be made to forcably release NFS locks when fencing a node. It is possible that SCSI reservations can help in the FibreChannel case. So as an immediate option we should perform the exclusivity checks in libvirt, and also apply fcntl() locks over file backed disks. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Wed, Oct 15, 2008 at 01:47:36PM -0700, Itamar Heim wrote:
The main issue is indeed across nodes. As you pointed out, fnctl() won't solve the issue for iSCSI. What about the leased lock mechanism, for example, with libvirt playing an external watchdog and terminating the qemu process if it cannot renew the lock, or a central (nfs? Central service) storage to maintain the global locks.
When you get to that level of cleverness, it seems to me that it is verging on a complete re-implementation of DLM (distributed lock manager), which really, AFAIK, needs a proper cluster setup so it can safely fence mis-behaving nodes, and avoid quorum/split-brain problems. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, Oct 15, 2008 at 09:59:12PM +0100, Daniel P. Berrange wrote:
When you get to that level of cleverness, it seems to me that it is verging on a complete re-implementation of DLM (distributed lock manager), which really, AFAIK, needs a proper cluster setup so it can safely fence mis-behaving nodes, and avoid quorum/split-brain problems. I've been toying with the idea of using DLM for libvirt earlier this year [1](but infered from other postings on the list that this would be out of scope for libvirt - probably should have asked). I looked at vm based locks then but having storage based locks is even better.
Currently you have to make sure "manually" that people using i.e. virt-manager[2] don't accidentally fire up VMs managed via e.g. rgmanager. Having cluster wide storage based locks would be an awesome solution. -- Guido [1] using the rather simple lock_resource() and unlock_resource() API: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=blob;f=dlm/doc/l... [2] i.e. by having virt-manager hooked to all libvirtds in the cluster and allowing to start each uuid only once

On Thu, Oct 16, 2008 at 10:02:10AM +0200, Guido G?nther wrote:
On Wed, Oct 15, 2008 at 09:59:12PM +0100, Daniel P. Berrange wrote:
When you get to that level of cleverness, it seems to me that it is verging on a complete re-implementation of DLM (distributed lock manager), which really, AFAIK, needs a proper cluster setup so it can safely fence mis-behaving nodes, and avoid quorum/split-brain problems. I've been toying with the idea of using DLM for libvirt earlier this year [1](but infered from other postings on the list that this would be out of scope for libvirt - probably should have asked). I looked at vm based locks then but having storage based locks is even better.
Currently you have to make sure "manually" that people using i.e. virt-manager[2] don't accidentally fire up VMs managed via e.g. rgmanager.
Having cluster wide storage based locks would be an awesome solution.
If libvirt is deployed in an environment where DLM is present & configured I've no objection to libvirt making use of it. It should just be a soft dependancy, where we also need to make a best effort for cases where DLM isn't around, even if that only works on a single host, or with a subset of storage backends. Give users the flexibility in terms of how they deploy & integrate libvirt, without imposing too many constraints. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Itamar Heim wrote:
Hi,
I am interested to find out how libvirt envisions image locking.
i.e., how do we make sure multiple nodes are not trying to access the same storage volume, probably causing image corruption.
I know this can be solved by means of a cluster, but it seems excessive (and not possible in all scenarios).
Central administration (ovirt-like) is also problematic, since unless fencing is assumed, it cannot verify the image is no longer being used.
In an oVirt network, this shouldn't be a problem. Storage can only be assigned to one VM at a time presently. (In the future we may relax this for clustered filesystems, but shared storage will be marked as such). Regardless of whether or not a VM is active/inactive, once an iSCSI LUN, disk image or otherwise is assigned to a VM it can't be used by other VMs. The storage is not released to the available list until the VM using it is both destroyed and undefined. We don't allow undefine until the VM has been destroyed and we won't confirm that a VM has been destroyed if we can't contact the host that it is running on to confirm. Now... if you start creating VMs on an oVirt network outside of the oVirt Server or decide to share your storage pools between an oVirt network and a non-oVirt network that is problematic. Our solution for that for the time being is 'just don't do that'. :) Perry

Hi Perry, The problem is with unreachable hosts which are locking the image forever. When fencing can't be used, there is no way for the management to "release" the image, since it can't verify the host stopped using the image. A leased lock mechanism, while not providing 100% prevention, does allow a collaborative effort to allow releasing the image after the lock expired, by having the nodes check that they still own the lease and stop writing to the images. It would have been much better if image access could have been enforced at storage level, but that is much more complex (and not relevant for images under LVM for example) Itamar -----Original Message----- From: libvir-list-bounces@redhat.com [mailto:libvir-list-bounces@redhat.com] On Behalf Of Perry Myers Sent: Tuesday, October 21, 2008 15:37 PM To: Itamar Heim Cc: libvir-list@redhat.com Subject: Re: [libvirt] image locking? In an oVirt network, this shouldn't be a problem. Storage can only be assigned to one VM at a time presently. (In the future we may relax this for clustered filesystems, but shared storage will be marked as such). Regardless of whether or not a VM is active/inactive, once an iSCSI LUN, disk image or otherwise is assigned to a VM it can't be used by other VMs. The storage is not released to the available list until the VM using it is both destroyed and undefined. We don't allow undefine until the VM has been destroyed and we won't confirm that a VM has been destroyed if we can't contact the host that it is running on to confirm. Now... if you start creating VMs on an oVirt network outside of the oVirt Server or decide to share your storage pools between an oVirt network and a non-oVirt network that is problematic. Our solution for that for the time being is 'just don't do that'. :) Perry -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Itamar Heim wrote:
Hi Perry,
The problem is with unreachable hosts which are locking the image forever.
When fencing can't be used, there is no way for the management to "release" the image, since it can't verify the host stopped using the image. A leased lock mechanism, while not providing 100% prevention, does allow a collaborative effort to allow releasing the image after the lock expired, by having the nodes check that they still own the lease and stop writing to the images.
If you have an unreachable host that is locking the image forever, you walk into the datacenter and pull the plug. Once that is done, you can use the oVirt Server interface to undefine the vm and release the storage volume. So it can be done without hw fencing, it just involves manual administrator action. Not ideal, but it works :)
It would have been much better if image access could have been enforced at storage level, but that is much more complex (and not relevant for images under LVM for example)
Agreed. We're using the above procedure (pull the plug or hw fencing) until a better mechanism is created. Perry

While this might work for SBC (although most enterprises have the datacenter on remote sites as well, so not always that easy). I don't think the solution is viable for CBC though (I am not sure CBC would use iSCSI, probably NFS is a more relevant option, but the leased locking is required there as well, just as a collaborative effort to notate to the non-responding node to stop writing to the image). -----Original Message----- From: Perry Myers [mailto:pmyers@redhat.com] Sent: Tuesday, October 21, 2008 21:09 PM To: Itamar Heim Cc: libvir-list@redhat.com Subject: Re: [libvirt] image locking? Itamar Heim wrote:
Hi Perry,
The problem is with unreachable hosts which are locking the image forever.
When fencing can't be used, there is no way for the management to "release" the image, since it can't verify the host stopped using the image. A leased lock mechanism, while not providing 100% prevention, does allow a collaborative effort to allow releasing the image after the lock expired, by having the nodes check that they still own the lease and stop writing to the images.
If you have an unreachable host that is locking the image forever, you walk into the datacenter and pull the plug. Once that is done, you can use the oVirt Server interface to undefine the vm and release the storage volume. So it can be done without hw fencing, it just involves manual administrator action. Not ideal, but it works :)
It would have been much better if image access could have been enforced at storage level, but that is much more complex (and not relevant for images under LVM for example)
Agreed. We're using the above procedure (pull the plug or hw fencing) until a better mechanism is created. Perry

On Tue, Oct 21, 2008 at 12:16:01PM -0700, Itamar Heim wrote: > While this might work for SBC (although most enterprises have the > datacenter on remote sites as well, so not always that easy). I don't > think the solution is viable for CBC though (I am not sure CBC would > use iSCSI, probably NFS is a more relevant option, but the leased > locking is required there as well, just as a collaborative effort to > notate to the non-responding node to stop writing to the image). What about using dlm? This gives fencing for free. See my other post - allowing libvirt to use cluster wide storage locks would solve your problem? -- Guido > > -----Original Message----- > From: Perry Myers [mailto:pmyers@redhat.com] > Sent: Tuesday, October 21, 2008 21:09 PM > To: Itamar Heim > Cc: libvir-list@redhat.com > Subject: Re: [libvirt] image locking? > > Itamar Heim wrote: > > Hi Perry, > > > > The problem is with unreachable hosts which are locking the image > > forever. > > > > When fencing can't be used, there is no way for the management to > > "release" the image, since it can't verify the host stopped using the > > image. A leased lock mechanism, while not providing 100% prevention, > > does allow a collaborative effort to allow releasing the image after > > the lock expired, by having the nodes check that they still own the > > lease and stop writing to the images. > > If you have an unreachable host that is locking the image forever, you > walk into the datacenter and pull the plug. Once that is done, you can > use the oVirt Server interface to undefine the vm and release the storage > volume. So it can be done without hw fencing, it just involves manual > administrator action. Not ideal, but it works :) > > > It would have been much better if image access could have been enforced > > at storage level, but that is much more complex (and not relevant for > > images under LVM for example) > > Agreed. We're using the above procedure (pull the plug or hw fencing) > until a better mechanism is created. > > Perry > > -- > Libvir-list mailing list > Libvir-list@redhat.com > https://www.redhat.com/mailman/listinfo/libvir-list

On Wed, Oct 22, 2008 at 10:21:38PM +0200, Guido G?nther wrote: > On Tue, Oct 21, 2008 at 12:16:01PM -0700, Itamar Heim wrote: > > While this might work for SBC (although most enterprises have the > > datacenter on remote sites as well, so not always that easy). I don't > > think the solution is viable for CBC though (I am not sure CBC would > > use iSCSI, probably NFS is a more relevant option, but the leased > > locking is required there as well, just as a collaborative effort to > > notate to the non-responding node to stop writing to the image). > What about using dlm? This gives fencing for free. See my other post > - allowing libvirt to use cluster wide storage locks would solve your > problem? Yes it all comes back to clustering. If you have clustering you can build a reliable image locking system across all types of storage backend. If you don't have clustering you can make a best effort, likely only enforcing safety within the scope of a single host. We need libvirt to enforce safety per host using standard POSIX locking because we cann't demand that everyone using clustering. We should optionally use some cluster based locking scheme if it is available, as a 2nd layer ontop of the basic single-host view locking Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
participants (4)
-
Daniel P. Berrange
-
Guido Günther
-
Itamar Heim
-
Perry Myers