[libvirt] [PATCH 0/2] Recursively lock backing chain

This is how I tested the feature: Create couple of disks: qemu-img create -f qcow2 /var/lib/libvirt/images/test1.qcow2 1G qemu-img create -f qcow2 /var/lib/libvirt/images/test2.qcow2 -b /var/lib/libvirt/images/test1.qcow2 qemu-img create -f qcow2 /var/lib/libvirt/images/test3.qcow2 -b /var/lib/libvirt/images/test1.qcow2 Then adjust configuration of two domains: virsh # domblklist gentoo Target Source ------------------------------------------------ fda /var/lib/libvirt/images/fd.img vda /var/lib/libvirt/images/gentoo.qcow2 vdb /var/lib/libvirt/images/test1.qcow2 hdc /home/zippy/tmp/install-amd64-minimal-20150402.iso virsh # domblklist fedora Target Source ------------------------------------------------ sda /var/lib/libvirt/images/fedora.qcow2 sdb /var/lib/libvirt/images/test2.qcow2 hda - With this configuration, fedora and gentoo should not be able to run at the same time: virsh # start fedora Domain fedora started virsh # start gentoo error: Failed to start domain gentoo error: resource busy: Lockspace resource '/var/lib/libvirt/images/test1.qcow2' is locked Cool! But with slight change, gentoo should be able to run: virsh # domblklist gentoo Target Source ------------------------------------------------ fda /var/lib/libvirt/images/fd.img vda /var/lib/libvirt/images/gentoo.qcow2 vdb /var/lib/libvirt/images/test3.qcow2 hdc /home/zippy/tmp/install-amd64-minimal-20150402.iso virsh # start gentoo Domain gentoo started virsh # list Id Name State ---------------------------------------------------- 10 fedora running 11 gentoo running Awesome! Michal Privoznik (2): qemuProcessStart: Parse backingStore for all disks virDomainLockManagerAddImage: Recursively lock backing chain src/locking/domain_lock.c | 53 +++++++++++++++++++++++++++++------------------ src/qemu/qemu_domain.c | 11 +++------- src/qemu/qemu_process.c | 6 +++++- 3 files changed, 41 insertions(+), 29 deletions(-) -- 2.4.6

Later, in the next commit, we are going to need backing chain for all the disks within the domain. Let's parse them at domain startup phase. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 11 +++-------- src/qemu/qemu_process.c | 6 +++++- 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 91c632c..54bcce9 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2788,17 +2788,12 @@ qemuDomainCheckDiskPresence(virQEMUDriverPtr driver, virFileExists(virDomainDiskGetSource(disk))) continue; - if (qemuDomainDetermineDiskChain(driver, vm, disk, true, true) >= 0) - continue; - if (disk->startupPolicy && qemuDomainCheckDiskStartupPolicy(driver, vm, idx, - cold_boot) >= 0) { - virResetLastError(); - continue; - } + cold_boot) < 0) + goto error; - goto error; + virResetLastError(); } ret = 0; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f7eb2b6..693ec33 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4661,7 +4661,11 @@ int qemuProcessStart(virConnectPtr conn, * cgroup and security setting. */ for (i = 0; i < vm->def->ndisks; i++) { - if (virStorageTranslateDiskSourcePool(conn, vm->def->disks[i]) < 0) + virDomainDiskDefPtr disk = vm->def->disks[i]; + if (virStorageTranslateDiskSourcePool(conn, disk) < 0) + goto cleanup; + + if (qemuDomainDetermineDiskChain(driver, vm, disk, true, true) < 0) goto cleanup; } -- 2.4.6

On Tue, Sep 08, 2015 at 17:17:18 +0200, Michal Privoznik wrote: Subject is misleading. Backing store is detected from disk images not parsed from XML.
Later, in the next commit, we are going to need backing chain for all the disks within the domain. Let's parse them at domain startup phase.
Why do we need this patch? Currently qemuDomainDetermineDiskChain still gets called on all disks where it makes sense. Exceptions are: 1) empty drives (this is also redundant, since qemuDomainDetermineDiskChain will check it again). 2) disk source formats that don't support backing chains, where we just check that the file exists. This patch would make us call that on those too. NACK, Peter

https://bugzilla.redhat.com/show_bug.cgi?id=1192399 It's known fact for a while now that we should not only lock the top level layers of backing chain but the rest of it too. And well known too that we are not doing that. Well, up until this commit. The reason is that while one guest can have for instance the following backing chain: A (top) -> B -> C (bottom), the other can be started just over B or C. But libvirt should prevent that as it will almost certainly mangle the backing chain for the former guest. On the other hand, we want to allow guest with the following backing chain: D (top) -> B -> C (bottom), because in both cases B and C are there just for reading. In order to achieve that we must lock the rest of backing chain with VIR_LOCK_MANAGER_RESOURCE_SHARED flag. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/locking/domain_lock.c | 53 +++++++++++++++++++++++++++++------------------ 1 file changed, 33 insertions(+), 20 deletions(-) diff --git a/src/locking/domain_lock.c b/src/locking/domain_lock.c index 705b132..faf73f2 100644 --- a/src/locking/domain_lock.c +++ b/src/locking/domain_lock.c @@ -69,35 +69,48 @@ static int virDomainLockManagerAddLease(virLockManagerPtr lock, static int virDomainLockManagerAddImage(virLockManagerPtr lock, - virStorageSourcePtr src) + virStorageSourcePtr disk) { unsigned int diskFlags = 0; - int type = virStorageSourceGetActualType(src); - - if (!src->path) - return 0; - - if (!(type == VIR_STORAGE_TYPE_BLOCK || - type == VIR_STORAGE_TYPE_FILE || - type == VIR_STORAGE_TYPE_DIR)) - return 0; + virStorageSourcePtr src = disk; + int ret = -1; + /* The top layer of backing chain should have the correct flags set. + * The rest should be locked with VIR_LOCK_MANAGER_RESOURCE_SHARED + * for it can be shared among other domains. */ if (src->readonly) diskFlags |= VIR_LOCK_MANAGER_RESOURCE_READONLY; if (src->shared) diskFlags |= VIR_LOCK_MANAGER_RESOURCE_SHARED; - VIR_DEBUG("Add disk %s", src->path); - if (virLockManagerAddResource(lock, - VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK, - src->path, - 0, - NULL, - diskFlags) < 0) { - VIR_DEBUG("Failed add disk %s", src->path); - return -1; + while (src) { + if (!virStorageSourceIsLocalStorage(src)) + break; + + if (!src->path) + break; + + VIR_DEBUG("Add disk %s", src->path); + if (virLockManagerAddResource(lock, + VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK, + src->path, + 0, + NULL, + diskFlags) < 0) { + VIR_DEBUG("Failed add disk %s", src->path); + goto cleanup; + } + + /* Now that we've added the first disk (top layer of backing chain) + * into the lock manager, let's traverse the rest of the backing chain + * and lock each layer for RO. This should prevent accidentally + * starting a domain with RW disk from the middle of the chain. */ + diskFlags = VIR_LOCK_MANAGER_RESOURCE_SHARED; + src = src->backingStore; } - return 0; + ret = 0; + cleanup: + return ret; } -- 2.4.6

On Tue, Sep 08, 2015 at 17:17:19 +0200, Michal Privoznik wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=1192399
It's known fact for a while now that we should not only lock the top level layers of backing chain but the rest of it too. And well known too that we are not doing that. Well, up until this commit. The reason is that while one guest can have for instance the following backing chain: A (top) -> B -> C (bottom), the other can be started just over B or C. But libvirt should prevent that as it will almost certainly mangle the backing chain for the former guest. On the other hand, we want to allow guest with the
Well, the corruption may still happen since if you run the guest that uses image 'B' for writing, without starting the one that uses 'A' the image 'A' will be invalidated anyways.
following backing chain: D (top) -> B -> C (bottom), because in both cases B and C are there just for reading. In order to achieve that we must lock the rest of backing chain with VIR_LOCK_MANAGER_RESOURCE_SHARED flag.
See below ...
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/locking/domain_lock.c | 53 +++++++++++++++++++++++++++++------------------ 1 file changed, 33 insertions(+), 20 deletions(-)
diff --git a/src/locking/domain_lock.c b/src/locking/domain_lock.c index 705b132..faf73f2 100644 --- a/src/locking/domain_lock.c +++ b/src/locking/domain_lock.c @@ -69,35 +69,48 @@ static int virDomainLockManagerAddLease(virLockManagerPtr lock,
static int virDomainLockManagerAddImage(virLockManagerPtr lock, - virStorageSourcePtr src) + virStorageSourcePtr disk)
This function is used both in cases where you are adding the whole disk to the lock manager but also in cases where we are adding individual members of the backing chain that may already contain shared portions of the backing chain (eg. when creating snapshots) which is not desired. virDomainLockDiskAttach should recurse through the backing chain, but virDomainLockImageAttach should not. While this wouldn't pose a big problem for adding disks, when you are removing a single image file from the locking, the code would unlock the whole backing chain.
{ unsigned int diskFlags = 0; - int type = virStorageSourceGetActualType(src); - - if (!src->path) - return 0; - - if (!(type == VIR_STORAGE_TYPE_BLOCK || - type == VIR_STORAGE_TYPE_FILE || - type == VIR_STORAGE_TYPE_DIR)) - return 0; + virStorageSourcePtr src = disk; + int ret = -1;
+ /* The top layer of backing chain should have the correct flags set. + * The rest should be locked with VIR_LOCK_MANAGER_RESOURCE_SHARED + * for it can be shared among other domains. */ if (src->readonly) diskFlags |= VIR_LOCK_MANAGER_RESOURCE_READONLY; if (src->shared) diskFlags |= VIR_LOCK_MANAGER_RESOURCE_SHARED;
- VIR_DEBUG("Add disk %s", src->path); - if (virLockManagerAddResource(lock, - VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK, - src->path, - 0, - NULL, - diskFlags) < 0) { - VIR_DEBUG("Failed add disk %s", src->path); - return -1; + while (src) { + if (!virStorageSourceIsLocalStorage(src)) + break;
Also note that once you create a snapshot on a remote storage device on top of the existing local backing chain, the backing chain of that image won't get unlocked.
+ + if (!src->path) + break; + + VIR_DEBUG("Add disk %s", src->path); + if (virLockManagerAddResource(lock, + VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK, + src->path, + 0, + NULL, + diskFlags) < 0) { + VIR_DEBUG("Failed add disk %s", src->path); + goto cleanup; + } + + /* Now that we've added the first disk (top layer of backing chain) + * into the lock manager, let's traverse the rest of the backing chain + * and lock each layer for RO. This should prevent accidentally + * starting a domain with RW disk from the middle of the chain. */ + diskFlags = VIR_LOCK_MANAGER_RESOURCE_SHARED;
The VIR_LOCK_MANAGER_RESOURCE_SHARED flag means "shared, writable" mode according to lock_driver.h, I think you want to lock it with VIR_LOCK_MANAGER_RESOURCE_READONLY.
+ src = src->backingStore; } - return 0; + ret = 0; + cleanup: + return ret; }
Peter

On Tue, Sep 08, 2015 at 05:59:45PM +0200, Peter Krempa wrote:
On Tue, Sep 08, 2015 at 17:17:19 +0200, Michal Privoznik wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=1192399
It's known fact for a while now that we should not only lock the top level layers of backing chain but the rest of it too. And well known too that we are not doing that. Well, up until this commit. The reason is that while one guest can have for instance the following backing chain: A (top) -> B -> C (bottom), the other can be started just over B or C. But libvirt should prevent that as it will almost certainly mangle the backing chain for the former guest. On the other hand, we want to allow guest with the
Well, the corruption may still happen since if you run the guest that uses image 'B' for writing, without starting the one that uses 'A' the image 'A' will be invalidated anyways.
Yep, the lock manager won't protect against you doing stupid stuff to the base image, which invalidates children, but that's out of scope really. We only need to consider the concurrent running VM problem here.
following backing chain: D (top) -> B -> C (bottom), because in both cases B and C are there just for reading. In order to achieve that we must lock the rest of backing chain with VIR_LOCK_MANAGER_RESOURCE_SHARED flag.
See below ...
+ + if (!src->path) + break; + + VIR_DEBUG("Add disk %s", src->path); + if (virLockManagerAddResource(lock, + VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK, + src->path, + 0, + NULL, + diskFlags) < 0) { + VIR_DEBUG("Failed add disk %s", src->path); + goto cleanup; + } + + /* Now that we've added the first disk (top layer of backing chain) + * into the lock manager, let's traverse the rest of the backing chain + * and lock each layer for RO. This should prevent accidentally + * starting a domain with RW disk from the middle of the chain. */ + diskFlags = VIR_LOCK_MANAGER_RESOURCE_SHARED;
The VIR_LOCK_MANAGER_RESOURCE_SHARED flag means "shared, writable" mode according to lock_driver.h, I think you want to lock it with VIR_LOCK_MANAGER_RESOURCE_READONLY.
Yep, backing files should be marked readonly - 'shared' will allow multiple VMs to write to the image at once. So you've only provided mutual exclusion between an exclusive writer, vs a shared writer. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 08.09.2015 18:04, Daniel P. Berrange wrote:
On Tue, Sep 08, 2015 at 05:59:45PM +0200, Peter Krempa wrote:
On Tue, Sep 08, 2015 at 17:17:19 +0200, Michal Privoznik wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=1192399
It's known fact for a while now that we should not only lock the top level layers of backing chain but the rest of it too. And well known too that we are not doing that. Well, up until this commit. The reason is that while one guest can have for instance the following backing chain: A (top) -> B -> C (bottom), the other can be started just over B or C. But libvirt should prevent that as it will almost certainly mangle the backing chain for the former guest. On the other hand, we want to allow guest with the
Well, the corruption may still happen since if you run the guest that uses image 'B' for writing, without starting the one that uses 'A' the image 'A' will be invalidated anyways.
Yep, the lock manager won't protect against you doing stupid stuff to the base image, which invalidates children, but that's out of scope really. We only need to consider the concurrent running VM problem here.
Agreed. There's no way for us to see if the layer we are looking at is top layer or not. On snapshot creation top layer is not modified, there are no reverse records. Unless we want to search each file on the system to see if there's 'A' when starting from 'B', it's not an argument against my patch.
following backing chain: D (top) -> B -> C (bottom), because in both cases B and C are there just for reading. In order to achieve that we must lock the rest of backing chain with VIR_LOCK_MANAGER_RESOURCE_SHARED flag.
See below ...
+ + if (!src->path) + break; + + VIR_DEBUG("Add disk %s", src->path); + if (virLockManagerAddResource(lock, + VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK, + src->path, + 0, + NULL, + diskFlags) < 0) { + VIR_DEBUG("Failed add disk %s", src->path); + goto cleanup; + } + + /* Now that we've added the first disk (top layer of backing chain) + * into the lock manager, let's traverse the rest of the backing chain + * and lock each layer for RO. This should prevent accidentally + * starting a domain with RW disk from the middle of the chain. */ + diskFlags = VIR_LOCK_MANAGER_RESOURCE_SHARED;
The VIR_LOCK_MANAGER_RESOURCE_SHARED flag means "shared, writable" mode according to lock_driver.h, I think you want to lock it with VIR_LOCK_MANAGER_RESOURCE_READONLY.
Yep, backing files should be marked readonly - 'shared' will allow multiple VMs to write to the image at once. So you've only provided mutual exclusion between an exclusive writer, vs a shared writer.
Except that both of our locking drivers threat VIR_LOCK_MANAGER_RESOURCE_READONLY as no-op. But I think that since shared and exclusive writer are mutually exclusive, it doesn't matter that the lower layers are locked as shared writer. As soon as somebody wants to lock it exclusively they will fail. Nor virtlockd nor sanlock is actually enforcing the RO vs RW policy (if a file is locked for RO no writes are allowed). It's merely for tracking shared and exclusive locks. Michal

On Wed, Sep 09, 2015 at 10:13:24 +0200, Michal Privoznik wrote:
On 08.09.2015 18:04, Daniel P. Berrange wrote:
On Tue, Sep 08, 2015 at 05:59:45PM +0200, Peter Krempa wrote:
On Tue, Sep 08, 2015 at 17:17:19 +0200, Michal Privoznik wrote:
https://bugzilla.redhat.com/show_bug.cgi?id=1192399
It's known fact for a while now that we should not only lock the top level layers of backing chain but the rest of it too. And well known too that we are not doing that. Well, up until this commit. The reason is that while one guest can have for instance the following backing chain: A (top) -> B -> C (bottom), the other can be started just over B or C. But libvirt should prevent that as it will almost certainly mangle the backing chain for the former guest. On the other hand, we want to allow guest with the
Well, the corruption may still happen since if you run the guest that uses image 'B' for writing, without starting the one that uses 'A' the image 'A' will be invalidated anyways.
Yep, the lock manager won't protect against you doing stupid stuff to the base image, which invalidates children, but that's out of scope really. We only need to consider the concurrent running VM problem here.
Agreed. There's no way for us to see if the layer we are looking at is top layer or not. On snapshot creation top layer is not modified, there are no reverse records. Unless we want to search each file on the system to see if there's 'A' when starting from 'B', it's not an argument against my patch.
What I wanted to point out is that this might give the users a false sense of security in some orderings of startup of VMs while in other cases the data will be corrupted and nothing can happen. I think we should point out this fact in the docs very very explicitly.
...
Yep, backing files should be marked readonly - 'shared' will allow multiple VMs to write to the image at once. So you've only provided mutual exclusion between an exclusive writer, vs a shared writer.
Except that both of our locking drivers threat VIR_LOCK_MANAGER_RESOURCE_READONLY as no-op.
But I think that since shared and exclusive writer are mutually exclusive, it doesn't matter that the lower layers are locked as shared writer. As soon as somebody wants to lock it exclusively they will fail. Nor virtlockd nor sanlock is actually enforcing the RO vs RW policy (if a file is locked for RO no writes are allowed). It's merely for tracking shared and exclusive locks.
In that case you should fix the documentation in the lock driver header first so that you will then be able to justify the use of _SHARED here.
Additionally this will probably end up in more work, since you will need to go through the block job code and fix it to do the locking in the same way you decide to do it here. The backing chain manipulation APIs change between RW and RO locks for individual backing chain members, so a change to shared lock will need to be used there too. Peter

On Wed, Sep 09, 2015 at 10:13:24AM +0200, Michal Privoznik wrote:
On 08.09.2015 18:04, Daniel P. Berrange wrote:
following backing chain: D (top) -> B -> C (bottom), because in both cases B and C are there just for reading. In order to achieve that we must lock the rest of backing chain with VIR_LOCK_MANAGER_RESOURCE_SHARED flag.
See below ...
+ + if (!src->path) + break; + + VIR_DEBUG("Add disk %s", src->path); + if (virLockManagerAddResource(lock, + VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK, + src->path, + 0, + NULL, + diskFlags) < 0) { + VIR_DEBUG("Failed add disk %s", src->path); + goto cleanup; + } + + /* Now that we've added the first disk (top layer of backing chain) + * into the lock manager, let's traverse the rest of the backing chain + * and lock each layer for RO. This should prevent accidentally + * starting a domain with RW disk from the middle of the chain. */ + diskFlags = VIR_LOCK_MANAGER_RESOURCE_SHARED;
The VIR_LOCK_MANAGER_RESOURCE_SHARED flag means "shared, writable" mode according to lock_driver.h, I think you want to lock it with VIR_LOCK_MANAGER_RESOURCE_READONLY.
Yep, backing files should be marked readonly - 'shared' will allow multiple VMs to write to the image at once. So you've only provided mutual exclusion between an exclusive writer, vs a shared writer.
Except that both of our locking drivers threat VIR_LOCK_MANAGER_RESOURCE_READONLY as no-op.
But I think that since shared and exclusive writer are mutually exclusive, it doesn't matter that the lower layers are locked as shared writer. As soon as somebody wants to lock it exclusively they will fail. Nor virtlockd nor sanlock is actually enforcing the RO vs RW policy (if a file is locked for RO no writes are allowed). It's merely for tracking shared and exclusive locks.
Right, so I remember why we ignore READONLY locks now - we were only trying to protect against 2 processes /writing/ to the same image at once, which would cause unrecoverable data corruption. We were not trying to protect against one guest having a disk readonly while another guest has the disk read-write. That is bad because the guest holding the image readonly will see inconsistent filesystem state which may result in it kernel panicing. But the underling disk is still only written by one guest at a time, so there's no disk image corruption. This is similar to the scenario you illustrated with backing files. One guest had a backing file open for write, and another guest had it open readonly by virtue of having an upper layer open for write. We're not actually going to suffer data corription of the backing file here since we're still protected to not have concurrent writes to it. We are going to destroy the upper layer image, by invalidating the backing file, but that is true regardless of whether another guest has the upper layer open right now or not. So in fact I think there's a case to say that this entire patch is out of scope for the lock managers, as their goal is to prevent concurrent writes to an image, not to protect against administrator screwups with backing file writing. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Sep 09, 2015 at 09:49:16AM +0100, Daniel P. Berrange wrote:
On Wed, Sep 09, 2015 at 10:13:24AM +0200, Michal Privoznik wrote:
On 08.09.2015 18:04, Daniel P. Berrange wrote:
following backing chain: D (top) -> B -> C (bottom), because in both cases B and C are there just for reading. In order to achieve that we must lock the rest of backing chain with VIR_LOCK_MANAGER_RESOURCE_SHARED flag.
See below ...
+ + if (!src->path) + break; + + VIR_DEBUG("Add disk %s", src->path); + if (virLockManagerAddResource(lock, + VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK, + src->path, + 0, + NULL, + diskFlags) < 0) { + VIR_DEBUG("Failed add disk %s", src->path); + goto cleanup; + } + + /* Now that we've added the first disk (top layer of backing chain) + * into the lock manager, let's traverse the rest of the backing chain + * and lock each layer for RO. This should prevent accidentally + * starting a domain with RW disk from the middle of the chain. */ + diskFlags = VIR_LOCK_MANAGER_RESOURCE_SHARED;
The VIR_LOCK_MANAGER_RESOURCE_SHARED flag means "shared, writable" mode according to lock_driver.h, I think you want to lock it with VIR_LOCK_MANAGER_RESOURCE_READONLY.
Yep, backing files should be marked readonly - 'shared' will allow multiple VMs to write to the image at once. So you've only provided mutual exclusion between an exclusive writer, vs a shared writer.
Except that both of our locking drivers threat VIR_LOCK_MANAGER_RESOURCE_READONLY as no-op.
But I think that since shared and exclusive writer are mutually exclusive, it doesn't matter that the lower layers are locked as shared writer. As soon as somebody wants to lock it exclusively they will fail. Nor virtlockd nor sanlock is actually enforcing the RO vs RW policy (if a file is locked for RO no writes are allowed). It's merely for tracking shared and exclusive locks.
Right, so I remember why we ignore READONLY locks now - we were only trying to protect against 2 processes /writing/ to the same image at once, which would cause unrecoverable data corruption.
We were not trying to protect against one guest having a disk readonly while another guest has the disk read-write. That is bad because the guest holding the image readonly will see inconsistent filesystem state which may result in it kernel panicing. But the underling disk is still only written by one guest at a time, so there's no disk image corruption.
This is similar to the scenario you illustrated with backing files. One guest had a backing file open for write, and another guest had it open readonly by virtue of having an upper layer open for write. We're not actually going to suffer data corription of the backing file here since we're still protected to not have concurrent writes to it.
We are going to destroy the upper layer image, by invalidating the backing file, but that is true regardless of whether another guest has the upper layer open right now or not.
So in fact I think there's a case to say that this entire patch is out of scope for the lock managers, as their goal is to prevent concurrent writes to an image, not to protect against administrator screwups with backing file writing.
I'm just jumping into the discussion from the middle, so don't beat me for details I might've missed. I don't quite get these last two paragraphs. There will always be cases that can screw up the whole backing chain and there will always be users doing so. But for the cases libvirt is able to know about, we should utilize the locks if we have the option to do so. Whatever two domains are doing concurrently, we can make sure it's OK. If one domain is invalidating a backing file of a another domain just because it's not running... well, there's plenty of stuff you can break without utilizing backing chains. I think the idea of this patch is good, it just needs some polishing and after that, few follow-up patches that will utilize exclusive locks for all files used by block jobs. Martin
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Wed, Sep 09, 2015 at 02:37:13PM +0200, Martin Kletzander wrote:
On Wed, Sep 09, 2015 at 09:49:16AM +0100, Daniel P. Berrange wrote:
On Wed, Sep 09, 2015 at 10:13:24AM +0200, Michal Privoznik wrote:
On 08.09.2015 18:04, Daniel P. Berrange wrote:
following backing chain: D (top) -> B -> C (bottom), because in both cases B and C are there just for reading. In order to achieve that we must lock the rest of backing chain with VIR_LOCK_MANAGER_RESOURCE_SHARED flag.
See below ...
+ + if (!src->path) + break; + + VIR_DEBUG("Add disk %s", src->path); + if (virLockManagerAddResource(lock, + VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK, + src->path, + 0, + NULL, + diskFlags) < 0) { + VIR_DEBUG("Failed add disk %s", src->path); + goto cleanup; + } + + /* Now that we've added the first disk (top layer of backing chain) + * into the lock manager, let's traverse the rest of the backing chain + * and lock each layer for RO. This should prevent accidentally + * starting a domain with RW disk from the middle of the chain. */ + diskFlags = VIR_LOCK_MANAGER_RESOURCE_SHARED;
The VIR_LOCK_MANAGER_RESOURCE_SHARED flag means "shared, writable" mode according to lock_driver.h, I think you want to lock it with VIR_LOCK_MANAGER_RESOURCE_READONLY.
Yep, backing files should be marked readonly - 'shared' will allow multiple VMs to write to the image at once. So you've only provided mutual exclusion between an exclusive writer, vs a shared writer.
Except that both of our locking drivers threat VIR_LOCK_MANAGER_RESOURCE_READONLY as no-op.
But I think that since shared and exclusive writer are mutually exclusive, it doesn't matter that the lower layers are locked as shared writer. As soon as somebody wants to lock it exclusively they will fail. Nor virtlockd nor sanlock is actually enforcing the RO vs RW policy (if a file is locked for RO no writes are allowed). It's merely for tracking shared and exclusive locks.
Right, so I remember why we ignore READONLY locks now - we were only trying to protect against 2 processes /writing/ to the same image at once, which would cause unrecoverable data corruption.
We were not trying to protect against one guest having a disk readonly while another guest has the disk read-write. That is bad because the guest holding the image readonly will see inconsistent filesystem state which may result in it kernel panicing. But the underling disk is still only written by one guest at a time, so there's no disk image corruption.
This is similar to the scenario you illustrated with backing files. One guest had a backing file open for write, and another guest had it open readonly by virtue of having an upper layer open for write. We're not actually going to suffer data corription of the backing file here since we're still protected to not have concurrent writes to it.
We are going to destroy the upper layer image, by invalidating the backing file, but that is true regardless of whether another guest has the upper layer open right now or not.
So in fact I think there's a case to say that this entire patch is out of scope for the lock managers, as their goal is to prevent concurrent writes to an image, not to protect against administrator screwups with backing file writing.
I'm just jumping into the discussion from the middle, so don't beat me for details I might've missed.
I don't quite get these last two paragraphs. There will always be cases that can screw up the whole backing chain and there will always be users doing so. But for the cases libvirt is able to know about, we should utilize the locks if we have the option to do so. Whatever two domains are doing concurrently, we can make sure it's OK. If one domain is invalidating a backing file of a another domain just because it's not running... well, there's plenty of stuff you can break without utilizing backing chains. I think the idea of this patch is good, it just needs some polishing and after that, few follow-up patches that will utilize exclusive locks for all files used by block jobs.
The issue is that this is really special casing the lock manager operation in the case of backing files. While it may solve the particular scenario, outlined in the cover letter, that's only one of many scenarios related to backing files. Furthermore, it does this by changing the semantics of the lock manager's handling of read-only files, but only in the case of backing files. I think this kind of special casing is something that is liable to come back & cause pain in the future. I realize the lock manager is an appealing hammer to use here, but I don't think it should be used on this particular nail. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (4)
-
Daniel P. Berrange
-
Martin Kletzander
-
Michal Privoznik
-
Peter Krempa