[libvirt] [PATCH 0/2] Storage : Fixes for cloning raw volumes

From: Prerna Saxena <prerna@linux.vnet.ibm.com> Date: Mon, 4 May 2015 12:00:46 -0700 This series has some long-overdue fixes for copying of raw storage volumes with libvirt. Prerna Saxena (2): Storage : Suppress metadata refresh for volumes being built. Storage : Fix cloning of raw, sparse volumes. src/storage/storage_backend.c | 5 ++++- src/storage/storage_driver.c | 5 ----- 2 files changed, 4 insertions(+), 6 deletions(-) -- 1.8.3.1 -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India

Libvirt periodically calls 'stat' on all volumes in a storage pool, to update fields such as 'target.allocation'. The operation doesnt make sense for a volume which is curently being allocated. Also, the 'target.allocation' sub-field is taken into account while copying a raw image. To suppress any (potential) corruption, libvirt must not attempt to refresh a volume currently being built. Signed-off-by: Prerna Saxena <prerna@linux.vnet.ibm.com> --- src/storage/storage_backend.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index 289f454..355fc7f 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -1576,6 +1576,9 @@ virStorageBackendUpdateVolInfo(virStorageVolDefPtr vol, { int ret; + if (vol->building) + return 0; + if ((ret = virStorageBackendUpdateVolTargetInfo(&vol->target, withBlockVolFormat, openflags)) < 0) -- 1.8.3.1 -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India

On Tue, May 05, 2015 at 08:43:21AM +0530, Prerna Saxena wrote:
Libvirt periodically calls 'stat' on all volumes in a storage pool, to update fields such as 'target.allocation'.
The operation doesnt make sense for a volume which is curently being allocated.
From the comments in the storage driver, the point of allowing refresh for a volume that is currently being allocated is to track the progress of the allocation.
Also, the 'target.allocation' sub-field is taken into account while copying a raw image. To suppress any (potential) corruption, libvirt must not attempt to refresh a volume currently being built.
What would be the corruption? We do not allow using a volume that is currently building as a source for cloning the volume - storageVolCreateXMLFrom checks for origvol->building: if (origvol->building) { virReportError(VIR_ERR_OPERATION_INVALID, _("volume '%s' is still being allocated."), origvol->name); goto cleanup; } Jan

On Tuesday 05 May 2015 01:52 PM, Ján Tomko wrote:
On Tue, May 05, 2015 at 08:43:21AM +0530, Prerna Saxena wrote:
Libvirt periodically calls 'stat' on all volumes in a storage pool, to update fields such as 'target.allocation'.
The operation doesnt make sense for a volume which is curently being allocated. From the comments in the storage driver, the point of allowing refresh for a volume that is currently being allocated is to track the progress of the allocation.
Also, the 'target.allocation' sub-field is taken into account while copying a raw image. To suppress any (potential) corruption, libvirt must not attempt to refresh a volume currently being built. What would be the corruption?
We do not allow using a volume that is currently building as a source for cloning the volume - storageVolCreateXMLFrom checks for origvol->building:
if (origvol->building) { virReportError(VIR_ERR_OPERATION_INVALID, _("volume '%s' is still being allocated."), origvol->name); goto cleanup; }
While running libvirt on PowerPC, I saw an interesting scenario. The 'target.allocation' field seemed to change for a volume getting allocated, and this would lead to incomplete copy. This would happen at random intervals, not deterministically. While looking through the code, I found this to be the other place in code where the same field seemed to change without a lock. Hence the patch. I have sent the second patch which fixes the erring code too : - remain = vol->target.allocation; + remain = inputvol->target.capacity; Regards, -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India

On Tuesday 05 May 2015 03:20 PM, Prerna Saxena wrote:
On Tuesday 05 May 2015 01:52 PM, Ján Tomko wrote:
On Tue, May 05, 2015 at 08:43:21AM +0530, Prerna Saxena wrote:
Libvirt periodically calls 'stat' on all volumes in a storage pool, to update fields such as 'target.allocation'.
The operation doesnt make sense for a volume which is curently being allocated. From the comments in the storage driver, the point of allowing refresh for a volume that is currently being allocated is to track the progress of the allocation.
Also, the 'target.allocation' sub-field is taken into account while copying a raw image. To suppress any (potential) corruption, libvirt must not attempt to refresh a volume currently being built. What would be the corruption?
We do not allow using a volume that is currently building as a source for cloning the volume - storageVolCreateXMLFrom checks for origvol->building:
if (origvol->building) { virReportError(VIR_ERR_OPERATION_INVALID, _("volume '%s' is still being allocated."), origvol->name); goto cleanup; }
While running libvirt on PowerPC, I saw an interesting scenario. The 'target.allocation' field seemed to change for a volume getting allocated, and this would lead to incomplete copy. This would happen at random intervals, not deterministically. While looking through the code, I found this to be the other place in code where the same field seemed to change without a lock. Hence the patch.
I have sent the second patch which fixes the erring code too :
- remain = vol->target.allocation; + remain = inputvol->target.capacity;
More fundamental question -- why do we offload the copying of non-raw images to qemu-img tool, but make libvirt responsible for copying raw volumes ? Would it not be better if libvirt called on 'qemu-img' to copy all types of volumes, including raw ones ? -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India

On Tue, May 05, 2015 at 03:24:31PM +0530, Prerna Saxena wrote:
On Tuesday 05 May 2015 03:20 PM, Prerna Saxena wrote:
On Tuesday 05 May 2015 01:52 PM, Ján Tomko wrote:
On Tue, May 05, 2015 at 08:43:21AM +0530, Prerna Saxena wrote:
Libvirt periodically calls 'stat' on all volumes in a storage pool, to update fields such as 'target.allocation'.
The operation doesnt make sense for a volume which is curently being allocated. From the comments in the storage driver, the point of allowing refresh for a volume that is currently being allocated is to track the progress of the allocation.
Also, the 'target.allocation' sub-field is taken into account while copying a raw image. To suppress any (potential) corruption, libvirt must not attempt to refresh a volume currently being built. What would be the corruption?
We do not allow using a volume that is currently building as a source for cloning the volume - storageVolCreateXMLFrom checks for origvol->building:
if (origvol->building) { virReportError(VIR_ERR_OPERATION_INVALID, _("volume '%s' is still being allocated."), origvol->name); goto cleanup; }
While running libvirt on PowerPC, I saw an interesting scenario. The 'target.allocation' field seemed to change for a volume getting allocated, and this would lead to incomplete copy. This would happen at random intervals, not deterministically. While looking through the code, I found this to be the other place in code where the same field seemed to change without a lock. Hence the patch.
Oh, I was thinking of the soure volume for some reason. We correctly lock the pool before calling refreshVol, so changing the object should not be an issue. I think the bug is in storageVolCreateXMLFrom - it drops all the locks, but expects the allocation not to change. In storageVolCreateXML we work around this by creating a shallow copy of the volume.
I have sent the second patch which fixes the erring code too :
- remain = vol->target.allocation; + remain = inputvol->target.capacity;
More fundamental question -- why do we offload the copying of non-raw images to qemu-img tool, but make libvirt responsible for copying raw volumes ? Would it not be better if libvirt called on 'qemu-img' to copy all types of volumes, including raw ones ?
This way, libvirt can create raw volumes even without qemu-img installed. I don't know if there's any other reason. Jan

On Tuesday 05 May 2015 04:22 PM, Ján Tomko wrote:
On Tue, May 05, 2015 at 03:24:31PM +0530, Prerna Saxena wrote:
On Tuesday 05 May 2015 03:20 PM, Prerna Saxena wrote:
On Tuesday 05 May 2015 01:52 PM, Ján Tomko wrote:
On Tue, May 05, 2015 at 08:43:21AM +0530, Prerna Saxena wrote:
Libvirt periodically calls 'stat' on all volumes in a storage pool, to update fields such as 'target.allocation'.
The operation doesnt make sense for a volume which is curently being allocated. From the comments in the storage driver, the point of allowing refresh for a volume that is currently being allocated is to track the progress of the allocation.
Also, the 'target.allocation' sub-field is taken into account while copying a raw image. To suppress any (potential) corruption, libvirt must not attempt to refresh a volume currently being built. What would be the corruption?
We do not allow using a volume that is currently building as a source for cloning the volume - storageVolCreateXMLFrom checks for origvol->building:
if (origvol->building) { virReportError(VIR_ERR_OPERATION_INVALID, _("volume '%s' is still being allocated."), origvol->name); goto cleanup; }
While running libvirt on PowerPC, I saw an interesting scenario. The 'target.allocation' field seemed to change for a volume getting allocated, and this would lead to incomplete copy. This would happen at random intervals, not deterministically. While looking through the code, I found this to be the other place in code where the same field seemed to change without a lock. Hence the patch.
Oh, I was thinking of the soure volume for some reason.
We correctly lock the pool before calling refreshVol, so changing the object should not be an issue. I think the bug is in storageVolCreateXMLFrom - it drops all the locks, but expects the allocation not to change.
Yes, and so I sent this patch that blocks a refresh for volumes being created.
In storageVolCreateXML we work around this by creating a shallow copy of the volume.
I have sent the second patch which fixes the erring code too :
- remain = vol->target.allocation; + remain = inputvol->target.capacity;
More fundamental question -- why do we offload the copying of non-raw images to qemu-img tool, but make libvirt responsible for copying raw volumes ? Would it not be better if libvirt called on 'qemu-img' to copy all types of volumes, including raw ones ?
This way, libvirt can create raw volumes even without qemu-img installed. I don't know if there's any other reason.
I'm sorry, libvirt does not copy raw volumes as a 'fallback mechanism'. Libvirt chooses to copy raw volumes on its own, but calls on qemu-img to copy all other volume types. Is it not better to let qemu-img copy all volume types , including raw ones ? I wanted to check if there are reasons for this decision ? I'll be happy if the community could throw some light. Also cc'ing Cole, who had put in some fixes in this area. Regards, -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India

On 05/05/2015 10:44 PM, Prerna Saxena wrote:
On Tuesday 05 May 2015 04:22 PM, Ján Tomko wrote:
On Tue, May 05, 2015 at 03:24:31PM +0530, Prerna Saxena wrote:
On Tuesday 05 May 2015 03:20 PM, Prerna Saxena wrote:
On Tuesday 05 May 2015 01:52 PM, Ján Tomko wrote:
On Tue, May 05, 2015 at 08:43:21AM +0530, Prerna Saxena wrote:
Libvirt periodically calls 'stat' on all volumes in a storage pool, to update fields such as 'target.allocation'.
The operation doesnt make sense for a volume which is curently being allocated. From the comments in the storage driver, the point of allowing refresh for a volume that is currently being allocated is to track the progress of the allocation.
Also, the 'target.allocation' sub-field is taken into account while copying a raw image. To suppress any (potential) corruption, libvirt must not attempt to refresh a volume currently being built. What would be the corruption?
We do not allow using a volume that is currently building as a source for cloning the volume - storageVolCreateXMLFrom checks for origvol->building:
if (origvol->building) { virReportError(VIR_ERR_OPERATION_INVALID, _("volume '%s' is still being allocated."), origvol->name); goto cleanup; }
While running libvirt on PowerPC, I saw an interesting scenario. The 'target.allocation' field seemed to change for a volume getting allocated, and this would lead to incomplete copy. This would happen at random intervals, not deterministically. While looking through the code, I found this to be the other place in code where the same field seemed to change without a lock. Hence the patch.
Oh, I was thinking of the soure volume for some reason.
We correctly lock the pool before calling refreshVol, so changing the object should not be an issue. I think the bug is in storageVolCreateXMLFrom - it drops all the locks, but expects the allocation not to change.
Yes, and so I sent this patch that blocks a refresh for volumes being created.
In storageVolCreateXML we work around this by creating a shallow copy of the volume.
I have sent the second patch which fixes the erring code too :
- remain = vol->target.allocation; + remain = inputvol->target.capacity;
More fundamental question -- why do we offload the copying of non-raw images to qemu-img tool, but make libvirt responsible for copying raw volumes ? Would it not be better if libvirt called on 'qemu-img' to copy all types of volumes, including raw ones ?
This way, libvirt can create raw volumes even without qemu-img installed. I don't know if there's any other reason.
I'm sorry, libvirt does not copy raw volumes as a 'fallback mechanism'. Libvirt chooses to copy raw volumes on its own, but calls on qemu-img to copy all other volume types. Is it not better to let qemu-img copy all volume types , including raw ones ? I wanted to check if there are reasons for this decision ? I'll be happy if the community could throw some light. Also cc'ing Cole, who had put in some fixes in this area.
My recollection is that like Jan says we implement raw volume creation ourselves so that we could work without qemu-img in that case, like if qemu-img isn't installed. And then raw cloning was kind of built upon the raw creation code. But IMO we should drop all that stuff and just use qemu-img unconditionally, it's ubiquitous at this point. - Cole

When virsh vol-clone is attempted on a raw file where capacity > allocation, the resulting cloned volume has a size that matches the virtual-size of the parent; in place of matching its actual, disk size. This patch fixes the cloned disk to have same _allocated_size_ as the parent file from which it was cloned. Reference: https://www.redhat.com/archives/libvir-list/2014-September/msg00064.html Also fixes : https://bugzilla.redhat.com/show_bug.cgi?id=1130739 Signed-off-by: Prerna Saxena <prerna@linux.vnet.ibm.com> --- src/storage/storage_backend.c | 2 +- src/storage/storage_driver.c | 5 ----- 2 files changed, 1 insertion(+), 6 deletions(-) diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index 355fc7f..1a7c0cc 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -429,7 +429,7 @@ createRawFile(int fd, virStorageVolDefPtr vol, } #endif - remain = vol->target.allocation; + remain = inputvol->target.capacity; if (inputvol) { /* allow zero blocks to be skipped if we've requested sparse diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c index ac4a74a..c511035 100644 --- a/src/storage/storage_driver.c +++ b/src/storage/storage_driver.c @@ -1990,11 +1990,6 @@ storageVolCreateXMLFrom(virStoragePoolPtr obj, if (newvol->target.capacity < origvol->target.capacity) newvol->target.capacity = origvol->target.capacity; - /* Make sure allocation is at least as large as the destination cap, - * to make absolutely sure we copy all possible contents */ - if (newvol->target.allocation < origvol->target.capacity) - newvol->target.allocation = origvol->target.capacity; - if (!backend->buildVolFrom) { virReportError(VIR_ERR_NO_SUPPORT, "%s", _("storage pool does not support" -- 1.8.3.1 -- Prerna Saxena Linux Technology Centre, IBM Systems and Technology Lab, Bangalore, India

On Tue, May 05, 2015 at 08:47:56AM +0530, Prerna Saxena wrote:
When virsh vol-clone is attempted on a raw file where capacity > allocation, the resulting cloned volume has a size that matches the virtual-size of the parent; in place of matching its actual, disk size. This patch fixes the cloned disk to have same _allocated_size_ as the parent file from which it was cloned.
Reference: https://www.redhat.com/archives/libvir-list/2014-September/msg00064.html
Also fixes : https://bugzilla.redhat.com/show_bug.cgi?id=1130739
Signed-off-by: Prerna Saxena <prerna@linux.vnet.ibm.com> --- src/storage/storage_backend.c | 2 +- src/storage/storage_driver.c | 5 ----- 2 files changed, 1 insertion(+), 6 deletions(-)
diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index 355fc7f..1a7c0cc 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -429,7 +429,7 @@ createRawFile(int fd, virStorageVolDefPtr vol, } #endif
- remain = vol->target.allocation; + remain = inputvol->target.capacity;
If we're not cloning a volume, inputvol is NULL and this will crash.
if (inputvol) {
After moving the assignment here, we will copy all the data from the input volume, but we still use the original value of 'allocation' to fallocate the start of the file. So for an original 5GB file with 3 GB allocated (where + is an allocated gigabyte): +_+_+ We'll end up with a file that has 4 GB allocated: +++_+ (We'll allocate the first 3 GB, and copy any remaining data) I think we do not have to honor an allocation value less than the capacity of the original volume and just assume it means 'create a sparse file'. So I'd suggest: 1. setting the need_alloc to false if we're cloning a volume and the new allocation is less than the *original* capacity 2. making the fallocate call depend on need_alloc
/* allow zero blocks to be skipped if we've requested sparse
Jan
participants (3)
-
Cole Robinson
-
Ján Tomko
-
Prerna Saxena