[libvirt] Storage backend and RBD improvements

This series of patches is mainly focussed on improving the RBD storage pool backend. In the series it fixes a bug in the storage pool driver regarding volume wiping and it also adds the new flag to instruct storage pools to use TRIM for wiping volumes. Afterwards it adds this functionality to the RBD storage pool and in addition it also adds to ability to clone RBD volumes. The original series of patches were reviewed by John Ferlan and this series are revised based on his great feedback. Wido

Using virCheckFlags() we validate if the provided flags are support flags by this function. The old code would also still run the command since it didn't go to cleanup when a invalid flag was supplied. We now go to cleanup and exit if a invalid flag would be provided. Signed-off-by: Wido den Hollander <wido@widodh.nl> --- src/storage/storage_backend.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index 03bc18c..15e9109 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -2056,7 +2056,16 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED, struct stat st; virCommandPtr cmd = NULL; - virCheckFlags(0, -1); + virCheckFlags(VIR_STORAGE_VOL_WIPE_ALG_ZERO | + VIR_STORAGE_VOL_WIPE_ALG_NNSA | + VIR_STORAGE_VOL_WIPE_ALG_DOD | + VIR_STORAGE_VOL_WIPE_ALG_BSI | + VIR_STORAGE_VOL_WIPE_ALG_GUTMANN | + VIR_STORAGE_VOL_WIPE_ALG_SCHNEIER | + VIR_STORAGE_VOL_WIPE_ALG_PFITZNER7 | + VIR_STORAGE_VOL_WIPE_ALG_PFITZNER33 | + VIR_STORAGE_VOL_WIPE_ALG_RANDOM | + VIR_STORAGE_VOL_WIPE_ALG_LAST, -1); VIR_DEBUG("Wiping volume with path '%s' and algorithm %u", vol->target.path, algorithm); @@ -2078,7 +2087,7 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED, if (algorithm != VIR_STORAGE_VOL_WIPE_ALG_ZERO) { const char *alg_char ATTRIBUTE_UNUSED = NULL; - switch (algorithm) { + switch ((virStorageVolWipeAlgorithm) algorithm) { case VIR_STORAGE_VOL_WIPE_ALG_NNSA: alg_char = "nnsa"; break; @@ -2107,6 +2116,7 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED, virReportError(VIR_ERR_INVALID_ARG, _("unsupported algorithm %d"), algorithm); + goto cleanup; } cmd = virCommandNew(SCRUB); virCommandAddArgList(cmd, "-f", "-p", alg_char, -- 1.9.1

On 01/27/2016 05:20 AM, Wido den Hollander wrote:
Using virCheckFlags() we validate if the provided flags are support flags by this function.
The old code would also still run the command since it didn't go to cleanup when a invalid flag was supplied.
We now go to cleanup and exit if a invalid flag would be provided.
Signed-off-by: Wido den Hollander <wido@widodh.nl> --- src/storage/storage_backend.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index 03bc18c..15e9109 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -2056,7 +2056,16 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED, struct stat st; virCommandPtr cmd = NULL;
- virCheckFlags(0, -1); + virCheckFlags(VIR_STORAGE_VOL_WIPE_ALG_ZERO | + VIR_STORAGE_VOL_WIPE_ALG_NNSA | + VIR_STORAGE_VOL_WIPE_ALG_DOD | + VIR_STORAGE_VOL_WIPE_ALG_BSI | + VIR_STORAGE_VOL_WIPE_ALG_GUTMANN | + VIR_STORAGE_VOL_WIPE_ALG_SCHNEIER | + VIR_STORAGE_VOL_WIPE_ALG_PFITZNER7 | + VIR_STORAGE_VOL_WIPE_ALG_PFITZNER33 | + VIR_STORAGE_VOL_WIPE_ALG_RANDOM | + VIR_STORAGE_VOL_WIPE_ALG_LAST, -1);
Looks like my brain had a misfire - these are bits for 'algorithm' and not 'flags' <sigh>. I'll clean up, no need to resend.
VIR_DEBUG("Wiping volume with path '%s' and algorithm %u", vol->target.path, algorithm); @@ -2078,7 +2087,7 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED,
if (algorithm != VIR_STORAGE_VOL_WIPE_ALG_ZERO) { const char *alg_char ATTRIBUTE_UNUSED = NULL; - switch (algorithm) { + switch ((virStorageVolWipeAlgorithm) algorithm) {
FYI: When you do this... Then the "default:" changes to VIR_STORAGE_VOL_WIPE_ALG_LAST and of course you'd need a VIR_STORAGE_VOL_WIPE_ALG_ZERO case. Because of that if construct I moved the switch outside the if, in order to add case ZERO (otherwise it's a dead code path from coverity) I cleaned it up already and pushed. Tks - John
case VIR_STORAGE_VOL_WIPE_ALG_NNSA: alg_char = "nnsa"; break; @@ -2107,6 +2116,7 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED, virReportError(VIR_ERR_INVALID_ARG, _("unsupported algorithm %d"), algorithm); + goto cleanup; } cmd = virCommandNew(SCRUB); virCommandAddArgList(cmd, "-f", "-p", alg_char,

When wiping the RBD image will be filled with zeros started at offset 0 and until the end of the volume. This will result in the RBD volume growing to it's full allocation on the Ceph cluster. All data on the volume will be overwritten however, making it unavailable. It does NOT take any RBD snapshots into account. The original data might still be in a snapshot of that RBD volume. Signed-off-by: Wido den Hollander <wido@widodh.nl> --- src/storage/storage_backend_rbd.c | 115 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+) diff --git a/src/storage/storage_backend_rbd.c b/src/storage/storage_backend_rbd.c index 8c7a80d..7e669ff 100644 --- a/src/storage/storage_backend_rbd.c +++ b/src/storage/storage_backend_rbd.c @@ -732,6 +732,120 @@ static int virStorageBackendRBDResizeVol(virConnectPtr conn ATTRIBUTE_UNUSED, return ret; } +static int +virStorageBackendRBDVolWipeZero(rbd_image_t image, + char *imgname, + rbd_image_info_t info, + uint64_t stripe_count) +{ + int r = -1; + int ret = -1; + uint64_t offset = 0; + uint64_t length; + char *writebuf; + + if (VIR_ALLOC_N(writebuf, info.obj_size * stripe_count) < 0) + goto cleanup; + + while (offset < info.size) { + length = MIN((info.size - offset), (info.obj_size * stripe_count)); + + if ((r = rbd_write(image, offset, length, writebuf)) < 0) { + virReportSystemError(-r, _("writing %zu bytes failed on " + "RBD image %s at offset %zu"), + length, imgname, offset); + goto cleanup; + } + + VIR_DEBUG("Wrote %zu bytes to RBD image %s at offset %zu", + length, imgname, offset); + + offset += length; + } + + ret = 0; + + cleanup: + VIR_FREE(writebuf); + + return ret; +} + +static int +virStorageBackendRBDVolWipe(virConnectPtr conn, + virStoragePoolObjPtr pool, + virStorageVolDefPtr vol, + unsigned int algorithm, + unsigned int flags) +{ + virStorageBackendRBDState ptr; + ptr.cluster = NULL; + ptr.ioctx = NULL; + rbd_image_t image = NULL; + rbd_image_info_t info; + uint64_t stripe_count; + int r = -1; + int ret = -1; + + virCheckFlags(VIR_STORAGE_VOL_WIPE_ALG_ZERO, -1); + + VIR_DEBUG("Wiping RBD image %s/%s", pool->def->source.name, vol->name); + + if (virStorageBackendRBDOpenRADOSConn(&ptr, conn, &pool->def->source) < 0) + goto cleanup; + + if (virStorageBackendRBDOpenIoCTX(&ptr, pool) < 0) + goto cleanup; + + if ((r = rbd_open(ptr.ioctx, vol->name, &image, NULL)) < 0) { + virReportSystemError(-r, _("failed to open the RBD image %s"), + vol->name); + goto cleanup; + } + + if ((r = rbd_stat(image, &info, sizeof(info))) < 0) { + virReportSystemError(-r, _("failed to stat the RBD image %s"), + vol->name); + goto cleanup; + } + + if ((r = rbd_get_stripe_count(image, &stripe_count)) < 0) { + virReportSystemError(-r, _("failed to get stripe count of RBD image %s"), + vol->name); + goto cleanup; + } + + VIR_DEBUG("Need to wipe %zu bytes from RBD image %s/%s", + info.size, pool->def->source.name, vol->name); + + switch ((virStorageVolWipeAlgorithm) algorithm) { + case VIR_STORAGE_VOL_WIPE_ALG_ZERO: + r = virStorageBackendRBDVolWipeZero(image, vol->name, + info, stripe_count); + break; + default: + virReportError(VIR_ERR_INVALID_ARG, _("unsupported algorithm %d"), + algorithm); + goto cleanup; + } + + if (r < 0) { + virReportSystemError(-r, _("failed to wipe RBD image %s"), + vol->name); + goto cleanup; + } + + ret = 0; + + cleanup: + if (image) + rbd_close(image); + + virStorageBackendRBDCloseRADOSConn(&ptr); + + return ret; +} + virStorageBackend virStorageBackendRBD = { .type = VIR_STORAGE_POOL_RBD, @@ -741,4 +855,5 @@ virStorageBackend virStorageBackendRBD = { .refreshVol = virStorageBackendRBDRefreshVol, .deleteVol = virStorageBackendRBDDeleteVol, .resizeVol = virStorageBackendRBDResizeVol, + .wipeVol = virStorageBackendRBDVolWipe }; -- 1.9.1

On 01/27/2016 05:20 AM, Wido den Hollander wrote:
When wiping the RBD image will be filled with zeros started at offset 0 and until the end of the volume.
This will result in the RBD volume growing to it's full allocation on the Ceph cluster. All data on the volume will be overwritten however, making it unavailable.
It does NOT take any RBD snapshots into account. The original data might still be in a snapshot of that RBD volume.
Signed-off-by: Wido den Hollander <wido@widodh.nl> --- src/storage/storage_backend_rbd.c | 115 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+)
diff --git a/src/storage/storage_backend_rbd.c b/src/storage/storage_backend_rbd.c index 8c7a80d..7e669ff 100644 --- a/src/storage/storage_backend_rbd.c +++ b/src/storage/storage_backend_rbd.c @@ -732,6 +732,120 @@ static int virStorageBackendRBDResizeVol(virConnectPtr conn ATTRIBUTE_UNUSED, return ret; }
+static int +virStorageBackendRBDVolWipeZero(rbd_image_t image, + char *imgname, + rbd_image_info_t info,
^^ I changed this to a pointer (*info) in order to avoid Coverity whines about pass_by_value... Had to obviously change the info. to be info-> and the call to pass by reference (&info).
+ uint64_t stripe_count) +{ + int r = -1; + int ret = -1; + uint64_t offset = 0; + uint64_t length; + char *writebuf; + + if (VIR_ALLOC_N(writebuf, info.obj_size * stripe_count) < 0) + goto cleanup; + + while (offset < info.size) { + length = MIN((info.size - offset), (info.obj_size * stripe_count)); + + if ((r = rbd_write(image, offset, length, writebuf)) < 0) { + virReportSystemError(-r, _("writing %zu bytes failed on " + "RBD image %s at offset %zu"), + length, imgname, offset); + goto cleanup; + } + + VIR_DEBUG("Wrote %zu bytes to RBD image %s at offset %zu", + length, imgname, offset); + + offset += length; + } + + ret = 0; + + cleanup: + VIR_FREE(writebuf); + + return ret; +} + +static int +virStorageBackendRBDVolWipe(virConnectPtr conn, + virStoragePoolObjPtr pool, + virStorageVolDefPtr vol, + unsigned int algorithm, + unsigned int flags) +{ + virStorageBackendRBDState ptr; + ptr.cluster = NULL; + ptr.ioctx = NULL; + rbd_image_t image = NULL; + rbd_image_info_t info; + uint64_t stripe_count; + int r = -1; + int ret = -1; + + virCheckFlags(VIR_STORAGE_VOL_WIPE_ALG_ZERO, -1);
Like 1/5... My bad... I'll clean up... It'll be (0, -1) since the flags argument isn't defined. <sigh>
+ + VIR_DEBUG("Wiping RBD image %s/%s", pool->def->source.name, vol->name); + + if (virStorageBackendRBDOpenRADOSConn(&ptr, conn, &pool->def->source) < 0) + goto cleanup; + + if (virStorageBackendRBDOpenIoCTX(&ptr, pool) < 0) + goto cleanup; + + if ((r = rbd_open(ptr.ioctx, vol->name, &image, NULL)) < 0) { + virReportSystemError(-r, _("failed to open the RBD image %s"), + vol->name); + goto cleanup; + } + + if ((r = rbd_stat(image, &info, sizeof(info))) < 0) { + virReportSystemError(-r, _("failed to stat the RBD image %s"), + vol->name); + goto cleanup; + } + + if ((r = rbd_get_stripe_count(image, &stripe_count)) < 0) { + virReportSystemError(-r, _("failed to get stripe count of RBD image %s"), + vol->name); + goto cleanup; + } + + VIR_DEBUG("Need to wipe %zu bytes from RBD image %s/%s", + info.size, pool->def->source.name, vol->name); + + switch ((virStorageVolWipeAlgorithm) algorithm) { + case VIR_STORAGE_VOL_WIPE_ALG_ZERO: + r = virStorageBackendRBDVolWipeZero(image, vol->name, + info, stripe_count); + break; + default:
Similar to 1/5 this needs to be fully populated... I'll clean up and push. Tks - John
+ virReportError(VIR_ERR_INVALID_ARG, _("unsupported algorithm %d"), + algorithm); + goto cleanup; + } + + if (r < 0) { + virReportSystemError(-r, _("failed to wipe RBD image %s"), + vol->name); + goto cleanup; + } + + ret = 0; + + cleanup: + if (image) + rbd_close(image); + + virStorageBackendRBDCloseRADOSConn(&ptr); + + return ret; +} + virStorageBackend virStorageBackendRBD = { .type = VIR_STORAGE_POOL_RBD,
@@ -741,4 +855,5 @@ virStorageBackend virStorageBackendRBD = { .refreshVol = virStorageBackendRBDRefreshVol, .deleteVol = virStorageBackendRBDDeleteVol, .resizeVol = virStorageBackendRBDResizeVol, + .wipeVol = virStorageBackendRBDVolWipe };

On 01/27/2016 03:20 AM, Wido den Hollander wrote:
When wiping the RBD image will be filled with zeros started at offset 0 and until the end of the volume.
This will result in the RBD volume growing to it's full allocation on the Ceph cluster. All data on the volume will be overwritten however, making it unavailable.
It does NOT take any RBD snapshots into account. The original data might still be in a snapshot of that RBD volume.
Signed-off-by: Wido den Hollander <wido@widodh.nl> --- src/storage/storage_backend_rbd.c | 115 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+)
+ + VIR_DEBUG("Need to wipe %zu bytes from RBD image %s/%s", + info.size, pool->def->source.name, vol->name);
This statement, and others like it, breaks the build on 32-bit architectures, since info.size is uint64_t but %zu is only 32-bits: ../../src/storage/storage_backend_rbd.c: In function 'virStorageBackendRBDVolWipe': ../../src/storage/storage_backend_rbd.c:1281:15: error: format '%zu' expects argument of type 'size_t', but argument 8 has type 'uint64_t {aka long long unsigned int}' [-Werror=format=] VIR_DEBUG("Need to wipe %zu bytes from RBD image %s/%s", ^ ../../src/util/virlog.h:90:73: note: in definition of macro 'VIR_DEBUG_INT' virLogMessage(src, VIR_LOG_DEBUG, filename, linenr, funcname, NULL, __VA_ARGS__) ^ ../../src/storage/storage_backend_rbd.c:1281:5: note: in expansion of macro 'VIR_DEBUG' VIR_DEBUG("Need to wipe %zu bytes from RBD image %s/%s", ^ I'm preparing an obvious patch, but was a bit surprised that we don't have a 32-bit build-bot to have caught it sooner. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

This new algorithm adds support for wiping volumes using TRIM. It does not overwrite all the data in a volume, but it tells the backing storage pool/driver that all bytes in a volume can be discarded. It depends on the backing storage pool how this is handled. A SCSI backend might send UNMAP commands to remove all data present on a LUN. A Ceph backend might use rbd_discard() to instruct the Ceph cluster that all data on that RBD volume can be discarded. Signed-off-by: Wido den Hollander <wido@widodh.nl> --- include/libvirt/libvirt-storage.h | 3 +++ src/storage/storage_backend.c | 4 ++++ tools/virsh-volume.c | 2 +- tools/virsh.pod | 1 + 4 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h index 2c55c93..232b4d3 100644 --- a/include/libvirt/libvirt-storage.h +++ b/include/libvirt/libvirt-storage.h @@ -153,6 +153,9 @@ typedef enum { VIR_STORAGE_VOL_WIPE_ALG_RANDOM = 8, /* 1-pass random */ + VIR_STORAGE_VOL_WIPE_ALG_TRIM = 9, /* 1-pass, trim all data on the + volume by using TRIM or DISCARD */ + # ifdef VIR_ENUM_SENTINELS VIR_STORAGE_VOL_WIPE_ALG_LAST /* diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index 15e9109..1bb44a7 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -2065,6 +2065,7 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED, VIR_STORAGE_VOL_WIPE_ALG_PFITZNER7 | VIR_STORAGE_VOL_WIPE_ALG_PFITZNER33 | VIR_STORAGE_VOL_WIPE_ALG_RANDOM | + VIR_STORAGE_VOL_WIPE_ALG_TRIM | VIR_STORAGE_VOL_WIPE_ALG_LAST, -1); VIR_DEBUG("Wiping volume with path '%s' and algorithm %u", @@ -2112,6 +2113,9 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED, case VIR_STORAGE_VOL_WIPE_ALG_RANDOM: alg_char = "random"; break; + case VIR_STORAGE_VOL_WIPE_ALG_TRIM: + alg_char = "trim"; + break; default: virReportError(VIR_ERR_INVALID_ARG, _("unsupported algorithm %d"), diff --git a/tools/virsh-volume.c b/tools/virsh-volume.c index 661c876..35f0cbd 100644 --- a/tools/virsh-volume.c +++ b/tools/virsh-volume.c @@ -906,7 +906,7 @@ static const vshCmdOptDef opts_vol_wipe[] = { VIR_ENUM_DECL(virStorageVolWipeAlgorithm) VIR_ENUM_IMPL(virStorageVolWipeAlgorithm, VIR_STORAGE_VOL_WIPE_ALG_LAST, "zero", "nnsa", "dod", "bsi", "gutmann", "schneier", - "pfitzner7", "pfitzner33", "random"); + "pfitzner7", "pfitzner33", "random", "trim"); static bool cmdVolWipe(vshControl *ctl, const vshCmd *cmd) diff --git a/tools/virsh.pod b/tools/virsh.pod index e830c59..b259507 100644 --- a/tools/virsh.pod +++ b/tools/virsh.pod @@ -3546,6 +3546,7 @@ B<Supported algorithms> pfitzner7 - Roy Pfitzner's 7-random-pass method: random x7. pfitzner33 - Roy Pfitzner's 33-random-pass method: random x33. random - 1-pass pattern: random. + trim - 1-pass trimming the volume using TRIM or DISCARD B<Note>: The availability of algorithms may be limited by the version of the C<scrub> binary installed on the host. -- 1.9.1

On 01/27/2016 05:20 AM, Wido den Hollander wrote:
This new algorithm adds support for wiping volumes using TRIM.
It does not overwrite all the data in a volume, but it tells the backing storage pool/driver that all bytes in a volume can be discarded.
It depends on the backing storage pool how this is handled.
A SCSI backend might send UNMAP commands to remove all data present on a LUN.
A Ceph backend might use rbd_discard() to instruct the Ceph cluster that all data on that RBD volume can be discarded.
Signed-off-by: Wido den Hollander <wido@widodh.nl> --- include/libvirt/libvirt-storage.h | 3 +++ src/storage/storage_backend.c | 4 ++++ tools/virsh-volume.c | 2 +- tools/virsh.pod | 1 + 4 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h index 2c55c93..232b4d3 100644 --- a/include/libvirt/libvirt-storage.h +++ b/include/libvirt/libvirt-storage.h @@ -153,6 +153,9 @@ typedef enum {
VIR_STORAGE_VOL_WIPE_ALG_RANDOM = 8, /* 1-pass random */
+ VIR_STORAGE_VOL_WIPE_ALG_TRIM = 9, /* 1-pass, trim all data on the + volume by using TRIM or DISCARD */ + # ifdef VIR_ENUM_SENTINELS VIR_STORAGE_VOL_WIPE_ALG_LAST /* diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index 15e9109..1bb44a7 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -2065,6 +2065,7 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED, VIR_STORAGE_VOL_WIPE_ALG_PFITZNER7 | VIR_STORAGE_VOL_WIPE_ALG_PFITZNER33 | VIR_STORAGE_VOL_WIPE_ALG_RANDOM | + VIR_STORAGE_VOL_WIPE_ALG_TRIM | VIR_STORAGE_VOL_WIPE_ALG_LAST, -1);
VIR_DEBUG("Wiping volume with path '%s' and algorithm %u", @@ -2112,6 +2113,9 @@ virStorageBackendVolWipeLocal(virConnectPtr conn ATTRIBUTE_UNUSED, case VIR_STORAGE_VOL_WIPE_ALG_RANDOM: alg_char = "random"; break; + case VIR_STORAGE_VOL_WIPE_ALG_TRIM: + alg_char = "trim"; + break;
This would be bad... Passing "trim" to the SCRUB image would cause a failure... I've replaced with: case VIR_STORAGE_VOL_WIPE_ALG_TRIM: virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", _("'trim' algorithm not supported")); goto cleanup; Since at this point none of the drivers that utilize this function supports the 'trim' option (in fact none do at this point).
default: virReportError(VIR_ERR_INVALID_ARG, _("unsupported algorithm %d"), diff --git a/tools/virsh-volume.c b/tools/virsh-volume.c index 661c876..35f0cbd 100644 --- a/tools/virsh-volume.c +++ b/tools/virsh-volume.c @@ -906,7 +906,7 @@ static const vshCmdOptDef opts_vol_wipe[] = { VIR_ENUM_DECL(virStorageVolWipeAlgorithm) VIR_ENUM_IMPL(virStorageVolWipeAlgorithm, VIR_STORAGE_VOL_WIPE_ALG_LAST, "zero", "nnsa", "dod", "bsi", "gutmann", "schneier", - "pfitzner7", "pfitzner33", "random"); + "pfitzner7", "pfitzner33", "random", "trim");
static bool cmdVolWipe(vshControl *ctl, const vshCmd *cmd) diff --git a/tools/virsh.pod b/tools/virsh.pod index e830c59..b259507 100644 --- a/tools/virsh.pod +++ b/tools/virsh.pod @@ -3546,6 +3546,7 @@ B<Supported algorithms> pfitzner7 - Roy Pfitzner's 7-random-pass method: random x7. pfitzner33 - Roy Pfitzner's 33-random-pass method: random x33. random - 1-pass pattern: random. + trim - 1-pass trimming the volume using TRIM or DISCARD
B<Note>: The availability of algorithms may be limited by the version of the C<scrub> binary installed on the host.
I adjusted the <Note> text to be: B<Note>: The C<scrub> binary will be used to handle the 'nnsa', 'dod', 'bsi', 'gutmann', 'schneier', 'pfitzner7' and 'pfitzner33' algorithms. The availability of the algorithms may be limited by the version of the C<scrub> binary installed on the host. The 'zero' algorithm will write zeroes to the entire volume. For some volumes, such as sparse or rbd volumes, this may result in completely filling the volume with zeroes making it appear to be completely full. As an alternative, the 'trim' algorithm does not overwrite all the data in a volume, rather it expects the storage driver to be able to discard all bytes in a volume. It is up to the storage driver to handle how the discarding occurs. Not all storage drivers or volume types can support 'trim'. Also, since I fixed the switch in 2/5 for storage_backend_rbd.c to not use 'default:' - I added a temporary case for TRIM there (which gets resolved in the next patch). Tks - This has been pushed. John

On Wed, Jan 27, 2016 at 11:20:07AM +0100, Wido den Hollander wrote:
This new algorithm adds support for wiping volumes using TRIM.
It does not overwrite all the data in a volume, but it tells the backing storage pool/driver that all bytes in a volume can be discarded.
It depends on the backing storage pool how this is handled.
A SCSI backend might send UNMAP commands to remove all data present on a LUN.
A Ceph backend might use rbd_discard() to instruct the Ceph cluster that all data on that RBD volume can be discarded.
Signed-off-by: Wido den Hollander <wido@widodh.nl> --- include/libvirt/libvirt-storage.h | 3 +++ src/storage/storage_backend.c | 4 ++++ tools/virsh-volume.c | 2 +- tools/virsh.pod | 1 + 4 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h index 2c55c93..232b4d3 100644 --- a/include/libvirt/libvirt-storage.h +++ b/include/libvirt/libvirt-storage.h @@ -153,6 +153,9 @@ typedef enum {
VIR_STORAGE_VOL_WIPE_ALG_RANDOM = 8, /* 1-pass random */
+ VIR_STORAGE_VOL_WIPE_ALG_TRIM = 9, /* 1-pass, trim all data on the + volume by using TRIM or DISCARD */ +
Do we need a separate flag for that? VIR_STORAGE_VOL_WIPE_ALG_ZERO already has a similar behavior for sparse files, where we truncate the file size to 0, then back to the original size. We could treat it as 'rbd_discard' in the rbd driver, with a fallback to writing zeroes if that's not supported. Jan

On Mon, Feb 01, 2016 at 01:12:13PM +0100, Ján Tomko wrote:
On Wed, Jan 27, 2016 at 11:20:07AM +0100, Wido den Hollander wrote:
This new algorithm adds support for wiping volumes using TRIM.
It does not overwrite all the data in a volume, but it tells the backing storage pool/driver that all bytes in a volume can be discarded.
It depends on the backing storage pool how this is handled.
A SCSI backend might send UNMAP commands to remove all data present on a LUN.
A Ceph backend might use rbd_discard() to instruct the Ceph cluster that all data on that RBD volume can be discarded.
Signed-off-by: Wido den Hollander <wido@widodh.nl> --- include/libvirt/libvirt-storage.h | 3 +++ src/storage/storage_backend.c | 4 ++++ tools/virsh-volume.c | 2 +- tools/virsh.pod | 1 + 4 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h index 2c55c93..232b4d3 100644 --- a/include/libvirt/libvirt-storage.h +++ b/include/libvirt/libvirt-storage.h @@ -153,6 +153,9 @@ typedef enum {
VIR_STORAGE_VOL_WIPE_ALG_RANDOM = 8, /* 1-pass random */
+ VIR_STORAGE_VOL_WIPE_ALG_TRIM = 9, /* 1-pass, trim all data on the + volume by using TRIM or DISCARD */ +
Do we need a separate flag for that?
VIR_STORAGE_VOL_WIPE_ALG_ZERO already has a similar behavior for sparse files, where we truncate the file size to 0, then back to the original size.
We did not want to use ALG_ZERO, because that flag should not have a side effect on the underlying physical storage allocation of the file. We added the ALG_TRIM variant as one which explicitly does trimming of the underlying storage. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Using VIR_STORAGE_VOL_WIPE_ALG_TRIM a RBD volume can be trimmed down to 0 bytes using rbd_dicard() Effectively all the data on the volume will be lost/gone, but the volume remains available for use afterwards. Starting at offset 0 the storage pool will call rbd_discard() in stripe size * count increments which is usually 4MB. Stripe size being 4MB and count 1. rbd_discard() is available since Ceph version Dumpling (0.67) which dates back to August 2013. Signed-off-by: Wido den Hollander <wido@widodh.nl> --- src/storage/storage_backend_rbd.c | 42 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/src/storage/storage_backend_rbd.c b/src/storage/storage_backend_rbd.c index 7e669ff..8a95388 100644 --- a/src/storage/storage_backend_rbd.c +++ b/src/storage/storage_backend_rbd.c @@ -772,6 +772,41 @@ virStorageBackendRBDVolWipeZero(rbd_image_t image, } static int +virStorageBackendRBDVolWipeDiscard(rbd_image_t image, + char *imgname, + rbd_image_info_t info, + uint64_t stripe_count) +{ + int r = -1; + int ret = -1; + uint64_t offset = 0; + uint64_t length; + + VIR_DEBUG("Wiping RBD %s volume using discard)", imgname); + + while (offset < info.size) { + length = MIN((info.size - offset), (info.obj_size * stripe_count)); + + if ((r = rbd_discard(image, offset, length)) < 0) { + virReportSystemError(-r, _("discarding %zu bytes failed on " + "RBD image %s at offset %zu"), + length, imgname, offset); + goto cleanup; + } + + VIR_DEBUG("Discarded %zu bytes of RBD image %s at offset %zu", + length, imgname, offset); + + offset += length; + } + + ret = 0; + + cleanup: + return ret; +} + +static int virStorageBackendRBDVolWipe(virConnectPtr conn, virStoragePoolObjPtr pool, virStorageVolDefPtr vol, @@ -787,7 +822,8 @@ virStorageBackendRBDVolWipe(virConnectPtr conn, int r = -1; int ret = -1; - virCheckFlags(VIR_STORAGE_VOL_WIPE_ALG_ZERO, -1); + virCheckFlags(VIR_STORAGE_VOL_WIPE_ALG_ZERO | + VIR_STORAGE_VOL_WIPE_ALG_TRIM, -1); VIR_DEBUG("Wiping RBD image %s/%s", pool->def->source.name, vol->name); @@ -823,6 +859,10 @@ virStorageBackendRBDVolWipe(virConnectPtr conn, r = virStorageBackendRBDVolWipeZero(image, vol->name, info, stripe_count); break; + case VIR_STORAGE_VOL_WIPE_ALG_TRIM: + r = virStorageBackendRBDVolWipeDiscard(image, vol->name, + info, stripe_count); + break; default: virReportError(VIR_ERR_INVALID_ARG, _("unsupported algorithm %d"), algorithm); -- 1.9.1

On 01/27/2016 05:20 AM, Wido den Hollander wrote:
Using VIR_STORAGE_VOL_WIPE_ALG_TRIM a RBD volume can be trimmed down to 0 bytes using rbd_dicard()
s/dicard/discard
Effectively all the data on the volume will be lost/gone, but the volume remains available for use afterwards.
Starting at offset 0 the storage pool will call rbd_discard() in stripe size * count increments which is usually 4MB. Stripe size being 4MB and count 1.
rbd_discard() is available since Ceph version Dumpling (0.67) which dates back to August 2013.
Signed-off-by: Wido den Hollander <wido@widodh.nl> --- src/storage/storage_backend_rbd.c | 42 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-)
diff --git a/src/storage/storage_backend_rbd.c b/src/storage/storage_backend_rbd.c index 7e669ff..8a95388 100644 --- a/src/storage/storage_backend_rbd.c +++ b/src/storage/storage_backend_rbd.c @@ -772,6 +772,41 @@ virStorageBackendRBDVolWipeZero(rbd_image_t image, }
static int +virStorageBackendRBDVolWipeDiscard(rbd_image_t image, + char *imgname, + rbd_image_info_t info,
Like WipeZero, I adjusted to be *info to silence coverity. Other adjustments for info-> and &info were similarly made and pushed. Tks - John
+ uint64_t stripe_count) +{ + int r = -1; + int ret = -1; + uint64_t offset = 0; + uint64_t length; + + VIR_DEBUG("Wiping RBD %s volume using discard)", imgname); + + while (offset < info.size) { + length = MIN((info.size - offset), (info.obj_size * stripe_count)); + + if ((r = rbd_discard(image, offset, length)) < 0) { + virReportSystemError(-r, _("discarding %zu bytes failed on " + "RBD image %s at offset %zu"), + length, imgname, offset); + goto cleanup; + } + + VIR_DEBUG("Discarded %zu bytes of RBD image %s at offset %zu", + length, imgname, offset); + + offset += length; + } + + ret = 0; + + cleanup: + return ret; +} + +static int virStorageBackendRBDVolWipe(virConnectPtr conn, virStoragePoolObjPtr pool, virStorageVolDefPtr vol, @@ -787,7 +822,8 @@ virStorageBackendRBDVolWipe(virConnectPtr conn, int r = -1; int ret = -1;
- virCheckFlags(VIR_STORAGE_VOL_WIPE_ALG_ZERO, -1); + virCheckFlags(VIR_STORAGE_VOL_WIPE_ALG_ZERO | + VIR_STORAGE_VOL_WIPE_ALG_TRIM, -1);
VIR_DEBUG("Wiping RBD image %s/%s", pool->def->source.name, vol->name);
@@ -823,6 +859,10 @@ virStorageBackendRBDVolWipe(virConnectPtr conn, r = virStorageBackendRBDVolWipeZero(image, vol->name, info, stripe_count); break; + case VIR_STORAGE_VOL_WIPE_ALG_TRIM: + r = virStorageBackendRBDVolWipeDiscard(image, vol->name, + info, stripe_count); + break; default: virReportError(VIR_ERR_INVALID_ARG, _("unsupported algorithm %d"), algorithm);

RBD supports cloning by creating a snapshot, protecting it and create a child image based on that snapshot afterwards. The RBD storage driver will try to find a snapshot with zero deltas between the current state of the original volume and the snapshot. If such a snapshot is found a clone/child image will be created using the rbd_clone2() function from librbd. rbd_clone2() is available in librbd since Ceph version Dumpling (0.67) which dates back to August 2013. It will use the same features, strip size and stripe count as the parent image. This implementation will only create a single snapshot on the parent image if never changes. This reduces the amount of snapshots created for that RBD image which benefits the performance of the Ceph cluster. During build the decision will be made to use either rbd_diff_iterate() or rbd_diff_iterate2(). The latter is faster, but only available on Ceph versions after 0.94 (Hammer). Cloning is only supported if RBD format 2 is used. All images created by libvirt are already format 2. If a RBD format 1 image is used as the original volume the backend will report a VIR_ERR_OPERATION_UNSUPPORTED error. Signed-off-by: Wido den Hollander <wido@widodh.nl> --- src/storage/storage_backend_rbd.c | 341 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 341 insertions(+) diff --git a/src/storage/storage_backend_rbd.c b/src/storage/storage_backend_rbd.c index 8a95388..1fe6667 100644 --- a/src/storage/storage_backend_rbd.c +++ b/src/storage/storage_backend_rbd.c @@ -32,6 +32,7 @@ #include "base64.h" #include "viruuid.h" #include "virstring.h" +#include "virrandom.h" #include "rados/librados.h" #include "rbd/librbd.h" @@ -663,6 +664,345 @@ virStorageBackendRBDBuildVol(virConnectPtr conn, return ret; } +static int +virStorageBackendRBDImageInfo(rbd_image_t image, + char *volname, + uint64_t *features, + uint64_t *stripe_unit, + uint64_t *stripe_count) +{ + int ret = -1; + int r = 0; + uint8_t oldformat; + + if ((r = rbd_get_old_format(image, &oldformat)) < 0) { + virReportSystemError(-r, _("failed to get the format of RBD image %s"), + volname); + goto cleanup; + } + + if (oldformat != 0) { + virReportError(VIR_ERR_OPERATION_UNSUPPORTED, + _("RBD image %s is old format. Does not support " + "extended features and striping"), + volname); + goto cleanup; + } + + if ((r = rbd_get_features(image, features)) < 0) { + virReportSystemError(-r, _("failed to get the features of RBD image %s"), + volname); + goto cleanup; + } + + if ((r = rbd_get_stripe_unit(image, stripe_unit)) < 0) { + virReportSystemError(-r, _("failed to get the stripe unit of RBD image %s"), + volname); + goto cleanup; + } + + if ((r = rbd_get_stripe_count(image, stripe_count)) < 0) { + virReportSystemError(-r, _("failed to get the stripe count of RBD image %s"), + volname); + goto cleanup; + } + + ret = 0; + + cleanup: + return ret; +} + +/* Callback function for rbd_diff_iterate() */ +static int +virStorageBackendRBDIterateCb(uint64_t offset ATTRIBUTE_UNUSED, + size_t length ATTRIBUTE_UNUSED, + int exists ATTRIBUTE_UNUSED, + void *arg) +{ + /* + * Just set that there is a diff for this snapshot, we do not care where + * + * When it returns a negative number the rbd_diff_iterate() function will stop + * + * That's why we return -1, meaning that there is a difference and we can stop + * searching any further. + */ + *(int*) arg = 1; + return -1; +} + +static int +virStorageBackendRBDSnapshotFindNoDiff(rbd_image_t image, + char *imgname, + virBufferPtr snapname) +{ + int r = -1; + int ret = -1; + int snap_count; + int max_snaps = 128; + size_t i; + int diff; + rbd_snap_info_t *snaps = NULL; + rbd_image_info_t info; + + if ((r = rbd_stat(image, &info, sizeof(info))) < 0) { + virReportSystemError(-r, _("failed to stat the RBD image %s"), + imgname); + goto cleanup; + } + + do { + if (VIR_ALLOC_N(snaps, max_snaps)) + goto cleanup; + + snap_count = rbd_snap_list(image, snaps, &max_snaps); + if (snap_count <= 0) + VIR_FREE(snaps); + + } while (snap_count == -ERANGE); + + if (snap_count <= 0) { + if (snap_count == 0) + ret = 0; + goto cleanup; + } + + VIR_DEBUG("Found %d snapshots for RBD image %s", snap_count, imgname); + + for (i = 0; i < snap_count; i++) { + VIR_DEBUG("Querying diff for RBD snapshot %s@%s", imgname, + snaps[i].name); + + /* The callback will set diff to non-zero if there is a diff */ + diff = 0; + +/* + * rbd_diff_iterate2() is available in versions above Ceph 0.94 (Hammer) + * It uses a object map inside Ceph which is faster than rbd_diff_iterate() + * which iterates all objects. + */ +#if LIBRBD_VERSION_CODE > 266 + r = rbd_diff_iterate2(image, snaps[i].name, 0, info.size, 0, 1, + virStorageBackendRBDIterateCb, (void *)&diff); +#else + r = rbd_diff_iterate(image, snaps[i].name, 0, info.size, + virStorageBackendRBDIterateCb, (void *)&diff); +#endif + + if (r < 0) { + virReportSystemError(-r, _("failed to iterate RBD snapshot %s@%s"), + imgname, snaps[i].name); + goto cleanup; + } + + /* If diff is still set to zero we found a snapshot without deltas */ + if (diff == 0) { + VIR_DEBUG("RBD snapshot %s@%s has no delta", imgname, + snaps[i].name); + virBufferAsprintf(snapname, "%s", snaps[i].name); + ret = 0; + goto cleanup; + } + + VIR_DEBUG("RBD snapshot %s@%s has deltas. Continuing search.", + imgname, snaps[i].name); + } + + ret = 0; + + cleanup: + if (snaps) + rbd_snap_list_end(snaps); + + VIR_FREE(snaps); + + return ret; +} + +static int +virStorageBackendRBDSnapshotCreate(rbd_image_t image, + char *imgname, + char *snapname) +{ + int ret = -1; + int r = -1; + + VIR_DEBUG("Creating RBD snapshot %s@%s", imgname, snapname); + + if ((r = rbd_snap_create(image, snapname)) < 0) { + virReportSystemError(-r, _("failed to create RBD snapshot %s@%s"), + imgname, snapname); + goto cleanup; + } + + ret = 0; + + cleanup: + return ret; +} + +static int +virStorageBackendRBDSnapshotProtect(rbd_image_t image, + char *imgname, + char *snapname) +{ + int r = -1; + int ret = -1; + int protected; + + VIR_DEBUG("Querying if RBD snapshot %s@%s is protected", imgname, snapname); + + if ((r = rbd_snap_is_protected(image, snapname, &protected)) < 0) { + virReportSystemError(-r, _("failed verify if RBD snapshot %s@%s " + "is protected"), imgname, snapname); + goto cleanup; + } + + if (protected == 0) { + VIR_DEBUG("RBD Snapshot %s@%s is not protected, protecting", + imgname, snapname); + + if ((r = rbd_snap_protect(image, snapname)) < 0) { + virReportSystemError(-r, _("failed protect RBD snapshot %s@%s"), + imgname, snapname); + goto cleanup; + } + } else { + VIR_DEBUG("RBD Snapshot %s@%s is already protected", imgname, snapname); + } + + ret = 0; + + cleanup: + return ret; +} + +static int +virStorageBackendRBDCloneImage(rados_ioctx_t io, + char *origvol, + char *newvol) +{ + int r = -1; + int ret = -1; + int order = 0; + uint64_t features; + uint64_t stripe_count; + uint64_t stripe_unit; + virBuffer snapname = VIR_BUFFER_INITIALIZER; + char *snapname_buff = NULL; + rbd_image_t image = NULL; + + if ((r = rbd_open(io, origvol, &image, NULL)) < 0) { + virReportSystemError(-r, _("failed to open the RBD image %s"), + origvol); + goto cleanup; + } + + if ((virStorageBackendRBDImageInfo(image, origvol, &features, &stripe_unit, + &stripe_count)) < 0) + goto cleanup; + + /* + * First we attempt to find a snapshot which has no differences between + * the current state of the RBD image. + * + * This prevents us from creating a new snapshot for every clone operation + * while it could be that the original volume has not changed + */ + if (virStorageBackendRBDSnapshotFindNoDiff(image, origvol, &snapname) < 0) + goto cleanup; + + /* + * the virBuffer snapname will contain a snapshot's name if one without + * deltas has been found. + * + * If it's NULL we have to create a new snapshot and clone from there + */ + snapname_buff = virBufferContentAndReset(&snapname); + + if (snapname_buff == NULL) { + VIR_DEBUG("No RBD snapshot with zero delta could be found for image %s", + origvol); + + virBufferAsprintf(&snapname, "libvirt-%d", (int)virRandomInt(65534)); + + if (virBufferCheckError(&snapname) < 0) + goto cleanup; + + snapname_buff = virBufferContentAndReset(&snapname); + + if (virStorageBackendRBDSnapshotCreate(image, origvol, snapname_buff) < 0) + goto cleanup; + + } + + VIR_DEBUG("Using snapshot name %s for cloning RBD image %s to %s", + snapname_buff, origvol, newvol); + + /* + * RBD snapshots have to be 'protected' before they can be used + * as a parent snapshot for a child image + */ + if ((r = virStorageBackendRBDSnapshotProtect(image, origvol, snapname_buff)) < 0) + goto cleanup; + + VIR_DEBUG("Performing RBD clone from %s to %s", origvol, newvol); + + if ((r = rbd_clone2(io, origvol, snapname_buff, io, newvol, features, + &order, stripe_unit, stripe_count)) < 0) { + virReportSystemError(-r, _("failed to clone RBD volume %s to %s"), + origvol, newvol); + goto cleanup; + } + + VIR_DEBUG("Cloned RBD image %s to %s", origvol, newvol); + + ret = 0; + + cleanup: + virBufferFreeAndReset(&snapname); + VIR_FREE(snapname_buff); + + if (image) + rbd_close(image); + + return ret; +} + +static int +virStorageBackendRBDBuildVolFrom(virConnectPtr conn, + virStoragePoolObjPtr pool, + virStorageVolDefPtr newvol, + virStorageVolDefPtr origvol, + unsigned int flags) +{ + virStorageBackendRBDState ptr; + ptr.cluster = NULL; + ptr.ioctx = NULL; + int ret = -1; + + VIR_DEBUG("Creating clone of RBD image %s/%s with name %s", + pool->def->source.name, origvol->name, newvol->name); + + virCheckFlags(0, -1); + + if (virStorageBackendRBDOpenRADOSConn(&ptr, conn, &pool->def->source) < 0) + goto cleanup; + + if (virStorageBackendRBDOpenIoCTX(&ptr, pool) < 0) + goto cleanup; + + if ((virStorageBackendRBDCloneImage(ptr.ioctx, origvol->name, newvol->name)) < 0) + goto cleanup; + + ret = 0; + + cleanup: + virStorageBackendRBDCloseRADOSConn(&ptr); + return ret; +} + static int virStorageBackendRBDRefreshVol(virConnectPtr conn, virStoragePoolObjPtr pool ATTRIBUTE_UNUSED, virStorageVolDefPtr vol) @@ -892,6 +1232,7 @@ virStorageBackend virStorageBackendRBD = { .refreshPool = virStorageBackendRBDRefreshPool, .createVol = virStorageBackendRBDCreateVol, .buildVol = virStorageBackendRBDBuildVol, + .buildVolFrom = virStorageBackendRBDBuildVolFrom, .refreshVol = virStorageBackendRBDRefreshVol, .deleteVol = virStorageBackendRBDDeleteVol, .resizeVol = virStorageBackendRBDResizeVol, -- 1.9.1

On 01/27/2016 05:20 AM, Wido den Hollander wrote:
RBD supports cloning by creating a snapshot, protecting it and create a child image based on that snapshot afterwards.
The RBD storage driver will try to find a snapshot with zero deltas between the current state of the original volume and the snapshot.
If such a snapshot is found a clone/child image will be created using the rbd_clone2() function from librbd.
rbd_clone2() is available in librbd since Ceph version Dumpling (0.67) which dates back to August 2013.
It will use the same features, strip size and stripe count as the parent image.
This implementation will only create a single snapshot on the parent image if never changes. This reduces the amount of snapshots created for that RBD image which benefits the performance of the Ceph cluster.
During build the decision will be made to use either rbd_diff_iterate() or rbd_diff_iterate2().
The latter is faster, but only available on Ceph versions after 0.94 (Hammer).
Cloning is only supported if RBD format 2 is used. All images created by libvirt are already format 2.
If a RBD format 1 image is used as the original volume the backend will report a VIR_ERR_OPERATION_UNSUPPORTED error.
Signed-off-by: Wido den Hollander <wido@widodh.nl> --- src/storage/storage_backend_rbd.c | 341 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 341 insertions(+)
ACK - pushed John
participants (5)
-
Daniel P. Berrange
-
Eric Blake
-
John Ferlan
-
Ján Tomko
-
Wido den Hollander