[PATCH 0/9] Allow sparse streams for block devices

The way our sparse streams are implemented is: 1) user facing APIs (virStreamSparseRecvAll() and virStreamSparseSendAll()) take callbacks as arguments. These callbacks read/write data or determine if there is a hole in the underlying file and big it is. 2) libvirtd has something similar - virFDStream, except here the functions for read/write of data and hole handling are called directly. Sparse streams were originally implemented for regular files only => both ends of stream has to be regular files. This limitation exists because the callbacks from 1) (implemented in virsh for vol-download/vol-upload commands) and also from 2) (which is basically the same code) uses lseek(..., SEEK_DATA) and/or lseek(..., SEEK_HOLE) to get map of allocated file blocks. They also take a shortcut (valid for regular files) - when one side of the stream is asked to create a hole it merely lseek() + ftruncate(). For regular files this creates a hole and later, when somebody reads it all they get is zeroes. Neither of these two approaches work for block devices. Block devices have no notion of data/hole sections [1], nor can they be truncated. What we can do instead is read data from the block device and check if its full of zeroes. And for "creating a hole" we just write zeroes of requested size. There is a follow up patch that I am working on: this implementation I'm posting here has one disadvantage: after some blocks are read from the block device and they are found to contain data, the whole buffer is freed only to be read again. For instance, when uploading volume, virshStreamInData() is called at the beginning to check if the file containing data to upload doesn't start with a hole. If the file is a block device, then virFileInDataDetectZeroes() is called which reads 1MiB from it, finds (say) data and throws the buffer away. Then virshStreamSource() is called, which reads the 1MiB buffer again. The patch is still under development though. 1: Okay, SSDs keep list of free blocks for wear levelling, but the list is kept private by the SSD controller and I am not aware of any way of getting it. Michal Prívozník (9): libvirt-storage: Document volume upload/download stream format virstring: Introduce virStringIsNull() virfile: Introduce virFileInDataDetectZeroes() virsh: Pass virshStreamCallbackDataPtr to virshStreamSink() and virshStreamSkip() virsh: Track if vol-upload or vol-download work over a block device virshStreamSkip: Emulate skip for block devices virfdstream: Allow sparse stream vol-download virshStreamInData: Handle block devices virfdstream: Emulate skip for block devices src/libvirt-storage.c | 8 +++-- src/libvirt_private.syms | 2 ++ src/util/virfdstream.c | 74 ++++++++++++++++++++++++++++++---------- src/util/virfile.c | 67 ++++++++++++++++++++++++++++++++++++ src/util/virfile.h | 4 +++ src/util/virstring.c | 40 ++++++++++++++++++++++ src/util/virstring.h | 2 ++ tools/virsh-util.c | 52 ++++++++++++++++++++++------ tools/virsh-util.h | 1 + tools/virsh-volume.c | 20 ++++++++++- 10 files changed, 238 insertions(+), 32 deletions(-) -- 2.26.2

For libvirt, the volume is just a binary blob and it doesn't interpret data on volume upload/download. But as it turns out, this unspoken assumption is not clear to our users. Document it explicitly. Suggested in: https://bugzilla.redhat.com/show_bug.cgi?id=1851023#c17 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt-storage.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c index a45c8b98c1..8738f6dd14 100644 --- a/src/libvirt-storage.c +++ b/src/libvirt-storage.c @@ -1590,7 +1590,9 @@ virStorageVolCreateXMLFrom(virStoragePoolPtr pool, * * Download the content of the volume as a stream. If @length * is zero, then the remaining contents of the volume after - * @offset will be downloaded. + * @offset will be downloaded. Please note, that the data is + * not interpreted and therefore data received by stream + * client are in the very same format the volume is in. * * If VIR_STORAGE_VOL_DOWNLOAD_SPARSE_STREAM is set in @flags * effective transmission of holes is enabled. This assumes using @@ -1663,7 +1665,9 @@ virStorageVolDownload(virStorageVolPtr vol, * will fail if @offset + @length exceeds the size of the * volume. Otherwise, if @length is non-zero, an error * will be raised if an attempt is made to upload greater - * than @length bytes of data. + * than @length bytes of data. Please note, that the data is + * not interpreted and therefore data sent by stream client + * must be in the very same format the volume is in. * * If VIR_STORAGE_VOL_UPLOAD_SPARSE_STREAM is set in @flags * effective transmission of holes is enabled. This assumes using -- 2.26.2

This function will be used to detect zero buffers (which are going to be interpreted as hole in virStream later). I shamelessly took inspiration from coreutils. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virstring.c | 40 ++++++++++++++++++++++++++++++++++++++++ src/util/virstring.h | 2 ++ 3 files changed, 43 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index ae0e253ab9..1d80aeb833 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3197,6 +3197,7 @@ virStringHasChars; virStringHasControlChars; virStringHasSuffix; virStringIsEmpty; +virStringIsNull; virStringIsPrintable; virStringListAdd; virStringListAutoFree; diff --git a/src/util/virstring.c b/src/util/virstring.c index e9e792f3bf..7f6871d5ab 100644 --- a/src/util/virstring.c +++ b/src/util/virstring.c @@ -1404,3 +1404,43 @@ int virStringParseYesNo(const char *str, bool *result) return 0; } + + +/** + * virStringIsNull: + * @buf: buffer to check + * @len: the length of the buffer + * + * For given buffer @buf and its size @len determine whether + * it contains only zero bytes or not. + * + * Returns: true if buffer is full of zero bytes, + * false otherwise. + */ +bool virStringIsNull(const char *buf, size_t len) +{ + const char *p = buf; + + if (!len) + return true; + + /* Check up to 16 first bytes. memcmp() below uses block of sizeof(long) + * size (in most cases on 64bits is 8 bytes), doing twice the size just to + * be safe. */ + for (;;) { + if (*p) + return false; + + p++; + len--; + + if (!len) + return true; + + if ((len & (2 * sizeof(long))) == 0) + break; + } + + /* Now we know first 16 bytes are NUL, memcmp with self. */ + return memcmp(buf, p, len) == 0; +} diff --git a/src/util/virstring.h b/src/util/virstring.h index 360c68395c..d0e54358ac 100644 --- a/src/util/virstring.h +++ b/src/util/virstring.h @@ -185,6 +185,8 @@ int virStringParsePort(const char *str, int virStringParseYesNo(const char *str, bool *result) G_GNUC_WARN_UNUSED_RESULT; +bool virStringIsNull(const char *buf, size_t len); + /** * VIR_AUTOSTRINGLIST: * -- 2.26.2

On a Friday in 2020, Michal Privoznik wrote:
This function will be used to detect zero buffers (which are going to be interpreted as hole in virStream later).
I shamelessly took inspiration from coreutils.
How is that okay, isn't coreutils GPL-licensed? Jano
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virstring.c | 40 ++++++++++++++++++++++++++++++++++++++++ src/util/virstring.h | 2 ++ 3 files changed, 43 insertions(+)

On 7/3/20 2:08 PM, Ján Tomko wrote:
On a Friday in 2020, Michal Privoznik wrote:
This function will be used to detect zero buffers (which are going to be interpreted as hole in virStream later).
I shamelessly took inspiration from coreutils.
How is that okay, isn't coreutils GPL-licensed?
Well, I only took inspiration in that code, I haven't copied it. But this idea of using memcmp() is quite old and described for instance in this blog post too: https://rusty.ozlabs.org/?p=560 Do you want to switch to something else instead? If so, how can we make sure it's not already somewhere in a code with incompatible license? Michal

On Fri, Jul 03, 2020 at 03:13:29PM +0200, Michal Privoznik wrote:
On 7/3/20 2:08 PM, Ján Tomko wrote:
On a Friday in 2020, Michal Privoznik wrote:
This function will be used to detect zero buffers (which are going to be interpreted as hole in virStream later).
I shamelessly took inspiration from coreutils.
How is that okay, isn't coreutils GPL-licensed?
Well, I only took inspiration in that code, I haven't copied it. But this idea of using memcmp() is quite old and described for instance in this blog post too:
https://rusty.ozlabs.org/?p=560
Do you want to switch to something else instead? If so, how can we make sure it's not already somewhere in a code with incompatible license?
I think the current code is fine as it is simple logic. If someone described the algorithm you'd end up doing it the same way Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, Jul 03, 2020 at 12:28:43PM +0200, Michal Privoznik wrote:
This function will be used to detect zero buffers (which are going to be interpreted as hole in virStream later).
I shamelessly took inspiration from coreutils.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virstring.c | 40 ++++++++++++++++++++++++++++++++++++++++ src/util/virstring.h | 2 ++ 3 files changed, 43 insertions(+)
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index ae0e253ab9..1d80aeb833 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3197,6 +3197,7 @@ virStringHasChars; virStringHasControlChars; virStringHasSuffix; virStringIsEmpty; +virStringIsNull; virStringIsPrintable; virStringListAdd; virStringListAutoFree; diff --git a/src/util/virstring.c b/src/util/virstring.c index e9e792f3bf..7f6871d5ab 100644 --- a/src/util/virstring.c +++ b/src/util/virstring.c @@ -1404,3 +1404,43 @@ int virStringParseYesNo(const char *str, bool *result)
return 0; } + + +/** + * virStringIsNull: + * @buf: buffer to check + * @len: the length of the buffer + * + * For given buffer @buf and its size @len determine whether + * it contains only zero bytes or not. + * + * Returns: true if buffer is full of zero bytes, + * false otherwise. + */ +bool virStringIsNull(const char *buf, size_t len) +{ + const char *p = buf; + + if (!len) + return true; + + /* Check up to 16 first bytes. memcmp() below uses block of sizeof(long) + * size (in most cases on 64bits is 8 bytes), doing twice the size just to + * be safe. */
On 32-bit this is only going to do 8 bytes though. I'm not sure why we need "2 * sizeof(long)" instead of "16". Unless we really do want this to have different behaviour based on sizeof(long), in which case the comment could be clearer
+ for (;;) { + if (*p) + return false; + + p++; + len--; + + if (!len) + return true; + + if ((len & (2 * sizeof(long))) == 0) + break; + } + + /* Now we know first 16 bytes are NUL, memcmp with self. */ + return memcmp(buf, p, len) == 0; +} diff --git a/src/util/virstring.h b/src/util/virstring.h index 360c68395c..d0e54358ac 100644 --- a/src/util/virstring.h +++ b/src/util/virstring.h @@ -185,6 +185,8 @@ int virStringParsePort(const char *str, int virStringParseYesNo(const char *str, bool *result) G_GNUC_WARN_UNUSED_RESULT; +bool virStringIsNull(const char *buf, size_t len); + /** * VIR_AUTOSTRINGLIST: * -- 2.26.2
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

For given file descriptor determine if the current position it is in plus 1MiB (arbitrary chosen value) consists solely from zero bytes or not. This is a block device friendly version of virFileInData(). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virfile.c | 67 ++++++++++++++++++++++++++++++++++++++++ src/util/virfile.h | 4 +++ 3 files changed, 72 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 1d80aeb833..6b5a751788 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2070,6 +2070,7 @@ virFileGetMountSubtree; virFileGetXAttr; virFileGetXAttrQuiet; virFileInData; +virFileInDataDetectZeroes; virFileIsCDROM; virFileIsDir; virFileIsExecutable; diff --git a/src/util/virfile.c b/src/util/virfile.c index c034df5931..18b5096b88 100644 --- a/src/util/virfile.c +++ b/src/util/virfile.c @@ -4064,6 +4064,73 @@ virFileInData(int fd G_GNUC_UNUSED, #endif /* !HAVE_DECL_SEEK_HOLE */ +#define DETECT_ZEORES_BLOCK_SIZE (1 * 1024 * 1024) + +/* + * virFileInDataDetectZeroes: + * @fd: file to check + * @inData: true if current position in the @fd is in data section + * @length: amount of bytes until the end of the current section + * + * This behaves exactly like virFileInData() except it doesn't use SEEK_DATA + * and SEEK_HOLE rather than a buffer into which it reads data from @fd and + * detects if the buffer is full of zeroes. Therefore, it is safe to use on + * special files (e.g. block devices). On the other hand it sees only + * DETECT_ZEORES_BLOCK_SIZE ahead. + * + * Returns: 0 on success, + * -1 otherwise. + */ +int +virFileInDataDetectZeroes(int fd, + int *inData, + long long *length) +{ + const size_t buflen = DETECT_ZEORES_BLOCK_SIZE; + g_autofree char *buf = NULL; + off_t cur; + ssize_t r; + int ret = -1; + + /* Get current position */ + cur = lseek(fd, 0, SEEK_CUR); + if (cur == (off_t) -1) { + virReportSystemError(errno, "%s", + _("Unable to get current position in file")); + goto cleanup; + } + + buf = g_new0(char, buflen); + + if ((r = saferead(fd, buf, buflen)) < 0) { + virReportSystemError(errno, "%s", + _("Unable to read from file")); + goto cleanup; + } + + *inData = !virStringIsNull(buf, r); + *length = MIN(r, buflen); + ret = 0; + + cleanup: + /* At any rate, reposition back to where we started. */ + if (cur != (off_t) -1) { + int theerrno = errno; + + if (lseek(fd, cur, SEEK_SET) == (off_t) -1) { + virReportSystemError(errno, "%s", + _("unable to restore position in file")); + ret = -1; + if (theerrno == 0) + theerrno = errno; + } + + errno = theerrno; + } + return ret; +} + + /** * virFileReadValueInt: * @value: pointer to int to be filled in with the value diff --git a/src/util/virfile.h b/src/util/virfile.h index 7a92364a5c..9a5eade609 100644 --- a/src/util/virfile.h +++ b/src/util/virfile.h @@ -352,6 +352,10 @@ int virFileInData(int fd, int *inData, long long *length); +int virFileInDataDetectZeroes(int fd, + int *inData, + long long *length); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(virFileWrapperFd, virFileWrapperFdFree); int virFileGetXAttr(const char *path, -- 2.26.2

These callback will need to know more that the FD they are working on. Pass the structure that is passed to other stream callbacks (e.g. virshStreamSource() or virshStreamSourceSkip()) instead of inventing a new one. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tools/virsh-util.c | 10 +++++----- tools/virsh-volume.c | 6 +++++- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/tools/virsh-util.c b/tools/virsh-util.c index 932d6d0849..89f15efd08 100644 --- a/tools/virsh-util.c +++ b/tools/virsh-util.c @@ -146,9 +146,9 @@ virshStreamSink(virStreamPtr st G_GNUC_UNUSED, size_t nbytes, void *opaque) { - int *fd = opaque; + virshStreamCallbackDataPtr cbData = opaque; - return safewrite(*fd, bytes, nbytes); + return safewrite(cbData->fd, bytes, nbytes); } @@ -186,13 +186,13 @@ virshStreamSkip(virStreamPtr st G_GNUC_UNUSED, long long offset, void *opaque) { - int *fd = opaque; + virshStreamCallbackDataPtr cbData = opaque; off_t cur; - if ((cur = lseek(*fd, offset, SEEK_CUR)) == (off_t) -1) + if ((cur = lseek(cbData->fd, offset, SEEK_CUR)) == (off_t) -1) return -1; - if (ftruncate(*fd, cur) < 0) + if (ftruncate(cbData->fd, cur) < 0) return -1; return 0; diff --git a/tools/virsh-volume.c b/tools/virsh-volume.c index 72394915d8..5cbc2efb7a 100644 --- a/tools/virsh-volume.c +++ b/tools/virsh-volume.c @@ -793,6 +793,7 @@ cmdVolDownload(vshControl *ctl, const vshCmd *cmd) unsigned long long offset = 0, length = 0; bool created = false; virshControlPtr priv = ctl->privData; + virshStreamCallbackData cbData; unsigned int flags = 0; if (vshCommandOptULongLong(ctl, cmd, "offset", &offset) < 0) @@ -820,6 +821,9 @@ cmdVolDownload(vshControl *ctl, const vshCmd *cmd) created = true; } + cbData.ctl = ctl; + cbData.fd = fd; + if (!(st = virStreamNew(priv->conn, 0))) { vshError(ctl, _("cannot create a new stream")); goto cleanup; @@ -830,7 +834,7 @@ cmdVolDownload(vshControl *ctl, const vshCmd *cmd) goto cleanup; } - if (virStreamSparseRecvAll(st, virshStreamSink, virshStreamSkip, &fd) < 0) { + if (virStreamSparseRecvAll(st, virshStreamSink, virshStreamSkip, &cbData) < 0) { vshError(ctl, _("cannot receive data from volume %s"), name); goto cleanup; } -- 2.26.2

We can't use virFileInData() with block devices, but we could use new virFileInDataDetectZeroes(). But to decide we need to know if the FD we are reading data from / writing data to is a block device. Store this information in _virshStreamCallbackData. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tools/virsh-util.h | 1 + tools/virsh-volume.c | 14 ++++++++++++++ 2 files changed, 15 insertions(+) diff --git a/tools/virsh-util.h b/tools/virsh-util.h index 72653d9735..9ef28cfe0a 100644 --- a/tools/virsh-util.h +++ b/tools/virsh-util.h @@ -72,6 +72,7 @@ typedef virshStreamCallbackData *virshStreamCallbackDataPtr; struct _virshStreamCallbackData { vshControl *ctl; int fd; + bool isBlock; }; int diff --git a/tools/virsh-volume.c b/tools/virsh-volume.c index 5cbc2efb7a..20823e2d10 100644 --- a/tools/virsh-volume.c +++ b/tools/virsh-volume.c @@ -679,6 +679,7 @@ cmdVolUpload(vshControl *ctl, const vshCmd *cmd) virshControlPtr priv = ctl->privData; unsigned int flags = 0; virshStreamCallbackData cbData; + struct stat sb; if (vshCommandOptULongLong(ctl, cmd, "offset", &offset) < 0) return false; @@ -697,8 +698,14 @@ cmdVolUpload(vshControl *ctl, const vshCmd *cmd) goto cleanup; } + if (fstat(fd, &sb) < 0) { + vshError(ctl, _("unable to stat %s"), file); + goto cleanup; + } + cbData.ctl = ctl; cbData.fd = fd; + cbData.isBlock = !!S_ISBLK(sb.st_mode); if (vshCommandOptBool(cmd, "sparse")) flags |= VIR_STORAGE_VOL_UPLOAD_SPARSE_STREAM; @@ -795,6 +802,7 @@ cmdVolDownload(vshControl *ctl, const vshCmd *cmd) virshControlPtr priv = ctl->privData; virshStreamCallbackData cbData; unsigned int flags = 0; + struct stat sb; if (vshCommandOptULongLong(ctl, cmd, "offset", &offset) < 0) return false; @@ -821,8 +829,14 @@ cmdVolDownload(vshControl *ctl, const vshCmd *cmd) created = true; } + if (fstat(fd, &sb) < 0) { + vshError(ctl, _("unable to stat %s"), file); + goto cleanup; + } + cbData.ctl = ctl; cbData.fd = fd; + cbData.isBlock = !!S_ISBLK(sb.st_mode); if (!(st = virStreamNew(priv->conn, 0))) { vshError(ctl, _("cannot create a new stream")); -- 2.26.2

This callback is called when the server sends us STREAM_HOLE meaning there is no real data, only zeroes. For regular files we would just seek() beyond EOF and ftruncate() to create the hole. But for block devices this won't work. Not only we can't seek() beyond EOF, and ftruncate() will fail, this approach won't fill the device with zeroes. We have to do it manually. Partially resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1852528 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tools/virsh-util.c | 30 ++++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/tools/virsh-util.c b/tools/virsh-util.c index 89f15efd08..884261eb49 100644 --- a/tools/virsh-util.c +++ b/tools/virsh-util.c @@ -189,11 +189,33 @@ virshStreamSkip(virStreamPtr st G_GNUC_UNUSED, virshStreamCallbackDataPtr cbData = opaque; off_t cur; - if ((cur = lseek(cbData->fd, offset, SEEK_CUR)) == (off_t) -1) - return -1; + if (cbData->isBlock) { + g_autofree char * buf = NULL; + const size_t buflen = 1 * 1024 * 1024; /* 1MiB */ - if (ftruncate(cbData->fd, cur) < 0) - return -1; + /* While for files it's enough to lseek() and ftruncate() to create + * a hole which would emulate zeroes on read(), for block devices + * we have to write zeroes to read() zeroes. And we have to write + * @got bytes of zeroes. Do that in smaller chunks though.*/ + + buf = g_new0(char, buflen); + + while (offset) { + size_t count = MIN(offset, buflen); + ssize_t r; + + if ((r = safewrite(cbData->fd, buf, count)) < 0) + return -1; + + offset -= r; + } + } else { + if ((cur = lseek(cbData->fd, offset, SEEK_CUR)) == (off_t) -1) + return -1; + + if (ftruncate(cbData->fd, cur) < 0) + return -1; + } return 0; } -- 2.26.2

When handling sparse stream, a thread is executed. This thread runs a read() or write() loop (depending what API is called; in this case it's virStorageVolDownload() and this the thread run read() loop). The read() is handled in virFDStreamThreadDoRead() which is then data/hole section aware, meaning it uses virFileInData() to detect data and hole sections and sends TYPE_DATA or TYPE_HOLE virStream messages accordingly. However, virFileInData() does not work with block devices. Simply because block devices don't have data and hole sections. But we can use new virFileInDataDetectZeroes() which is block device friendly for that. Partially resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1852528 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virfdstream.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/src/util/virfdstream.c b/src/util/virfdstream.c index 1c32be47a9..b5406fe690 100644 --- a/src/util/virfdstream.c +++ b/src/util/virfdstream.c @@ -396,6 +396,7 @@ struct _virFDStreamThreadData { size_t length; bool doRead; bool sparse; + bool isBlock; int fdin; char *fdinname; int fdout; @@ -419,6 +420,7 @@ virFDStreamThreadDataFree(virFDStreamThreadDataPtr data) static ssize_t virFDStreamThreadDoRead(virFDStreamDataPtr fdst, bool sparse, + bool isBlock, const int fdin, const int fdout, const char *fdinname, @@ -435,8 +437,13 @@ virFDStreamThreadDoRead(virFDStreamDataPtr fdst, ssize_t got; if (sparse && *dataLen == 0) { - if (virFileInData(fdin, &inData, §ionLen) < 0) - goto error; + if (isBlock) { + if (virFileInDataDetectZeroes(fdin, &inData, §ionLen) < 0) + goto error; + } else { + if (virFileInData(fdin, &inData, §ionLen) < 0) + goto error; + } if (length && sectionLen > length - total) @@ -575,6 +582,7 @@ virFDStreamThread(void *opaque) virStreamPtr st = data->st; size_t length = data->length; bool sparse = data->sparse; + bool isBlock = data->isBlock; int fdin = data->fdin; char *fdinname = data->fdinname; int fdout = data->fdout; @@ -611,7 +619,7 @@ virFDStreamThread(void *opaque) } if (doRead) - got = virFDStreamThreadDoRead(fdst, sparse, + got = virFDStreamThreadDoRead(fdst, sparse, isBlock, fdin, fdout, fdinname, fdoutname, length, total, @@ -1289,6 +1297,7 @@ virFDStreamOpenFileInternal(virStreamPtr st, threadData->st = virObjectRef(st); threadData->length = length; threadData->sparse = sparse; + threadData->isBlock = !!S_ISBLK(sb.st_mode); if ((oflags & O_ACCMODE) == O_RDONLY) { threadData->fdin = fd; -- 2.26.2

This is very similar to previous commit. The virshStreamInData() callback is used by virStreamSparseSendAll() to detect whether the file the data is read from is in data or hole section. The SendAll() will then send corresponding type of virStream message to make server create a hole or write actual data. But the callback uses virFileInData() even for block devices, which results in an error. Switch to virFileInDataDetectZeroes() for block devices. Partially resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1852528 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tools/virsh-util.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/tools/virsh-util.c b/tools/virsh-util.c index 884261eb49..867f69577f 100644 --- a/tools/virsh-util.c +++ b/tools/virsh-util.c @@ -230,12 +230,20 @@ virshStreamInData(virStreamPtr st G_GNUC_UNUSED, virshStreamCallbackDataPtr cbData = opaque; vshControl *ctl = cbData->ctl; int fd = cbData->fd; - int ret; - if ((ret = virFileInData(fd, inData, offset)) < 0) - vshError(ctl, "%s", _("Unable to get current position in stream")); + if (cbData->isBlock) { + if (virFileInDataDetectZeroes(fd, inData, offset) < 0) { + vshError(ctl, "%s", _("Unable to get current position in stream")); + return -1; + } + } else { + if (virFileInData(fd, inData, offset) < 0) { + vshError(ctl, "%s", _("Unable to get current position in stream")); + return -1; + } + } - return ret; + return 0; } -- 2.26.2

This is similar to one of previous patches. When receiving stream (on virStorageVolUpload() and subsequent virStreamSparseSendAll()) we may receive a hole. If the volume we are saving the incoming data into is a regular file we just lseek() and ftruncate() to create the hole. But this won't work if the file is a block device. If that is the case we must write zeroes so that any subsequent reader reads nothing just zeroes (just like they would from a hole in a regular file). Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1852528 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/util/virfdstream.c | 59 +++++++++++++++++++++++++++++++----------- 1 file changed, 44 insertions(+), 15 deletions(-) diff --git a/src/util/virfdstream.c b/src/util/virfdstream.c index b5406fe690..0894be0341 100644 --- a/src/util/virfdstream.c +++ b/src/util/virfdstream.c @@ -512,6 +512,7 @@ virFDStreamThreadDoRead(virFDStreamDataPtr fdst, static ssize_t virFDStreamThreadDoWrite(virFDStreamDataPtr fdst, bool sparse, + bool isBlock, const int fdin, const int fdout, const char *fdinname, @@ -519,7 +520,6 @@ virFDStreamThreadDoWrite(virFDStreamDataPtr fdst, { ssize_t got = 0; virFDStreamMsgPtr msg = fdst->msg; - off_t off; bool pop = false; switch (msg->type) { @@ -547,19 +547,48 @@ virFDStreamThreadDoWrite(virFDStreamDataPtr fdst, } got = msg->stream.hole.len; - off = lseek(fdout, got, SEEK_CUR); - if (off == (off_t) -1) { - virReportSystemError(errno, - _("unable to seek in %s"), - fdoutname); - return -1; - } - - if (ftruncate(fdout, off) < 0) { - virReportSystemError(errno, - _("unable to truncate %s"), - fdoutname); - return -1; + if (isBlock) { + g_autofree char * buf = NULL; + const size_t buflen = 1 * 1024 * 1024; /* 1MiB */ + size_t toWrite = got; + + /* While for files it's enough to lseek() and ftruncate() to create + * a hole which would emulate zeroes on read(), for block devices + * we have to write zeroes to read() zeroes. And we have to write + * @got bytes of zeroes. Do that in smaller chunks though.*/ + + buf = g_new0(char, buflen); + + while (toWrite) { + size_t count = MIN(toWrite, buflen); + ssize_t r; + + if ((r = safewrite(fdout, buf, count)) < 0) { + virReportSystemError(errno, + _("Unable to write %s"), + fdoutname); + return -1; + } + + toWrite -= r; + } + } else { + off_t off; + + off = lseek(fdout, got, SEEK_CUR); + if (off == (off_t) -1) { + virReportSystemError(errno, + _("unable to seek in %s"), + fdoutname); + return -1; + } + + if (ftruncate(fdout, off) < 0) { + virReportSystemError(errno, + _("unable to truncate %s"), + fdoutname); + return -1; + } } pop = true; @@ -625,7 +654,7 @@ virFDStreamThread(void *opaque) length, total, &dataLen, buflen); else - got = virFDStreamThreadDoWrite(fdst, sparse, + got = virFDStreamThreadDoWrite(fdst, sparse, isBlock, fdin, fdout, fdinname, fdoutname); -- 2.26.2

On Fri, Jul 03, 2020 at 12:28:41PM +0200, Michal Privoznik wrote:
The way our sparse streams are implemented is:
1) user facing APIs (virStreamSparseRecvAll() and virStreamSparseSendAll()) take callbacks as arguments. These callbacks read/write data or determine if there is a hole in the underlying file and big it is.
2) libvirtd has something similar - virFDStream, except here the functions for read/write of data and hole handling are called directly.
Sparse streams were originally implemented for regular files only => both ends of stream has to be regular files. This limitation exists because the callbacks from 1) (implemented in virsh for vol-download/vol-upload commands) and also from 2) (which is basically the same code) uses lseek(..., SEEK_DATA) and/or lseek(..., SEEK_HOLE) to get map of allocated file blocks. They also take a shortcut (valid for regular files) - when one side of the stream is asked to create a hole it merely lseek() + ftruncate(). For regular files this creates a hole and later, when somebody reads it all they get is zeroes.
Neither of these two approaches work for block devices. Block devices have no notion of data/hole sections [1], nor can they be truncated. What we can do instead is read data from the block device and check if its full of zeroes. And for "creating a hole" we just write zeroes of requested size.
There is a follow up patch that I am working on: this implementation I'm posting here has one disadvantage: after some blocks are read from the block device and they are found to contain data, the whole buffer is freed only to be read again. For instance, when uploading volume, virshStreamInData() is called at the beginning to check if the file containing data to upload doesn't start with a hole. If the file is a block device, then virFileInDataDetectZeroes() is called which reads 1MiB from it, finds (say) data and throws the buffer away. Then virshStreamSource() is called, which reads the 1MiB buffer again. The patch is still under development though.
Was there a particular user/app feature request for this support ? I'm wondering about the likely use case, because if I was starting over from scratch I'd never implement stream support for storage volumes. Instead I would add APIs for starting/stopping a qemu-nbd server attached to the volume. Probably don't even need a start/ stop pair, could just run in single client mode where we pass back an opened client FD, and have qemu-nbd exit when this is closed. Depending on the user requesting sparse support for blockdevs it may still make sense to provide them an NBD solution, especially if they're likely to have followup feature requests already handled by NBD. NB, this is not an objection to your series here. Since you've already done the work and it is really just giving closer parity between block and file volumes. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 7/3/20 1:34 PM, Daniel P. Berrangé wrote:
On Fri, Jul 03, 2020 at 12:28:41PM +0200, Michal Privoznik wrote:
The way our sparse streams are implemented is:
1) user facing APIs (virStreamSparseRecvAll() and virStreamSparseSendAll()) take callbacks as arguments. These callbacks read/write data or determine if there is a hole in the underlying file and big it is.
2) libvirtd has something similar - virFDStream, except here the functions for read/write of data and hole handling are called directly.
Sparse streams were originally implemented for regular files only => both ends of stream has to be regular files. This limitation exists because the callbacks from 1) (implemented in virsh for vol-download/vol-upload commands) and also from 2) (which is basically the same code) uses lseek(..., SEEK_DATA) and/or lseek(..., SEEK_HOLE) to get map of allocated file blocks. They also take a shortcut (valid for regular files) - when one side of the stream is asked to create a hole it merely lseek() + ftruncate(). For regular files this creates a hole and later, when somebody reads it all they get is zeroes.
Neither of these two approaches work for block devices. Block devices have no notion of data/hole sections [1], nor can they be truncated. What we can do instead is read data from the block device and check if its full of zeroes. And for "creating a hole" we just write zeroes of requested size.
There is a follow up patch that I am working on: this implementation I'm posting here has one disadvantage: after some blocks are read from the block device and they are found to contain data, the whole buffer is freed only to be read again. For instance, when uploading volume, virshStreamInData() is called at the beginning to check if the file containing data to upload doesn't start with a hole. If the file is a block device, then virFileInDataDetectZeroes() is called which reads 1MiB from it, finds (say) data and throws the buffer away. Then virshStreamSource() is called, which reads the 1MiB buffer again. The patch is still under development though.
Was there a particular user/app feature request for this support ?
Yes, see BZ linked in 9/9. https://bugzilla.redhat.com/show_bug.cgi?id=1852528 Apparently, VDSM uses streams to copy volumes around.
I'm wondering about the likely use case, because if I was starting over from scratch I'd never implement stream support for storage volumes. Instead I would add APIs for starting/stopping a qemu-nbd server attached to the volume. Probably don't even need a start/ stop pair, could just run in single client mode where we pass back an opened client FD, and have qemu-nbd exit when this is closed.
Depending on the user requesting sparse support for blockdevs it may still make sense to provide them an NBD solution, especially if they're likely to have followup feature requests already handled by NBD.
Agreed, streams should have been for console and screenshot only. They are strictly worse for handling large files than scp/nbd/rsync/... because it all has to go through our event loop. And client even loop. On the other hand, they are multiplexed within virCommand which is an advantage that neither of the aforementioned tools have (no need to open a special port then). Michal

On Fri, Jul 03, 2020 at 01:50:22PM +0200, Michal Privoznik wrote:
On 7/3/20 1:34 PM, Daniel P. Berrangé wrote:
On Fri, Jul 03, 2020 at 12:28:41PM +0200, Michal Privoznik wrote:
The way our sparse streams are implemented is:
1) user facing APIs (virStreamSparseRecvAll() and virStreamSparseSendAll()) take callbacks as arguments. These callbacks read/write data or determine if there is a hole in the underlying file and big it is.
2) libvirtd has something similar - virFDStream, except here the functions for read/write of data and hole handling are called directly.
Sparse streams were originally implemented for regular files only => both ends of stream has to be regular files. This limitation exists because the callbacks from 1) (implemented in virsh for vol-download/vol-upload commands) and also from 2) (which is basically the same code) uses lseek(..., SEEK_DATA) and/or lseek(..., SEEK_HOLE) to get map of allocated file blocks. They also take a shortcut (valid for regular files) - when one side of the stream is asked to create a hole it merely lseek() + ftruncate(). For regular files this creates a hole and later, when somebody reads it all they get is zeroes.
Neither of these two approaches work for block devices. Block devices have no notion of data/hole sections [1], nor can they be truncated. What we can do instead is read data from the block device and check if its full of zeroes. And for "creating a hole" we just write zeroes of requested size.
There is a follow up patch that I am working on: this implementation I'm posting here has one disadvantage: after some blocks are read from the block device and they are found to contain data, the whole buffer is freed only to be read again. For instance, when uploading volume, virshStreamInData() is called at the beginning to check if the file containing data to upload doesn't start with a hole. If the file is a block device, then virFileInDataDetectZeroes() is called which reads 1MiB from it, finds (say) data and throws the buffer away. Then virshStreamSource() is called, which reads the 1MiB buffer again. The patch is still under development though.
Was there a particular user/app feature request for this support ?
Yes, see BZ linked in 9/9.
https://bugzilla.redhat.com/show_bug.cgi?id=1852528
Apparently, VDSM uses streams to copy volumes around.
Ah I see. We might want to ask them if they would find it useful if we added an NBD alternative.
I'm wondering about the likely use case, because if I was starting over from scratch I'd never implement stream support for storage volumes. Instead I would add APIs for starting/stopping a qemu-nbd server attached to the volume. Probably don't even need a start/ stop pair, could just run in single client mode where we pass back an opened client FD, and have qemu-nbd exit when this is closed.
Depending on the user requesting sparse support for blockdevs it may still make sense to provide them an NBD solution, especially if they're likely to have followup feature requests already handled by NBD.
Agreed, streams should have been for console and screenshot only. They are strictly worse for handling large files than scp/nbd/rsync/... because it all has to go through our event loop. And client even loop. On the other hand, they are multiplexed within virCommand which is an advantage that neither of the aforementioned tools have (no need to open a special port then).
If qemu-nbd listen on a private UNIX socket, and we pass back a FD connected to that socket, there's no need to deal with firewalls or permissions. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, 2020-07-03 at 12:28 +0200, Michal Privoznik wrote:
Michal Prívozník (9): libvirt-storage: Document volume upload/download stream format virstring: Introduce virStringIsNull() virfile: Introduce virFileInDataDetectZeroes() virsh: Pass virshStreamCallbackDataPtr to virshStreamSink() and virshStreamSkip() virsh: Track if vol-upload or vol-download work over a block device virshStreamSkip: Emulate skip for block devices virfdstream: Allow sparse stream vol-download virshStreamInData: Handle block devices virfdstream: Emulate skip for block devices
Is this change something that we should mention in the release notes? If so, please post a patch that updates them accordingly. Thanks! -- Andrea Bolognani / Red Hat / Virtualization

On 7/3/20 3:59 PM, Andrea Bolognani wrote:
On Fri, 2020-07-03 at 12:28 +0200, Michal Privoznik wrote:
Michal Prívozník (9): libvirt-storage: Document volume upload/download stream format virstring: Introduce virStringIsNull() virfile: Introduce virFileInDataDetectZeroes() virsh: Pass virshStreamCallbackDataPtr to virshStreamSink() and virshStreamSkip() virsh: Track if vol-upload or vol-download work over a block device virshStreamSkip: Emulate skip for block devices virfdstream: Allow sparse stream vol-download virshStreamInData: Handle block devices virfdstream: Emulate skip for block devices
Is this change something that we should mention in the release notes? If so, please post a patch that updates them accordingly. Thanks!
Of course! I was think that I have to write a release note for this one as I was updating NEWS.rst for next release in the morning. But Guess what, of course I forgot :-) Michal

On Fri, 2020-07-03 at 16:06 +0200, Michal Privoznik wrote:
On 7/3/20 3:59 PM, Andrea Bolognani wrote:
Is this change something that we should mention in the release notes? If so, please post a patch that updates them accordingly. Thanks!
Of course! I was think that I have to write a release note for this one as I was updating NEWS.rst for next release in the morning. But Guess what, of course I forgot :-)
That's fine, I'm just trying to be a bit more proactive with the release notes, by pestering people into providing updates themselves along with the code changes instead of always scrambling to document a month worth of commits during the freeze period :) -- Andrea Bolognani / Red Hat / Virtualization
participants (4)
-
Andrea Bolognani
-
Daniel P. Berrangé
-
Ján Tomko
-
Michal Privoznik