[libvirt] [PATCH v2] iohelper: fix reading with O_DIRECT

saferead is not suitable for direct reads. If file size is not multiple of align size then we get EINVAL on the read(2) that is supposed to return 0 because read buffer will not be aligned at this point. Let's not read again after partial read and check that we read everything by comparing the number of total bytes read against file size. --- Diff from v1: - a couple of cosmetic changes src/util/iohelper.c | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-) diff --git a/src/util/iohelper.c b/src/util/iohelper.c index fe15a92..9aa6ae0 100644 --- a/src/util/iohelper.c +++ b/src/util/iohelper.c @@ -56,6 +56,7 @@ runIO(const char *path, int fd, int oflags) unsigned long long total = 0; bool direct = O_DIRECT && ((oflags & O_DIRECT) != 0); off_t end = 0; + off_t cur; #if HAVE_POSIX_MEMALIGN if (posix_memalign(&base, alignMask + 1, buflen)) { @@ -78,10 +79,22 @@ runIO(const char *path, int fd, int oflags) fdoutname = "stdout"; /* To make the implementation simpler, we give up on any * attempt to use O_DIRECT in a non-trivial manner. */ - if (direct && ((end = lseek(fd, 0, SEEK_CUR)) != 0)) { - virReportSystemError(end < 0 ? errno : EINVAL, "%s", - _("O_DIRECT read needs entire seekable file")); - goto cleanup; + if (direct) { + if ((cur = lseek(fd, 0, SEEK_CUR)) != 0) { + virReportSystemError(cur < 0 ? errno : EINVAL, "%s", + _("O_DIRECT read needs entire seekable file")); + goto cleanup; + } + + if ((end = lseek(fd, 0, SEEK_END)) < 0) { + virReportSystemError(errno, "%s", _("can not seek file end")); + goto cleanup; + } + + if (lseek(fd, cur, SEEK_SET) < 0) { + virReportSystemError(errno, "%s", _("can not seek file begin")); + goto cleanup; + } } break; case O_WRONLY: @@ -109,7 +122,26 @@ runIO(const char *path, int fd, int oflags) while (1) { ssize_t got; - if ((got = saferead(fdin, buf, buflen)) < 0) { + /* in case of O_DIRECT we cannot read again to check for EOF + * after partial buffer read as it is done in saferead */ + if (direct && fdin == fd && end - total < buflen) { + if (total == end) + break; + + while ((got = read(fd, buf, buflen)) < 0 && errno == EINTR) + ; + + if (got < end - total) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to read last chunk from %s"), + fdinname); + goto cleanup; + } + } else { + got = saferead(fdin, buf, buflen); + } + + if (got < 0) { virReportSystemError(errno, _("Unable to read %s"), fdinname); goto cleanup; } -- 1.8.3.1

On Wed, Sep 20, 2017 at 02:58:55PM +0300, Nikolay Shirokovskiy wrote:
saferead is not suitable for direct reads. If file size is not multiple of align size then we get EINVAL on the read(2) that is supposed to return 0 because read buffer will not be aligned at this point.
Let's not read again after partial read and check that we read everything by comparing the number of total bytes read against file size.
What scenario did you actually hit this problem in ? IIUC, we should only be using O_DIRECT against block devices or plain files, and in both those cases we should never see short-read unless we hit EOF. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 20.09.2017 16:30, Daniel P. Berrange wrote:
On Wed, Sep 20, 2017 at 02:58:55PM +0300, Nikolay Shirokovskiy wrote:
saferead is not suitable for direct reads. If file size is not multiple of align size then we get EINVAL on the read(2) that is supposed to return 0 because read buffer will not be aligned at this point.
Let's not read again after partial read and check that we read everything by comparing the number of total bytes read against file size.
What scenario did you actually hit this problem in ? IIUC, we should only be using O_DIRECT against block devices or plain files, and in both those cases we should never see short-read unless we hit EOF.
Yes. But saferead is generic function and rereads file after short read. Here we got EINVAL because of misalignement. Nikolay

On Wed, Sep 20, 2017 at 04:35:58PM +0300, Nikolay Shirokovskiy wrote:
On 20.09.2017 16:30, Daniel P. Berrange wrote:
On Wed, Sep 20, 2017 at 02:58:55PM +0300, Nikolay Shirokovskiy wrote:
saferead is not suitable for direct reads. If file size is not multiple of align size then we get EINVAL on the read(2) that is supposed to return 0 because read buffer will not be aligned at this point.
Let's not read again after partial read and check that we read everything by comparing the number of total bytes read against file size.
What scenario did you actually hit this problem in ? IIUC, we should only be using O_DIRECT against block devices or plain files, and in both those cases we should never see short-read unless we hit EOF.
Yes. But saferead is generic function and rereads file after short read. Here we got EINVAL because of misalignement.
I understand that, but how have you actually hit this in the real world. AFAICT, the short-read and subsequent problem with misalignemt should be impossible to hit, because any files we use O_DIRECT on should not ever return a shortread. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 20.09.2017 16:41, Daniel P. Berrange wrote:
On Wed, Sep 20, 2017 at 04:35:58PM +0300, Nikolay Shirokovskiy wrote:
On 20.09.2017 16:30, Daniel P. Berrange wrote:
On Wed, Sep 20, 2017 at 02:58:55PM +0300, Nikolay Shirokovskiy wrote:
saferead is not suitable for direct reads. If file size is not multiple of align size then we get EINVAL on the read(2) that is supposed to return 0 because read buffer will not be aligned at this point.
Let's not read again after partial read and check that we read everything by comparing the number of total bytes read against file size.
What scenario did you actually hit this problem in ? IIUC, we should only be using O_DIRECT against block devices or plain files, and in both those cases we should never see short-read unless we hit EOF.
Yes. But saferead is generic function and rereads file after short read. Here we got EINVAL because of misalignement.
I understand that, but how have you actually hit this in the real world. AFAICT, the short-read and subsequent problem with misalignemt should be impossible to hit, because any files we use O_DIRECT on should not ever return a shortread.
virsh restore --bypass-cache just fails (at least on my kernel). The problem is that if the state file size is not multiple of buffer then last read will definetly be short read. Nikolay

On Wed, Sep 20, 2017 at 04:45:35PM +0300, Nikolay Shirokovskiy wrote:
On 20.09.2017 16:41, Daniel P. Berrange wrote:
On Wed, Sep 20, 2017 at 04:35:58PM +0300, Nikolay Shirokovskiy wrote:
On 20.09.2017 16:30, Daniel P. Berrange wrote:
On Wed, Sep 20, 2017 at 02:58:55PM +0300, Nikolay Shirokovskiy wrote:
saferead is not suitable for direct reads. If file size is not multiple of align size then we get EINVAL on the read(2) that is supposed to return 0 because read buffer will not be aligned at this point.
Let's not read again after partial read and check that we read everything by comparing the number of total bytes read against file size.
What scenario did you actually hit this problem in ? IIUC, we should only be using O_DIRECT against block devices or plain files, and in both those cases we should never see short-read unless we hit EOF.
Yes. But saferead is generic function and rereads file after short read. Here we got EINVAL because of misalignement.
I understand that, but how have you actually hit this in the real world. AFAICT, the short-read and subsequent problem with misalignemt should be impossible to hit, because any files we use O_DIRECT on should not ever return a shortread.
virsh restore --bypass-cache just fails (at least on my kernel). The problem is that if the state file size is not multiple of buffer then last read will definetly be short read.
Ah ok I see what's happening - its a short read in the sense that we have hit end-of-file, not a short read in the middle of the file. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Sep 20, 2017 at 02:58:55PM +0300, Nikolay Shirokovskiy wrote:
saferead is not suitable for direct reads. If file size is not multiple of align size then we get EINVAL on the read(2) that is supposed to return 0 because read buffer will not be aligned at this point.
Let's not read again after partial read and check that we read everything by comparing the number of total bytes read against file size. ---
Diff from v1: - a couple of cosmetic changes
src/util/iohelper.c | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-)
diff --git a/src/util/iohelper.c b/src/util/iohelper.c index fe15a92..9aa6ae0 100644 --- a/src/util/iohelper.c +++ b/src/util/iohelper.c @@ -56,6 +56,7 @@ runIO(const char *path, int fd, int oflags) unsigned long long total = 0; bool direct = O_DIRECT && ((oflags & O_DIRECT) != 0); off_t end = 0; + off_t cur;
#if HAVE_POSIX_MEMALIGN if (posix_memalign(&base, alignMask + 1, buflen)) { @@ -78,10 +79,22 @@ runIO(const char *path, int fd, int oflags) fdoutname = "stdout"; /* To make the implementation simpler, we give up on any * attempt to use O_DIRECT in a non-trivial manner. */ - if (direct && ((end = lseek(fd, 0, SEEK_CUR)) != 0)) { - virReportSystemError(end < 0 ? errno : EINVAL, "%s", - _("O_DIRECT read needs entire seekable file")); - goto cleanup; + if (direct) { + if ((cur = lseek(fd, 0, SEEK_CUR)) != 0) { + virReportSystemError(cur < 0 ? errno : EINVAL, "%s", + _("O_DIRECT read needs entire seekable file")); + goto cleanup; + } + + if ((end = lseek(fd, 0, SEEK_END)) < 0) { + virReportSystemError(errno, "%s", _("can not seek file end")); + goto cleanup; + } + + if (lseek(fd, cur, SEEK_SET) < 0) { + virReportSystemError(errno, "%s", _("can not seek file begin")); + goto cleanup; + } } break; case O_WRONLY: @@ -109,7 +122,26 @@ runIO(const char *path, int fd, int oflags) while (1) { ssize_t got;
- if ((got = saferead(fdin, buf, buflen)) < 0) { + /* in case of O_DIRECT we cannot read again to check for EOF + * after partial buffer read as it is done in saferead */ + if (direct && fdin == fd && end - total < buflen) { + if (total == end) + break; + + while ((got = read(fd, buf, buflen)) < 0 && errno == EINTR) + ; + + if (got < end - total) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to read last chunk from %s"), + fdinname); + goto cleanup; + } + } else { + got = saferead(fdin, buf, buflen); + } + + if (got < 0) { virReportSystemError(errno, _("Unable to read %s"), fdinname); goto cleanup; }
So the ultimate problem here is that when saferead() hits EOF, it tries to read some more, expecting to see ret == 0, but the buffer that saferead is passing into read() is not suitably aligned, so we get an error report despite not having any data to read. What's fun is that the runIO method already handles the fact that saferead() may return less data than was asked for, so using the saferead() function is pretty much pointless. There is thus no need to have separate codepaths for O_DIRECT vs normal, if we change iohelper to just use read() unconditionally and let it handle all return values with its existing code. IOW can simplify your patch to just this I believe: diff --git a/src/util/iohelper.c b/src/util/iohelper.c index fe15a92e6..5416d4506 100644 --- a/src/util/iohelper.c +++ b/src/util/iohelper.c @@ -109,7 +109,9 @@ runIO(const char *path, int fd, int oflags) while (1) { ssize_t got; - if ((got = saferead(fdin, buf, buflen)) < 0) { + if ((got = read(fdin, buf, buflen)) < 0) { + if (errno == EINTR) + continue; virReportSystemError(errno, _("Unable to read %s"), fdinname); goto cleanup; } Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 20.09.2017 17:48, Daniel P. Berrange wrote:
On Wed, Sep 20, 2017 at 02:58:55PM +0300, Nikolay Shirokovskiy wrote:
saferead is not suitable for direct reads. If file size is not multiple of align size then we get EINVAL on the read(2) that is supposed to return 0 because read buffer will not be aligned at this point.
Let's not read again after partial read and check that we read everything by comparing the number of total bytes read against file size. ---
Diff from v1: - a couple of cosmetic changes
src/util/iohelper.c | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-)
diff --git a/src/util/iohelper.c b/src/util/iohelper.c index fe15a92..9aa6ae0 100644 --- a/src/util/iohelper.c +++ b/src/util/iohelper.c @@ -56,6 +56,7 @@ runIO(const char *path, int fd, int oflags) unsigned long long total = 0; bool direct = O_DIRECT && ((oflags & O_DIRECT) != 0); off_t end = 0; + off_t cur;
#if HAVE_POSIX_MEMALIGN if (posix_memalign(&base, alignMask + 1, buflen)) { @@ -78,10 +79,22 @@ runIO(const char *path, int fd, int oflags) fdoutname = "stdout"; /* To make the implementation simpler, we give up on any * attempt to use O_DIRECT in a non-trivial manner. */ - if (direct && ((end = lseek(fd, 0, SEEK_CUR)) != 0)) { - virReportSystemError(end < 0 ? errno : EINVAL, "%s", - _("O_DIRECT read needs entire seekable file")); - goto cleanup; + if (direct) { + if ((cur = lseek(fd, 0, SEEK_CUR)) != 0) { + virReportSystemError(cur < 0 ? errno : EINVAL, "%s", + _("O_DIRECT read needs entire seekable file")); + goto cleanup; + } + + if ((end = lseek(fd, 0, SEEK_END)) < 0) { + virReportSystemError(errno, "%s", _("can not seek file end")); + goto cleanup; + } + + if (lseek(fd, cur, SEEK_SET) < 0) { + virReportSystemError(errno, "%s", _("can not seek file begin")); + goto cleanup; + } } break; case O_WRONLY: @@ -109,7 +122,26 @@ runIO(const char *path, int fd, int oflags) while (1) { ssize_t got;
- if ((got = saferead(fdin, buf, buflen)) < 0) { + /* in case of O_DIRECT we cannot read again to check for EOF + * after partial buffer read as it is done in saferead */ + if (direct && fdin == fd && end - total < buflen) { + if (total == end) + break; + + while ((got = read(fd, buf, buflen)) < 0 && errno == EINTR) + ; + + if (got < end - total) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to read last chunk from %s"), + fdinname); + goto cleanup; + } + } else { + got = saferead(fdin, buf, buflen); + } + + if (got < 0) { virReportSystemError(errno, _("Unable to read %s"), fdinname); goto cleanup; }
So the ultimate problem here is that when saferead() hits EOF, it tries to read some more, expecting to see ret == 0, but the buffer that saferead is passing into read() is not suitably aligned, so we get an error report despite not having any data to read.
What's fun is that the runIO method already handles the fact that saferead() may return less data than was asked for, so using the saferead() function is pretty much pointless. There is thus no need to have separate codepaths for O_DIRECT vs normal, if we change iohelper to just use read() unconditionally and let it handle all return values with its existing code.
It depends on whether reads on files/block devices never return partial reads or not. At least there is not clear statement in read(2) AFAIR.
IOW can simplify your patch to just this I believe:> diff --git a/src/util/iohelper.c b/src/util/iohelper.c index fe15a92e6..5416d4506 100644 --- a/src/util/iohelper.c +++ b/src/util/iohelper.c @@ -109,7 +109,9 @@ runIO(const char *path, int fd, int oflags) while (1) { ssize_t got;
- if ((got = saferead(fdin, buf, buflen)) < 0) { + if ((got = read(fdin, buf, buflen)) < 0) { + if (errno == EINTR) + continue; virReportSystemError(errno, _("Unable to read %s"), fdinname); goto cleanup; }
But we still read half the buffer at the file end and then try misaligned read after that. We could consider short-read as a sign of EOF but i'm not sure this is true so I decided to add check using file size measured thru lseek(2). Nikolay

On Wed, Sep 20, 2017 at 06:00:59PM +0300, Nikolay Shirokovskiy wrote:
On 20.09.2017 17:48, Daniel P. Berrange wrote:
IOW can simplify your patch to just this I believe:> diff --git a/src/util/iohelper.c b/src/util/iohelper.c index fe15a92e6..5416d4506 100644 --- a/src/util/iohelper.c +++ b/src/util/iohelper.c @@ -109,7 +109,9 @@ runIO(const char *path, int fd, int oflags) while (1) { ssize_t got;
- if ((got = saferead(fdin, buf, buflen)) < 0) { + if ((got = read(fdin, buf, buflen)) < 0) { + if (errno == EINTR) + continue; virReportSystemError(errno, _("Unable to read %s"), fdinname); goto cleanup; }
But we still read half the buffer at the file end and then try misaligned read after that.
No, we shouldn't be mis-aligned. The 'buf' variable we use is aligned correctly. If we read half a buffer, we'll then write that data back to libvirtd. So when we go back into read(), we'll be reading into the start of 'buf' again, so we're still correctly aligned. This is the key difference against saferead() - that tries to fill up the buffer before we can write any data, whereas with this patch we'll always write whatever we read immediately Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 20.09.2017 18:05, Daniel P. Berrange wrote:
On Wed, Sep 20, 2017 at 06:00:59PM +0300, Nikolay Shirokovskiy wrote:
On 20.09.2017 17:48, Daniel P. Berrange wrote:
IOW can simplify your patch to just this I believe:> diff --git a/src/util/iohelper.c b/src/util/iohelper.c index fe15a92e6..5416d4506 100644 --- a/src/util/iohelper.c +++ b/src/util/iohelper.c @@ -109,7 +109,9 @@ runIO(const char *path, int fd, int oflags) while (1) { ssize_t got;
- if ((got = saferead(fdin, buf, buflen)) < 0) { + if ((got = read(fdin, buf, buflen)) < 0) { + if (errno == EINTR) + continue; virReportSystemError(errno, _("Unable to read %s"), fdinname); goto cleanup; }
But we still read half the buffer at the file end and then try misaligned read after that.
No, we shouldn't be mis-aligned. The 'buf' variable we use is aligned correctly. If we read half a buffer, we'll then write that data back to libvirtd. So when we go back into read(), we'll be reading into the start of 'buf' again, so we're still correctly aligned. This is the key difference against saferead() - that tries to fill up the buffer before we can write any data, whereas with this patch we'll always write whatever we read immediately
I've tested this patch and this works. From read(2) EINVAL fd is attached to an object which is unsuitable for reading; or the file was opened with the O_DIRECT flag, and either the address specified in buf, the value specified in count, or the current file off‐ set is not suitably aligned. I thought everything should be aligned. But after giving the above sentence a thought - it is not stated ) So in case of EOF at least position may not be aligned. But in open(2): The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to dis‐ cover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3). So finally it is not clear should such EOF read be possible or not. But I vote for the simple approach while it works. Nikolay
participants (2)
-
Daniel P. Berrange
-
Nikolay Shirokovskiy