[libvirt] Fix slow storage file allocation with O_DSYNC

Hi. By opening storage file with O_DSYNC before allocating disk blocks for it we made the whole operation terribly slow when the allocation is done using posix_fallocate() on a filesystem which does not support fallocate(), i.e. anything but ext4. And by terribly slow I mean 45 minutes (RHEL5) to 10+ hours (Fedora Rawhide) for a 12GB storage. To fix this issue, we have two options. Either avoid using fallocate() emulation inside posix_fallocate() or stop using O_DSYNC. I prepared a patch for each of these options and they will follow as replies to this introduction. Some numbers for the same 12GB file: - libvirt's internal fallocate emulation using - mmap ~3.5 minutes (will be used whenever mmap() is present) - write ~5 minutes - fsync() instead of O_DSYNC - ~3.5 minutes So basically the two options are equivalent wrt to time consumption. The second one seems to be less hacky at the expense of non-lienar behavior which makes progress reporting less useful. Jirka

If fallocate() is present, use it directly instead of posix_allocate(). If it is not support by the kernel or filesystem, emulate it using mmap() or write(). This change is to work around slow fallocate emulation done by glibc's posix_allocate() when used on files opened with O_DSYNC. Signed-off-by: Jiri Denemark <jdenemar@redhat.com> --- configure.ac | 2 +- src/util/util.c | 24 +++++++++++++++++++++--- 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/configure.ac b/configure.ac index 682b8b5..c000ac3 100644 --- a/configure.ac +++ b/configure.ac @@ -104,7 +104,7 @@ dnl Use --disable-largefile if you don't want this. AC_SYS_LARGEFILE dnl Availability of various common functions (non-fatal if missing). -AC_CHECK_FUNCS([cfmakeraw regexec uname sched_getaffinity getuid getgid posix_fallocate mmap]) +AC_CHECK_FUNCS([cfmakeraw regexec uname sched_getaffinity getuid getgid posix_fallocate mmap fallocate]) dnl Availability of various not common threadsafe functions AC_CHECK_FUNCS([strerror_r strtok_r getmntent_r getgrnam_r getpwuid_r]) diff --git a/src/util/util.c b/src/util/util.c index 34c585d..6ecc226 100644 --- a/src/util/util.c +++ b/src/util/util.c @@ -128,7 +128,7 @@ ssize_t safewrite(int fd, const void *buf, size_t count) return nwritten; } -#ifdef HAVE_POSIX_FALLOCATE +#if defined(HAVE_POSIX_FALLOCATE) && !defined(HAVE_FALLOCATE) int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) { return posix_fallocate(fd, offset, len); @@ -136,7 +136,7 @@ int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) #else #ifdef HAVE_MMAP -int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) +static int safezero_mmap(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) { int r; char *buf; @@ -160,7 +160,7 @@ int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) #else /* HAVE_MMAP */ -int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) +static int safezero_write(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) { int r; char *buf; @@ -196,6 +196,24 @@ int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) return 0; } #endif /* HAVE_MMAP */ + +int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) +{ +#ifdef HAVE_FALLOCATE + if (fallocate(fd, 0, offset, len) < 0) { + if (errno != ENOSYS && errno != EOPNOTSUPP) + return -1; + } else + return 0; +#endif + +#ifdef HAVE_MMAP + return safezero_mmap(fd, 0, offset, len); +#else + return safezero_write(fd, 0, offset, len); +#endif +} + #endif /* HAVE_POSIX_FALLOCATE */ #ifndef PROXY -- 1.7.0

According to Jiri Denemark on 3/2/2010 8:26 AM:
If fallocate() is present, use it directly instead of posix_allocate(). If it is not support by the kernel or filesystem, emulate it using mmap() or write().
This change is to work around slow fallocate emulation done by glibc's posix_allocate() when used on files opened with O_DSYNC.
I'm personally undecided between the two approaches at the moment, but here's some stylistic review in the meantime.
dnl Availability of various common functions (non-fatal if missing). -AC_CHECK_FUNCS([cfmakeraw regexec uname sched_getaffinity getuid getgid posix_fallocate mmap]) +AC_CHECK_FUNCS([cfmakeraw regexec uname sched_getaffinity getuid getgid posix_fallocate mmap fallocate])
This is already a long line, and you made it longer. Autoconf will let you line-wrap the list to AC_CHECK_FUNCS; it splits on whitespace, including newlines. And maybe it makes more sense to use AC_CHECK_FUNCS_ONCE instead of AC_CHECK_FUNCS (oh right, 2.59 didn't have the former).
-#ifdef HAVE_POSIX_FALLOCATE +#if defined(HAVE_POSIX_FALLOCATE) && !defined(HAVE_FALLOCATE)
Redundant parenthesis. I know upstream gnulib's maint.mk warns about them; did we turn them off in libvirt's 'make syntax-check'?
#ifdef HAVE_MMAP -int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) +static int safezero_mmap(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len)
Already there before your patch, but formatting this as: static int safezero_mmap(int... makes life easier for emacs, and for 'grep "^safezero"' to quickly find the implementation.
@@ -196,6 +196,24 @@ int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len) return 0; } #endif /* HAVE_MMAP */ + +int safezero(int fd, int flags ATTRIBUTE_UNUSED, off_t offset, off_t len)
Likewise, for: int safezero(int...
+{ +#ifdef HAVE_FALLOCATE + if (fallocate(fd, 0, offset, len) < 0) { + if (errno != ENOSYS && errno != EOPNOTSUPP) + return -1; + } else + return 0; +#endif + +#ifdef HAVE_MMAP + return safezero_mmap(fd, 0, offset, len); +#else + return safezero_write(fd, 0, offset, len); +#endif
Looks reasonable; but why not name both fallbacks safezero_fallback at the points where they are declared (HAVE_MMAP/!HAVE_MMAP), so that you can get rid of the in-function #ifdef HAVE_MMAP here? -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

According to Eric Blake on 3/2/2010 8:50 AM:
+AC_CHECK_FUNCS([cfmakeraw regexec uname sched_getaffinity getuid getgid posix_fallocate mmap fallocate])
This is already a long line, and you made it longer. Autoconf will let you line-wrap the list to AC_CHECK_FUNCS; it splits on whitespace, including newlines. And maybe it makes more sense to use AC_CHECK_FUNCS_ONCE instead of AC_CHECK_FUNCS (oh right, 2.59 didn't have the former).
Then again, we use gnulib, and gnulib guarantees AC_CHECK_FUNCS_ONCE even for autoconf 2.59. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Tue, Mar 02, 2010 at 08:50:23AM -0700, Eric Blake wrote:
According to Jiri Denemark on 3/2/2010 8:26 AM:
If fallocate() is present, use it directly instead of posix_allocate(). If it is not support by the kernel or filesystem, emulate it using mmap() or write().
This change is to work around slow fallocate emulation done by glibc's posix_allocate() when used on files opened with O_DSYNC. [..]
-#ifdef HAVE_POSIX_FALLOCATE +#if defined(HAVE_POSIX_FALLOCATE) && !defined(HAVE_FALLOCATE)
Redundant parenthesis. I know upstream gnulib's maint.mk warns about them; did we turn them off in libvirt's 'make syntax-check'?
Hum, what's redundant here ? Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

According to Daniel Veillard on 3/2/2010 9:53 AM:
+#if defined(HAVE_POSIX_FALLOCATE) && !defined(HAVE_FALLOCATE)
Redundant parenthesis. I know upstream gnulib's maint.mk warns about them; did we turn them off in libvirt's 'make syntax-check'?
Hum, what's redundant here ?
It's perfectly acceptable (and less typing) to do: #if defined HAVE_POSIX_FALLOCATE && !defined HAVE_FALLOCATE Whether it is also more legible is a matter of opinion. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Tue, Mar 02, 2010 at 10:08:50AM -0700, Eric Blake wrote:
According to Daniel Veillard on 3/2/2010 9:53 AM:
+#if defined(HAVE_POSIX_FALLOCATE) && !defined(HAVE_FALLOCATE)
Redundant parenthesis. I know upstream gnulib's maint.mk warns about them; did we turn them off in libvirt's 'make syntax-check'?
Hum, what's redundant here ?
It's perfectly acceptable (and less typing) to do:
#if defined HAVE_POSIX_FALLOCATE && !defined HAVE_FALLOCATE
Whether it is also more legible is a matter of opinion.
Never seen the second form, and I prefer the first one personally :-) Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

If fallocate() is present, use it directly instead of posix_allocate(). If it is not support by the kernel or filesystem, emulate it using mmap() or write().
This change is to work around slow fallocate emulation done by glibc's posix_allocate() when used on files opened with O_DSYNC.
Hmm, there is a dark side of this patch. Since fallocate is not portable, we use posix_fallocate in case fallocate is not available. Although this could happen because another system call with similar semantics exists, it can also happen on older linux distributions where there is no equivalent of fallocate call. And in such distributions (RHEL5 being one example), this patch will make no functional change. So either we could sacrifice portability by not using posix_fallocate at all or come up with something else. Or use the patch from option 2. Jirka

Instead of opening storage file with O_DSYNC, make sure data are written to a disk only before we claim allocation has finished. Signed-off-by: Jiri Denemark <jdenemar@redhat.com> --- src/storage/storage_backend.c | 9 ++++++++- 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index 849f01b..4cc2f93 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -332,6 +332,13 @@ static int createRawFileOpHook(int fd, void *data) { goto cleanup; } } + + if (fsync(fd) < 0) { + ret = errno; + virReportSystemError(errno, _("cannot sync data to file '%s'"), + hdata->vol->target.path); + goto cleanup; + } } cleanup: @@ -357,7 +364,7 @@ virStorageBackendCreateRaw(virConnectPtr conn ATTRIBUTE_UNUSED, } if ((createstat = virFileOperation(vol->target.path, - O_RDWR | O_CREAT | O_EXCL | O_DSYNC, + O_RDWR | O_CREAT | O_EXCL, vol->target.perms.mode, vol->target.perms.uid, vol->target.perms.gid, createRawFileOpHook, &hdata, -- 1.7.0

According to Jiri Denemark on 3/2/2010 8:26 AM:
Instead of opening storage file with O_DSYNC, make sure data are written to a disk only before we claim allocation has finished.
+++ b/src/storage/storage_backend.c @@ -332,6 +332,13 @@ static int createRawFileOpHook(int fd, void *data) { goto cleanup; } } + + if (fsync(fd) < 0) { + ret = errno; + virReportSystemError(errno, _("cannot sync data to file '%s'"), + hdata->vol->target.path); + goto cleanup; + } }
cleanup: @@ -357,7 +364,7 @@ virStorageBackendCreateRaw(virConnectPtr conn ATTRIBUTE_UNUSED, }
if ((createstat = virFileOperation(vol->target.path, - O_RDWR | O_CREAT | O_EXCL | O_DSYNC, + O_RDWR | O_CREAT | O_EXCL,
Are there any other uses of the fd that also need a call to fsync()? This is certainly the less invasive patch of the two options, with no obvious problems to me. You are correct that we have less progress timing visibility into the guts of the fsync() than if we are doing the per-page mmap/write flushing on an O_DSYNC fd. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Tue, Mar 02, 2010 at 04:26:59PM +0100, Jiri Denemark wrote:
Instead of opening storage file with O_DSYNC, make sure data are written to a disk only before we claim allocation has finished.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com> --- src/storage/storage_backend.c | 9 ++++++++- 1 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index 849f01b..4cc2f93 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -332,6 +332,13 @@ static int createRawFileOpHook(int fd, void *data) { goto cleanup; } } + + if (fsync(fd) < 0) { + ret = errno; + virReportSystemError(errno, _("cannot sync data to file '%s'"), + hdata->vol->target.path); + goto cleanup; + } }
cleanup: @@ -357,7 +364,7 @@ virStorageBackendCreateRaw(virConnectPtr conn ATTRIBUTE_UNUSED, }
if ((createstat = virFileOperation(vol->target.path, - O_RDWR | O_CREAT | O_EXCL | O_DSYNC, + O_RDWR | O_CREAT | O_EXCL, vol->target.perms.mode, vol->target.perms.uid, vol->target.perms.gid, createRawFileOpHook, &hdata,
I would tend to prefer this patch on the basis that I would prefer a smaller change to the code at this point before the release. But the O_DSYNC was added after finding that cache effects could just break the qemu startup timeout in the test environment. I'm not 100% sure fsync() will have the same behaviour overall, it should but theory and practice may diverge :-) So I'm tempted to go with this patch but we should validate it on testing before the end of the week. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Instead of opening storage file with O_DSYNC, make sure data are written to a disk only before we claim allocation has finished. ... I would tend to prefer this patch on the basis that I would prefer a smaller change to the code at this point before the release. But the O_DSYNC was added after finding that cache effects could just break the qemu startup timeout in the test environment. I'm not 100% sure fsync() will have the same behaviour overall, it should but theory and practice may diverge :-) So I'm tempted to go with this patch but we should validate it on testing before the end of the week.
Oh, the patch missed the release :-/ I pushed it now. Jirka
participants (3)
-
Daniel Veillard
-
Eric Blake
-
Jiri Denemark