[libvirt] [PATCH] Use posix_fallocate() to allocate disk space

Hi, This is an untested patch to make disk allocations faster and non-fragmented. I'm using posix_fallocate() now but relying on glibc really calling fallocate() if it exists for the file system to be the fastest. - This fails build because libutil needs to be added as a dependency? ../src/.libs/libvirt_driver_storage.a(storage_backend_fs.o): In function `virStorageBackendFileSystemVolCreate': /home/amit/src/libvirt/src/storage_backend_fs.c:1023: undefined reference to `safezero' - What's vol->capacity? Why is ftruncate() needed after the call to (current) safewrite()? My assumption is that the user can specify some max. capacity and wish to allocate only a chunk off it at create-time. Is that correct? The best case to get a non-fragmented VM image is to have it allocated completely at create-time with fallocate(). Currently xfs and ext4 support the fallocate() syscall (btrfs will, too, when it's ready). Comments? Amit
From dfe4780f5990571f026e02e6187cb64505c982c1 Mon Sep 17 00:00:00 2001 From: Amit Shah <amit.shah@redhat.com> Date: Tue, 24 Feb 2009 16:55:58 +0530 Subject: [PATCH] Use posix_fallocate() to allocate disk space
Using posix_fallocate() to allocate disk space and fill it with zeros is faster than writing the zeros block-by-block. Also, for backing file systems that support the fallocate() syscall, this operation will give us a big speed boost. The biggest advantage of using this is the file will not be fragmented for the allocated chunks. Signed-off-by: Amit Shah <amit.shah@redhat.com> --- src/storage_backend_fs.c | 23 ++++++++--------------- src/util.c | 5 +++++ src/util.h | 1 + 3 files changed, 14 insertions(+), 15 deletions(-) diff --git a/src/storage_backend_fs.c b/src/storage_backend_fs.c index 240de96..74b0fda 100644 --- a/src/storage_backend_fs.c +++ b/src/storage_backend_fs.c @@ -1019,21 +1019,14 @@ virStorageBackendFileSystemVolCreate(virConnectPtr conn, /* XXX slooooooooooooooooow. * Need to add in progress bars & bg thread somehow */ if (vol->allocation) { - unsigned long long remain = vol->allocation; - static char const zeros[4096]; - while (remain) { - int bytes = sizeof(zeros); - if (bytes > remain) - bytes = remain; - if ((bytes = safewrite(fd, zeros, bytes)) < 0) { - virReportSystemError(conn, errno, - _("cannot fill file '%s'"), - vol->target.path); - unlink(vol->target.path); - close(fd); - return -1; - } - remain -= bytes; + int r; + if ((r = safezero(fd, 0, 0, vol->allocation)) < 0) { + virReportSystemError(conn, r, + _("cannot fill file '%s'"), + vol->target.path); + unlink(vol->target.path); + close(fd); + return -1; } } diff --git a/src/util.c b/src/util.c index 990433a..1bee7f0 100644 --- a/src/util.c +++ b/src/util.c @@ -117,6 +117,11 @@ ssize_t safewrite(int fd, const void *buf, size_t count) return nwritten; } +int safezero(int fd, int flags, off_t offset, off_t len) +{ + return posix_fallocate(fd, offset, len); +} + #ifndef PROXY int virFileStripSuffix(char *str, diff --git a/src/util.h b/src/util.h index a79cfa7..acaabb1 100644 --- a/src/util.h +++ b/src/util.h @@ -31,6 +31,7 @@ int saferead(int fd, void *buf, size_t count); ssize_t safewrite(int fd, const void *buf, size_t count); +int safezero(int fd, int flags, off_t offset, off_t len); enum { VIR_EXEC_NONE = 0, -- 1.6.0.6

On Tue, Feb 24, 2009 at 05:09:31PM +0530, Amit Shah wrote:
Hi,
This is an untested patch to make disk allocations faster and non-fragmented. I'm using posix_fallocate() now but relying on glibc really calling fallocate() if it exists for the file system to be the fastest.
- This fails build because libutil needs to be added as a dependency?
../src/.libs/libvirt_driver_storage.a(storage_backend_fs.o): In function `virStorageBackendFileSystemVolCreate': /home/amit/src/libvirt/src/storage_backend_fs.c:1023: undefined reference to `safezero'
You'd need to add 'safezero' to src/libvirt_private.syms to allow it to be linked to by the storage driver.
- What's vol->capacity? Why is ftruncate() needed after the call to (current) safewrite()? My assumption is that the user can specify some max. capacity and wish to allocate only a chunk off it at create-time. Is that correct?
"allocation" refers to the current physical usage of the volume "capacity" refers to the logical size of the volume So, you can have a raw sparse file of size 4 GB, but not allocate any disk upfront - just allocated on demand when guest writes to it. Or you can allocate 1 GB upfront, and leave the rest unallocated. So this code is first filling out the upfront allocation the user requested, and then using ftruncate() to extend to a (possibly larger) logical size. Similarly for qcow files, capacity refers to the logical disk size but qcow is grow on demand, so allocation will be much lower. Usually allocation <= capacity, but if the volume format has metadata overhead, you can get to a place where allocation > capacity if the entire volume has been written to.
The best case to get a non-fragmented VM image is to have it allocated completely at create-time with fallocate().
The main problem with this change is that it'll make it harder for us to provide incremental feedback. As per the comment in the code, it is our intention to make the volume creation API run as a background job which provides feedback on progress of allocation, and the ability to cancel the job. Since posix_fallocate() is an all-or-nothing kind of API it wouldn't be very helpful. What sort of performance boost does this give you ? Would we perhaps be able to get close to it by writing in bigger chunks than 4k, or mmap'ing the file and then doing a memset across it ?
From dfe4780f5990571f026e02e6187cb64505c982c1 Mon Sep 17 00:00:00 2001 From: Amit Shah <amit.shah@redhat.com> Date: Tue, 24 Feb 2009 16:55:58 +0530 Subject: [PATCH] Use posix_fallocate() to allocate disk space
Using posix_fallocate() to allocate disk space and fill it with zeros is faster than writing the zeros block-by-block.
Also, for backing file systems that support the fallocate() syscall, this operation will give us a big speed boost.
The biggest advantage of using this is the file will not be fragmented for the allocated chunks.
Signed-off-by: Amit Shah <amit.shah@redhat.com> --- src/storage_backend_fs.c | 23 ++++++++--------------- src/util.c | 5 +++++ src/util.h | 1 + 3 files changed, 14 insertions(+), 15 deletions(-)
diff --git a/src/storage_backend_fs.c b/src/storage_backend_fs.c index 240de96..74b0fda 100644 --- a/src/storage_backend_fs.c +++ b/src/storage_backend_fs.c @@ -1019,21 +1019,14 @@ virStorageBackendFileSystemVolCreate(virConnectPtr conn, /* XXX slooooooooooooooooow. * Need to add in progress bars & bg thread somehow */ if (vol->allocation) { - unsigned long long remain = vol->allocation; - static char const zeros[4096]; - while (remain) { - int bytes = sizeof(zeros); - if (bytes > remain) - bytes = remain; - if ((bytes = safewrite(fd, zeros, bytes)) < 0) { - virReportSystemError(conn, errno, - _("cannot fill file '%s'"), - vol->target.path); - unlink(vol->target.path); - close(fd); - return -1; - } - remain -= bytes; + int r; + if ((r = safezero(fd, 0, 0, vol->allocation)) < 0) { + virReportSystemError(conn, r, + _("cannot fill file '%s'"), + vol->target.path); + unlink(vol->target.path); + close(fd); + return -1; } }
diff --git a/src/util.c b/src/util.c index 990433a..1bee7f0 100644 --- a/src/util.c +++ b/src/util.c @@ -117,6 +117,11 @@ ssize_t safewrite(int fd, const void *buf, size_t count) return nwritten; }
+int safezero(int fd, int flags, off_t offset, off_t len) +{ + return posix_fallocate(fd, offset, len); +} + #ifndef PROXY
int virFileStripSuffix(char *str, diff --git a/src/util.h b/src/util.h index a79cfa7..acaabb1 100644 --- a/src/util.h +++ b/src/util.h @@ -31,6 +31,7 @@
int saferead(int fd, void *buf, size_t count); ssize_t safewrite(int fd, const void *buf, size_t count); +int safezero(int fd, int flags, off_t offset, off_t len);
enum { VIR_EXEC_NONE = 0, --
Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On (Tue) Feb 24 2009 [11:58:31], Daniel P. Berrange wrote:
On Tue, Feb 24, 2009 at 05:09:31PM +0530, Amit Shah wrote:
Hi,
This is an untested patch to make disk allocations faster and non-fragmented. I'm using posix_fallocate() now but relying on glibc really calling fallocate() if it exists for the file system to be the fastest.
- This fails build because libutil needs to be added as a dependency?
../src/.libs/libvirt_driver_storage.a(storage_backend_fs.o): In function `virStorageBackendFileSystemVolCreate': /home/amit/src/libvirt/src/storage_backend_fs.c:1023: undefined reference to `safezero'
You'd need to add 'safezero' to src/libvirt_private.syms to allow it to be linked to by the storage driver.
Thanks; builds now.
- What's vol->capacity? Why is ftruncate() needed after the call to (current) safewrite()? My assumption is that the user can specify some max. capacity and wish to allocate only a chunk off it at create-time. Is that correct?
"allocation" refers to the current physical usage of the volume
"capacity" refers to the logical size of the volume
So, you can have a raw sparse file of size 4 GB, but not allocate any disk upfront - just allocated on demand when guest writes to it. Or you can allocate 1 GB upfront, and leave the rest unallocated. So this code is first filling out the upfront allocation the user requested, and then using ftruncate() to extend to a (possibly larger) logical size.
Similarly for qcow files, capacity refers to the logical disk size but qcow is grow on demand, so allocation will be much lower.
Usually allocation <= capacity, but if the volume format has metadata overhead, you can get to a place where allocation > capacity if the entire volume has been written to.
This case had me puzzled. Thanks for the explanation!
The best case to get a non-fragmented VM image is to have it allocated completely at create-time with fallocate().
The main problem with this change is that it'll make it harder for us to provide incremental feedback. As per the comment in the code, it is our intention to make the volume creation API run as a background job which provides feedback on progress of allocation, and the ability to cancel the job. Since posix_fallocate() is an all-or-nothing kind of API it wouldn't be very helpful.
What sort of performance boost does this give you ? Would we perhaps be able to get close to it by writing in bigger chunks than 4k, or mmap'ing the file and then doing a memset across it ?
If the file system is asked to zero out a particular block of data (using extents, as is possible in xfs and ext4), it's going to be the fastest method available. Definitely faster than writing chunks in userspace. Of course, my patch is based on untested stuff. I initially started out by wanting to have the image as defragmented as possible. There are various parameters which "fast" depends on: - do we want the image creation to be the fastest operation? (fast, yes, fastest, at the expense of something else, like guest runtime, maybe not.) - how do guests cope with fragmented images? If the data in the image itself is fragmented, what good is an unfragmented image? The problem with writing chunks of any size is that it can easily lead to a lot of fragmentation. If we have the answers to the two questions above, we can make a decision based on actual numbers. If it turns out that an defragmented image file is much better, pretty graphics showing % complete will have to be in a state where they currently are ;-) I don't have an ext4 file system on actual hardware to test -- I'll provide numbers when I get to set one up, though.

On (Tue) Feb 24 2009 [11:58:31], Daniel P. Berrange wrote:
On Tue, Feb 24, 2009 at 05:09:31PM +0530, Amit Shah wrote: ...
The best case to get a non-fragmented VM image is to have it allocated completely at create-time with fallocate().
The main problem with this change is that it'll make it harder for us to provide incremental feedback. As per the comment in the code, it is our intention to make the volume creation API run as a background job which provides feedback on progress of allocation, and the ability to cancel the job. Since posix_fallocate() is an all-or-nothing kind of API it wouldn't be very helpful.
What sort of performance boost does this give you ? Would we perhaps be able to get close to it by writing in bigger chunks than 4k, or mmap'ing the file and then doing a memset across it ?
I have a program up at [1] that gives me the following data. [1] http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git;a=blob_p... I compiled results for ext3, ext4, xfs and btrfs. I used the following methods to allocate a file (1 GB in size) and zero it: - posix_fallocate() - mmap() and memset() - write chunks, sized 4k and 8k. Results: --- ext4: posix-fallocate run time: (approx 0s) mmap run time: (approx 13s) 4096-sized chunk run time: (approx 15s) 8192-sized chunk run time: (approx 18s) $ sudo filefrag /mnt/ext4/* /mnt/ext4/file-chunk4: 29 extents found /mnt/ext4/file-chunk8: 20 extents found /mnt/ext4/file-mmap: 38 extents found /mnt/ext4/file-pf: 1 extent found --- xfs: posix-fallocate run time: (approx 0s) mmap run time: (approx 14s) 4096-sized chunk run time: (approx 18s) 8192-sized chunk run time: (approx 19s) $ sudo filefrag /mnt/xfs/* /mnt/xfs/file-chunk4: 3 extents found /mnt/xfs/file-chunk8: 4 extents found /mnt/xfs/file-mmap: 2 extents found /mnt/xfs/file-pf: 1 extent found --- ext3: posix-fallocate run time: (approx 18s) mmap run time: (approx 20s) 4096-sized chunk run time: (approx 22s) 8192-sized chunk run time: (approx 24s) $ sudo filefrag /mnt/ext3/* /mnt/ext3/file-chunk4: 38 extents found, perfection would be 9 extents /mnt/ext3/file-chunk8: 9 extents found /mnt/ext3/file-mmap: 44 extents found, perfection would be 9 extents /mnt/ext3/file-pf: 9 extents found --- btrfs: posix-fallocate run time: (approx 0s) mmap run time: (approx 18s) 4096-sized chunk run time: (approx 17s) 8192-sized chunk run time: (approx 19s) $ sudo /mnt/btrfs/* FIBMAP: Invalid argument --- I have detailed results up at http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git;a=blob_p... The link to the git tree is http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git Clearly, extents-based file systems provide a very very fast fallocate() implementation that allocates a new file and zeroes it. Since F11 is going to have ext4 by default, I strongly suggest we switch to posix_fallocate() for Linux hosts. The feedback should not matter on the newer file systems as the alloc is really fast and we anyway don't have an implementation currently for non-extent-based file systems. It really won't be missed for newer hosts. Inspite of this if some feedback is needed for a non-extents-based file system, a run-time probe for the underlying file system can be made and we could default to a chunk-based allocation in that case. For systems that do not implement posix_fallocate(), some configure-magic is needed. Amit

On Mon, Mar 02, 2009 at 03:02:11PM +0530, Amit Shah wrote:
On (Tue) Feb 24 2009 [11:58:31], Daniel P. Berrange wrote:
On Tue, Feb 24, 2009 at 05:09:31PM +0530, Amit Shah wrote: ...
The best case to get a non-fragmented VM image is to have it allocated completely at create-time with fallocate().
The main problem with this change is that it'll make it harder for us to provide incremental feedback. As per the comment in the code, it is our intention to make the volume creation API run as a background job which provides feedback on progress of allocation, and the ability to cancel the job. Since posix_fallocate() is an all-or-nothing kind of API it wouldn't be very helpful.
What sort of performance boost does this give you ? Would we perhaps be able to get close to it by writing in bigger chunks than 4k, or mmap'ing the file and then doing a memset across it ?
I have a program up at [1] that gives me the following data.
[1] http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git;a=blob_p...
I compiled results for ext3, ext4, xfs and btrfs. I used the following methods to allocate a file (1 GB in size) and zero it:
- posix_fallocate() - mmap() and memset() - write chunks, sized 4k and 8k.
Results:
[snip details]
The link to the git tree is
http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git
Clearly, extents-based file systems provide a very very fast fallocate() implementation that allocates a new file and zeroes it. Since F11 is going to have ext4 by default, I strongly suggest we switch to posix_fallocate() for Linux hosts. The feedback should not matter on the newer file systems as the alloc is really fast and we anyway don't have an implementation currently for non-extent-based file systems. It really won't be missed for newer hosts.
Inspite of this if some feedback is needed for a non-extents-based file system, a run-time probe for the underlying file system can be made and we could default to a chunk-based allocation in that case.
These results are quite impressive. It is better by orders of magnitude for modern filesystems, and even on ext3 it has a slight edge. So I'm inclined to say we should use posix_fallocate() by default at all times. When we introduce the API for incremental feedback, we could do something like call fallocate() in 1GB chunks so we can get some reasonable amount of feedback while still keeping it very fast & well allocated on disk. And of course, just fallocate() the whole thing upfront if the user does not provide a callback for requesting feedback.
For systems that do not implement posix_fallocate(), some configure-magic is needed.
Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On (Wed) Mar 04 2009 [12:55:22], Daniel P. Berrange wrote:
On Mon, Mar 02, 2009 at 03:02:11PM +0530, Amit Shah wrote:
Inspite of this if some feedback is needed for a non-extents-based file system, a run-time probe for the underlying file system can be made and we could default to a chunk-based allocation in that case.
These results are quite impressive. It is better by orders of magnitude for modern filesystems, and even on ext3 it has a slight edge. So I'm inclined to say we should use posix_fallocate() by default at all times.
When we introduce the API for incremental feedback, we could do something like call fallocate() in 1GB chunks so we can get some reasonable amount of feedback while still keeping it very fast & well allocated on disk.
Oh btw fallocate() seems to be an O(1) operation (at least on new file systems). Allocating a 5G file also gave me 0s:few microseconds delay.
And of course, just fallocate() the whole thing upfront if the user does not provide a callback for requesting feedback.
Amit

On Tue, Feb 24, 2009 at 05:09:31PM +0530, Amit Shah wrote:
This is an untested patch to make disk allocations faster and non-fragmented. I'm using posix_fallocate() now but relying on glibc
This is not available everywhere - you need to make a configure.in test regards john
participants (3)
-
Amit Shah
-
Daniel P. Berrange
-
John Levon