
On (Tue) Feb 24 2009 [11:58:31], Daniel P. Berrange wrote:
On Tue, Feb 24, 2009 at 05:09:31PM +0530, Amit Shah wrote: ...
The best case to get a non-fragmented VM image is to have it allocated completely at create-time with fallocate().
The main problem with this change is that it'll make it harder for us to provide incremental feedback. As per the comment in the code, it is our intention to make the volume creation API run as a background job which provides feedback on progress of allocation, and the ability to cancel the job. Since posix_fallocate() is an all-or-nothing kind of API it wouldn't be very helpful.
What sort of performance boost does this give you ? Would we perhaps be able to get close to it by writing in bigger chunks than 4k, or mmap'ing the file and then doing a memset across it ?
I have a program up at [1] that gives me the following data. [1] http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git;a=blob_p... I compiled results for ext3, ext4, xfs and btrfs. I used the following methods to allocate a file (1 GB in size) and zero it: - posix_fallocate() - mmap() and memset() - write chunks, sized 4k and 8k. Results: --- ext4: posix-fallocate run time: (approx 0s) mmap run time: (approx 13s) 4096-sized chunk run time: (approx 15s) 8192-sized chunk run time: (approx 18s) $ sudo filefrag /mnt/ext4/* /mnt/ext4/file-chunk4: 29 extents found /mnt/ext4/file-chunk8: 20 extents found /mnt/ext4/file-mmap: 38 extents found /mnt/ext4/file-pf: 1 extent found --- xfs: posix-fallocate run time: (approx 0s) mmap run time: (approx 14s) 4096-sized chunk run time: (approx 18s) 8192-sized chunk run time: (approx 19s) $ sudo filefrag /mnt/xfs/* /mnt/xfs/file-chunk4: 3 extents found /mnt/xfs/file-chunk8: 4 extents found /mnt/xfs/file-mmap: 2 extents found /mnt/xfs/file-pf: 1 extent found --- ext3: posix-fallocate run time: (approx 18s) mmap run time: (approx 20s) 4096-sized chunk run time: (approx 22s) 8192-sized chunk run time: (approx 24s) $ sudo filefrag /mnt/ext3/* /mnt/ext3/file-chunk4: 38 extents found, perfection would be 9 extents /mnt/ext3/file-chunk8: 9 extents found /mnt/ext3/file-mmap: 44 extents found, perfection would be 9 extents /mnt/ext3/file-pf: 9 extents found --- btrfs: posix-fallocate run time: (approx 0s) mmap run time: (approx 18s) 4096-sized chunk run time: (approx 17s) 8192-sized chunk run time: (approx 19s) $ sudo /mnt/btrfs/* FIBMAP: Invalid argument --- I have detailed results up at http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git;a=blob_p... The link to the git tree is http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git Clearly, extents-based file systems provide a very very fast fallocate() implementation that allocates a new file and zeroes it. Since F11 is going to have ext4 by default, I strongly suggest we switch to posix_fallocate() for Linux hosts. The feedback should not matter on the newer file systems as the alloc is really fast and we anyway don't have an implementation currently for non-extent-based file systems. It really won't be missed for newer hosts. Inspite of this if some feedback is needed for a non-extents-based file system, a run-time probe for the underlying file system can be made and we could default to a chunk-based allocation in that case. For systems that do not implement posix_fallocate(), some configure-magic is needed. Amit