On Mon, Mar 02, 2009 at 03:02:11PM +0530, Amit Shah wrote:
On (Tue) Feb 24 2009 [11:58:31], Daniel P. Berrange wrote:
> On Tue, Feb 24, 2009 at 05:09:31PM +0530, Amit Shah wrote:
...
> > The best case to get a non-fragmented VM image is to have it allocated
> > completely at create-time with fallocate().
>
> The main problem with this change is that it'll make it harder for
> us to provide incremental feedback. As per the comment in the code,
> it is our intention to make the volume creation API run as a background
> job which provides feedback on progress of allocation, and the ability
> to cancel the job. Since posix_fallocate() is an all-or-nothing kind of
> API it wouldn't be very helpful.
>
> What sort of performance boost does this give you ? Would we perhaps
> be able to get close to it by writing in bigger chunks than 4k, or
> mmap'ing the file and then doing a memset across it ?
I have a program up at [1] that gives me the following data.
[1]
http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git;a=blo...
I compiled results for ext3, ext4, xfs and btrfs. I used the following
methods to allocate a file (1 GB in size) and zero it:
- posix_fallocate()
- mmap() and memset()
- write chunks, sized 4k and 8k.
Results:
[snip details]
The link to the git tree is
http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git
Clearly, extents-based file systems provide a very very fast fallocate()
implementation that allocates a new file and zeroes it. Since F11 is
going to have ext4 by default, I strongly suggest we switch to
posix_fallocate() for Linux hosts. The feedback should not matter on the
newer file systems as the alloc is really fast and we anyway don't have
an implementation currently for non-extent-based file systems. It really
won't be missed for newer hosts.
Inspite of this if some feedback is needed for a non-extents-based file
system, a run-time probe for the underlying file system can be made and
we could default to a chunk-based allocation in that case.
These results are quite impressive. It is better by orders of magnitude
for modern filesystems, and even on ext3 it has a slight edge. So I'm
inclined to say we should use posix_fallocate() by default at all
times.
When we introduce the API for incremental feedback, we could do something
like call fallocate() in 1GB chunks so we can get some reasonable amount
of feedback while still keeping it very fast & well allocated on disk.
And of course, just fallocate() the whole thing upfront if the user does
not provide a callback for requesting feedback.
For systems that do not implement posix_fallocate(), some
configure-magic is needed.
Daniel
--
|: Red Hat, Engineering, London -o-
http://people.redhat.com/berrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org -o-
http://ovirt.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|