Re: [libvirt] [PATCH] Use posix_fallocate() to allocate disk space

24 Feb 2009

      On (Tue) Feb 24 2009 [11:58:31], Daniel P. Berrange wrote:
...
On Tue, Feb 24, 2009 at 05:09:31PM +0530, Amit Shah wrote:
...
Hi,
This is an untested patch to make disk allocations faster and
non-fragmented. I'm using posix_fallocate() now but relying on glibc
really calling fallocate() if it exists for the file system to be the
fastest.
- This fails build because libutil needs to be added as a dependency?
../src/.libs/libvirt_driver_storage.a(storage_backend_fs.o): In function
`virStorageBackendFileSystemVolCreate':
/home/amit/src/libvirt/src/storage_backend_fs.c:1023: undefined
reference to `safezero'
You'd need to add 'safezero' to src/libvirt_private.syms to allow it
to be linked to by the storage driver.
Thanks; builds now.
...
...
- What's vol->capacity? Why is ftruncate() needed after the call to
  (current) safewrite()? My assumption is that the user can specify some
  max. capacity and wish to allocate only a chunk off it at create-time.
  Is that correct?
"allocation" refers to the current physical usage of the volume
"capacity" refers to the logical size of the volume
So, you can have a raw sparse file of size 4 GB, but not allocate any disk
upfront - just allocated on demand when guest writes to it. Or you can 
allocate 1 GB upfront, and leave the rest unallocated. So this code is
first filling out the upfront allocation the user requested, and then using
ftruncate() to extend to a (possibly larger) logical size.
Similarly for qcow files, capacity refers to the logical disk size
but qcow is grow on demand, so allocation will be much lower.
Usually allocation <= capacity, but if the volume format has metadata
overhead, you can get to a place where allocation > capacity if the
entire volume has been written to.
This case had me puzzled. Thanks for the explanation!
...
...
The best case to get a non-fragmented VM image is to have it allocated
completely at create-time with fallocate().
The main problem with this change is that it'll make it harder for
us to provide incremental feedback. As per the comment in the code, 
it is our intention to make the volume creation API run as a background
job which provides feedback on progress of allocation, and the ability
to cancel the job. Since posix_fallocate() is an all-or-nothing kind of
API it wouldn't be very helpful.
What sort of performance boost does this give you ?  Would we perhaps
be able to get close to it by writing in bigger chunks than 4k, or 
mmap'ing the file and then doing a memset across it ?
If the file system is asked to zero out a particular block of data
(using extents, as is possible in xfs and ext4), it's going to be the
fastest method available. Definitely faster than writing chunks in
userspace.

Of course, my patch is based on untested stuff. I initially started out
by wanting to have the image as defragmented as possible.

There are various parameters which "fast" depends on:

- do we want the image creation to be the fastest operation? (fast, yes,
  fastest, at the expense of something else, like guest runtime, maybe not.)
- how do guests cope with fragmented images? If the data in the image
  itself is fragmented, what good is an unfragmented image?

The problem with writing chunks of any size is that it can easily
lead to a lot of fragmentation. If we have the answers to the two
questions above, we can make a decision based on actual numbers.

If it turns out that an defragmented image file is much better, pretty
graphics showing % complete will have to be in a state where they
currently are ;-)

I don't have an ext4 file system on actual hardware to test -- I'll
provide numbers when I get to set one up, though.