[libvirt] [PATCH] storage: use btrfs file clone ioctl when possible

Btrfs provides a copy-on-write clone ioctl so let's try to use it instead of copying files block by block. The ioctl is executed unconditionally if it's available and we fall back to block copying if it fails, similarly to cp --reflink=auto. Signed-off-by: Oskari Saarenmaa <os@ohmu.fi> --- configure.ac | 5 +++++ src/storage/storage_backend.c | 11 +++++++++++ 2 files changed, 16 insertions(+) diff --git a/configure.ac b/configure.ac index 553015a..acae92e 100644 --- a/configure.ac +++ b/configure.ac @@ -1984,6 +1984,11 @@ fi AM_CONDITIONAL([WITH_STORAGE], [test "$with_storage" = "yes"]) dnl +dnl check for headers for filesystem specific operations +dnl +AC_CHECK_HEADERS([linux/btrfs.h]) + +dnl dnl check for (ESX) dnl diff --git a/src/storage/storage_backend.c b/src/storage/storage_backend.c index b7edf85..40bfb73 100644 --- a/src/storage/storage_backend.c +++ b/src/storage/storage_backend.c @@ -38,6 +38,9 @@ # include <sys/ioctl.h> # include <linux/fs.h> #endif +#ifdef HAVE_LINUX_BTRFS_H +# include <linux/btrfs.h> +#endif #if WITH_SELINUX # include <selinux/selinux.h> @@ -149,6 +152,13 @@ virStorageBackendCopyToFD(virStorageVolDefPtr vol, goto cleanup; } +#ifdef HAVE_LINUX_BTRFS_H + /* try to perform a btrfs CoW clone */ + if (ioctl(fd, BTRFS_IOC_CLONE, inputfd) == 0) { + goto done; + } +#endif + #ifdef __linux__ if (ioctl(fd, BLKBSZGET, &wbytes) < 0) { wbytes = 0; @@ -210,6 +220,7 @@ virStorageBackendCopyToFD(virStorageVolDefPtr vol, } while ((amtleft -= interval) > 0); } +done: if (fdatasync(fd) < 0) { ret = -errno; virReportSystemError(errno, _("cannot sync data to file '%s'"), -- 1.8.3.1

On Fri, Sep 27, 2013 at 05:02:53PM +0300, Oskari Saarenmaa wrote:
Btrfs provides a copy-on-write clone ioctl so let's try to use it instead of copying files block by block. The ioctl is executed unconditionally if it's available and we fall back to block copying if it fails, similarly to cp --reflink=auto.
Currently the virStorageVolCreateXMLFrom method does a full allocation of storage when cloning volumes. This means applications can rely on the image having enough space when clone completes and won't get ENOSPC in the VM. AFAICT, this change to do copy-on-write changes the API to do thin provisioning of the storage during clone, so any future write on either the new or old volume may generate ENOSPC when btrfs finally copies the sector. I don't think this is a good thing. I think applications should have to explicitly request copy-on-write behaviour for the clone so they know the implications. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Sep 27, 2013 at 03:19:06PM +0100, Daniel P. Berrange wrote:
On Fri, Sep 27, 2013 at 05:02:53PM +0300, Oskari Saarenmaa wrote:
Btrfs provides a copy-on-write clone ioctl so let's try to use it instead of copying files block by block. The ioctl is executed unconditionally if it's available and we fall back to block copying if it fails, similarly to cp --reflink=auto.
Currently the virStorageVolCreateXMLFrom method does a full allocation of storage when cloning volumes. This means applications can rely on the image having enough space when clone completes and won't get ENOSPC in the VM. AFAICT, this change to do copy-on-write changes the API to do thin provisioning of the storage during clone, so any future write on either the new or old volume may generate ENOSPC when btrfs finally copies the sector. I don't think this is a good thing. I think applications should have to explicitly request copy-on-write behaviour for the clone so they know the implications.
That's a good point. However, it looks like this change would only change the behavior for the old volumes; new volumes are always created sparsely and they may already get ENOSPC on write if they contained zero blocks. This should probably be fixed by calling fallocate instead of lseek when noticing empty blocks (safezero should probably be used instead, but it's currently rather unsafe if posix_fallocate isn't available.) I was wondering if we could reuse the allocation and capacity fields to decide whether or not to try to do a cow-clone (or sparse allocation of the cloned bits)? Currently a cloned volume's allocation is always set to at least the original volume's capacity and the original client-requested allocation value is not passed on to the code doing the cloning, but we could pass it on and allow copy-on-write clones if allocation is set to zero (no space is guaranteed to be available for writing) and also change sparse cloning to only happen if allocation is lower than capacity. / Oskari

On Mon, Sep 30, 2013 at 01:21:18AM +0300, Oskari Saarenmaa wrote:
On Fri, Sep 27, 2013 at 03:19:06PM +0100, Daniel P. Berrange wrote:
On Fri, Sep 27, 2013 at 05:02:53PM +0300, Oskari Saarenmaa wrote:
Btrfs provides a copy-on-write clone ioctl so let's try to use it instead of copying files block by block. The ioctl is executed unconditionally if it's available and we fall back to block copying if it fails, similarly to cp --reflink=auto.
Currently the virStorageVolCreateXMLFrom method does a full allocation of storage when cloning volumes. This means applications can rely on the image having enough space when clone completes and won't get ENOSPC in the VM. AFAICT, this change to do copy-on-write changes the API to do thin provisioning of the storage during clone, so any future write on either the new or old volume may generate ENOSPC when btrfs finally copies the sector. I don't think this is a good thing. I think applications should have to explicitly request copy-on-write behaviour for the clone so they know the implications.
That's a good point. However, it looks like this change would only change the behavior for the old volumes; new volumes are always created sparsely and they may already get ENOSPC on write if they contained zero blocks. This should probably be fixed by calling fallocate instead of lseek when noticing empty blocks (safezero should probably be used instead, but it's currently rather unsafe if posix_fallocate isn't available.)
I was wondering if we could reuse the allocation and capacity fields to decide whether or not to try to do a cow-clone (or sparse allocation of the cloned bits)? Currently a cloned volume's allocation is always set to at least the original volume's capacity and the original client-requested allocation value is not passed on to the code doing the cloning, but we could pass it on and allow copy-on-write clones if allocation is set to zero (no space is guaranteed to be available for writing) and also change sparse cloning to only happen if allocation is lower than capacity.
I think just having a VIR_STORAGE_VOL_CLONE_COPY_ON_WRITE flag for the API would suffice. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (2)
-
Daniel P. Berrange
-
Oskari Saarenmaa