[Libvir] Thoughts on remote storage support

Hello! I had a hobby project where I needed to manipulate xen disk images on remote systems, that used a model similar to libirt's remote support. Based on what I learnt from it, I came up with a possible model for libvirt's remote storage support. I present it here for discussion. We typically store the images in volumes on LVM, or in dedicated file system folders. The folders and volume groups usable by libvirt can be limited in a config file. It is probably not neccessary to differentiate between defined and created files, as you can not stop and start a file like a domain, you either have it on disk, or not. Libvirt should not store information on these files, everything should be checked/listed on the fly, so that if you just copy an image to a directory, libvirt can deduct all information (well, all it can) on it, and handle it just as if the file was created by it. The handle for the file is its path, plus its virConnect object (i.e. the host it is on). For consistency, it may be possible to create an object for it, but as disk images have no persistent properties apart from what is on the disk, and it can always be checked from there, it provides no extra functionality. I think there is no need to support remote files explicitly, as the domains mount local files/volumes. The file/volume may actually be mounted from a NAS or SAN, of course, but it does not matter because we use the local path names, and AFAIK all virtualization tools use local files or local devices as blockdevs. I have added compression to the mix because it is immensely useful. I have used lzop in my project, and a full backup and restore was much faster when using a compressed backup file, than with and uncrompessed one. It conserves disk space, as well as cpu/bus capacity. Zeroing out newly allocated files, helps with compressed backups, as well as security. It also means that no holey files can be used. The objects we are dealing with are disk images. They have the following properties: -Path: The unix path of the file ( /mnt/images/fc7.img or /dev/VG/fc7) -Compression: Mountable/compressed -Type: Plain file/LVM volume/ What else? -Size -Filesystem: swap/ext3/xfs/.... -Is it mounted? We can do the following operations on the images: Create -connection -filepath -size Allocates a new image of the given type, size, and name. Libvirt should parse the filepath, and determine the base path, check if it's a directory or a VG, check if libvirt is allowed to operate on the path/VG, then create the file/volume. For security reasons zeroing out the allocated space should be a non-optional step of the allocation. DirectoryList -connection -directorypath Plain ls functionality, that returns the list of files, and any subdirectories. If called on a VG, it returns the volumes in it. Info -connection -filepath Returns information on the given file/volume, including size, type, filesystem (if available), whether it is a snapshot (if a volume), and whether it is mounted or not. size can be determined by ls or lvinfo, filesystem by 'file' command. Delete -connection -filepath Delete the file/volume. Find out if it's a file or volume, and rm or lvremove it. Grow -connection -path -filename -newsize Grows the specified image to the given size. The newly added space is zeroed out. Shrink? -connection -path -filename -newsize Shrinks the specified image the the given size. It's very tricky, because to avoid data loss, we need to analyze the file system size. Of course, we can just say that it's the reponsibility of the user, after all, we allow him to outright delete the file as well. We may combine it with Grow, and call it Resize. Growfs -connection -path -filename -newsize? Grows the filesystem on the image to fill the size of the image, or the given newsize. Can only be used on umounted images. It is neccesary because some filesystems may not be grown while mounted, so the guest can not do it on its own. Shrinkfs -connection -path -filname -newsize Shrinks the filesystem to the given size. Same coniderations apply as with growfs, may be combined to Resizefs? Snapshot -connection -filepath -filesize Creates a snapshot of the given image. It is only possible with images on LVM. Should return the snapshot image name. The snapshot can later be deleted with Delete. CopyTo -connection -source filepath -target connection -target filepath -snapshot flag -archive flag -overwrite flag Copy the source image to the target image. If connection is on another machine, then it's a network copy. If the snapshot flag is true, then first create a snapshot of the source image, copy that, then delete the snapshot. If the archive flag is active, then the target file will be archive file (compressed). If the overwite flag is active, then the target file is overwitten, if it exists. Otherwise existing files are not changed. Even if the source file is compressed, the target file is uncompressed, unless the archive flag is set. CopyContents -connection -source filepath -target connection -target filepath -snapshot flag Copies the contents of the source file to the target file. The target file must already exist, and be no smaller than the source file. The contents of the target file are overwitten, and any extra space is zeroed out. Archive -connection -filepath compresses the given file. Makes sense only on files, not volumes. Unarchive -connection -filepath uncompresses the given file. Makes sense only on archved(compressed) files. StorageInfo - connection Returns information on the node's storage configuration. What kind of filesystems it can handle, What are the accessible file / VG paths, what's the free space on them, etc. A typical usage scenario could be something like this: Aconn=getVirconn("ssh:Ahost"); //Open the connection to host A Bconn=getVirconn("ssh:Bhost"); //Host B will hold our backup image Ainfo=StorageInfo(Aconn); //Get Ainfo AVGPath = <get the first usable VG path from VG info> Newimage = Create(Aconn, concat("AVGPath", "newimage", 100000); //A 100Mb volume is created and zeroed out. CopyContents(Aconn, NewImage, Aconn, "/images/ghost/FC7default.img", no); //Copy our pre-created FC7 image to the new image Growfs(Aconn, NewImage, 0); //Grow the copied filesystem to fill the whole volume. <Here we define a new domain, and use NewImage as the name of the backing image for the guest's block device> <Start the new domain> Copy(Aconn, NewImage, Bconn, "/mnt/backups/backup23image", snapshot=yes, archive=yes, overwrite=no); //Make an LVM snapshot of NewImage, and copy it to Host B on the given filename, compressing it on the fly, then remove the snapshot <Stop the domain> CopyContents(Bconn, "/mnt/backups/backup23image", Aconn, NewImage, snapshot=no); //Restore the backed-up image to NewImage, decompress it on the fly. <Start the domain> <Stop the domain> Grow(Aconn, NewImage, 200000); //Grow the volume to 200MBs Growfs(Aconn, NewImage); //Grow the fs on the volume to fill the volume <Start the domain> ..... Best regards István

Tóth István wrote:
Hello!
I had a hobby project where I needed to manipulate xen disk images on remote systems, that used a model similar to libirt's remote support. Based on what I learnt from it, I came up with a possible model for libvirt's remote storage support. I present it here for discussion.
We typically store the images in volumes on LVM, or in dedicated file system folders. The folders and volume groups usable by libvirt can be limited in a config file.
It is probably not neccessary to differentiate between defined and created files, as you can not stop and start a file like a domain, you either have it on disk, or not.
Agreed.
Libvirt should not store information on these files, everything should be checked/listed on the fly, so that if you just copy an image to a directory, libvirt can deduct all information (well, all it can) on it, and handle it just as if the file was created by it.
Agreed.
The handle for the file is its path, plus its virConnect object (i.e. the host it is on). For consistency, it may be possible to create an object for it, but as disk images have no persistent properties apart from what is on the disk, and it can always be checked from there, it provides no extra functionality.
Probably the 'handle' is its filename or device name + the virDomain object. For example, here are the domains and their images running on my Xen host at the moment. I got this by writing a simple script which parses the domain XML: fc6_0: /var/lib/xen/images/fc6_0.img -> xvda /var/lib/xen/images/home.disk -> xvdb fc6_1: /var/lib/xen/images/fc6_1.img -> xvda debian32fv: /var/lib/xen/images/debian32fv.img -> hda f764pv: /dev/Images/f764pv -> xvda freebsd32fv: /var/lib/xen/images/freebsd32fv.img -> hda [CD] -> hdc gentoo32fv: /var/lib/xen/images/gentoo32fv.img -> hda
I think there is no need to support remote files explicitly, as the domains mount local files/volumes. The file/volume may actually be mounted from a NAS or SAN, of course, but it does not matter because we use the local path names, and AFAIK all virtualization tools use local files or local devices as blockdevs.
I have added compression to the mix because it is immensely useful. I have used lzop in my project, and a full backup and restore was much faster when using a compressed backup file, than with and uncrompessed one. It conserves disk space, as well as cpu/bus capacity.
Zeroing out newly allocated files, helps with compressed backups, as well as security. It also means that no holey files can be used.
The objects we are dealing with are disk images. They have the following properties: -Path: The unix path of the file ( /mnt/images/fc7.img or /dev/VG/fc7) -Compression: Mountable/compressed -Type: Plain file/LVM volume/ What else? -Size -Filesystem: swap/ext3/xfs/.... -Is it mounted?
This is where it gets very complicated. Files or partitions may represent simple filesystems, or partitioned block devices, or LVM PVs, or filesystems or partitions in formats that we have no chance of understanding (eg. NTFS), or snapshots, or dm_crypt, or compressed and so on and so on. To do this in any feasible way, I'm sure we'll need a library, the obvious one being gparted (http://www.gnu.org/software/parted/). However parted has a lot of problems, and most specifically it doesn't support LVM. I am aware of a project to make LVM accessible as a library and to include LVM library support in gparted/libparted, but I'm not sure how it is progressing.
We can do the following operations on the images: [... operations ...]
I'd prefer to stick to the minimum set of operations needed now and add to them later. But yes, the mix of operations looks OK. Much of this work is needed by virt-p2v and virt-df (http://et.redhat.com/~rjones/virt-p2v/ and virt-df to be released soon). Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

OOps, my spam filter ate your reply, sorry for the delay. Richard W.M. Jones wrote:
Tóth István wrote:
Hello!
I had a hobby project where I needed to manipulate xen disk images on remote systems, that used a model similar to libirt's remote support. Based on what I learnt from it, I came up with a possible model for libvirt's remote storage support. I present it here for discussion.
We typically store the images in volumes on LVM, or in dedicated file system folders. The folders and volume groups usable by libvirt can be limited in a config file.
It is probably not neccessary to differentiate between defined and created files, as you can not stop and start a file like a domain, you either have it on disk, or not.
Agreed.
Libvirt should not store information on these files, everything should be checked/listed on the fly, so that if you just copy an image to a directory, libvirt can deduct all information (well, all it can) on it, and handle it just as if the file was created by it.
Agreed.
The handle for the file is its path, plus its virConnect object (i.e. the host it is on). For consistency, it may be possible to create an object for it, but as disk images have no persistent properties apart from what is on the disk, and it can always be checked from there, it provides no extra functionality.
Probably the 'handle' is its filename or device name + the virDomain object.
For example, here are the domains and their images running on my Xen host at the moment. I got this by writing a simple script which parses the domain XML:
fc6_0: /var/lib/xen/images/fc6_0.img -> xvda /var/lib/xen/images/home.disk -> xvdb fc6_1: /var/lib/xen/images/fc6_1.img -> xvda debian32fv: /var/lib/xen/images/debian32fv.img -> hda f764pv: /dev/Images/f764pv -> xvda freebsd32fv: /var/lib/xen/images/freebsd32fv.img -> hda [CD] -> hdc gentoo32fv: /var/lib/xen/images/gentoo32fv.img -> hda Well, this is a good handle for the images that belong too an active domain. But I can see other images laying around, backup images, snapshots, virgin installed images for provisioning of new VMs, and you need to refer to those as well. Hence, I still think that it would be better to use host+path. For example, you need to be able to say in effect: copy "/var/lib/xen/images/fc6_0.img" to "/backups/fc6_xvda_1", and you have to refer to target image somehow. You could just use the local path in this case, but I think that being able work with images on other libvirt hosts would be a bonus.
I think there is no need to support remote files explicitly, as the domains mount local files/volumes. The file/volume may actually be mounted from a NAS or SAN, of course, but it does not matter because we use the local path names, and AFAIK all virtualization tools use local files or local devices as blockdevs.
I have added compression to the mix because it is immensely useful. I have used lzop in my project, and a full backup and restore was much faster when using a compressed backup file, than with and uncrompessed one. It conserves disk space, as well as cpu/bus capacity.
Zeroing out newly allocated files, helps with compressed backups, as well as security. It also means that no holey files can be used.
The objects we are dealing with are disk images. They have the following properties: -Path: The unix path of the file ( /mnt/images/fc7.img or /dev/VG/fc7) -Compression: Mountable/compressed -Type: Plain file/LVM volume/ What else? -Size -Filesystem: swap/ext3/xfs/.... -Is it mounted?
This is where it gets very complicated. Files or partitions may represent simple filesystems, or partitioned block devices, or LVM PVs, or filesystems or partitions in formats that we have no chance of understanding (eg. NTFS), or snapshots, or dm_crypt, or compressed and so on and so on.
To do this in any feasible way, I'm sure we'll need a library, the obvious one being gparted (http://www.gnu.org/software/parted/). However parted has a lot of problems, and most specifically it doesn't support LVM.
I am aware of a project to make LVM accessible as a library and to include LVM library support in gparted/libparted, but I'm not sure how it is progressing.
Indeed, back when I created this specifications the supported mode for Xen was to feed it partitions instead of whole disk images, and the partitioning problem was not apparent. I checked the LVM library project about a moth ago, but it does not seem to be in a generally usable shape yet. The supported way is just to write, call and parse the command lines. In fact, the more I thought about it, and the more scenarios popped into my mind (plus the ones you describe above), the more I think that at least an initial implementation should not try to see into the partition, exactly because of the problems you mention above. Even if partition/filesystem handling is included in libvirt, it should probably be somewhat orthogonal to the rest of the image handling functions. i.e. the operations I detailed below (except for the growfs-related ones) for creating, moving, backing up, etc. of raw images, and a different set of operations that partitions, adds/removes paritions, creates file systems, growfs-es, etc. This limits the complexity to just supporting simple files, block devices, and LVMs ( or the equivalent functionality on other platforms), and the parted-like functionality can be added on top of it.
We can do the following operations on the images: [... operations ...]
I'd prefer to stick to the minimum set of operations needed now and add to them later. But yes, the mix of operations looks OK.
Much of this work is needed by virt-p2v and virt-df (http://et.redhat.com/~rjones/virt-p2v/ and virt-df to be released soon). virt-p2v looks immensely useful.
Rich.

Tóth István wrote:
Richard W.M. Jones wrote:
For example, here are the domains and their images running on my Xen host at the moment. I got this by writing a simple script which parses the domain XML:
fc6_0: /var/lib/xen/images/fc6_0.img -> xvda /var/lib/xen/images/home.disk -> xvdb fc6_1: /var/lib/xen/images/fc6_1.img -> xvda debian32fv: /var/lib/xen/images/debian32fv.img -> hda f764pv: /dev/Images/f764pv -> xvda freebsd32fv: /var/lib/xen/images/freebsd32fv.img -> hda [CD] -> hdc gentoo32fv: /var/lib/xen/images/gentoo32fv.img -> hda Well, this is a good handle for the images that belong too an active domain. But I can see other images laying around, backup images, snapshots, virgin installed images for provisioning of new VMs, and you need to refer to those as well. Hence, I still think that it would be better to use host+path. For example, you need to be able to say in effect: copy "/var/lib/xen/images/fc6_0.img" to "/backups/fc6_xvda_1", and you have to refer to target image somehow. You could just use the local path in this case, but I think that being able work with images on other libvirt hosts would be a bonus.
There's an open-ended access control problem here. libvirtd runs as root and host+path gives a way to read and write any file on the system. Better might be to allow the system administrator to configure directories where backup images, snapshots and so on may be located (through /etc/libvirtd.conf), and have libvirtd check this, and also have an additional level of enforcement through SELinux (as is done with Xen images now). For my rather limited needs with virt-df I was going to propose an API like this: virDomainPeekDevice (virDomainPtr domain, const char *path, off_t offset, off_t size, char *result_buffer); The security check would be something along these lines: * path must be a source device (as returned in the domain XML) * path must belong to the domain * offset, size must be entirely within the path device/file This check could be extended to allow path to be in the configured backup / snapshot directories. (This is not really thought through at the moment, however comments welcome). With that call we then need to look at "virtualising" libparted so that instead of making direct read(2), lseek(2) etc. system calls, these may be redirected through a VFS layer which would call virDomainPeekDevice. (I'm sure I posted something about this to the list, but that was two weeks ago, I've been on holiday, I'm jetlagged, and now I can't find it...) [...]
In fact, the more I thought about it, and the more scenarios popped into my mind (plus the ones you describe above), the more I think that at least an initial implementation should not try to see into the partition, exactly because of the problems you mention above. Even if partition/filesystem handling is included in libvirt, it should probably be somewhat orthogonal to the rest of the image handling functions.
i.e. the operations I detailed below (except for the growfs-related ones) for creating, moving, backing up, etc. of raw images, and a different set of operations that partitions, adds/removes paritions, creates file systems, growfs-es, etc.
This limits the complexity to just supporting simple files, block devices, and LVMs ( or the equivalent functionality on other platforms), and the parted-like functionality can be added on top of it.
My thinking about this moved along a bit: What if we explicitly _don't_ think about supporting LVM operations and so on within libvirt. Making a general-purpose solution is a big, intractable problem. Instead we could allow the system administrator to create some operations (again, through /etc/libvirtd.conf [1]): ----- /etc/libvirtd.conf ------------------- allocate partition: "lvcreate -L %size -n %name XenVolGroup" list partitions: "lvs" -------------------------------------------- On the libvirt side those turn into standard calls like: virConnectListPartitions (...) If the commands don't exist in libvirtd.conf then those calls fail with a suitable error message. We can set up suggested commands in the default configuration for Linux + LVM, Linux + partitions, Solaris, etc. but it defers the policy to system administrators. Rich. [1] libvirtd.conf is only available in the remote case, so perhaps we need also a libvirt.conf for the local case. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Mon, Oct 15, 2007 at 01:31:47PM +0100, Richard W.M. Jones wrote:
There's an open-ended access control problem here. libvirtd runs as root and host+path gives a way to read and write any file on the system.
Better might be to allow the system administrator to configure directories where backup images, snapshots and so on may be located (through /etc/libvirtd.conf), and have libvirtd check this, and also have an additional level of enforcement through SELinux (as is done with Xen images now).
Yep, that is a good idea. Indeed some deployments pretty much require that. When running with SELinux enforcing, only /var/lib/xen/images is a valid location for example. Being able to create/manage files on any part of the filesystem is rather overkill for our needs. Admin defined directory locations should be more than sufficient. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|
participants (3)
-
Daniel P. Berrange
-
Richard W.M. Jones
-
Tóth István