[Libvir] Concepts in storage management

So we've had many email threads on the subject of storage, but none have resulted in a satisfactory way forward to implementing any storage mgmt apis. Part of the problem I think is that we've not tried to understand all the various concepts / technologies which are available & how they relate to each other. This mail attempts to outline all the different technologies. There's a short list of API operations, but I don't want to particularly get into API details until we have a good undersatnding of the concepts. First and foremost I don't believe it is acceptable to say we're only going to allow one kind of storage. Storage is the key piece of infrastructure for any serious network, and we have to be able to adapt to deployment scenarios that present themselves. Second, there is clearly a huge number of storage technologies here and there's no way we'll implement support for all of them in one go. So we need to prioritize getting the conceptual model correct, to allow us to incrementally support new types of storage backend. Taxonomy of storage types ========================= | +- Block | | | +- Disk | | | | | +- Direct attached | | | | | | | +- IDE/ATA disk | | | +- SCSI disk | | | +- FibreChannel disk | | | +- USB disk/flash | | | +- FireWire disk/flash | | | | | +- Remote attached | | | | | +- iSCSI disk | | +- GNBD disk | | | +- Partition | | | +- Virtual | | | +- Direct attached | | | | | +- LVM | | +- ZFS | | | +- Remote attached | | | +- Cluster LVM | +- FileSystem | | | +- Directed attached | | | | | +- ext2/3/4 | | +- xfs | | +- ZFS | | | +- Remote attached | | | +- NFS | +- GFS | +- OCFS2 | +- Directory | +- File | +- Raw allocated +- Raw sparse +- QCow2 +- VMDK Storage attributes ================== - Local vs network (ext3 vs NFS, SCSI vs iSCSI) - Private vs shared (IDE vs FibreChannel) - Pool vs volume (LVM VG vs LV, Directory vs File, Disk vs Partition) - Container vs guest (OpenVZ vs Xen) - Attributes - Compressed - Encrypted - Auto-extend - Snapshots - RO - RW - Partition table - MBR - GPT - UUID - 16 hex digits - Unique string - SCSI WWID (world wide ID) - Local Path(s) (/dev/sda, /var/lib/xen/images/foo.img) - Server Hostname - Server Identifier (export path/target) - MAC security label (SELinux) - Redundancy - Mirrored - Striped - Multipath - Pool operation - RO - RW Nesting hierarchy ================= Many possibilities... - 1 x Host -> N x iSCSI target -> N x LUN -> N x Partition - N x Disk/Partition -> 1 x LVM VG -> B x LVM LV - 1 x Filesystem -> N x directory -> N x file - 1 x File -> 1 x Block (loopback) Application users ================= - virt-manager / virt-install - Enumerate available pools - Allocate volume from pool - Create guest with volume - virt-clone - Copy disks - Snapshot disks - virt-df - Filesystem usage - pygrub - Extract kernel/initrd from filesystem - virt-factory - Manage storage pools - Pre-migration sanity checks - virt-backup - Snapshot disks - virt-p2v - Snapshot disks Storage representation ====================== Two core concepts - Volume - a chunk of storage - assignable to a guest - assignable to a pool - optionally part of a pool - Pool - a chunk of storage - contains free space - allocate to provide volumes - compromised of volumes Recursive! n x Volume -> Pool -> n x Volume Nesting to many levels... Do we need an explicit Filesystem concept ? Operations ========== Limited set of operations to perform - List host volumes (physical attached devices) - List pools (logical volume groups, partitioned devs, filesystems) - List pool volumes (dev partitions, LVM logical volumes, files) - Define pool (eg create directory, or define iSCSI target) - Undefine pool (delete directory, undefine iSCSI config - Activate pool (mount NFS volume, login to iSCSI target) - Deactivate pool (unmount volume, logout of iSCSI) - Dump pool XML (get all the metadata) - Lookup by path - Lookup by UUID - Lookup by name - Create volume (create a file, allocate a LVM LV, etc) - Destroy volume (delete a file, deallocate a LVM LV) - Resize volume (grow or shrink volume) - Copy volume (copy data between volumes) - Snapshot volume (snapshot a volume) - Dump volume XML (get all the metadata) - Lookup by path - Lookup by UUID - Lookup by name http://www.redhat.com/archives/libvir-list/2007-February/msg00010.html http://www.redhat.com/archives/libvir-list/2007-September/msg00119.html Do we also need some explicit Filesystem APIs ? XML description =============== The horrible recursiveness & specific attributes are all in the XML description for different storage pool / volume types. This is where we define things like what physical volume are in a volume group, iSCSI server / target names, login details, etc, etc XXX fill in the hard stuff for metadata description here Implementation backends ======================= - FileSystem/Directory/File - POSIX APIs - LVM - LVM tools, or libLVM - Disk/partitions - sysfs / parted - iSCSI - sysfs / iscsi utils - ZFS - ZFS tools Implementation strategy ======================= Should prioritize implementation according to immediate application needs Initial goal to support remote guest creation on par with current capabilities: - Directory + allocateing raw sparse files - Enumerate existing disks, partitions & LVM volumes Further work: - Allocating LVM volumes - Defining LVM volume groups - Partitioning disks - Mounting networked filesystems - Accessing iSCSI volumes - Copying existing volumes - Snapshotting volumes - Cluster aware filesystems (GFS) - Various file formats (QCow, VMDK, etc) Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
http://www.redhat.com/archives/libvir-list/2007-September/msg00119.html
Since that thread is split across two months, can I bring to everyone's attention the post I made yesterday: http://www.redhat.com/archives/libvir-list/2007-October/msg00057.html In particular the concept at the end that we shouldn't even try to support every possible remote storage, but instead allow the administrator to write "scriptlets" (small shell scripts with a well-defined input & output) to perform a set of operations: ----- /etc/libvirtd.conf ------------------- allocate partition: "lvcreate -L %size -n %name XenVolGroup" list partitions: "lvs --xml" ---------- We can provide sample scriptlets for different operating systems and storage configurations. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Tue, Oct 16, 2007 at 04:34:26PM +0100, Richard W.M. Jones wrote:
Daniel P. Berrange wrote:
http://www.redhat.com/archives/libvir-list/2007-September/msg00119.html
Since that thread is split across two months, can I bring to everyone's attention the post I made yesterday:
http://www.redhat.com/archives/libvir-list/2007-October/msg00057.html
In particular the concept at the end that we shouldn't even try to support every possible remote storage, but instead allow the administrator to write "scriptlets" (small shell scripts with a well-defined input & output) to perform a set of operations:
This is really just an implementation detail. We still need to define the storage concepts we want to expose in the public API, before figuring out on the backend implementation. Most of the implementation wiill pretty much have to follow the scheme of just invoking command line tools like lvcreate and lvs, since formal APIs are scarse. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
On Tue, Oct 16, 2007 at 04:34:26PM +0100, Richard W.M. Jones wrote:
Daniel P. Berrange wrote:
http://www.redhat.com/archives/libvir-list/2007-September/msg00119.html Since that thread is split across two months, can I bring to everyone's attention the post I made yesterday:
http://www.redhat.com/archives/libvir-list/2007-October/msg00057.html
In particular the concept at the end that we shouldn't even try to support every possible remote storage, but instead allow the administrator to write "scriptlets" (small shell scripts with a well-defined input & output) to perform a set of operations:
This is really just an implementation detail. We still need to define the storage concepts we want to expose in the public API, before figuring out on the backend implementation. Most of the implementation wiill pretty much have to follow the scheme of just invoking command line tools like lvcreate and lvs, since formal APIs are scarse.
Well, a basic set of operations would be whatever we need to implement virt-install/virt-manager remotely now, plus other suggestions as they come along. From a fairly brief scan of the virt-install & virt-manager code that would be: - Create an empty a file with given name & size & sparseness. - Detect if a named device or file exists (basically a remote stat). - Copy image to remote temporary file (for kernel/CD-ROM). - Check free space (remote statvfs). It might be nice to list LVs, but it doesn't seem to be necessary to implement remote virt-* at the moment (AFAICS). Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Tue, Oct 16, 2007 at 05:16:33PM +0100, Richard W.M. Jones wrote:
Daniel P. Berrange wrote:
On Tue, Oct 16, 2007 at 04:34:26PM +0100, Richard W.M. Jones wrote:
Daniel P. Berrange wrote:
http://www.redhat.com/archives/libvir-list/2007-September/msg00119.html Since that thread is split across two months, can I bring to everyone's attention the post I made yesterday:
http://www.redhat.com/archives/libvir-list/2007-October/msg00057.html
In particular the concept at the end that we shouldn't even try to support every possible remote storage, but instead allow the administrator to write "scriptlets" (small shell scripts with a well-defined input & output) to perform a set of operations:
This is really just an implementation detail. We still need to define the storage concepts we want to expose in the public API, before figuring out on the backend implementation. Most of the implementation wiill pretty much have to follow the scheme of just invoking command line tools like lvcreate and lvs, since formal APIs are scarse.
Well, a basic set of operations would be whatever we need to implement virt-install/virt-manager remotely now, plus other suggestions as they come along.
From a fairly brief scan of the virt-install & virt-manager code that would be:
- Create an empty a file with given name & size & sparseness. - Detect if a named device or file exists (basically a remote stat). - Copy image to remote temporary file (for kernel/CD-ROM). - Check free space (remote statvfs).
It might be nice to list LVs, but it doesn't seem to be necessary to implement remote virt-* at the moment (AFAICS).
Current virt-manager doesn't enumerate block devices at all - it just presents a file selection dialog rooted in /dev letting you select a block device, be it a disk, or a logical volume. Using LVM volumes for guests is probably more common than using raw partitions based on user reports i see. Dan. - |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
Second, there is clearly a huge number of storage technologies here and there's no way we'll implement support for all of them in one go. So we need to prioritize getting the conceptual model correct, to allow us to incrementally support new types of storage backend.
Yeah, I think that's the right place to start. As you say, there are just too many underlying storage technologies to go after all of them at once. <snip>
Storage representation ======================
Two core concepts
- Volume - a chunk of storage - assignable to a guest - assignable to a pool - optionally part of a pool
- Pool - a chunk of storage - contains free space - allocate to provide volumes - compromised of volumes
Recursive!
n x Volume -> Pool -> n x Volume
Nesting to many levels...
Kind of, though I think there are actually two concepts of Volumes here (if I am understanding correctly). The first concept of volume is "raw storage" -> what you assign to a pool. The second concept is "Volume exported for a guest". I'm not sure that we want to Nest those concepts.
Do we need an explicit Filesystem concept ?
Operations ==========
Limited set of operations to perform
- List host volumes (physical attached devices) - List pools (logical volume groups, partitioned devs, filesystems) - List pool volumes (dev partitions, LVM logical volumes, files)
- Define pool (eg create directory, or define iSCSI target) - Undefine pool (delete directory, undefine iSCSI config - Activate pool (mount NFS volume, login to iSCSI target) - Deactivate pool (unmount volume, logout of iSCSI) - Dump pool XML (get all the metadata) - Lookup by path - Lookup by UUID - Lookup by name
- Create volume (create a file, allocate a LVM LV, etc) - Destroy volume (delete a file, deallocate a LVM LV) - Resize volume (grow or shrink volume) - Copy volume (copy data between volumes) - Snapshot volume (snapshot a volume) - Dump volume XML (get all the metadata) - Lookup by path - Lookup by UUID - Lookup by name
http://www.redhat.com/archives/libvir-list/2007-February/msg00010.html http://www.redhat.com/archives/libvir-list/2007-September/msg00119.html
Do we also need some explicit Filesystem APIs ?
The question I have with all of this is whether it really belongs in libvirt at all. Many of these concepts apply to bare-metal provisioning as well; so it might be a good idea to have a separate "libstorage" that libvirt links to, and that other tools might use.
XML description ===============
The horrible recursiveness & specific attributes are all in the XML description for different storage pool / volume types. This is where we define things like what physical volume are in a volume group, iSCSI server / target names, login details, etc, etc
XXX fill in the hard stuff for metadata description here
Implementation backends =======================
- FileSystem/Directory/File - POSIX APIs - LVM - LVM tools, or libLVM - Disk/partitions - sysfs / parted - iSCSI - sysfs / iscsi utils - ZFS - ZFS tools
The problem with most of these, as we all know, is that they only have command-line utilities, and no corresponding libraries. That makes it difficult for a library like libvirt to support them. That is, we can shell out to the commands, but then we run into a situation where different versions of the LVM command, for example, have different output. We have now effectively tied ourselves to a particular version of a tool, which is fairly disappointing. Also, as we have seen with xend, spawning external tools to do work makes error reporting far more difficult (maybe impossible). So that leaves us the tough question of what to do here. Ideally we would abstract all of the above tools into libraries, and re-write the tools to use those, and then have libvirt use the same. I'm not sure if it is practical, however, to wait that long to do such things. What I might suggest is a hybrid approach. Do the initial implementation with what we have (namely the command line utilities, possibly utilizing rjones "scriplet" concept). In parallel, make sure someone starts on making real libraries of the tools, so that we can benefit from that later on.
Implementation strategy =======================
Should prioritize implementation according to immediate application needs
Initial goal to support remote guest creation on par with current capabilities:
- Directory + allocateing raw sparse files - Enumerate existing disks, partitions & LVM volumes
Yep. As long as we make sure the XML is flexible enough to handle the remaining stuff, this seems like a good first place to start. Chris Lalancette

On Tue, Oct 16, 2007 at 12:30:45PM -0400, Chris Lalancette wrote:
Daniel P. Berrange wrote:
Second, there is clearly a huge number of storage technologies here and there's no way we'll implement support for all of them in one go. So we need to prioritize getting the conceptual model correct, to allow us to incrementally support new types of storage backend.
Yeah, I think that's the right place to start. As you say, there are just too many underlying storage technologies to go after all of them at once.
<snip>
Storage representation ======================
Two core concepts
- Volume - a chunk of storage - assignable to a guest - assignable to a pool - optionally part of a pool
- Pool - a chunk of storage - contains free space - allocate to provide volumes - compromised of volumes
Recursive!
n x Volume -> Pool -> n x Volume
Nesting to many levels...
Kind of, though I think there are actually two concepts of Volumes here (if I am understanding correctly). The first concept of volume is "raw storage" -> what you assign to a pool. The second concept is "Volume exported for a guest". I'm not sure that we want to Nest those concepts.
It is already nested, even if you don't see usually see it visible in the Dom0 host. eg, in the host I assign a LVM volume to a guest. A guest then puts this into its own nested LVM VG & allocates volumes. This nesting isn't normally visible by default, but tools like kpartx make it visible. Making this nesting visible in the host isn't neccessarily something we need to expose in the APIs, but we should consider it when thinking about the storage concepts. Depending on how we end up modelling storage APIs, we may end up with getting the capability 'for free', so artifically restricting it upfron is premature.
Do we need an explicit Filesystem concept ?
Operations ==========
Limited set of operations to perform
- List host volumes (physical attached devices) - List pools (logical volume groups, partitioned devs, filesystems) - List pool volumes (dev partitions, LVM logical volumes, files)
- Define pool (eg create directory, or define iSCSI target) - Undefine pool (delete directory, undefine iSCSI config - Activate pool (mount NFS volume, login to iSCSI target) - Deactivate pool (unmount volume, logout of iSCSI) - Dump pool XML (get all the metadata) - Lookup by path - Lookup by UUID - Lookup by name
- Create volume (create a file, allocate a LVM LV, etc) - Destroy volume (delete a file, deallocate a LVM LV) - Resize volume (grow or shrink volume) - Copy volume (copy data between volumes) - Snapshot volume (snapshot a volume) - Dump volume XML (get all the metadata) - Lookup by path - Lookup by UUID - Lookup by name
http://www.redhat.com/archives/libvir-list/2007-February/msg00010.html http://www.redhat.com/archives/libvir-list/2007-September/msg00119.html
Do we also need some explicit Filesystem APIs ?
The question I have with all of this is whether it really belongs in libvirt at all. Many of these concepts apply to bare-metal provisioning as well; so it might be a good idea to have a separate "libstorage" that libvirt links to, and that other tools might use.
It is a good question. My thought is that if we went for a 'libstorage' the scope would be dramatically broader, than if we focused on the concepts we we need for managing virtual machines. Or we provide it in libvirt and as it evolves we can factor our into a standalone library. My inclination is to get a working implementation for libvirt before trying to over generalize to service non-virt related applications.
XML description ===============
The horrible recursiveness & specific attributes are all in the XML description for different storage pool / volume types. This is where we define things like what physical volume are in a volume group, iSCSI server / target names, login details, etc, etc
XXX fill in the hard stuff for metadata description here
Implementation backends =======================
- FileSystem/Directory/File - POSIX APIs - LVM - LVM tools, or libLVM - Disk/partitions - sysfs / parted - iSCSI - sysfs / iscsi utils - ZFS - ZFS tools
The problem with most of these, as we all know, is that they only have command-line utilities, and no corresponding libraries. That makes it difficult for a library like libvirt to support them. That is, we can shell out to the commands, but then we run into a situation where different versions of the LVM command, for example, have different output. We have now effectively tied ourselves to a particular version of a tool, which is fairly disappointing. Also, as we have seen with xend, spawning external tools to do work makes error reporting far more difficult (maybe impossible).
Yes it is difficult, but we fundamentally have no choice. With a few exceptions there are no libraries we can use, so no matter what we want we'll end up having to invoke external tools to accomplish some tasks. There is work going on in places to improve library coverage (eg Jim Meyering is doing work on an LVM library), but depending on what OS releases we want to target we may or may not be able to leverage this.
So that leaves us the tough question of what to do here. Ideally we would abstract all of the above tools into libraries, and re-write the tools to use those, and then have libvirt use the same. I'm not sure if it is practical, however, to wait that long to do such things.
Libvirt is primary a technology integration tool / library & as such we need to work with the capabilities that are deployed in the OS' we want to target. If we want to only provide storage maangement in Fedora 9 or newer, then we can possibly mandate the LVM library. If we want to support Fedora 8 or older, we need to use the LVM commnd line tools. As long as the details are well hidden, we can support both, or switch from one to the other in the future.
What I might suggest is a hybrid approach. Do the initial implementation with what we have (namely the command line utilities, possibly utilizing rjones "scriplet" concept). In parallel, make sure someone starts on making real libraries of the tools, so that we can benefit from that later on.
Yep, this is already going on - eg the LVM library. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
It is already nested, even if you don't see usually see it visible in the Dom0 host. eg, in the host I assign a LVM volume to a guest. A guest then puts this into its own nested LVM VG & allocates volumes. This nesting isn't normally visible by default, but tools like kpartx make it visible.
Making this nesting visible in the host isn't neccessarily something we need to expose in the APIs, but we should consider it when thinking about the storage concepts. Depending on how we end up modelling storage APIs, we may end up with getting the capability 'for free', so artifically restricting it upfron is premature.
That's true. I've always viewed the storage allocated to a guest, whether it be a partition, a file, etc., as an opaque object that the guest can do anything it wants with, even if the host doesn't understand it. There is an argument to be made for the host being able to do maintenance on guest disks (if, for some reason, you can't boot the guest); but I honestly think this sort of maintenance is best done inside the guest container (via the "normal" guest rescue modes - LiveCD, Knoppix, Windows Rescue, etc).
It is a good question. My thought is that if we went for a 'libstorage' the scope would be dramatically broader, than if we focused on the concepts we we need for managing virtual machines. Or we provide it in libvirt and as it evolves we can factor our into a standalone library. My inclination is to get a working implementation for libvirt before trying to over generalize to service non-virt related applications.
Agreed that the scope for a libstorage could get out of hand. But I think if we keep the initial scope targetted enough for what we want (i.e. integration with libvirt/virt matters), we can get away with having a separate library, and the rest can grow organically from that. Really, the only part that needs to be generic is the XML; if we can get that down, the implementation doesn't matter. Basically with this model we wouldn't be dependent on libvirt internal data structures/functions, so we wouldn't have the pain of extracting it later. Of course, that is more work too :). Chris Lalancette

Hi Dan, FWIW, this looks pretty much spot on to me ... I'm not sure there's a lot to discuss :-) On Tue, 2007-10-16 at 16:19 +0100, Daniel P. Berrange wrote:
Application users =================
- virt-manager / virt-install - Enumerate available pools - Allocate volume from pool - Create guest with volume
Nice that you list these and concentrate on them.
Two core concepts
- Volume - a chunk of storage - assignable to a guest - assignable to a pool - optionally part of a pool
- Pool - a chunk of storage - contains free space - allocate to provide volumes - compromised of volumes
Recursive!
n x Volume -> Pool -> n x Volume
Nesting to many levels...
Hmm, I'd try and avoid the confusion associated with this nesting concept ... What kind of uses for it are you thinking? (e.g. something like allocate a big raw sparse volume, create a pool from it and and then allocate volumes from that pool?) Is it that you need to somehow represent the storage that a pool can allocate from and that that winds up being similar to how you represent storage that is assignable to a pool? If that's the case, maybe fold the concept of "what a pool allocates from" into the pool concept and make the volume concept just about "what are assignable to guests".
Operations ==========
Limited set of operations to perform
- List host volumes (physical attached devices) - List pools (logical volume groups, partitioned devs, filesystems) - List pool volumes (dev partitions, LVM logical volumes, files)
Perhaps there should be a default pool for each host so that to list host volumes you just list the volumes from the default pool? Cheers, Mark.

On Wed, Oct 17, 2007 at 02:55:21PM +0100, Mark McLoughlin wrote:
Hi Dan, FWIW, this looks pretty much spot on to me ... I'm not sure there's a lot to discuss :-)
On Tue, 2007-10-16 at 16:19 +0100, Daniel P. Berrange wrote:
Application users =================
- virt-manager / virt-install - Enumerate available pools - Allocate volume from pool - Create guest with volume
Nice that you list these and concentrate on them.
Two core concepts
- Volume - a chunk of storage - assignable to a guest - assignable to a pool - optionally part of a pool
- Pool - a chunk of storage - contains free space - allocate to provide volumes - compromised of volumes
Recursive!
n x Volume -> Pool -> n x Volume
Nesting to many levels...
Hmm, I'd try and avoid the confusion associated with this nesting concept ...
What kind of uses for it are you thinking?
This mention of recursion seems to have caused alot of confusion... All I really mean by it is that libvirt has two notions - A volume - A pool When you define a pool, the XML description may refer to one of more volumes which are the source of the pool. eg if you define a new LVM volume group, you provide one or more physical volumes. Given a pool you may carve out out or more volumes. eg you carve out logical volumes. So, the APIs from a libvirt level aren't directly 'recursive' - you just have a concept of a pool & a volume object. As you work with these two concepts you may end up creating things which are recursive in nature. In fact even if you don't conciously define anything recursive, it is indirectly recursive, since a Fedora guest will turn a disk it is assigned into a LVM vol group & logical vols. So in summary, the 'recursion' is just a fundamental property of the storage stack, but not something we need to directly express in libvirt APIs - the mere concepts of a volume & a pool is sufficient.
Operations ==========
Limited set of operations to perform
- List host volumes (physical attached devices) - List pools (logical volume groups, partitioned devs, filesystems) - List pool volumes (dev partitions, LVM logical volumes, files)
Perhaps there should be a default pool for each host so that to list host volumes you just list the volumes from the default pool?
It depends on deployment scenario, but certainly in a 'fat dom0' scenario I imagine you couldd always provide a default pool (eg /var/lib/xen/images) Whether to treat the host as pool for its physically attached devices is interesting idea. One alternative is to have an explicit API for listing all host devices (eg, 'lshal'), since I'd certainly like to be able to enumerate any USB, devices & any PCI devices, as well as any physical network adapters. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

On Wed, Oct 17, 2007 at 04:02:01PM +0100, Daniel P. Berrange wrote:
On Wed, Oct 17, 2007 at 02:55:21PM +0100, Mark McLoughlin wrote:
Recursive!
n x Volume -> Pool -> n x Volume
Nesting to many levels...
Hmm, I'd try and avoid the confusion associated with this nesting concept ...
What kind of uses for it are you thinking?
This mention of recursion seems to have caused alot of confusion...
Recursion is actually the wrong word. It is really a directed acyclic graph / multi-level heirachy. Still, we only need 2 levels in any libvirt API, a pool & a volume.
All I really mean by it is that libvirt has two notions
- A volume - A pool
When you define a pool, the XML description may refer to one of more volumes which are the source of the pool. eg if you define a new LVM volume group, you provide one or more physical volumes.
Given a pool you may carve out out or more volumes. eg you carve out logical volumes.
So, the APIs from a libvirt level aren't directly 'recursive' - you just have a concept of a pool & a volume object. As you work with these two concepts you may end up creating things which are recursive in nature. In fact even if you don't conciously define anything recursive, it is indirectly recursive, since a Fedora guest will turn a disk it is assigned into a LVM vol group & logical vols.
So in summary, the 'recursion' is just a fundamental property of the storage stack, but not something we need to directly express in libvirt APIs - the mere concepts of a volume & a pool is sufficient.
Operations ==========
Limited set of operations to perform
- List host volumes (physical attached devices) - List pools (logical volume groups, partitioned devs, filesystems) - List pool volumes (dev partitions, LVM logical volumes, files)
Perhaps there should be a default pool for each host so that to list host volumes you just list the volumes from the default pool?
It depends on deployment scenario, but certainly in a 'fat dom0' scenario I imagine you couldd always provide a default pool (eg /var/lib/xen/images)
Whether to treat the host as pool for its physically attached devices is interesting idea. One alternative is to have an explicit API for listing all host devices (eg, 'lshal'), since I'd certainly like to be able to enumerate any USB, devices & any PCI devices, as well as any physical network adapters.
Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|
-- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

On Tue, Oct 16, 2007 at 04:19:29PM +0100, Daniel P. Berrange wrote:
Storage attributes ==================
- Local vs network (ext3 vs NFS, SCSI vs iSCSI)
- Private vs shared (IDE vs FibreChannel)
- Pool vs volume (LVM VG vs LV, Directory vs File, Disk vs Partition)
- Container vs guest (OpenVZ vs Xen)
- Attributes - Compressed - Encrypted - Auto-extend
- Snapshots - RO - RW
- Partition table - MBR - GPT
- UUID - 16 hex digits - Unique string - SCSI WWID (world wide ID)
- Local Path(s) (/dev/sda, /var/lib/xen/images/foo.img)
- Server Hostname
- Server Identifier (export path/target)
- MAC security label (SELinux)
- Redundancy - Mirrored - Striped - Multipath
- Pool operation - RO - RW
It was mentioned offlist that I didn't include security/authorization in this mail. I had it in my offline notes... - NFS - server side ACL based on client IP ranges - Kerberos GSSAPI. Client credentials taken from /etc/krb5.tab - iSCSI - server side ACL based on client IP ranges - CHAP username+password supplied when attaching target to client - Spec for Kerberos. Not GSSAPI based. Not implemented in Linux client or server. Frowned upon by IETF kerberos experts since it isn't GSSAPI - QCow - passphrase needed by process (eg QEMU) accessing the file - dm-crypt - passphrase needed when activating the volume Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

On Tue, Oct 16, 2007 at 04:19:29PM +0100, Daniel P. Berrange wrote:
Application users =================
- virt-manager / virt-install - Enumerate available pools - Allocate volume from pool - Create guest with volume
When we support migration the storage API should let us do sanity checking prior to migration. The metadata provided for a poool and a volume should allow an algorithm sort of like this For each disk assigned to the guest - Lookup volume associated with the path on the source host - Lookup volume associated with the path on the dest host - If the dest volume is missing, refuse to migrate - If the dest volume has as different UUID refuse to migrate (Sync UUID to SCSI worldwide name perhaps ?) - Lookup pool associated with the volume on source host - Lookup pool associated with the volume on dest host - If the pool is different, then refuse to migrate (catches case of a different NFS mount being used, or it being a local internal storage pool, for example) Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Here is an updated document about storage which attempts to formulate the various pieces of metadata in an XML representation. In addition I have introduced a 3rd concept of a 'device'. A device represents any physical device attached ato a host, be it a disk, a sound card, a USB gizmo, or anything else you would see with 'lshal' (or in sysfs). I considered Mark's suggestion that we have a 'host pool' in which physical storage devices live, but I think it is important to directly represent the physical devices as a concept separately from pools & volumes. This is because we need this in other areas - we we need network device info when setting up networking in guests & virtual networks. We need USB & PCI device info todo device pass-through to the guest, etc. Finally also included more info about permissions & security. Taxonomy of storage types ========================= | +- Block | | | +- Disk | | | | | +- Direct attached | | | | | | | +- IDE/ATA disk | | | +- SCSI disk | | | +- FibreChannel disk | | | +- USB disk/flash | | | +- FireWire disk/flash | | | | | +- Remote attached | | | | | +- iSCSI disk | | +- GNBD disk | | | +- Partition | | | +- Virtual | | | +- Direct attached | | | | | +- LVM | | +- ZFS | | | +- Remote attached | | | +- Cluster LVM | +- FileSystem | | | +- Directed attached | | | | | +- ext2/3/4 | | +- xfs | | +- ZFS | | | +- Remote attached | | | +- NFS | +- GFS | +- OCFS2 | +- Directory | +- File | +- Raw allocated +- Raw sparse +- QCow2 +- VMDK Storage attributes ================== - Local vs network (ext3 vs NFS, SCSI vs iSCSI) - Private vs shared (IDE vs FibreChannel) - Pool vs volume (LVM VG vs LV, Directory vs File, Disk vs Partition) - Container vs guest (OpenVZ vs Xen) - Attributes - Compressed - Encrypted - Auto-extend - Snapshots - RO - RW - Partition table - MBR - GPT - UUID - 16 hex digits - Unique string - SCSI WWID (world wide ID) - Local Path(s) - Server Hostname - Server Identifier (export path/target) - MAC security label (SELinux) - Redundancy (mirrored/striped/multipath) - Pool operation - RO - RW - Authentication - Username / Password - Client IP/MAC address - Kerberos / GSSAPI - Passphrase Nesting hierarchy ================= - 1 x Host -> N x iSCSI target -> N x LUN -> N x Partition - N x Disk/Partition -> 1 x LVM VG -> B x LVM LV - 1 x Filesystem -> N x directory -> N x file - 1 x File -> 1 x Block (loopback) Application users ================= - virt-manager / virt-install - Enumerate available pools - Allocate volume from pool - Create guest with volume - virt-clone - Copy disks - Snapshot disks - virt-df - Filesystem usage - pygrub - Extract kernel/initrd from filesystem - virt-factory - Manage storage pools - Pre-migration sanity checks - virt-backup - Snapshot disks - virt-p2v - Snapshot disks Storage representation ====================== Three core concepts - Device - a physical device attached to a host - associated with a bus / subsystem (scsi,usb,ide,etc) - bus specific identifier (vendor+product ID?) - a driver type - unique id / serial number - device name for its current mapping into the filesystem - Pool - a pool of storage - contains free space - allocate to provide volumes - compromised of devices, or a remote server - Volume - a chunk of storage - assignable to a guest - part of a pool XML description =============== Storage pools ------------- High level - Type - the representation of the storage pool - Source - the underlying data storage location - Target - mapping to local filesystem (if applicable) The XML only provides information that describes the pool itself. ie, information about phyiscal devices underlying the pool is not not maintained here. - A directory within a filesystem <pool type="dir"> <name="xenimages"/> <uuid>12345678-1234-1234-1234-123456781234</uuid> <target file="/var/lib/xen/images"/> <permissions> <mode>0700</mode> <owner>root</owner> <group>virt</group> <label>xen_image_t</label> </permissions> </pool> - A dedicated filesystem <pool type="fs"> <name="xenimages"/> <uuid>12345678-1234-1234-1234-123456781234</uuid> <source dev="/dev/sda1"/> <target file="/var/lib/xen/images"/> <permissions> <mode>0700</mode> <owner>root</owner> <group>virt</group> <label>xen_image_t</label> </permissions> </pool> - A dedicated disk <pool type="disk"> <name="xenimages"/> <uuid>12345678-1234-1234-1234-123456781234</uuid> <source dev="/dev/sda"/> <permissions> <mode>0700</mode> <owner>root</owner> <group>virt</group> <label>xen_image_t</label> </permissions> </pool> - A logical volume group with 3 physical volumes <pool type="lvm"> <name="xenimages"/> <uuid>12345678-1234-1234-1234-123456781234</uuid> <source dev="/dev/sda1"/> <source dev="/dev/sdb1"/> <source dev="/dev/sdc1"/> <target dev="/dev/VirtVG"/> </pool> - A network filesystem <pool type="nfs"> <name="xenimages"/> <uuid>12345678-1234-1234-1234-123456781234</uuid> <source host="someserver" export="/vol/files"> <auth type="kerberos" keytab="/etc/server.tab"/> </source> <target file="/var/lib/xen/images"/> <permissions> <mode>0700</mode> <owner>root</owner> <group>virt</group> <label>xen_image_t</label> </permissions> </pool> - An iSCSI target <pool type="iscsi"> <name="xenimages"/> <uuid>12345678-1234-1234-1234-123456781234</uuid> <source host="someserver" export="sometarget"> <auth type="chap" username="joe" password="123456"/> </source> </pool> XXX Some kind of indiciation as to whether a pool allows creation of new volumes, or merely use of existing ones XXX flag for whether volumes will be file or block based XXX capacity / usage information if available XXX indicate whether pool can be activated/deactived, vs permanently in active state Storage volumes --------------- High level - Unique name within pool - Data format type (qcow, raw, vmdk) - FS / Dir / NFS volume <volume type="file"> <name>foo</name> <format type="qcow2"> <encrypt key="123456"/> <compress/> </format> <capacity>1000000</capacity> <allocation>100</allocation> <permissions> <mode>0700</mode> <owner>root</owner> <group>virt</group> <label>xen_image_t</label> </permissions> <target file="/var/lib/xen/images/foo.img"/> </volume> - iSCSI / LVM / Partition <volume type="block"> <name>foo</name> <capacity>1000000</capacity> <allocation>100</allocation> <permissions> <mode>0700</mode> <owner>root</owner> <group>virt</group> <label>xen_image_t</label> </permissions> <target dev="/dev/HostVG/foo"/> <snapshots> <snapshot name="bar"/> </snapshots> <volume> XXX VMWare's VMDK can be made up of many chained files XXX QCow stores snapshots internally, with a name, while LVM stores them as separate volumes with a link. Listing snapshots along side master volume seems allows both to be represented. XXX flag to indiciate whether it is resizeable ? Host devices ------------ This is not just limited to storage devices. Basically a representation of the same data provided by HAL (cf lshal) - Opaque name, vendor & product strings - Subsystem specific unique identifier for vendor/product/model - Capability type, eg storage, sound, network, etc <device> <name>/org/freedesktop/Hal/devices/volume_part2_size_99920701440</name> <vendor name="Some Vendor"/> <product name="Some Disk"/> <subsystem type="usb"> <product id="2345"/> <vendor id="2345"/> </subsystem> <class type="storage"> <block dev="/dev/ <bus type="ide"/> <drive type="cdrom"/> </capability> </device> NB, 'class' sort of maps to HAL's 'capability' field. Though HAL allows for multiple capabilities per device. Operations ========== Limited set of operations to perform For devices: - List devices - List devices by class - Lookup by path - Lookup by name For pools: - List pools (logical volume groups, partitioned devs, filesystems) - Define pool (eg create directory, or define iSCSI target) - Undefine pool - Activate pool (mount NFS volume, login to iSCSI target) - Deactivate pool - Dump pool XML - Lookup by path - Lookup by UUID - Lookup by name For volumes - List volumes (takes a pool as a param) - Create volume (takes a pool as a param) - Destroy volume - Resize volume - Copy volume - Snapshot volume - Dump volume XML - Lookup by path - Lookup by UUID - Lookup by name http://www.redhat.com/archives/libvir-list/2007-February/msg00010.html http://www.redhat.com/archives/libvir-list/2007-September/msg00119.html Implementation ============== - devices - sysfs / HAL - FileSystem/Directory/File - POSIX APIs - LVM - lvm tools - Disk/partitions - sysfs / HAL / parted - iSCSI - sysfs / HAL / iscsi utils - ZFS - ZFS tools NB, HAL gets all its info from sysfs. So can choose to use HAL, or go directly to sysfs. The former is more easily portable to Solaris, but does required the software dependancy stack to include HAL, DBus, ConsoleKit, PolicyKit and GLib. Already have DBus dep via Avahi. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|
participants (4)
-
Chris Lalancette
-
Daniel P. Berrange
-
Mark McLoughlin
-
Richard W.M. Jones