В Thu, 7 Feb 2013 16:09:31 +0000
"Daniel P. Berrange" <berrange(a)redhat.com> пишет:
On Mon, Feb 04, 2013 at 08:13:20PM +0400, Alexander Gordeev wrote:
> Hi!
>
> I'd like to develop libvirt integration with Parallels Cloud
> Storage (Pstorage) which is a new distributed file system from
> Parallels designed specifically for storing VM images:
>
http://www.parallels.com/products/pcs/cloud-storage/
Yay, yet another filesystem :-( Everyone seems to think they need to
write their own custom network/cluster/cloud FUSE filesystem these
days.
Do you think Parallels would invest many man-years of development if we
could just take existing solutions? :) It offers some unique
combination of features: strong consistency, replication,
high-availability. Also it's very fast.
Let's compare:
1. CEPH uses BTRFS which is not considered stable yet.
2. Sheepdog has only synchronous writes and poor performance.
3. Glusterfs doesn't offer strong consistency. Strong consistency is
required for real filesystems (NTFS, ext3/4, ...) because that's what
HDDs offer.
Please correct me if I'm wrong. I haven't tried these filesystems
myself, this is just a summary from our internal wiki.
Pstorage includes three components: metadata servers, chunk servers and
clients. Metadata servers hold distributed metadata and synchronize it
using PAXOS (this is how we offer strong consistency). Chunk servers
store replicated pieces of VM images. The preferred method of access is
FUSE client, but you are not limited to it. You can also access a
Pstorage cluster directly using client library (FUSE client uses it
too, of course). There is also an LD_PRELOAD library that intercepts
syscalls like open, read, write, etc. I think it's the same approach as
in Glusterfs.
FUSE client is very fast when used with O_DIRECT + aio. Synchronous
access is much slower. For example, on a single machine with one HDD
async access gives almost the speed of HDD: 120-130MB/s. Sync access is
about 2.5MB/s. AFAIK this cannot be fixed without some improvements in
the kernel. We have some patches but we don't expect them to reach
mainline soon. But we want to make it easy to use Pstorage with KVM/Xen
and libvirt now. BTW I think user-configurable per-pool defaults can be
useful for others as well.
> It's not a POSIX FS but there is a FUSE client for it that
can be
> used to access and manipulate images. It's quite high speed but only
> when used with O_DIRECT + aio. I tried to setup several KVMs on top of
> a Pstorage mount using virt-manager. It worked good, but I had to:
> 1. tune cache and IO settings for every disk
> 2. disable space allocation by libvirt because it is using sync IO and
> is therefore slow
>
> I tried to find ways to solve the first issue and IMHO this can be
> done by adding a way to specify per-pool defaults for cache and IO
> settings. I didn't find any way for this in the current code and UI.
> I'd like to add a new storage backend also that will be a 'dir' backend
> with extra ability to manage Pstorage mount points and UI integration.
> I'd like to merge my work to the main tree when it's finished if
> possible.
I don't think that putting cache/IO defaults in the XML is really
appropriate. There are too many different scenarios which want
different settings to be able to clearly identify one set of
perfect settings. I see this as primarily a documentation problem.
Someone needs to write something describing the diferrent settings
for each storage pool & what the pros/cons are for each. Downstream
app developers can then make decisions about suitable defaults for
their use cases
Do you mean that I should patch virt-manager/oVirt for per-pool defaults
instead? IMO it's a bad idea. Imagine that I create a pool on one
machine and select some reasonable defaults for this pool in this
instance of virt-manager. Then I create an image from another instance
of virt-manager on a different machine which doesn't have these
defaults and therefore will not set them. I think, this is not
user-friendly.
I already made a step-by-step guide about creating a VM on a pstorage
pool. It's not quite easy. You should definitely avoid preallocating an
image because it will be preallocated synchronously and also you have
to set cache=none and io=native before starting VM for the
installation or it will take hours.
I think selecting some reasonable defaults is the right thing instead
of forcing users to read documentation.
--
Alexander