On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote:
Hi,
We've been working on a program called sync_manager that implements
shared-storage-based leases to protect shared resources. One way we'd like
to use it is to protect vm images that reside on shared storage,
i.e. preventing two vm's on two hosts from using the same image at once.
There's two different, but related problems here:
- Preventing 2 different VMs using the same disk
- Preventing the same VM running on 2 hosts at once
The first requires that there is a lease per configured disk (since
a guest can have multiple disks). The latter requires a lease per
VM and can ignore specifices of what disks are configured.
IIUC, sync-manager is aiming for the latter.
It's functional, and the next big step is using it through
libvirt.
sync_manager "wraps" a process, i.e. acquires a lease, forks&execs a
process, renews the lease wile the process runs, and releases the lease
when the process exits. While the process runs, it has exclusive access
to whatever resource was named in the lease that was acquired.
There are complications around migration we need to consider too.
During migration, you actually need QEMU running on two hosts at
once. IIRC the idea is that before starting the migration operation,
we'd have to tell sync-manager to mark the lease as shared with a
specific host. The destination QEMU would have to startup in shared
mode, and upgrade this to an exclusive lock when migration completes,
or quit when migration fails.
For libvirt integration, this means we need to consider an approach
that is much broader than just adding a wrapper process.
A command would be something like this:
sync_manager daemon -i <host_id> -n <vm_id> -l <lease> -c
<command> <args>
<host_id> is integer between 1 and 2000 that is statically
assigned to each host.
<vm_id> is a unique identifer for the process that will be run,
e.g. the vm name or uuid.
<lease> defines the shared storage area that sync_manager should
use for performing the disk-paxos based synchronization.
It consists of <resource_name>:<path>:<offset>, where
<resource_name> is likely to be the vm name/uuid (or the
name of the vm's disk image), and <path>:<offset> is an
area of shared storage that has been allocated for
sync_manager to use (a separate area for each resource/vm).
Can you give some real examples of the lease arg ? I guess <path> must
exclude the ':' character, or have some defined escaping scheme.
<command> <args>
would be the qemu command line that is currently used.
We expect these new config values will need a place to live in the libvirt
xml config file, and libvirt will need to fork sync_manager -c qemu rather
than qemu directly. At least those are the most obvious things that need
doing, there are sure to be other api or functional issues.
The <host_id> is obviously needs to be in /etc/libvirt/sync-manager.conf
since that's a per-host config. I assume the shared storage area is per
host too ?
That leaves just the VM name/uuid as a per-VM config option, and we
obviously already have that in XML. Is there actually any extra
attribute we need to track per-guest in the XML ? If not this will
simplify life, because we won't have to track sync-manager specific
attributes
sync_manager only forks when running the command, and doesn't
change any
fd's, so any command output should appear unchanged to libvirt. Would
there be any problem with sync_manager also printing its own warnings and
errors to stderr?
That's fine. Printing to stderr upon error is actually required. This is
the only way we can get VM startup failure messages back up to the user.
While the main point of sync_manager is the disk leases to
synchronize
access among hosts, it also uses posix locks to synchronize access among
local processes.
http://git.fedorahosted.org/git/?p=sync_manager.git
No docs ?
In terms of integration with libvirt, I think it is desirable that we keep
libvirt and sync-manager loosely coupled. ie We don't want to hardcode
libvirt using sync-manager, nor do we want to hardcode sync-manager only
working with libvirt.
This says to me that we need to provide a well defined plugin system for
providing a 'supervisor process' for QEMU guests. Essentially a dlopen()
module that provides a handful (< 10) callbacks which are triggered in
appropriate codepaths. At minimum I expect we need
- A callback at ARGV building, to let extra sync-manager ARGV to be injected
- A callback at VM startup. Not needed for sync-manager, but to allowfor
alternate impls that aren't based around supervising.
- A callback at VM shutdown. Just to cleanup resources
- A callback in the VM destroy method, in case we need todo something
different other than just kill($PID) the QEMU $PID. (eg to perhaps
tell sync-manager to kill QEMU instead of killing it ourselves)
- Several callbacks at various stages of migration to deal with
lock downgrade/upgrade
The one further complication is with the security drivers. IIUC, we will
absolutely not want QEMU to have any access to the shared storage lease
area. The problem is that if we just inject the wrapper process as is,
sync-manager will end up running with exact same privileges as QEMU.
ie same UID:GID, and same selinux context. I'm really not at all sure
how to deal with this problem, because our core design is that the thing
we spawn inherits the privileges we setup at fork() time. We don't want
to delegate the security setup to sync-manager, because it introduces
a huge variable condition in the security system. We need guarenteed
consistent security setup for QEMU, regardless of supervisor process
in use.
Daniel
--
|: Red Hat, Engineering, London -o-
http://people.redhat.com/berrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org -o-
http://deltacloud.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|