On Wed, Aug 11, 2010 at 05:59:55PM +0100, Daniel P. Berrange wrote:
On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote:
> Hi,
>
> We've been working on a program called sync_manager that implements
> shared-storage-based leases to protect shared resources. One way we'd like
> to use it is to protect vm images that reside on shared storage,
> i.e. preventing two vm's on two hosts from using the same image at once.
There's two different, but related problems here:
- Preventing 2 different VMs using the same disk
- Preventing the same VM running on 2 hosts at once
The first requires that there is a lease per configured disk (since
a guest can have multiple disks). The latter requires a lease per
VM and can ignore specifices of what disks are configured.
IIUC, sync-manager is aiming for the latter.
The present integration effort is aiming for the latter. sync_manager
itself aims to be agnostic about what it's managing.
> It's functional, and the next big step is using it through
libvirt.
>
> sync_manager "wraps" a process, i.e. acquires a lease, forks&execs a
> process, renews the lease wile the process runs, and releases the lease
> when the process exits. While the process runs, it has exclusive access
> to whatever resource was named in the lease that was acquired.
There are complications around migration we need to consider too.
During migration, you actually need QEMU running on two hosts at
once. IIRC the idea is that before starting the migration operation,
we'd have to tell sync-manager to mark the lease as shared with a
specific host. The destination QEMU would have to startup in shared
mode, and upgrade this to an exclusive lock when migration completes,
or quit when migration fails.
sync_manager leases can only be exclusive, so it's a matter of transfering
ownership of the exclusive lock from source host to destination host. We
have not yet added lease transfer capabilities to sync_manager, but it
might look something like this:
S = source host, sm-S = sync_manager on S, ...
D = destination host, sm-D = sync_manager on D, ...
1. sm-S holds the lease, and is monitoring qemu
2. migration begins from S to D
3. libvirt-D runs sm-D: sync_manager -c qemu with the addition of a new
sync_manager option --receive-lease
4. sm-D writes its hostid D to the lease area signaling sm-S that it wants
to be the lease owner when S is done with it
5. sm-D begins monitoring the lease owner on disk (which is still S)
6. sm-D forks qemu-D
7. sm-S sees that D wants the lease
8. qemu-S exits with success
9. sm-S sees qemu-S exit with success
10. sm-S writes D as the lease owner into the lease area and exits
(in the non-migration/transfer case, sm-S writes owner=LEASE_FREE)
11. sm-D (still monitoring the lease owner) sees that it has become the
owner, and begins renewing the lease
12. qemu-D runs fully
I don't know enough (anything) about qemu migration yet to say if those
steps work correctly or safely. One concern is that qemu-D should not
enter a state where it can write until we are certain that D has been
written as the lease's owner.
> sync_manager daemon -i <host_id> -n <vm_id> -l
<lease> -c <command> <args>
> <lease> defines the shared storage area that
sync_manager should
> use for performing the disk-paxos based synchronization.
> It consists of <resource_name>:<path>:<offset>, where
> <resource_name> is likely to be the vm name/uuid (or the
> name of the vm's disk image), and <path>:<offset> is an
> area of shared storage that has been allocated for
> sync_manager to use (a separate area for each resource/vm).
Can you give some real examples of the lease arg ? I guess <path> must
exclude the ':' character, or have some defined escaping scheme.
-l vm0:/dev/vg/lease_area:0
(exclude : from paths)
Manually setting up, intializing and keeping track of lease areas would be
a pain, so we'll definately be looking at adding that to higher level tools.
The <host_id> is obviously needs to be in
/etc/libvirt/sync-manager.conf
since that's a per-host config. I assume the shared storage area is per
host too ?
That leaves just the VM name/uuid as a per-VM config option, and we
obviously already have that in XML. Is there actually any extra
attribute we need to track per-guest in the XML ? If not this will
simplify life, because we won't have to track sync-manager specific
attributes
With the plugin style hooks you describe below, it seems all the
sync_manager config could be kept separate from the libvirt config.
In terms of integration with libvirt, I think it is desirable that we
keep
libvirt and sync-manager loosely coupled. ie We don't want to hardcode
libvirt using sync-manager, nor do we want to hardcode sync-manager only
working with libvirt.
This says to me that we need to provide a well defined plugin system for
providing a 'supervisor process' for QEMU guests. Essentially a dlopen()
module that provides a handful (< 10) callbacks which are triggered in
appropriate codepaths. At minimum I expect we need
- A callback at ARGV building, to let extra sync-manager ARGV to be injected
- A callback at VM startup. Not needed for sync-manager, but to allowfor
alternate impls that aren't based around supervising.
- A callback at VM shutdown. Just to cleanup resources
- A callback in the VM destroy method, in case we need todo something
different other than just kill($PID) the QEMU $PID. (eg to perhaps
tell sync-manager to kill QEMU instead of killing it ourselves)
- Several callbacks at various stages of migration to deal with
lock downgrade/upgrade
sounds good
The one further complication is with the security drivers. IIUC, we
will
absolutely not want QEMU to have any access to the shared storage lease
area. The problem is that if we just inject the wrapper process as is,
sync-manager will end up running with exact same privileges as QEMU.
ie same UID:GID, and same selinux context. I'm really not at all sure
how to deal with this problem, because our core design is that the thing
we spawn inherits the privileges we setup at fork() time. We don't want
to delegate the security setup to sync-manager, because it introduces
a huge variable condition in the security system. We need guarenteed
consistent security setup for QEMU, regardless of supervisor process
in use.
It might not be a big problem for qemu to write to its own lease area,
but writing to another's probably would (e.g. at a different offset on the
same lv). That implies a separate lease lv per qemu; I'll have to find
out how close that gets to lvm scalability limits.
Dave