[libvirt] using sync_manager with libvirt

Hi, We've been working on a program called sync_manager that implements shared-storage-based leases to protect shared resources. One way we'd like to use it is to protect vm images that reside on shared storage, i.e. preventing two vm's on two hosts from using the same image at once. It's functional, and the next big step is using it through libvirt. sync_manager "wraps" a process, i.e. acquires a lease, forks&execs a process, renews the lease wile the process runs, and releases the lease when the process exits. While the process runs, it has exclusive access to whatever resource was named in the lease that was acquired. A command would be something like this: sync_manager daemon -i <host_id> -n <vm_id> -l <lease> -c <command> <args> <host_id> is integer between 1 and 2000 that is statically assigned to each host. <vm_id> is a unique identifer for the process that will be run, e.g. the vm name or uuid. <lease> defines the shared storage area that sync_manager should use for performing the disk-paxos based synchronization. It consists of <resource_name>:<path>:<offset>, where <resource_name> is likely to be the vm name/uuid (or the name of the vm's disk image), and <path>:<offset> is an area of shared storage that has been allocated for sync_manager to use (a separate area for each resource/vm). <command> <args> would be the qemu command line that is currently used. We expect these new config values will need a place to live in the libvirt xml config file, and libvirt will need to fork sync_manager -c qemu rather than qemu directly. At least those are the most obvious things that need doing, there are sure to be other api or functional issues. sync_manager only forks when running the command, and doesn't change any fd's, so any command output should appear unchanged to libvirt. Would there be any problem with sync_manager also printing its own warnings and errors to stderr? While the main point of sync_manager is the disk leases to synchronize access among hosts, it also uses posix locks to synchronize access among local processes. http://git.fedorahosted.org/git/?p=sync_manager.git Thanks, Dave

On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote:
Hi,
We've been working on a program called sync_manager that implements shared-storage-based leases to protect shared resources. One way we'd like to use it is to protect vm images that reside on shared storage, i.e. preventing two vm's on two hosts from using the same image at once.
There's two different, but related problems here: - Preventing 2 different VMs using the same disk - Preventing the same VM running on 2 hosts at once The first requires that there is a lease per configured disk (since a guest can have multiple disks). The latter requires a lease per VM and can ignore specifices of what disks are configured. IIUC, sync-manager is aiming for the latter.
It's functional, and the next big step is using it through libvirt.
sync_manager "wraps" a process, i.e. acquires a lease, forks&execs a process, renews the lease wile the process runs, and releases the lease when the process exits. While the process runs, it has exclusive access to whatever resource was named in the lease that was acquired.
There are complications around migration we need to consider too. During migration, you actually need QEMU running on two hosts at once. IIRC the idea is that before starting the migration operation, we'd have to tell sync-manager to mark the lease as shared with a specific host. The destination QEMU would have to startup in shared mode, and upgrade this to an exclusive lock when migration completes, or quit when migration fails. For libvirt integration, this means we need to consider an approach that is much broader than just adding a wrapper process.
A command would be something like this:
sync_manager daemon -i <host_id> -n <vm_id> -l <lease> -c <command> <args>
<host_id> is integer between 1 and 2000 that is statically assigned to each host.
<vm_id> is a unique identifer for the process that will be run, e.g. the vm name or uuid.
<lease> defines the shared storage area that sync_manager should use for performing the disk-paxos based synchronization. It consists of <resource_name>:<path>:<offset>, where <resource_name> is likely to be the vm name/uuid (or the name of the vm's disk image), and <path>:<offset> is an area of shared storage that has been allocated for sync_manager to use (a separate area for each resource/vm).
Can you give some real examples of the lease arg ? I guess <path> must exclude the ':' character, or have some defined escaping scheme.
<command> <args> would be the qemu command line that is currently used.
We expect these new config values will need a place to live in the libvirt xml config file, and libvirt will need to fork sync_manager -c qemu rather than qemu directly. At least those are the most obvious things that need doing, there are sure to be other api or functional issues.
The <host_id> is obviously needs to be in /etc/libvirt/sync-manager.conf since that's a per-host config. I assume the shared storage area is per host too ? That leaves just the VM name/uuid as a per-VM config option, and we obviously already have that in XML. Is there actually any extra attribute we need to track per-guest in the XML ? If not this will simplify life, because we won't have to track sync-manager specific attributes
sync_manager only forks when running the command, and doesn't change any fd's, so any command output should appear unchanged to libvirt. Would there be any problem with sync_manager also printing its own warnings and errors to stderr?
That's fine. Printing to stderr upon error is actually required. This is the only way we can get VM startup failure messages back up to the user.
While the main point of sync_manager is the disk leases to synchronize access among hosts, it also uses posix locks to synchronize access among local processes.
No docs ? In terms of integration with libvirt, I think it is desirable that we keep libvirt and sync-manager loosely coupled. ie We don't want to hardcode libvirt using sync-manager, nor do we want to hardcode sync-manager only working with libvirt. This says to me that we need to provide a well defined plugin system for providing a 'supervisor process' for QEMU guests. Essentially a dlopen() module that provides a handful (< 10) callbacks which are triggered in appropriate codepaths. At minimum I expect we need - A callback at ARGV building, to let extra sync-manager ARGV to be injected - A callback at VM startup. Not needed for sync-manager, but to allowfor alternate impls that aren't based around supervising. - A callback at VM shutdown. Just to cleanup resources - A callback in the VM destroy method, in case we need todo something different other than just kill($PID) the QEMU $PID. (eg to perhaps tell sync-manager to kill QEMU instead of killing it ourselves) - Several callbacks at various stages of migration to deal with lock downgrade/upgrade The one further complication is with the security drivers. IIUC, we will absolutely not want QEMU to have any access to the shared storage lease area. The problem is that if we just inject the wrapper process as is, sync-manager will end up running with exact same privileges as QEMU. ie same UID:GID, and same selinux context. I'm really not at all sure how to deal with this problem, because our core design is that the thing we spawn inherits the privileges we setup at fork() time. We don't want to delegate the security setup to sync-manager, because it introduces a huge variable condition in the security system. We need guarenteed consistent security setup for QEMU, regardless of supervisor process in use. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, Aug 11, 2010 at 05:59:55PM +0100, Daniel P. Berrange wrote:
On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote:
Hi,
We've been working on a program called sync_manager that implements shared-storage-based leases to protect shared resources. One way we'd like to use it is to protect vm images that reside on shared storage, i.e. preventing two vm's on two hosts from using the same image at once.
There's two different, but related problems here:
- Preventing 2 different VMs using the same disk - Preventing the same VM running on 2 hosts at once
The first requires that there is a lease per configured disk (since a guest can have multiple disks). The latter requires a lease per VM and can ignore specifices of what disks are configured.
IIUC, sync-manager is aiming for the latter.
The present integration effort is aiming for the latter. sync_manager itself aims to be agnostic about what it's managing.
It's functional, and the next big step is using it through libvirt.
sync_manager "wraps" a process, i.e. acquires a lease, forks&execs a process, renews the lease wile the process runs, and releases the lease when the process exits. While the process runs, it has exclusive access to whatever resource was named in the lease that was acquired.
There are complications around migration we need to consider too. During migration, you actually need QEMU running on two hosts at once. IIRC the idea is that before starting the migration operation, we'd have to tell sync-manager to mark the lease as shared with a specific host. The destination QEMU would have to startup in shared mode, and upgrade this to an exclusive lock when migration completes, or quit when migration fails.
sync_manager leases can only be exclusive, so it's a matter of transfering ownership of the exclusive lock from source host to destination host. We have not yet added lease transfer capabilities to sync_manager, but it might look something like this: S = source host, sm-S = sync_manager on S, ... D = destination host, sm-D = sync_manager on D, ... 1. sm-S holds the lease, and is monitoring qemu 2. migration begins from S to D 3. libvirt-D runs sm-D: sync_manager -c qemu with the addition of a new sync_manager option --receive-lease 4. sm-D writes its hostid D to the lease area signaling sm-S that it wants to be the lease owner when S is done with it 5. sm-D begins monitoring the lease owner on disk (which is still S) 6. sm-D forks qemu-D 7. sm-S sees that D wants the lease 8. qemu-S exits with success 9. sm-S sees qemu-S exit with success 10. sm-S writes D as the lease owner into the lease area and exits (in the non-migration/transfer case, sm-S writes owner=LEASE_FREE) 11. sm-D (still monitoring the lease owner) sees that it has become the owner, and begins renewing the lease 12. qemu-D runs fully I don't know enough (anything) about qemu migration yet to say if those steps work correctly or safely. One concern is that qemu-D should not enter a state where it can write until we are certain that D has been written as the lease's owner.
sync_manager daemon -i <host_id> -n <vm_id> -l <lease> -c <command> <args>
<lease> defines the shared storage area that sync_manager should use for performing the disk-paxos based synchronization. It consists of <resource_name>:<path>:<offset>, where <resource_name> is likely to be the vm name/uuid (or the name of the vm's disk image), and <path>:<offset> is an area of shared storage that has been allocated for sync_manager to use (a separate area for each resource/vm).
Can you give some real examples of the lease arg ? I guess <path> must exclude the ':' character, or have some defined escaping scheme.
-l vm0:/dev/vg/lease_area:0 (exclude : from paths) Manually setting up, intializing and keeping track of lease areas would be a pain, so we'll definately be looking at adding that to higher level tools.
The <host_id> is obviously needs to be in /etc/libvirt/sync-manager.conf since that's a per-host config. I assume the shared storage area is per host too ?
That leaves just the VM name/uuid as a per-VM config option, and we obviously already have that in XML. Is there actually any extra attribute we need to track per-guest in the XML ? If not this will simplify life, because we won't have to track sync-manager specific attributes
With the plugin style hooks you describe below, it seems all the sync_manager config could be kept separate from the libvirt config.
In terms of integration with libvirt, I think it is desirable that we keep libvirt and sync-manager loosely coupled. ie We don't want to hardcode libvirt using sync-manager, nor do we want to hardcode sync-manager only working with libvirt.
This says to me that we need to provide a well defined plugin system for providing a 'supervisor process' for QEMU guests. Essentially a dlopen() module that provides a handful (< 10) callbacks which are triggered in appropriate codepaths. At minimum I expect we need
- A callback at ARGV building, to let extra sync-manager ARGV to be injected - A callback at VM startup. Not needed for sync-manager, but to allowfor alternate impls that aren't based around supervising. - A callback at VM shutdown. Just to cleanup resources - A callback in the VM destroy method, in case we need todo something different other than just kill($PID) the QEMU $PID. (eg to perhaps tell sync-manager to kill QEMU instead of killing it ourselves) - Several callbacks at various stages of migration to deal with lock downgrade/upgrade
sounds good
The one further complication is with the security drivers. IIUC, we will absolutely not want QEMU to have any access to the shared storage lease area. The problem is that if we just inject the wrapper process as is, sync-manager will end up running with exact same privileges as QEMU. ie same UID:GID, and same selinux context. I'm really not at all sure how to deal with this problem, because our core design is that the thing we spawn inherits the privileges we setup at fork() time. We don't want to delegate the security setup to sync-manager, because it introduces a huge variable condition in the security system. We need guarenteed consistent security setup for QEMU, regardless of supervisor process in use.
It might not be a big problem for qemu to write to its own lease area, but writing to another's probably would (e.g. at a different offset on the same lv). That implies a separate lease lv per qemu; I'll have to find out how close that gets to lvm scalability limits. Dave

On 08/11/10 - 03:37:12PM, David Teigland wrote:
There are complications around migration we need to consider too. During migration, you actually need QEMU running on two hosts at once. IIRC the idea is that before starting the migration operation, we'd have to tell sync-manager to mark the lease as shared with a specific host. The destination QEMU would have to startup in shared mode, and upgrade this to an exclusive lock when migration completes, or quit when migration fails.
sync_manager leases can only be exclusive, so it's a matter of transfering ownership of the exclusive lock from source host to destination host. We have not yet added lease transfer capabilities to sync_manager, but it might look something like this:
S = source host, sm-S = sync_manager on S, ... D = destination host, sm-D = sync_manager on D, ...
1. sm-S holds the lease, and is monitoring qemu 2. migration begins from S to D 3. libvirt-D runs sm-D: sync_manager -c qemu with the addition of a new sync_manager option --receive-lease 4. sm-D writes its hostid D to the lease area signaling sm-S that it wants to be the lease owner when S is done with it 5. sm-D begins monitoring the lease owner on disk (which is still S) 6. sm-D forks qemu-D 7. sm-S sees that D wants the lease 8. qemu-S exits with success 9. sm-S sees qemu-S exit with success 10. sm-S writes D as the lease owner into the lease area and exits (in the non-migration/transfer case, sm-S writes owner=LEASE_FREE) 11. sm-D (still monitoring the lease owner) sees that it has become the owner, and begins renewing the lease 12. qemu-D runs fully
Unfortunately, this is not how migration works in qemu/kvm. Using your nomenclature above, it's more like the following: A guest is running on S. A migration is then initiated, at which point D fires up a qemu process with a -incoming argument. This is sort of a container process that will receive all of the migration data. Crucially for sync-manager, though, qemu completely starts up and "attaches" to all of the resources (including disks) *while* qemu at S is still running. Then it enters a sort of paused state (where the guest cannot run), and receives all of the migration data. Once all of the migration data has been received, the guest on S is destroyed, and the guest on D is unpaused. That's why Dan mentioned that we need two hosts to access the disk at once. -- Chris Lalancette

On 08/11/2010 02:53 PM, Chris Lalancette wrote:
Unfortunately, this is not how migration works in qemu/kvm. Using your nomenclature above, it's more like the following:
A guest is running on S. A migration is then initiated, at which point D fires up a qemu process with a -incoming argument. This is sort of a container process that will receive all of the migration data. Crucially for sync-manager, though, qemu completely starts up and "attaches" to all of the resources (including disks) *while* qemu at S is still running. Then it enters a sort of paused state (where the guest cannot run), and receives all of the migration data. Once all of the migration data has been received, the guest on S is destroyed, and the guest on D is unpaused. That's why Dan mentioned that we need two hosts to access the disk at once.
On the other hand, does D do any writes to the disk prior to the point at which it is unpaused? Would it work if D can be granted a read-only lease to access to the disk for the duration of the migration data, and then be converted over to read-write at the point when S is destroyed? On a related vein, libguestfs provides things like 'guestfish --ro', which is documented as a safe way to do read-only access of a disk image in use by another VM. That serves as another case where we want to be able to provide read-only access to a disk while someone else holds the read-write lease. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Wed, Aug 11, 2010 at 03:07:29PM -0600, Eric Blake wrote:
On 08/11/2010 02:53 PM, Chris Lalancette wrote:
Unfortunately, this is not how migration works in qemu/kvm. Using your nomenclature above, it's more like the following:
A guest is running on S. A migration is then initiated, at which point D fires up a qemu process with a -incoming argument. This is sort of a container process that will receive all of the migration data. Crucially for sync-manager, though, qemu completely starts up and "attaches" to all of the resources (including disks) *while* qemu at S is still running. Then it enters a sort of paused state (where the guest cannot run), and receives all of the migration data. Once all of the migration data has been received, the guest on S is destroyed, and the guest on D is unpaused. That's why Dan mentioned that we need two hosts to access the disk at once.
On the other hand, does D do any writes to the disk prior to the point at which it is unpaused? Would it work if D can be granted a read-only lease to access to the disk for the duration of the migration data, and then be converted over to read-write at the point when S is destroyed?
Even if sync_manager had read/write lease semantics, this use case wouldn't translate onto it, because S is in write mode the entire time that D is in read mode, and read locks are not compatible with write locks. sync_manager shouldn't be viewed as something that's trying to add any new protection to the migration case. It's just trying to accurately represent, on disk, where qemu is unpaused. Dave

On Wed, Aug 11, 2010 at 04:53:20PM -0400, Chris Lalancette wrote:
1. sm-S holds the lease, and is monitoring qemu 2. migration begins from S to D 3. libvirt-D runs sm-D: sync_manager -c qemu with the addition of a new sync_manager option --receive-lease 4. sm-D writes its hostid D to the lease area signaling sm-S that it wants to be the lease owner when S is done with it 5. sm-D begins monitoring the lease owner on disk (which is still S) 6. sm-D forks qemu-D 7. sm-S sees that D wants the lease 8. qemu-S exits with success 9. sm-S sees qemu-S exit with success 10. sm-S writes D as the lease owner into the lease area and exits (in the non-migration/transfer case, sm-S writes owner=LEASE_FREE) 11. sm-D (still monitoring the lease owner) sees that it has become the owner, and begins renewing the lease 12. qemu-D runs fully
Unfortunately, this is not how migration works in qemu/kvm. Using your nomenclature above, it's more like the following:
A guest is running on S. A migration is then initiated, at which point D fires up a qemu process with a -incoming argument.
libvirt starts qemu -incoming on D, right? So with sync_manager, libvirt would start: sync_manager --receive_lease -c qemu -incoming
This is sort of a container process that will receive all of the migration data. Crucially for sync-manager, though, qemu completely starts up and "attaches" to all of the resources (including disks) *while* qemu at S is still running. Then it enters a sort of paused state (where the guest cannot run), and receives all of the migration data.
That should all be fine.
Once all of the migration data has been received, the guest on S is destroyed,
ok, sm-S sees qemu-S exit at that point.
and the guest on D is unpaused.
The critical bit would be ensuring that sm-S has written owner=D into the lease area before qemu-D is unpaused. Hooking into the sequence at that point in time might be too difficult or ugly, I don't know.
That's why Dan mentioned that we need two hosts to access the disk at once.
It would be easiest, of course, if the lease owner always represented where qemu was running, but that obviously won't work with migration. So we have to settle for the lease owner always representing where qemu is unpaused. Dave

On Wed, Aug 11, 2010 at 05:19:27PM -0400, David Teigland wrote:
On Wed, Aug 11, 2010 at 04:53:20PM -0400, Chris Lalancette wrote:
1. sm-S holds the lease, and is monitoring qemu 2. migration begins from S to D 3. libvirt-D runs sm-D: sync_manager -c qemu with the addition of a new sync_manager option --receive-lease 4. sm-D writes its hostid D to the lease area signaling sm-S that it wants to be the lease owner when S is done with it 5. sm-D begins monitoring the lease owner on disk (which is still S) 6. sm-D forks qemu-D 7. sm-S sees that D wants the lease 8. qemu-S exits with success 9. sm-S sees qemu-S exit with success 10. sm-S writes D as the lease owner into the lease area and exits (in the non-migration/transfer case, sm-S writes owner=LEASE_FREE) 11. sm-D (still monitoring the lease owner) sees that it has become the owner, and begins renewing the lease 12. qemu-D runs fully
Unfortunately, this is not how migration works in qemu/kvm. Using your nomenclature above, it's more like the following:
A guest is running on S. A migration is then initiated, at which point D fires up a qemu process with a -incoming argument.
libvirt starts qemu -incoming on D, right? So with sync_manager, libvirt would start: sync_manager --receive_lease -c qemu -incoming
Yes that is correct
This is sort of a container process that will receive all of the migration data. Crucially for sync-manager, though, qemu completely starts up and "attaches" to all of the resources (including disks) *while* qemu at S is still running. Then it enters a sort of paused state (where the guest cannot run), and receives all of the migration data.
That should all be fine.
Once all of the migration data has been received, the guest on S is destroyed,
ok, sm-S sees qemu-S exit at that point.
and the guest on D is unpaused.
The critical bit would be ensuring that sm-S has written owner=D into the lease area before qemu-D is unpaused. Hooking into the sequence at that point in time might be too difficult or ugly, I don't know.
The main hard bit here is that QEMU gives us no indication that migration has completed. We 'detect' it by issuing a 'cont' command to unpause the CPUs - this command blocks until migration is done. Clearly this won't work for SM, but this isn't SM's problem. We need to fix this in QEMU so that we get an async notification of migration completion, so we can then tell SM to upgrade the lease, before we then issue 'cont' to start CPUs.
That's why Dan mentioned that we need two hosts to access the disk at once.
It would be easiest, of course, if the lease owner always represented where qemu was running, but that obviously won't work with migration. So we have to settle for the lease owner always representing where qemu is unpaused.
I think my other mail is in fact describing the same thing as you are, I was just using different terminology :-) Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, Aug 11, 2010 at 03:37:12PM -0400, David Teigland wrote:
On Wed, Aug 11, 2010 at 05:59:55PM +0100, Daniel P. Berrange wrote:
On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote:
Hi,
We've been working on a program called sync_manager that implements shared-storage-based leases to protect shared resources. One way we'd like to use it is to protect vm images that reside on shared storage, i.e. preventing two vm's on two hosts from using the same image at once.
There's two different, but related problems here:
- Preventing 2 different VMs using the same disk - Preventing the same VM running on 2 hosts at once
The first requires that there is a lease per configured disk (since a guest can have multiple disks). The latter requires a lease per VM and can ignore specifices of what disks are configured.
IIUC, sync-manager is aiming for the latter.
The present integration effort is aiming for the latter. sync_manager itself aims to be agnostic about what it's managing.
Ok, it makes a bit of a difference to how we integrate with it in libvirt. If we want to ever let sync-manager do per-disk leases then we'd want to pass more info to the SM callbacks so it knows what disks QEMU has, instead of just its name
It's functional, and the next big step is using it through libvirt.
sync_manager "wraps" a process, i.e. acquires a lease, forks&execs a process, renews the lease wile the process runs, and releases the lease when the process exits. While the process runs, it has exclusive access to whatever resource was named in the lease that was acquired.
There are complications around migration we need to consider too. During migration, you actually need QEMU running on two hosts at once. IIRC the idea is that before starting the migration operation, we'd have to tell sync-manager to mark the lease as shared with a specific host. The destination QEMU would have to startup in shared mode, and upgrade this to an exclusive lock when migration completes, or quit when migration fails.
sync_manager leases can only be exclusive, so it's a matter of transfering ownership of the exclusive lock from source host to destination host. We have not yet added lease transfer capabilities to sync_manager, but it might look something like this:
In the past I've discussed with the original SM authors the idea of introducing a shared lease concept. The idea was - QEMU is running on S with an exclusive lease - User requests migration to D - SM on S downgrades the exclusive lease to a shared lease, shared only with host D - libvirt starts QEMU on D host to accept migration - SM on D grabs the exclusive lease - libvirt starts migration on S - If migration succeeds - libvirt kills QEMU on S - SM on D upgrades its shared lease to exclusive - If migration fails - libvirt kills QEMU on D - SM on S upgrades its shared lease to exclusive
S = source host, sm-S = sync_manager on S, ... D = destination host, sm-D = sync_manager on D, ...
1. sm-S holds the lease, and is monitoring qemu 2. migration begins from S to D 3. libvirt-D runs sm-D: sync_manager -c qemu with the addition of a new sync_manager option --receive-lease 4. sm-D writes its hostid D to the lease area signaling sm-S that it wants to be the lease owner when S is done with it 5. sm-D begins monitoring the lease owner on disk (which is still S) 6. sm-D forks qemu-D 7. sm-S sees that D wants the lease 8. qemu-S exits with success 9. sm-S sees qemu-S exit with success 10. sm-S writes D as the lease owner into the lease area and exits (in the non-migration/transfer case, sm-S writes owner=LEASE_FREE) 11. sm-D (still monitoring the lease owner) sees that it has become the owner, and begins renewing the lease 12. qemu-D runs fully
I don't know enough (anything) about qemu migration yet to say if those steps work correctly or safely. One concern is that qemu-D should not enter a state where it can write until we are certain that D has been written as the lease's owner.
Unfortunately the way migration works with QEMU prevents this scenario. This led us to invent the idea of a shared lease that is only used during migration.
The one further complication is with the security drivers. IIUC, we will absolutely not want QEMU to have any access to the shared storage lease area. The problem is that if we just inject the wrapper process as is, sync-manager will end up running with exact same privileges as QEMU. ie same UID:GID, and same selinux context. I'm really not at all sure how to deal with this problem, because our core design is that the thing we spawn inherits the privileges we setup at fork() time. We don't want to delegate the security setup to sync-manager, because it introduces a huge variable condition in the security system. We need guarenteed consistent security setup for QEMU, regardless of supervisor process in use.
It might not be a big problem for qemu to write to its own lease area, but writing to another's probably would (e.g. at a different offset on the same lv). That implies a separate lease lv per qemu; I'll have to find out how close that gets to lvm scalability limits.
Since SM is such an important process / job, I think it is really worth trying to get strict separation between SM and QEMU. Our goal with QEMU security is that QEMU can never access any host resource that isn't explicitly assigned via the XML config. This implies that it shouldn't be allowed to access any SM data, even if this would theoretically not cause problems for SM mutial exclusion Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 08/11/2010 05:27 PM, Daniel P. Berrange wrote:
On Wed, Aug 11, 2010 at 03:37:12PM -0400, David Teigland wrote:
On Wed, Aug 11, 2010 at 05:59:55PM +0100, Daniel P. Berrange wrote:
On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote: ... There's two different, but related problems here:
- Preventing 2 different VMs using the same disk - Preventing the same VM running on 2 hosts at once
The first requires that there is a lease per configured disk (since a guest can have multiple disks). The latter requires a lease per VM and can ignore specifices of what disks are configured.
IIUC, sync-manager is aiming for the latter.
If we only aim for the latter, then there is no protection mechanism to prevent two sysadmins using the same storage from independently creating two vms that use the same backend disk accidentally. On the other hand, we also need to be able to support the concept of a single block device shared among multiple guests intentionally (i.e. clustered filesystems, or applications that know how to properly use shared storage) So in addition to per-disk exclusive-write leases, do we also need per-disk shared-write leases? Or do we just say that disks that are marked as 'shared' just don't get leases at all?
The present integration effort is aiming for the latter. sync_manager itself aims to be agnostic about what it's managing.
Ok, it makes a bit of a difference to how we integrate with it in libvirt. If we want to ever let sync-manager do per-disk leases then we'd want to pass more info to the SM callbacks so it knows what disks QEMU has, instead of just its name
I dunno, but if the end goal is the latter, then why not do it correct from the start rather than integrating halfway and then having a second round of integration to move from per-vm leasing to per-disk leasing. Perry

On Wed, Aug 18, 2010 at 07:44:18PM -0400, Perry Myers wrote:
On 08/11/2010 05:27 PM, Daniel P. Berrange wrote:
On Wed, Aug 11, 2010 at 03:37:12PM -0400, David Teigland wrote:
On Wed, Aug 11, 2010 at 05:59:55PM +0100, Daniel P. Berrange wrote:
On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote: ... There's two different, but related problems here:
- Preventing 2 different VMs using the same disk - Preventing the same VM running on 2 hosts at once
The first requires that there is a lease per configured disk (since a guest can have multiple disks). The latter requires a lease per VM and can ignore specifices of what disks are configured.
IIUC, sync-manager is aiming for the latter.
If we only aim for the latter, then there is no protection mechanism to prevent two sysadmins using the same storage from independently creating two vms that use the same backend disk accidentally.
On the other hand, we also need to be able to support the concept of a single block device shared among multiple guests intentionally (i.e. clustered filesystems, or applications that know how to properly use shared storage)
So in addition to per-disk exclusive-write leases, do we also need per-disk shared-write leases? Or do we just say that disks that are marked as 'shared' just don't get leases at all?
The present integration effort is aiming for the latter. sync_manager itself aims to be agnostic about what it's managing.
Ok, it makes a bit of a difference to how we integrate with it in libvirt. If we want to ever let sync-manager do per-disk leases then we'd want to pass more info to the SM callbacks so it knows what disks QEMU has, instead of just its name
I dunno, but if the end goal is the latter, then why not do it correct from the start rather than integrating halfway and then having a second round of integration to move from per-vm leasing to per-disk leasing.
I'm only aware of one goal, and the current plan is to implement it correctly and completely. That goal is to lock vm images so if the vm happens to run on two hosts, only one instance can access the image. It seems unlikely that any other purposes for sync_manager would change how we're planning to protect vm images. Dave

On Thu, Aug 19, 2010 at 11:12:25AM -0400, David Teigland wrote:
I'm only aware of one goal, and the current plan is to implement it correctly and completely. That goal is to lock vm images so if the vm happens to run on two hosts, only one instance can access the image.
(That's slightly misleading; technically, the lock prevents a second qemu process from even being started.)

On 08/19/2010 01:23 PM, David Teigland wrote:
On Thu, Aug 19, 2010 at 11:12:25AM -0400, David Teigland wrote:
I'm only aware of one goal, and the current plan is to implement it correctly and completely. That goal is to lock vm images so if the vm happens to run on two hosts, only one instance can access the image.
Ok. So for the first implementation of sync_manager it will still be possible for someone to corrupt data by configuring two separate vms to accidentally use the same storage volumes. That's fine for the first pass, just something to keep in mind for later.
(That's slightly misleading; technically, the lock prevents a second qemu process from even being started.)
ack Perry

On Sun, Aug 22, 2010 at 12:13:16PM -0400, Perry Myers wrote:
On 08/19/2010 01:23 PM, David Teigland wrote:
On Thu, Aug 19, 2010 at 11:12:25AM -0400, David Teigland wrote:
I'm only aware of one goal, and the current plan is to implement it correctly and completely. That goal is to lock vm images so if the vm happens to run on two hosts, only one instance can access the image.
Ok. So for the first implementation of sync_manager it will still be possible for someone to corrupt data by configuring two separate vms to accidentally use the same storage volumes. That's fine for the first pass, just something to keep in mind for later.
Ideally, hosts should be configured from a common central point where a full view of the configuration is possible. Then it would be trivial to detect that kind of error by just looking at the configuration. If you don't have central configuration, then using a distributed system (like disk leases) to detect image assignment errors could be done, but it also pushes the problem down to the level of configuring the distributed system correctly, i.e. host id or lease area assignment errors. Dave

From: libvir-list-bounces@redhat.com [mailto:libvir-list- bounces@redhat.com] On Behalf Of Daniel P. Berrange ...
A command would be something like this:
sync_manager daemon -i <host_id> -n <vm_id> -l <lease> -c <command> <args>
<host_id> is integer between 1 and 2000 that is statically assigned to each host.
<vm_id> is a unique identifer for the process that will be run, e.g. the vm name or uuid.
<lease> defines the shared storage area that sync_manager should use for performing the disk-paxos based synchronization. It consists of <resource_name>:<path>:<offset>, where <resource_name> is likely to be the vm name/uuid (or the name of the vm's disk image), and <path>:<offset> is an area of shared storage that has been allocated for sync_manager to use (a separate area for each resource/vm).
Can you give some real examples of the lease arg ? I guess <path> must exclude the ':' character, or have some defined escaping scheme.
<command> <args> would be the qemu command line that is currently used.
We expect these new config values will need a place to live in the
libvirt
xml config file, and libvirt will need to fork sync_manager -c qemu rather than qemu directly. At least those are the most obvious things that need doing, there are sure to be other api or functional issues.
The <host_id> is obviously needs to be in /etc/libvirt/sync-manager.conf since that's a per-host config. I assume the shared storage area is per host too ?
That leaves just the VM name/uuid as a per-VM config option, and we obviously already have that in XML. Is there actually any extra attribute we need to track per-guest in the XML ? If not this will simplify life, because we won't have to track sync-manager specific attributes [IH] the shared storage is per shared storage domain the host accesses, which can be multiple / change during host lifetime, so easiest as a parameter. Actually, same goes to host id - since the host id can (and will) be different for each storage domain. (if hosts A,B,C are using shared storage S1, their ID's are probably 1,2,3. If B,C,D are sharing storage S2, they are probably 1,2,3 for that storage domain). The important thing is the host id is unique for all hosts per storage lease area, it's not really per host.
participants (6)
-
Chris Lalancette
-
Daniel P. Berrange
-
David Teigland
-
Eric Blake
-
Itamar Heim
-
Perry Myers