[libvirt] [PATCH] qemu: don't label anything before locking the domain

If locking the domain failed, files were already labelled and thus we restored the previous label on them. Having disks on NFS means the domain having the lock already gets permission denial. This code moves the labelling part into the command hook since it's still privileged, and also moves the clearing of VIR_QEMU_PROCESS_STOP_NO_RELABEL from stop_flags right after the handshare after hook. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1113327 Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/qemu/qemu_process.c | 69 ++++++++++++++++++++++++++++--------------------- 1 file changed, 39 insertions(+), 30 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 5b598be..bc751b9 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2700,6 +2700,8 @@ struct qemuProcessHookData { virQEMUDriverPtr driver; virBitmapPtr nodemask; virQEMUDriverConfigPtr cfg; + const char *stdin_path; + int stdin_fd; }; static int qemuProcessHook(void *data) @@ -2739,6 +2741,34 @@ static int qemuProcessHook(void *data) if (virNumaSetupMemoryPolicy(h->vm->def->numatune, h->nodemask) < 0) goto cleanup; + /* + * Only after we managed to get a domain lock we can label + * domain-related objects. + */ + VIR_DEBUG("Setting domain security labels"); + if (virSecurityManagerSetAllLabel(h->driver->securityManager, + h->vm->def, h->stdin_path) < 0) + goto cleanup; + + if (h->stdin_fd != -1) { + /* if there's an fd to migrate from, and it's a pipe, put the + * proper security label on it + */ + struct stat stdin_sb; + + VIR_DEBUG("setting security label on pipe used for migration"); + + if (fstat(h->stdin_fd, &stdin_sb) < 0) { + virReportSystemError(errno, + _("cannot stat fd %d"), h->stdin_fd); + goto cleanup; + } + if (S_ISFIFO(stdin_sb.st_mode) && + virSecurityManagerSetImageFDLabel(h->driver->securityManager, + h->vm->def, h->stdin_fd) < 0) + goto cleanup; + } + ret = 0; cleanup: @@ -3702,6 +3732,8 @@ int qemuProcessStart(virConnectPtr conn, hookData.driver = driver; /* We don't increase cfg's reference counter here. */ hookData.cfg = cfg; + hookData.stdin_path = stdin_path; + hookData.stdin_fd = stdin_fd; VIR_DEBUG("Beginning VM startup process"); @@ -4082,6 +4114,12 @@ int qemuProcessStart(virConnectPtr conn, goto cleanup; } + /* Security manager labeled all devices, therefore + * if any operation from now on fails and we goto cleanup, + * where virSecurityManagerRestoreAllLabel() is called + * (hidden under qemuProcessStop) we need to restore labels. */ + stop_flags &= ~VIR_QEMU_PROCESS_STOP_NO_RELABEL; + VIR_DEBUG("Setting up domain cgroup (if required)"); if (qemuSetupCgroup(driver, vm, nodemask) < 0) goto cleanup; @@ -4092,36 +4130,7 @@ int qemuProcessStart(virConnectPtr conn, qemuProcessInitCpuAffinity(driver, vm, nodemask) < 0) goto cleanup; - VIR_DEBUG("Setting domain security labels"); - if (virSecurityManagerSetAllLabel(driver->securityManager, - vm->def, stdin_path) < 0) - goto cleanup; - - /* Security manager labeled all devices, therefore - * if any operation from now on fails and we goto cleanup, - * where virSecurityManagerRestoreAllLabel() is called - * (hidden under qemuProcessStop) we need to restore labels. */ - stop_flags &= ~VIR_QEMU_PROCESS_STOP_NO_RELABEL; - - if (stdin_fd != -1) { - /* if there's an fd to migrate from, and it's a pipe, put the - * proper security label on it - */ - struct stat stdin_sb; - - VIR_DEBUG("setting security label on pipe used for migration"); - - if (fstat(stdin_fd, &stdin_sb) < 0) { - virReportSystemError(errno, - _("cannot stat fd %d"), stdin_fd); - goto cleanup; - } - if (S_ISFIFO(stdin_sb.st_mode) && - virSecurityManagerSetImageFDLabel(driver->securityManager, vm->def, stdin_fd) < 0) - goto cleanup; - } - - VIR_DEBUG("Labelling done, completing handshake to child"); + VIR_DEBUG("Affinity/cgroups set, completing handshake to child"); if (virCommandHandshakeNotify(cmd) < 0) { goto cleanup; } -- 2.0.0

On Thu, Jun 26, 2014 at 01:20:02PM +0200, Martin Kletzander wrote:
If locking the domain failed, files were already labelled and thus we restored the previous label on them. Having disks on NFS means the domain having the lock already gets permission denial.
This code moves the labelling part into the command hook since it's still privileged, and also moves the clearing of VIR_QEMU_PROCESS_STOP_NO_RELABEL from stop_flags right after the handshare after hook.
This problem description / fix doesn't make much sense to me. IIUC the control flow is - Parent runs fork() - Parent waits for handshake notify - Child runs hook - Hook *only* registers with lock daemon - Child sends handshake notify to parent - Child waits for handshake response - Parent received handshake notify - Parent does labelling - Parent sends handshake response - Child execs QEMU - QEMU launches but CPUs are paused - Parent acquires disk locks - Parent tells QEMU to start CPUs Note that the hook does not acquire any locks - it merely connects to the lock daemon. Locks are not acquired until the CPUs are ready to be started. So I don't see how moving labelling into the hook solves anything. Note that the goal of the locking code as it is today, was only to prevent the content of the disk image being corrupted by 2 QEMUs running concurrently. The design as it is succeeds in this. Stopping changes to the labelling was not attempted. Yes, this will result in a running QEMU loosing access to a disk if another QEMU attempts to start and use those disks, but the content is protected in this way. It isn't actually possible to protect against concurrent changes to both the content and the labelling with a single lock because there are differing lock ordering & protection rules requires for these. To do that, we actually need to incorporate use of the lock manager into the security drivers using a separate lock space and use locking rules that apply explicitly to the needs of the labelling. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Jun 26, 2014 at 12:42:52PM +0100, Daniel P. Berrange wrote:
On Thu, Jun 26, 2014 at 01:20:02PM +0200, Martin Kletzander wrote:
If locking the domain failed, files were already labelled and thus we restored the previous label on them. Having disks on NFS means the domain having the lock already gets permission denial.
This code moves the labelling part into the command hook since it's still privileged, and also moves the clearing of VIR_QEMU_PROCESS_STOP_NO_RELABEL from stop_flags right after the handshare after hook.
This problem description / fix doesn't make much sense to me.
IIUC the control flow is
- Parent runs fork() - Parent waits for handshake notify - Child runs hook - Hook *only* registers with lock daemon - Child sends handshake notify to parent - Child waits for handshake response - Parent received handshake notify - Parent does labelling - Parent sends handshake response - Child execs QEMU - QEMU launches but CPUs are paused - Parent acquires disk locks - Parent tells QEMU to start CPUs
Note that the hook does not acquire any locks - it merely connects to the lock daemon. Locks are not acquired until the CPUs are ready to be started. So I don't see how moving labelling into the hook solves anything.
Note that the goal of the locking code as it is today, was only to prevent the content of the disk image being corrupted by 2 QEMUs running concurrently. The design as it is succeeds in this. Stopping changes to the labelling was not attempted. Yes, this will result in a running QEMU loosing access to a disk if another QEMU attempts to start and use those disks, but the content is protected in this way.
It isn't actually possible to protect against concurrent changes to both the content and the labelling with a single lock because there are differing lock ordering & protection rules requires for these.
To do that, we actually need to incorporate use of the lock manager into the security drivers using a separate lock space and use locking rules that apply explicitly to the needs of the labelling.
Specifically what the security drivers would have todo is - Acquire exclusive lock on the image - If not already labelled - Label image Else - See if current labelling is readonly or shared and this matches desired labelling - Release the exclusive lock on the image So see that the lock only has to be held for the short time that the labelling is being changed. This is very different from the existing content lock which must be held for the entire time the guest is running. This all really ties back into the previous problem we've tried to solve of tracking the original image label so we can correctly restore upon guest shutdown. Both the locking and that tracking have to be solved at the same time - two facets of the same problem. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Jun 26, 2014 at 12:42:52PM +0100, Daniel P. Berrange wrote:
On Thu, Jun 26, 2014 at 01:20:02PM +0200, Martin Kletzander wrote:
If locking the domain failed, files were already labelled and thus we restored the previous label on them. Having disks on NFS means the domain having the lock already gets permission denial.
This code moves the labelling part into the command hook since it's still privileged, and also moves the clearing of VIR_QEMU_PROCESS_STOP_NO_RELABEL from stop_flags right after the handshare after hook.
This problem description / fix doesn't make much sense to me.
IIUC the control flow is
- Parent runs fork() - Parent waits for handshake notify - Child runs hook - Hook *only* registers with lock daemon - Child sends handshake notify to parent - Child waits for handshake response - Parent received handshake notify - Parent does labelling - Parent sends handshake response - Child execs QEMU - QEMU launches but CPUs are paused - Parent acquires disk locks - Parent tells QEMU to start CPUs
Note that the hook does not acquire any locks - it merely connects to the lock daemon. Locks are not acquired until the CPUs are ready to be started. So I don't see how moving labelling into the hook solves anything.
Oh, my fault, I haven't realized, we're just registering there.
Note that the goal of the locking code as it is today, was only to prevent the content of the disk image being corrupted by 2 QEMUs running concurrently. The design as it is succeeds in this. Stopping changes to the labelling was not attempted. Yes, this will result in a running QEMU loosing access to a disk if another QEMU attempts to start and use those disks, but the content is protected in this way.
It isn't actually possible to protect against concurrent changes to both the content and the labelling with a single lock because there are differing lock ordering & protection rules requires for these.
To do that, we actually need to incorporate use of the lock manager into the security drivers using a separate lock space and use locking rules that apply explicitly to the needs of the labelling.
It occurred to me too that this might be either fixed or the fix eased after Michal's patches are applied (not my area, though): http://www.redhat.com/archives/libvir-list/2014-March/msg00826.html What I think is that it would "(almost) solve it" automatically, since it would restore the original label, even though there would be a small window when the first QEMU process doesn't have access to the disk. But definitely better result than now. Martin

On Thu, Jun 26, 2014 at 01:57:34PM +0200, Martin Kletzander wrote:
On Thu, Jun 26, 2014 at 12:42:52PM +0100, Daniel P. Berrange wrote:
On Thu, Jun 26, 2014 at 01:20:02PM +0200, Martin Kletzander wrote:
If locking the domain failed, files were already labelled and thus we restored the previous label on them. Having disks on NFS means the domain having the lock already gets permission denial.
This code moves the labelling part into the command hook since it's still privileged, and also moves the clearing of VIR_QEMU_PROCESS_STOP_NO_RELABEL from stop_flags right after the handshare after hook.
This problem description / fix doesn't make much sense to me.
IIUC the control flow is
- Parent runs fork() - Parent waits for handshake notify - Child runs hook - Hook *only* registers with lock daemon - Child sends handshake notify to parent - Child waits for handshake response - Parent received handshake notify - Parent does labelling - Parent sends handshake response - Child execs QEMU - QEMU launches but CPUs are paused - Parent acquires disk locks - Parent tells QEMU to start CPUs
Note that the hook does not acquire any locks - it merely connects to the lock daemon. Locks are not acquired until the CPUs are ready to be started. So I don't see how moving labelling into the hook solves anything.
Oh, my fault, I haven't realized, we're just registering there.
Note that the goal of the locking code as it is today, was only to prevent the content of the disk image being corrupted by 2 QEMUs running concurrently. The design as it is succeeds in this. Stopping changes to the labelling was not attempted. Yes, this will result in a running QEMU loosing access to a disk if another QEMU attempts to start and use those disks, but the content is protected in this way.
It isn't actually possible to protect against concurrent changes to both the content and the labelling with a single lock because there are differing lock ordering & protection rules requires for these.
To do that, we actually need to incorporate use of the lock manager into the security drivers using a separate lock space and use locking rules that apply explicitly to the needs of the labelling.
It occurred to me too that this might be either fixed or the fix eased after Michal's patches are applied (not my area, though):
http://www.redhat.com/archives/libvir-list/2014-March/msg00826.html
What I think is that it would "(almost) solve it" automatically, since it would restore the original label, even though there would be a small window when the first QEMU process doesn't have access to the disk. But definitely better result than now.
Once the security managers are doing locking they can look at what the current label is, and if it is set to a label used by another VM, they can avoid changing the label at all. It might need a bit of cleverness in the migration code path but nothing too bad. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (2)
-
Daniel P. Berrange
-
Martin Kletzander