[libvirt] [RFC] secdrivers remembering original labels

Dear list, we have this very old bug [1] that I just keep pushing in front of me. I made several attempts to fix it. However, none of them made into the tree. I guess it's time to have discussion what to do about it. IIRC, the algorithm that I implemented last was to keep original label in XATTRs (among with some ref counter) and the last one to restore the label will look there and find the original label. There was a problem with two libvirtds fighting over a file on shared FS. So I guess my question is can we come up with a design that would work? Or at least work to the extent that we're satisfied with? Personally, I like the XATTRs approach. And to resolve the NFS race we can invent yet another lockspace to guard labelling - I also have a bug for that [2] (although, I'm not that familiar with lockspaces). I guess doing disk metadata locking is not going to be trivial, is it? Michal 1: https://bugzilla.redhat.com/show_bug.cgi?id=547546 2: https://bugzilla.redhat.com/show_bug.cgi?id=1524792

On Fri, Jul 27, 2018 at 09:56:23AM +0200, Michal Privoznik wrote:
Dear list,
we have this very old bug [1] that I just keep pushing in front of me. I made several attempts to fix it. However, none of them made into the tree. I guess it's time to have discussion what to do about it. IIRC, the algorithm that I implemented last was to keep original label in XATTRs (among with some ref counter) and the last one to restore the label will look there and find the original label. There was a problem with two libvirtds fighting over a file on shared FS.
IIRC there was a second problem around security of XATTRs. eg, if we set an XATTR it was possible for QEMU to change it once we given QEMU privs on the image. Thus a malicious QEMU can cause libvirtd to change the image to an arbitrary user on shutdown. I think it was possible to deal with that on Linux, because the kernel hsa some xattr namespacing that is reserved for only root. This protection is worthless though when using NFS images as you can't trust the remote NFS client to honour the same restriction.
So I guess my question is can we come up with a design that would work? Or at least work to the extent that we're satisfied with?
Personally, I like the XATTRs approach. And to resolve the NFS race we can invent yet another lockspace to guard labelling - I also have a bug for that [2] (although, I'm not that familiar with lockspaces). I guess doing disk metadata locking is not going to be trivial, is it?
Yeah, for sure we need two distinct locking regimes. Our current locking aims to protect disk content, and the lifetime of the lock is the entire duration of the QEMU process. For XATTRs, we only need write protection for the short time window in which the security drivers execute, which is at start, during hotplug/unplug, during shutdown, and finally migration. I think we could achieve this with the virtlockd if we make it use the same locking file, but a different byte offset within the file. Just need to make sure it doesn't clash with the byte offset QEMU has chosen, nor the current offset we use. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 07/27/2018 11:01 AM, Daniel P. Berrangé wrote:
On Fri, Jul 27, 2018 at 09:56:23AM +0200, Michal Privoznik wrote:
Dear list,
we have this very old bug [1] that I just keep pushing in front of me. I made several attempts to fix it. However, none of them made into the tree. I guess it's time to have discussion what to do about it. IIRC, the algorithm that I implemented last was to keep original label in XATTRs (among with some ref counter) and the last one to restore the label will look there and find the original label. There was a problem with two libvirtds fighting over a file on shared FS.
IIRC there was a second problem around security of XATTRs. eg, if we set an XATTR it was possible for QEMU to change it once we given QEMU privs on the image. Thus a malicious QEMU can cause libvirtd to change the image to an arbitrary user on shutdown.
I think it was possible to deal with that on Linux, because the kernel hsa some xattr namespacing that is reserved for only root.
Yes, there are basically 4 levels (defined by prefix of attribute name): security. system. trusted. user. For the first two there is no restriction on side of VFS (but there might be some coming from underlying FS and/or security module). The trusted. can be get/set only by CAP_SYS_ADMIN process. And finally user. can be set only one regular files (plus some other restrictions), but it's available for basically everybody. IIRC, I used trusted. namespace because of CAP_SYS_ADMIN. Nota bene, if a file is stored on NFS and only CAP_SYS_ADMIN can change the attribute's value they are CAP_SYS_ADMIN already - they might change seclabel too. Why bother mangling libvirt's records. IOW, if somebody runs qemu with CAP_SYS_ADMIN all bets are off anyway. This would of course mean that there's no label restore for session daemon, but that can hardly ever work. Do we even initialize DAC/SELinux drivers for session daemon?
This protection is worthless though when using NFS images as you can't trust the remote NFS client to honour the same restriction.
So I guess my question is can we come up with a design that would work? Or at least work to the extent that we're satisfied with?
Personally, I like the XATTRs approach. And to resolve the NFS race we can invent yet another lockspace to guard labelling - I also have a bug for that [2] (although, I'm not that familiar with lockspaces). I guess doing disk metadata locking is not going to be trivial, is it?
Yeah, for sure we need two distinct locking regimes. Our current locking aims to protect disk content, and the lifetime of the lock is the entire duration of the QEMU process. For XATTRs, we only need write protection for the short time window in which the security drivers execute, which is at start, during hotplug/unplug, during shutdown, and finally migration.
I guess migration would be covered by start. Since these locks would be held only for very short period (basically they would be acquired just before we try to set seclabel and released after disk content lock is successfully acquired) they can be exclusive. However, since we want to protect more than just disk seclabels, we need to acquire this new lock from more places.
I think we could achieve this with the virtlockd if we make it use the same locking file, but a different byte offset within the file. Just need to make sure it doesn't clash with the byte offset QEMU has chosen, nor the current offset we use.
Makes sense. So the first step is to introduce metadata locking. I'll look into that. Michal

On Fri, Jul 27, 2018 at 01:27:40PM +0200, Michal Privoznik wrote:
On 07/27/2018 11:01 AM, Daniel P. Berrangé wrote:
On Fri, Jul 27, 2018 at 09:56:23AM +0200, Michal Privoznik wrote:
Dear list,
we have this very old bug [1] that I just keep pushing in front of me. I made several attempts to fix it. However, none of them made into the tree. I guess it's time to have discussion what to do about it. IIRC, the algorithm that I implemented last was to keep original label in XATTRs (among with some ref counter) and the last one to restore the label will look there and find the original label. There was a problem with two libvirtds fighting over a file on shared FS.
IIRC there was a second problem around security of XATTRs. eg, if we set an XATTR it was possible for QEMU to change it once we given QEMU privs on the image. Thus a malicious QEMU can cause libvirtd to change the image to an arbitrary user on shutdown.
I think it was possible to deal with that on Linux, because the kernel hsa some xattr namespacing that is reserved for only root.
Yes, there are basically 4 levels (defined by prefix of attribute name):
security. system. trusted. user.
For the first two there is no restriction on side of VFS (but there might be some coming from underlying FS and/or security module). The trusted. can be get/set only by CAP_SYS_ADMIN process. And finally user. can be set only one regular files (plus some other restrictions), but it's available for basically everybody.
IIRC, I used trusted. namespace because of CAP_SYS_ADMIN. Nota bene, if a file is stored on NFS and only CAP_SYS_ADMIN can change the attribute's value they are CAP_SYS_ADMIN already - they might change seclabel too. Why bother mangling libvirt's records. IOW, if somebody runs qemu with CAP_SYS_ADMIN all bets are off anyway.
Yep, if QEMU has CAP_SYS_ADMIN they're doomed no matter what.
This would of course mean that there's no label restore for session daemon, but that can hardly ever work. Do we even initialize DAC/SELinux drivers for session daemon?
The DAC driver doesn't work for session daemons (since you can't setuid obviously), but the sVirt driver *does* work.
This protection is worthless though when using NFS images as you can't trust the remote NFS client to honour the same restriction.
So I guess my question is can we come up with a design that would work? Or at least work to the extent that we're satisfied with?
Personally, I like the XATTRs approach. And to resolve the NFS race we can invent yet another lockspace to guard labelling - I also have a bug for that [2] (although, I'm not that familiar with lockspaces). I guess doing disk metadata locking is not going to be trivial, is it?
Yeah, for sure we need two distinct locking regimes. Our current locking aims to protect disk content, and the lifetime of the lock is the entire duration of the QEMU process. For XATTRs, we only need write protection for the short time window in which the security drivers execute, which is at start, during hotplug/unplug, during shutdown, and finally migration.
I guess migration would be covered by start. Since these locks would be held only for very short period (basically they would be acquired just before we try to set seclabel and released after disk content lock is successfully acquired) they can be exclusive.
However, since we want to protect more than just disk seclabels, we need to acquire this new lock from more places.
I think we could achieve this with the virtlockd if we make it use the same locking file, but a different byte offset within the file. Just need to make sure it doesn't clash with the byte offset QEMU has chosen, nor the current offset we use.
Makes sense. So the first step is to introduce metadata locking. I'll look into that.
Yes, in fact the metdata locking is desirable even ignoring the original task of restoring original disk labels. The metadata locking alone would better protect against startup races between system libvirtds accessing the same NFS. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 07/27/2018 11:01 AM, Daniel P. Berrangé wrote:
On Fri, Jul 27, 2018 at 09:56:23AM +0200, Michal Privoznik wrote:
Dear list,
we have this very old bug [1] that I just keep pushing in front of me. I made several attempts to fix it. However, none of them made into the tree. I guess it's time to have discussion what to do about it. IIRC, the algorithm that I implemented last was to keep original label in XATTRs (among with some ref counter) and the last one to restore the label will look there and find the original label. There was a problem with two libvirtds fighting over a file on shared FS.
IIRC there was a second problem around security of XATTRs. eg, if we set an XATTR it was possible for QEMU to change it once we given QEMU privs on the image. Thus a malicious QEMU can cause libvirtd to change the image to an arbitrary user on shutdown.
I think it was possible to deal with that on Linux, because the kernel hsa some xattr namespacing that is reserved for only root. This protection is worthless though when using NFS images as you can't trust the remote NFS client to honour the same restriction.
So I guess my question is can we come up with a design that would work? Or at least work to the extent that we're satisfied with?
Personally, I like the XATTRs approach. And to resolve the NFS race we can invent yet another lockspace to guard labelling - I also have a bug for that [2] (although, I'm not that familiar with lockspaces). I guess doing disk metadata locking is not going to be trivial, is it?
Yeah, for sure we need two distinct locking regimes. Our current locking aims to protect disk content, and the lifetime of the lock is the entire duration of the QEMU process. For XATTRs, we only need write protection for the short time window in which the security drivers execute, which is at start, during hotplug/unplug, during shutdown, and finally migration.
I think we could achieve this with the virtlockd if we make it use the same locking file, but a different byte offset within the file. Just need to make sure it doesn't clash with the byte offset QEMU has chosen, nor the current offset we use.
So do we really need virtlockd for this? I mean, the whole purpose of it is to hold locks so that libvirtd can restart independently, without losing the lock attached. However, since the metadata lock will be held only for a fraction of a second and will be not held through more than a single API aren't we just fine with libvirtd holding the lock? The way I imagine this to work is the following: lock_for_metadata(vm->def); // locks an unique byte in all the disks qemuSecuritySetAllLabel(); ... qemuProcessStartCPUs(); // disk content lock is acquired here unlock_for_metadata(vm->def); Another way to look at it is: it's libvirtd who relabels all the paths and the metadata lock should guard just that. Speaking of which, we need these locks to wait for each other, not just set-or-fail. Again, something that goes against virtlockd view of locking. Frankly, I've started implementing this with virtlockd already, but the changes I made so far simply do not feel right, e.g. I have to change lock_protocol.x so that virLockSpaceProtocolAcquireResourceArgs has this 'type' argument which tells virtlockd which byte we want to lock. Michal

On Mon, Aug 06, 2018 at 10:02:44AM +0200, Michal Prívozník wrote:
On 07/27/2018 11:01 AM, Daniel P. Berrangé wrote:
On Fri, Jul 27, 2018 at 09:56:23AM +0200, Michal Privoznik wrote:
Dear list,
we have this very old bug [1] that I just keep pushing in front of me. I made several attempts to fix it. However, none of them made into the tree. I guess it's time to have discussion what to do about it. IIRC, the algorithm that I implemented last was to keep original label in XATTRs (among with some ref counter) and the last one to restore the label will look there and find the original label. There was a problem with two libvirtds fighting over a file on shared FS.
IIRC there was a second problem around security of XATTRs. eg, if we set an XATTR it was possible for QEMU to change it once we given QEMU privs on the image. Thus a malicious QEMU can cause libvirtd to change the image to an arbitrary user on shutdown.
I think it was possible to deal with that on Linux, because the kernel hsa some xattr namespacing that is reserved for only root. This protection is worthless though when using NFS images as you can't trust the remote NFS client to honour the same restriction.
So I guess my question is can we come up with a design that would work? Or at least work to the extent that we're satisfied with?
Personally, I like the XATTRs approach. And to resolve the NFS race we can invent yet another lockspace to guard labelling - I also have a bug for that [2] (although, I'm not that familiar with lockspaces). I guess doing disk metadata locking is not going to be trivial, is it?
Yeah, for sure we need two distinct locking regimes. Our current locking aims to protect disk content, and the lifetime of the lock is the entire duration of the QEMU process. For XATTRs, we only need write protection for the short time window in which the security drivers execute, which is at start, during hotplug/unplug, during shutdown, and finally migration.
I think we could achieve this with the virtlockd if we make it use the same locking file, but a different byte offset within the file. Just need to make sure it doesn't clash with the byte offset QEMU has chosen, nor the current offset we use.
So do we really need virtlockd for this? I mean, the whole purpose of it is to hold locks so that libvirtd can restart independently, without losing the lock attached. However, since the metadata lock will be held only for a fraction of a second and will be not held through more than a single API aren't we just fine with libvirtd holding the lock? The way I imagine this to work is the following:
There were two reasons for virtlockd. The first was the persistent holding of locks, but the other reasons is that POSIX fcntl() locks are horrific. If one thread has an FD open with a lock held, if another thread opens and closes the same underlying file, the first thread's lock will get dropped. We can't be confident that other threads won't open the file, so the only way to be safe was to put locking in to a separate process where we know exactly what all threads are doing.
Frankly, I've started implementing this with virtlockd already, but the changes I made so far simply do not feel right, e.g. I have to change lock_protocol.x so that virLockSpaceProtocolAcquireResourceArgs has this 'type' argument which tells virtlockd which byte we want to lock.
We could perhaps make use of the "flags" field, giving a flg to identify the lockspace to use. This could be turned into an offset server side. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 08/06/2018 10:30 AM, Daniel P. Berrangé wrote:
So do we really need virtlockd for this? I mean, the whole purpose of it is to hold locks so that libvirtd can restart independently, without losing the lock attached. However, since the metadata lock will be held only for a fraction of a second and will be not held through more than a single API aren't we just fine with libvirtd holding the lock? The way I imagine this to work is the following:
There were two reasons for virtlockd. The first was the persistent holding of locks, but the other reasons is that POSIX fcntl() locks are horrific. If one thread has an FD open with a lock held, if another thread opens and closes the same underlying file, the first thread's lock will get dropped.
We can't be confident that other threads won't open the file, so the only way to be safe was to put locking in to a separate process where we know exactly what all threads are doing.
Ah. That's terrible. But IIUC avoidable by using OFDs instead (which are not available on BSD I presume).
Frankly, I've started implementing this with virtlockd already, but the changes I made so far simply do not feel right, e.g. I have to change lock_protocol.x so that virLockSpaceProtocolAcquireResourceArgs has this 'type' argument which tells virtlockd which byte we want to lock.
We could perhaps make use of the "flags" field, giving a flg to identify the lockspace to use. This could be turned into an offset server side.
Okay, I'll try this. Thanks! Michal

On Mon, Aug 06, 2018 at 12:01:03PM +0200, Michal Prívozník wrote:
On 08/06/2018 10:30 AM, Daniel P. Berrangé wrote:
So do we really need virtlockd for this? I mean, the whole purpose of it is to hold locks so that libvirtd can restart independently, without losing the lock attached. However, since the metadata lock will be held only for a fraction of a second and will be not held through more than a single API aren't we just fine with libvirtd holding the lock? The way I imagine this to work is the following:
There were two reasons for virtlockd. The first was the persistent holding of locks, but the other reasons is that POSIX fcntl() locks are horrific. If one thread has an FD open with a lock held, if another thread opens and closes the same underlying file, the first thread's lock will get dropped.
We can't be confident that other threads won't open the file, so the only way to be safe was to put locking in to a separate process where we know exactly what all threads are doing.
Ah. That's terrible. But IIUC avoidable by using OFDs instead (which are not available on BSD I presume).
Yep, OFD is a (nice) Linux-ism but only exists in pretty recent Linux, and no non-Linux OS, so we can't rely on it. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (3)
-
Daniel P. Berrangé
-
Michal Privoznik
-
Michal Prívozník