On Wed, Oct 17, 2018 at 12:52:50PM +0200, Michal Privoznik wrote:
On 10/17/2018 11:45 AM, Daniel P. Berrangé wrote:
> On Wed, Oct 17, 2018 at 11:06:46AM +0200, Michal Privoznik wrote:
>> Trying to use virlockd to lock metadata turns out to be too big
>> gun. Since we will always spawn a separate process for relabeling
>> we are safe to use thread unsafe POSIX locks and take out
>> virtlockd completely out of the picture.
>
> Note safety of POSIX locks is not merely about other threads.
>
> If *any* code we invoke from security_dac or security_selinux
> were to open() & close() the file we're labelling we'll
> loose the locks.
>
> This is a little concerning since we call out to a few functions
> in src/util that might not be aware they should *never* open)(
> and close() the path they're given. It looks like we lucky at
> the moment, but this has future gun -> foot -> ouch potential.
>
> Likewise we also call out to libselinux APIs. I've looked at
> the code for the APIs we call and again, right now, we look
> to be safe, but there's some risk there.
>
> I'm not sure whether this is big enough risk to invalidate this
> forking approach or not though. This kind of problem was the
> reason why virtlockd was created to hold locks in a completely
> separate process from the code with the critical section.
But unless we check for hard links virtlockd will suffer from the POSIX
locks too. For instance, assume that B is a hardlink to A. Then, if a
process opens A, locks it an the other opens B and closes it the lock is
released. Even though A and B look distinct.
Yes, that is correct - there's an implicit assumption that all VMs
are using the same path if there's multiple hard links. Foruntately
this scenario is fairly rare - more common is symlinks which can be
canonicalized.
The problem with virlockd is that our current virlockspace impl is
not
aware of different offsets. We can try to make it so, but that would
complicate the code a lot. Just imagine that there's a domain already
running and it holds disk content lock over file C. If an user wants to
start a domain (which too has disk C), libvirtd will try to lock C again
(even though on different offset) but it will fail to do so because C is
already locked.
Yep, we would need intelligence around offsets to avoid opening it
twice for each offset.
Another problem with virtlockd that I faced was that with current
implementation the connection to virlockd is opened in metadataLock(),
where the connection FD is dup()-ed to prevent connection closing. The
connection is then closed in metadataUnlock() where the dup()-ed FD is
closed. This works pretty much good, except for fork(). If a fork()
occurs between metadataLock() and metadataUnlock() the duplicated FD is
leaked into the child and we can't be certain when will the child close
it. The worst case scenario is that the child will close the FD when we
are holding some resources, which results in virtlockd killing libvirtd
(= registered owner of the locks).
I don't think that's the case actually. If you have an FD that is a
UNIX/TCP socket that gets dup(), the remote end of the socket won't
see HUP until all duplicates of the socket are closed. IOW, both the
parent & child process would need to close their respective dups.
Alternatively, we can switch to OFD locks and have the feature work
only
on Linux. If anybody from BSD users will want the feature they will have
to push their kernel developers to implement some sensible file locking.
Also, one of the questions is what we want metadata locking to be. So
far I am aiming on having exclusive access to the file that we are
chown()-ing (or setfcon()-ing) so that I can put a code around it that
manipulates XATTR where the original owner would be stored. This way the
metadata lock is held only for a fraction of a second.
I guess we don't want the metadata lock be held through whole run of a
domain, i.e. obtained at domain start and released at domain shutdown. I
don't see much benefit in that.
Yes, we must only lock it for the short time that we're reading or writing
the metadata. We must not hold the long long term, because other threads
or processes need access to the metadata while the VM is running, for
example to deal with validating shared disk labels.
On further reflection, I think we'll be safe enough with what you're
proposing. For it to go wrong, we would have to have a bug in the code
which accidentally open+close the socket, and be on an old kernel that
lacks OFD locks, and two processes must be in the critical section at
the same time in that short < 1 second window.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|