On 10/17/2018 01:46 PM, Daniel P. Berrangé wrote:
On Wed, Oct 17, 2018 at 12:52:50PM +0200, Michal Privoznik wrote:
> On 10/17/2018 11:45 AM, Daniel P. Berrangé wrote:
>> On Wed, Oct 17, 2018 at 11:06:46AM +0200, Michal Privoznik wrote:
>>> Trying to use virlockd to lock metadata turns out to be too big
>>> gun. Since we will always spawn a separate process for relabeling
>>> we are safe to use thread unsafe POSIX locks and take out
>>> virtlockd completely out of the picture.
>>
>> Note safety of POSIX locks is not merely about other threads.
>>
>> If *any* code we invoke from security_dac or security_selinux
>> were to open() & close() the file we're labelling we'll
>> loose the locks.
>>
>> This is a little concerning since we call out to a few functions
>> in src/util that might not be aware they should *never* open)(
>> and close() the path they're given. It looks like we lucky at
>> the moment, but this has future gun -> foot -> ouch potential.
>>
>> Likewise we also call out to libselinux APIs. I've looked at
>> the code for the APIs we call and again, right now, we look
>> to be safe, but there's some risk there.
>>
>> I'm not sure whether this is big enough risk to invalidate this
>> forking approach or not though. This kind of problem was the
>> reason why virtlockd was created to hold locks in a completely
>> separate process from the code with the critical section.
>
> But unless we check for hard links virtlockd will suffer from the POSIX
> locks too. For instance, assume that B is a hardlink to A. Then, if a
> process opens A, locks it an the other opens B and closes it the lock is
> released. Even though A and B look distinct.
Yes, that is correct - there's an implicit assumption that all VMs
are using the same path if there's multiple hard links. Foruntately
this scenario is fairly rare - more common is symlinks which can be
canonicalized.
> The problem with virlockd is that our current virlockspace impl is not
> aware of different offsets. We can try to make it so, but that would
> complicate the code a lot. Just imagine that there's a domain already
> running and it holds disk content lock over file C. If an user wants to
> start a domain (which too has disk C), libvirtd will try to lock C again
> (even though on different offset) but it will fail to do so because C is
> already locked.
Yep, we would need intelligence around offsets to avoid opening it
twice for each offset.
> Another problem with virtlockd that I faced was that with current
> implementation the connection to virlockd is opened in metadataLock(),
> where the connection FD is dup()-ed to prevent connection closing. The
> connection is then closed in metadataUnlock() where the dup()-ed FD is
> closed. This works pretty much good, except for fork(). If a fork()
> occurs between metadataLock() and metadataUnlock() the duplicated FD is
> leaked into the child and we can't be certain when will the child close
> it. The worst case scenario is that the child will close the FD when we
> are holding some resources, which results in virtlockd killing libvirtd
> (= registered owner of the locks).
I don't think that's the case actually. If you have an FD that is a
UNIX/TCP socket that gets dup(), the remote end of the socket won't
see HUP until all duplicates of the socket are closed. IOW, both the
parent & child process would need to close their respective dups.
It is the case. I've seen it out in the wild (when starting several
domains at once):
https://www.redhat.com/archives/libvir-list/2018-September/msg01138.html
> Alternatively, we can switch to OFD locks and have the feature work only
> on Linux. If anybody from BSD users will want the feature they will have
> to push their kernel developers to implement some sensible file locking.
>
> Also, one of the questions is what we want metadata locking to be. So
> far I am aiming on having exclusive access to the file that we are
> chown()-ing (or setfcon()-ing) so that I can put a code around it that
> manipulates XATTR where the original owner would be stored. This way the
> metadata lock is held only for a fraction of a second.
> I guess we don't want the metadata lock be held through whole run of a
> domain, i.e. obtained at domain start and released at domain shutdown. I
> don't see much benefit in that.
Yes, we must only lock it for the short time that we're reading or writing
the metadata. We must not hold the long long term, because other threads
or processes need access to the metadata while the VM is running, for
example to deal with validating shared disk labels.
On further reflection, I think we'll be safe enough with what you're
proposing. For it to go wrong, we would have to have a bug in the code
which accidentally open+close the socket, and be on an old kernel that
lacks OFD locks, and two processes must be in the critical section at
the same time in that short < 1 second window.
Cool. Thanks.
Let me just say that I find it disturbing that POSIX doesn't specify
file locks that are usable in a multithreaded app. Sure, we can have
some abstraction layer over lockf() + open() + close() but it can hardly
be something usable. Or something that every app should do.
Michal