On 09/26/2018 11:14 PM, John Ferlan wrote:
On 9/26/18 5:46 AM, Michal Privoznik wrote:
> On 09/25/2018 10:52 PM, John Ferlan wrote:
>>
>>
>> On 9/21/18 5:29 AM, Michal Privoznik wrote:
>>> There is this latent bug which can result in virtlockd killing
>>> libvirtd. The problem is when the following steps occur:
>>>
>>> Parent | Child
>>> ------------------------------------------------------+----------------
>>> 1) virSecurityManagerMetadataLock(path); |
>>> This results in connection FD to virtlockd to be |
>>> dup()-ed and saved for later use. |
>>> |
>>> 2) Another thread calling fork(); |
>>> This results in the FD from step 1) to be cloned |
>>> |
>>> 3) virSecurityManagerMetadataUnlock(path); |
>>> Here, the FD is closed, but the connection is |
>>> still kept open because child process has |
>>> duplicated FD |
>>> |
>>> 4) virSecurityManagerMetadataLock(differentPath); |
>>> Again, new connection is opened, FD is dup()-ed |
>>> |
>>> 5) | exec() or exit()
>>>
>>> Step 5) results in closing the connection from 1) to be closed,
>>> finally. However, at this point virtlockd looks at its internal
>>> records if PID from 1) does not still own any resources. Well it
>>> does - in step 4) it locked differentPath. This results in
>>> virtlockd killing libvirtd.
>>>
>>> Note that this happens because we do some relabelling even before
>>> fork() at which point vm->pid is still -1. When this value is
>>> passed down to transaction code it simply runs the transaction
>>> in parent process (=libvirtd), which means the owner of metadata
>>> locks is the parent process.
>>>
>>> Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
>>>
>>> Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
>>
>> Always good to double up on the S-o-b's when there's locks and forks
>> involved. That makes sure the fork gets one too ;-)
>>
>>
>>> ---
>>> src/security/security_dac.c | 12 ++++++------
>>> src/security/security_selinux.c | 12 ++++++------
>>> 2 files changed, 12 insertions(+), 12 deletions(-)
>>>
>>
>> I read the cover, it's not simpler than the above.
>
> Yeah, this is not that simple bug. I'll try to explain it differently:
>
> On fork(), all FDs of the parent are duplicated to child. So even if
> parent closes one of them, the connection is not closed until child
> closes the same FD. And since metadata locking is very fragile when it
> comes to connection to virtlockd, this fork() behaviour actually plays
> against us because it violates our assumption that connection is closed
> at the end of metadataUnlock(). In fact it is not - child has it still open.
>
> If there was a flag like DONT_DUP_ON_FORK I could just set it on
> virtlockd connection in metadataLock() and be done.
Avoid the forking exec's works too ;-}
OK - instead of talking in generalities - which forking dup'd fd is in
play here? I don't want to assume or guess any more.
The connection FD.
Maybe I'm misreading your comments or the other way around. I'll try to
explain what is the problem one last time, in the simplest way I can.
1) Thread A wants to start a domain so it calls
virSecurityManagerMetadataLock().
2) The function opens a connection to virtlockd (represented by FD) and
duplicates the FD so that connection is kept open until unlock.
3) Now that locks are acquired, the thread does chown() and setfcon().
4) the virSecurityManagerMetadataUnlock() function is called, which
causes virtlockd to release all the locks. Subsequently to that, the
function closes the duplicated FD obtained in step 2).
The duplication is done because both virLockManagerAcquire() and
virLockManagerRelease() open new connection, do the RPC talk and close
the connection. That's just the way the functions are written. Now,
because of FD duplication happening in step 2) the connection from
virLockManagerAcquire() is not really closed until step 4) where the
duplicated FD is closed.
This is important to realize: the connection is closed only when the
last FD representing it is closed. If I were to duplicate the connection
FD a hundred times and then close 99 of those duplicates the connection
would still be kept open.
Now, lets put some threads into the picture. Imagine thread B which
called fork() just in between steps 2) and 3). On fork, ALL partent FDs
are duplicated. So the connection FD is duplicated into the child
process too. Therefore, in step 4) where we think the connection is
closed - well it is not really because there is still one FD around - in
the child.
And this is huge problem, because we do not know when child is going to
close it. It may be (unintentionally) an evil child and close the FD at
the worst time possible - when libvirtd holds some locks.
Important thing to realize #2: whenever a connection closes, virtlockd
looks into its internal records and if PID on the other side held some
locks then: a) the locks are releasd, b) the PID is killed.
I'm sorry, I can't explain this any simpler. I'll try to find somebody
who understands this and have them review the patch.
As an alternative approach, if we did not close the connection in 2) and
4) we don't have to care when the child closes the FD because it will
not cause the connection to actually close. I wrote a patch like that
for previous versions but it was disliked. Maybe I can write it from
scratch again and have that one ACKed instead.
>
> Perhaps I could use pthread_atfork() but that might be very tricky - I
> am not sure if some library that we are linking with does not use it and
> it pthreads are capable of dealing with us and the library registering
> their own callbacks.
>
>> It also indicates
>> that patches 1-3 don't have any bearing. So to me that says - 2 series,
>> one the artsy/fartsy cleanup variety and the second this edge/corner
>> condition chase based on timing and/or other interactions.
>>
>> My first question is - have you bisected your changes? Since this only
>> seems to matter for metadata locking, you don't necessarily have to
>> start too early, but I would assume that prior to commit 8b8aefb3 there
>> is no issue.
>
> I haven't. I'm not quite sure why would I do that. I already know the
> problem and yes, it's in metadata locking. From quick skim through git
> log: 7a9ca0fa and 6d855abc14338 look like the result of bisect.
>
Um, well I'd include c34f11998e into that list since that's where
metadata locking API's were introduced. The problem space described
there is I think exactly what you've been describing as the problem you
were trying to avoid.
>>
>> My second question is - given my analysis in patch2, was testing of this
>> patch done with or without patch 2 and 3? Hard to review this without
>> having my patch2 analysis rattling around in short term memory that
>> works. At this point, I'm thinking they were in place and may have had
>> an impact especially since that dup() call was removed, but it's
>> mentioned during this commit message, which is really confusing.
>
> I've tested the whole series not individual patches. So this patch was
> tested with 2 and 3 applied.
>
> However, it's true that since we will spawn a separate process for every
> transaction (and dup() is called only from there) it is not possible to
> actually have FD leak (because the transaction code does not fork()
> further). Having said that, there are other callers that duplicate
> socket FD and the problem is real (even though after this patch metadata
> locking is not affected).
>
>>
>> Prior to patch2/3, virNetSocketDupFD has 4 callers of which only one
>> calls with 'false' and that's the path having problems.
>>
>> libxlMigrateDstReceive (using true)
>> libxlDomainMigrationSrcPerform (using true)
>> qemuMigrationSrcConnect (using true)
>
> Since these were calling virNetSocketDupFD(cloexec=true) anyway, nothing
> changed for them.
>
>> virNetClientDupFD (using false)
>>
>> Since your testing certainly doesn't include the first 3, I'll focus on
>> the last one which has but one caller:
>>
>> virLockManagerLockDaemonAcquire
>>
>> meaning without patches 2-3, we got a "dup()" socket back instead of
>> changing the existing socket to F_DUPFD_CLOEXEC. I think I just talked
>> myself in a circle, but perhaps gives you an understanding of the
>> struggle with all the patches in one series.
>>
>>> diff --git a/src/security/security_dac.c b/src/security/security_dac.c
>>> index 5aea386e7c..876cca0f2f 100644
>>> --- a/src/security/security_dac.c
>>> +++ b/src/security/security_dac.c
>>> @@ -561,12 +561,12 @@ virSecurityDACTransactionCommit(virSecurityManagerPtr
mgr ATTRIBUTE_UNUSED,
>>> goto cleanup;
>>> }
>>>
>>> - if ((pid == -1 &&
>>> - virSecurityDACTransactionRun(pid, list) < 0) ||
>>> - (pid != -1 &&
>>> - virProcessRunInMountNamespace(pid,
>>> - virSecurityDACTransactionRun,
>>> - list) < 0))
>>> + if (pid == -1)
>>> + pid = getpid();
>>> +
>>> + if (virProcessRunInMountNamespace(pid,
>>> + virSecurityDACTransactionRun,
>>> + list) < 0)
>>
>> Going back to commit d41c1621, it essentially states:
>>
>> "To differentiate whether transaction code should fork() or not the @pid
>> argument now accepts -1 (which means do not fork)."
>>
>> This commit then backtracks or undoes that by using a getpid() causing
>> every transaction to fork.
>>
>> The "logic" for the decision to pass pid == -1 or not is introduced in
>> the subsequent commit 37131ada based on qemuDomainNamespaceEnabled.
>>
>> Thus with this patch every domain w/ or w/o namespace and daemon commit
>> path will fork. So if this logic is to stay, then there's no reason to
>> have the namespace check either, is there? Is the vm->pid different than
>> getpid()? IOW: I think this patch is too big of a leap.
>
> Yes, vm->pid is different to getpid(). If there is a domain running with
> namespaces enabled then we definitely wants to use vm->pid to enter the
> qemu's namespace and relabel /dev/blah there instead of the namespace
> that libvirtd (=getpid()) is running. However, some labelling is done
> prior to running qemu (and setting up its namespace). For instance:
> qemuDomainWriteMasterKeyFile, qemuProcessStartManagedPRDaemon,
> qemuProcessMakeDir - they all call qemuSecurityDomainSetPathLabel() at
> the time when no namespace was built yet.
>
> But since we initialize vm->pid to -1 then the namspace check feels
> redundant. I agree with that. In virSecurityDACTransactionCommit() it
> does not matter if pid = -1 because vm->pid is -1 or because no
> namespaces are used and therefore -1 comes from variable initialization
> in qemuSecurityDomainSetPathLabel().
>
My point here was setup towards the subsequent point of rather than
doing the forks for domain's always just because daemon's now need to do
that. Why can't we figure a way to get that pid for the subsequent fork
for the narrower case which is the metadata lock or daemon lock.
>>
>> However, the contention from the cover is that this issue only happens
>> when metadata locking is enabled.
>>
>> Thus, rather than getpid() for everyone, maybe it's a getpid() for
>> metadata lock consumers. Seems for pid == -1 we could query @mgr for
>> what it knows, like virSecurityManagerGetPrivateData and then if
>> priv->type == VIR_LOCK_MANAGER_OBJECT_TYPE_DAEMON, use getpid(). Of
>> course then @mgr loses its ATTRIBUTE_UNUSED.
>
> The whole point of always fork()-ing is to avoid leaking FD even into
> child process. Patches 2 and 3 are trying to avoid leaking FD into
> exec(), this one is trying to avoid leaking it into fork(). I mean, if
> the connection FD is cloned into child, we can't be sure when child
> decides to close it. It may close it at the worst possible moment - when
> libvirtd holds some locks (in virtlockd). And because virtlockd locks
> are not associated with connection (only with PID) from virtlockd
> perspective it looks like libvirtd has closed connection and yet is
> still holding some locks. BANG! that's where virtlockd kills libvirtd.
This still says to me not using the getpid() for domain locks when
namespaces aren't in play isn't a problem. IOW: If !namespace, then pid
== -1 is perfectly fine for domain locks.
It's just this new daemon path where we need to fork, IIUC.
>
> I don't think there is any other way to prevent fork() cloning an FD.
>
OpenVMS (hahaha)
I don't think we want to revert all metadata locking patches only to
implement support for different beast. Also, we don't need distributed
locking.
Michal