Re: [libvirt] [PATCH 11/14] rpc: virnetserver: Fix race on srv->nclients_unauth

21 Dec 2017

On Thu, Dec 21, 2017 at 01:18 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
...
On 12/21/2017 06:29 AM, Marc Hartmayer wrote:
...
On Fri, Dec 15, 2017 at 07:16 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
...
On 12/12/2017 06:36 AM, Marc Hartmayer wrote:
...
There is a race between virNetServerProcessClients (main thread) and
remoteDispatchAuthList/remoteDispatchAuthPolkit/remoteSASLFinish (worker
thread) that can lead to decrementing srv->nclients_unauth when it's
zero. Since virNetServerCheckLimits relies on the value
srv->nclients_unauth the underrun causes libvirtd to stop accepting
new connections forever.
Example race scenario (assuming libvirtd is using policykit and the
client is privileged):
  1. The client calls the RPC remoteDispatchAuthList =>
     remoteDispatchAuthList is executed on a worker thread (Thread
     T1). We're assuming now the execution stops for some time before
     the line 'virNetServerClientSetAuth(client, 0)'
  2. The client closes the connection irregularly. This causes the
     event loop to wake up and virNetServerProcessClient to be
     called (on the main thread T0). During the
     virNetServerProcessClients the srv lock is hold. The condition
     virNetServerClientNeedAuth(client) will be checked and as the
     authentication is not finished right now
     virNetServerTrackCompletedAuthLocked(srv) will be called =>
     --srv->nclients_unauth => 0
  3. The Thread T1 continues, marks the client as authenticated, and
     calls virNetServerTrackCompletedAuthLocked(srv) =>
     --srv->nclients_unauth => --0 => wrap around as nclient_unauth is
     unsigned
  4. virNetServerCheckLimits(srv) will disable the services forever
To fix it, add an auth_pending field to the client struct so that it
is now possible to determine if the authentication process has already
been handled for this client.
Setting the authentication method to none for the client in
virNetServerProcessClients is not a proper way to indicate that the
counter has been decremented, as this would imply that the client is
authenticated.
Additionally, adjust the existing test cases for this new field.
In any case, I think perhaps this can be split into two patches - one
that just sets/clears the auth_pending and then what's leftover as the
fix.
I’m not sure it’s necessary as the correct usage of auth_pending is
already the fix. But let’s see then.
OK. I understand. I assume by this point I was juggling in my head a few
of the various adjustments so I figured that by adding the flag and
separating the logic between just the set/clear flag and the reason it
was added would be easier to review, but then again that would make the
set/clear patch essentially meaningless.
...
...
...

diff --git a/src/libvirt_remote.syms b/src/libvirt_remote.syms
index 7fcae970484d..cef2794e1122 100644
--- a/src/libvirt_remote.syms
+++ b/src/libvirt_remote.syms
@@ -138,6 +138,7 @@ virNetServerClientGetUNIXIdentity;
 virNetServerClientImmediateClose;
[…snip…]
...
...
...
...
@@ -375,6 +376,7 @@ static virNetServerClientPtr
 virNetServerClientNewInternal(unsigned long long id,
                               virNetSocketPtr sock,
                               int auth,
+                              bool auth_pending,
 #ifdef WITH_GNUTLS
                               virNetTLSContextPtr tls,
 #endif
@@ -384,6 +386,12 @@ virNetServerClientNewInternal(unsigned long long id,
 {
     virNetServerClientPtr client;
+    /* If the used authentication method implies that the new client
+     * is automatically authenticated, the authentication cannot be
+     * pending */
+    if (auth_pending && virNetServerClientAuthMethodImpliesAuthenticated(auth))
                           ^^^^^^
Why check twice? I'm not clear on the need for this check.  The setting
of auth_pending is based on @auth in the caller or perhaps the value of
auth_pending already. If there's some sort of check to be made perhaps
it's in ExecRestart.
Normally I would prefer an assert here… The intent behind this guard was
that it should prevent the wrong usage of this function.
But I’ll move the check.
I think this is perhaps why I made the separate the patches comment. I
was starting to overflow my heap.
...
...
Also, if we returned NULL here without some sort of error, then couldn't
we get the libvirt failed for some reason error message which is always
not fun to track down.
Does it makes sense to leave the check here (and additionally add it to
ExecRestart) and throw an error message here?
Adding a check, a message, and having a NULL return to guard against
some future change passing the wrong parameters perhaps means that
future patch submitter (and eventual reviewer) didn't clearly understand
the purpose of the function...
Yep. That’s why I like asserts… (for functions used internally
only). You can 'tell' other developers what are the
preconditions/postconditions and help them to understand the code
flow. Sorry for the off topic…

Short answer, I’ll move the check.
...
Still this is a *NewInternal function - I would hope passing correct
arguments wouldn't be problem for it.
Me too.
...
...
...
...
+        return NULL;
+
     if (virNetServerClientInitialize() < 0)
         return NULL;
@@ -393,6 +401,7 @@ virNetServerClientNewInternal(unsigned long long id,
     client->id = id;
     client->sock = virObjectRef(sock);
     client->auth = auth;
+    client->auth_pending = auth_pending;
     client->readonly = readonly;
 #ifdef WITH_GNUTLS
     client->tlsCtxt = virObjectRef(tls);
@@ -440,6 +449,7 @@ virNetServerClientPtr virNetServerClientNew(unsigned long long id,
 {
     virNetServerClientPtr client;
     time_t now;
+    bool auth_pending = !virNetServerClientAuthMethodImpliesAuthenticated(auth);
I think it's much easier for readability to have result be auth !=
VIR_NET_SERVER_SERVICE_AUTH_NONE instead of this call... Perhaps even
easier to find all instances of AUTH_NONE.
This function/condition is used in four places. But sure, if you’re
still saying it’s easier to read and to understand that a user is
authenticated (and therefore the authentication is no longer pending) if
auth_method == VIR_NET_SERVER_SERVICE_AUTH_NONE then I’ll replace this
function at all places :)
By this time I wasn't sure what was going to happen from Dan's patch 8
comment and the virNetServerClientNeedAuth[Locked] functions vs. the
name chosen (ImpliesAuthenticated) that was AFAICT trying to convey what
you discovered.
Whether the comparison is hidden behind some call or a direct comparison
I guess just becomes an implementation detail. I use cscope a lot to
find things, so searching on NONE is perhaps easier that searching on
NONE, finding some API that returning true/false based off of NONE, and
then searching on callers for that too.  OTOH, it's cleaner to have a
helper <sigh>.
Shorter answer, your call. If you want the helper, that's fine.
I’ll think about it.

[…snip]

--
Beste Grüße / Kind regards
   Marc Hartmayer

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294