Last Comment Bug 767158 - remove synchronous DNS resolution in nsAuthSSPI.cpp
: remove synchronous DNS resolution in nsAuthSSPI.cpp
Status: REOPENED
[Snappy:P1][necko-backlog][ntlm]
:
Product: Core
Classification: Components
Component: Networking: DNS (show other bugs)
: Trunk
: All Windows 7
: -- major with 3 votes (vote)
: mozilla19
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on: 785050
Blocks: 766973
  Show dependency treegraph
 
Reported: 2012-06-21 14:31 PDT by Josh Aas
Modified: 2016-02-16 11:22 PST (History)
7 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch 0 (29.73 KB, patch)
2012-07-21 18:29 PDT, Patrick McManus [:mcmanus]
honzab.moz: review-
Details | Diff | Review
patch 1 (27.68 KB, patch)
2012-07-23 12:23 PDT, Patrick McManus [:mcmanus]
honzab.moz: review+
Details | Diff | Review

Description Josh Aas 2012-06-21 14:31:47 PDT
extensions/auth/nsAuthSSPI.cpp

We probably shouldn't ever make synchronous DNS resolution calls, and we definitely shouldn't do it on the main thread. Removing this stuff is a high priority.
Comment 1 Josh Aas 2012-06-21 14:39:36 PDT
The bad call is:

extensions/auth/nsAuthSSPI.cpp:
94 static nsresult
95 MakeSN(const char *principal, nsCString &result)
[...]
121     rv = dns->Resolve(Substring(buf, index + 1),
122                       nsIDNSService::RESOLVE_CANONICAL_NAME,
123                       getter_AddRefs(record));
Comment 2 Patrick McManus [:mcmanus] 2012-07-11 06:40:54 PDT
nsAuthSSPI is an implementation of nsIAuthModule which has only a synchronous interface.

fixing this bug will require making nsIAuthModule asynchronous and breaking out of tree users.

The goal is to remove the synchronous DNS resolve API so there is no point in adding a second interface concurrently.
Comment 3 Patrick McManus [:mcmanus] 2012-07-21 11:37:24 PDT
As mentioned in comment 2, nsIAuthModule is a synchronous API and there are a bunch of implementors of it. To make it async would require converting all of those callers even though only one implementation (nsAuthSSPI) is engaging in janky behavior, and even then only in one of the two paths it implements (kerberos SSPI, not NTLM). Even after changing those implementations the callers would need to be updated, and if that level was synchronous the process would need to be repeated until we hit an existing body of code that code go async.

That stopping point turns out to be nsIHttpChannelAuthProvider<-nsIHttpAuthenticator<-nsIAuthModule

Instead of changing all of that code, I took advantage of the fact that we know we need to do this blocking DNS call well before it is currently being done. So I put a property on the particular kerberos code that requires it and let nsIHttpChannelAuthProvider query that property at an early point that is already potentially asynchronous. (I use the point where we might pop up a credential dialog, that blocks for user input). Its convenient to do the non blocking DNS at that stage and then just feed the nsIAuthModule the canonicalized version of the hostname as usual in a synchronous way because the lookup was already completed.

the result is still annoying, but not nearly as bad as making nsIAuthModule async.

As for testing - I don't have kerberos setup and don't think I even have the right OS pieces to get SSPI/Negotiate/Kerberos running. I have tested this with windows NTLM (forcing the canonicalization to happen with the flag for that (which is wrong per the comments - so I have removed the flag before submitting the patch) and that's going to exercise all of the relevant code paths.. its the closest I could come.

I'll post the patch after try-server approves.
Comment 4 Patrick McManus [:mcmanus] 2012-07-21 18:29:19 PDT
Created attachment 644703 [details] [diff] [review]
patch 0

not the prettiest thing..
Comment 5 Honza Bambas (:mayhemer) 2012-07-23 07:22:05 PDT
Comment on attachment 644703 [details] [diff] [review]
patch 0

Review of attachment 644703 [details] [diff] [review]:
-----------------------------------------------------------------

r- for two reasons.

This doesn't look that bad, IMO, it appears to be a good idea to use the async mechanism we already have got.  Except the forwarder class.  I think it is not necessary to have that little "monster".  See the comments in the code.


But, there are another major issues with this patch.  

1. nsHttpChannelAuthProvider is AddRef'ed on one of the resolver threads.  But it is not implemented as thread-safe:

>	xul.dll!nsHttpChannelAuthProvider::AddRef()  Line 1409 + 0x93 bytes	C++
 	xul.dll!nsCOMPtr<nsIDNSListener>::nsCOMPtr<nsIDNSListener>(nsIDNSListener * aRawPtr=0x0bae7df8)  Line 537	C++
 	xul.dll!`anonymous namespace'::DNSListenerProxy::OnLookupCompleteRunnable::OnLookupCompleteRunnable(nsIDNSListener * aListener=0x0bae7df8, nsICancelable * aRequest=0x0b843f4c, nsIDNSRecord * aRecord=0x0b30da58, unsigned int aStatus=0x00000000)  Line 515 + 0x27 bytes	C++
 	xul.dll!`anonymous namespace'::DNSListenerProxy::OnLookupComplete(nsICancelable * aRequest=0x0b843f4c, nsIDNSRecord * aRecord=0x0b30da58, unsigned int aStatus=0x00000000)  Line 545 + 0x33 bytes	C++

This is causing an assertion failure.  Thus you should have an extra class to use as a resolver listener that would be thread safe, or turn the authprovider to be thread-safe.  I personally would prefer the former, otherwise we can loose a way to detect that authprovider is being used on more then one thread in the future.

2. I presume that performing the resolution for a canonical name doesn't mean to break a common NTLM auth, is that so?  I checked the host name is identical to the original host name after resolution.  However, when I force the resolution to happen, this patch prevents me to auth on my NTLM enabled IIS server.  Getting "You are not authorized to view this page due to invalid authentication headers".

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +4252,5 @@
> +    rv = GetURI(getter_AddRefs(uri));
> +    if (NS_FAILED(rv))
> +        return rv;
> +    return uri->GetAsciiHost(host);
> +}

This should just forward to nsHttpChannelAuthProvider, if any, to pick the auth host: if there is a canon host resolved, return it, otherwise return mAuthChannel->GetURI->GetAsciiHost.

::: netwerk/protocol/http/nsHttpChannelAuthProvider.cpp
@@ +1081,5 @@
>              static_cast<nsAuthInformationHolder*>(aAuthInfo);
> +    if (holder)
> +        ident->Set(holder->Domain().get(),
> +                   holder->User().get(),
> +                   holder->Password().get());

In braces please.

::: netwerk/protocol/http/nsIHttpAuthenticableChannel.idl
@@ +73,5 @@
>  
>      /**
> +     * The host portion of the URI, possibly canonicalized
> +     */
> +    readonly attribute ACString asciiHost;

Rename this to asciiHostForAuth.
Comment 6 Patrick McManus [:mcmanus] 2012-07-23 12:23:16 PDT
Created attachment 645024 [details] [diff] [review]
patch 1

Thanks Honza, your comments made this better.

I didn't set the r? flag on the update (which should have all the review comments) because I can't reproduce the problem you described starting with "I presume".. more on that later. But in any event - lets use this version of the patch moving forward.
Comment 7 Patrick McManus [:mcmanus] 2012-07-23 12:23:29 PDT
(In reply to Honza Bambas (:mayhemer) from comment #5)

> 
> 2. I presume that performing the resolution for a canonical name doesn't
> mean to break a common NTLM auth, is that so?  I checked the host name is
> identical to the original host name after resolution.  However, when I force
> the resolution to happen, this patch prevents me to auth on my NTLM enabled
> IIS server.  Getting "You are not authorized to view this page due to
> invalid authentication headers".
> 

hmm.. I tested almost exactly like that.
1] setup a samba PDC that does ntlm
2] have apache require ntlm for access
3] force ntlm in nsAuthSSPI to do canonical resolution (which it shouldn't..)
4] confirm via debug output that the resolution didn't actually change the name
5] login (or be denied based on bad credentials)

other than the fact I was using apache/samba does that sound the same? Obviously if the additional name resolution didn't result in a change it shouldn't break anything. (and if it did result in a change the test is invalid).

can you privately send me a url (probably tunneled through a nat?) and credentials I can try with? maybe retest with the new patch?
Comment 8 Honza Bambas (:mayhemer) 2012-07-25 10:59:28 PDT
Comment on attachment 645024 [details] [diff] [review]
patch 1

Review of attachment 645024 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab except some referencing issues.

This patch version works in my local test environment for direct NTLM (IIS 7.5 w/ extended val) and proxy NTLM (ISA 2006 configured to let all users auth).

::: netwerk/protocol/http/nsHttpChannelAuthProvider.cpp
@@ +247,5 @@
> +    if (mDNSQuery) {
> +        mDNSQuery->Cancel(status);
> +        mDNSQuery = nsnull;
> +    }
> +    mDNSCallback = nsnull;

Why do you actually need to hold the callback?

@@ +783,5 @@
> +        nsresult rv = ResolveHost();
> +        if (NS_SUCCEEDED(rv))
> +            return NS_ERROR_IN_PROGRESS;
> +        return rv;
> +    }

Hmm.. when the auth method changes you should truncate mCanonicalizedHost.  Followup bug is probably OK to do that.

@@ +1378,5 @@
> +                                                         nsIDNSRecord  *record,
> +                                                         nsresult       rv)
> +{
> +    nsCString cname;
> +    mAuthProvider->SetDNSQuery(nsnull);

Most elegant (also according a need for this particular method) would be to let all this code live in some semi-private method of nsHttpChannelAuthProvider that DNSCallback::OnLookupComplete would call with necessary args.

::: netwerk/protocol/http/nsHttpChannelAuthProvider.h
@@ +148,5 @@
>      PRUint32                          mTriedProxyAuth           : 1;
>      PRUint32                          mTriedHostAuth            : 1;
>      PRUint32                          mSuppressDefensiveAuth    : 1;
> +
> +    PRUint32                          mResolvedHost             : 1;

No new blank line before this line.

@@ +161,5 @@
> +            : mAuthProvider(authProvider)
> +        { }
> +
> +    private:
> +        nsHttpChannelAuthProvider        *mAuthProvider; // weak ref

This can live potentially longer then the provider.  Since you don't need to ref callback from the provider, I think it would be better to have this as a strong ref.
Comment 9 Patrick McManus [:mcmanus] 2012-07-25 18:31:31 PDT
Thanks honza. fyi mResolvedHost was uninitialized - I fixed that too.

(In reply to Honza Bambas (:mayhemer) from comment #8)
> 
> Hmm.. when the auth method changes you should truncate mCanonicalizedHost. 
> Followup bug is probably OK to do that.

do you think its sufficient to just truncate (and set mresolvedhost to false) in processauthentication()? That looks like it would do it, if so we can do it in this patch.
Comment 10 Honza Bambas (:mayhemer) 2012-07-26 06:21:39 PDT
(In reply to Patrick McManus [:mcmanus] from comment #9)
> do you think its sufficient to just truncate (and set mresolvedhost to
> false) in processauthentication()? That looks like it would do it, if so we
> can do it in this patch.

It would be sufficient.  But the right question is "when to do it?".  And that might be tricky.  I would have to think of this first.  It is imo ok for a followup.
Comment 11 Patrick McManus [:mcmanus] 2012-07-26 06:39:18 PDT
(In reply to Honza Bambas (:mayhemer) from comment #10)
> (In reply to Patrick McManus [:mcmanus] from comment #9)
> > do you think its sufficient to just truncate (and set mresolvedhost to
> > false) in processauthentication()? That looks like it would do it, if so we
> > can do it in this patch.
> 
> It would be sufficient.  But the right question is "when to do it?".  And
> that might be tricky.  I would have to think of this first.  It is imo ok
> for a followup.

as long as its ok by you, I'm going to check it in as part of processauthentication() because I think there's a potential bug there without it. (you get a ssspi auth challenge and do the c14n, the subsequent auth fails because user provided bogus creds, and now you get challenged with basic ntlm.. we would reuse the c14'd name there which isn't right and could lead to login failure..) we can optimize that in another bug, but I think its not really necessary - we're basically talking about taking an extra dns lookup that probably hits the cache in a very edge case anyhow, right?
Comment 12 Patrick McManus [:mcmanus] 2012-07-26 06:40:30 PDT
if I wasn't clear - I meant every time processauthentication was called truncate mc14n.
Comment 13 Honza Bambas (:mayhemer) 2012-07-26 06:47:30 PDT
Yes, that is probably the right place, but I cannot confirm it right now.  If we need to canon the name again then the lookup will be with high probability satisfied from the cache.
Comment 14 Patrick McManus [:mcmanus] 2012-07-26 08:33:51 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/959f9da9f85e
Comment 15 Patrick McManus [:mcmanus] 2012-07-26 08:34:40 PDT
(In reply to Honza Bambas (:mayhemer) from comment #13)
> Yes, that is probably the right place, but I cannot confirm it right now. 

I stepped through it in the windows debugger - watched it get cleared after a bad auth attempt and then all reset with a cache hit the next time.
Comment 16 Matt Brubeck (:mbrubeck) 2012-07-26 14:07:43 PDT
https://hg.mozilla.org/mozilla-central/rev/959f9da9f85e
Comment 17 Patrick McManus [:mcmanus] 2012-10-29 12:11:46 PDT
backout of 767158 and 785050 from ff17 beta
  https://hg.mozilla.org/releases/mozilla-beta/rev/757f408c1494
Comment 18 Patrick McManus [:mcmanus] 2012-12-03 13:10:53 PST
due to 804605 backed out of ff18 beta
Comment 19 Patrick McManus [:mcmanus] 2013-01-22 09:11:12 PST
I've taken 767158 out of the tree completely pending a better fix for 804605. that requires reopening 766973. The only known offender of DNS on the main thread is in conjunction windows integrated auth and involves looking up a different record type for a hostname we have already successfully resolved (in order to connect to) so it should have minimal impact. nonetheless, that's something to get fixed so we can remove the synchronous API.

Note You need to log in before you can comment on or make changes to this bug.