Closed Bug 1355858 Opened 3 years ago Closed Last year

When force-disabling spdy, we should ensure the conninfo doesn't allow spdy on future connections to the same host

Categories

(Core :: Networking: HTTP, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox65 --- fixed

People

(Reporter: u408661, Assigned: u408661)

References

Details

(Whiteboard: [ntlm][necko-triaged])

Attachments

(3 files, 1 obsolete file)

See bug 1346392 comment 75 (specifically, items 6 and 7). After making it so talking h2 to a server that wants NTLM works (by falling back to h1), we still in some cases dial back to that server with h2, even though we can be fairly certain that won't work (for subresources and stuff like that). While this works (we fall ourselves back to h1), it's sub-optimal. We should fix it so when we force ourselves to fallback, we won't try dialing back with h2.
Whiteboard: [necko-next]
Whiteboard: [necko-next] → [necko-active]
Hi Troy, sorry to bug you on a different... bug, but I'm hoping you would be willing to test a build to see if it fixes the issues you noted in bug 1346392 comment 75 (namely, that we keep trying h2 before failing over to h1). If so, the builds are at https://archive.mozilla.org/pub/firefox/try-builds/hurley@mozilla.com-56dd3152ccfb352e1f46a97b2a8064101a13fb43/

Thanks!
Flags: needinfo?(bugzilla)
Sure, I'll test it out when I get home tonight.
I used my repro environment to test Nick's fix identified as:

20170421140148
https://hg.mozilla.org/try/rev/56dd3152ccfb352e1f46a97b2a8064101a13fb43

SUMMARY:

The testing was successful.  The fix addressed the bug and no other regressions were found.  I was able to successfully browse to content using NTLM and Kerberos authentication, even if both the client and server were configured to support HTTP/2 over TLS 1.2.  Firefox successfully downgraded the HTTP/2 connection to HTTP/1.1 when initiating a request using NTLM or Kerberos authentication.

Future requests for resources on "downgraded hosts" didn't attempt to try HTTP/2 again.  They continued using HTTP/1.1.  This was true even if I closed the browser and restarted it.  The only way to force the client to attempt an HTTP/2 request again to a "downgraded host" was to clear the browser cache.

I've uploaded Firefox Nightly network logs for the NTLM and Kerberos tests in case anyone wants to review them.

DETAILS:

+----------------------+------------------------+--------------------+------------------+
| AuthN / Connection   | HTTP/1.1 (unencrypted) | HTTP/1.1 (TLS 1.2) | HTTP/2 (TLS 1.2) |
+----------------------+------------------------+--------------------+------------------+
| Anonymous            | Pass (1)               | Pass (1)           | Pass (2)         |
+----------------------+------------------------+--------------------+------------------+
| Basic                | Pass (1)               | Pass (1)           | Pass (2)         |
+----------------------+------------------------+--------------------+------------------+
| Digest (MD5-sess)    | Fail (3)               | Fail (3)           | Fail (4)         |
+----------------------+------------------------+--------------------+------------------+
| Kerberos (Negotiate) | Pass (5)               | Pass (5)           | Pass (6)         |
+----------------------+------------------------+--------------------+------------------+
| NTLM (Direct)        | Pass (5)               | Pass (5)           | Pass (7)         |
+----------------------+------------------------+--------------------+------------------+
| NTLM (Negotiate)     | Pass (5)               | Pass (5)           | Pass (7)         |
+----------------------+------------------------+--------------------+------------------+

1: The requests were initiated as HTTP/1.1, and the responses were successfully received as HTTP/1.1 as expected.

2: The requests were initiated as HTTP/2, and the responses were successfully received as HTTP/2 as expected.

3: The initial request was initiated as HTTP/1.1, and the response was successfully received as HTTP/1.1 as expected.  However, when accessing additional resources defined in the HTML page (images), the client included an Authorization header as HTTP/1.1 that was rejected by the server with an HTTP 401 response.  This resulted in a new authentication prompt to the end user.  When the end user provided the same credentials again, the request was retried and was successful.  I believe this is a known issue, tracked by Bugzilla bug 270447.  This failure is unrelated to this change.

4: The initial request was initiated as HTTP/2, and the response was successfully received as HTTP/2 as expected.  However, when accessing additional resources defined in the HTML page (images), the client included an Authorization header as HTTP/2 that was rejected by the server with an HTTP 401 response.  This resulted in a new authentication prompt to the end user.  When the end user provided the same credentials again, the request was retried and was successful.  I believe this is a known issue, tracked by Bugzilla bug 270447.  This failure is unrelated to this change.

5: The requests were initiated as HTTP/1.1, and the responses were successfully received as HTTP/1.1 as expected.  In addition, the existing connection was reused when requesting some of the additional content, so no further re-authentication attempts were necessary for those requests.  It appears that some new connections were created when requesting additional content, and those new connections performed the authentication handshake over HTTP/1.1 as expected.

6: The initial request was initiated as HTTP/2, and the initial HTTP 401 response was successfully received as HTTP/2 as expected.  The client then resent the request with an Authorization header over HTTP/1.1 as expected.  The content was received over HTTP/1.1 as expected.  The existing connection was reused when requesting some of the additional content, so no further re-authentication attempts were necessary for those requests.   It appears that some new connections were created when requesting additional content, and those new connections performed the Kerberos authentication handshake over HTTP/1.1 as expected.  They didn't attempt to try HTTP/2 again.

7: The initial request was initiated as HTTP/2, and the initial HTTP 401 response was successfully received as HTTP/2 as expected.  The client then resent the request with an Authorization header over HTTP/1.1 as expected.  The remaining steps of the NTLM handshake were completed over HTTP/1.1 as expected and the content was received over HTTP/1.1 as expected.  The existing connection was reused when requesting some of the additional content, so no further re-authentication attempts were necessary for those requests.   It appears that some new connections were created when requesting additional content, and those new connections performed the NTLM authentication handshake over HTTP/1.1 as expected.  They didn't attempt to try HTTP/2 again.
Flags: needinfo?(bugzilla)
Thanks, Troy! This all looks pretty great to me, except for:

(In reply to Troy Starr from comment #3)
> Future requests for resources on "downgraded hosts" didn't attempt to try
> HTTP/2 again.  They continued using HTTP/1.1.  This was true even if I
> closed the browser and restarted it.  The only way to force the client to
> attempt an HTTP/2 request again to a "downgraded host" was to clear the
> browser cache.

particularly the part about not trying h2 even after closing the browser. The blacklist I added is memory-only, so unless there's something I'm missing (which there probably is), we should try again with h2 after a restart. Given you said the blacklisting goes away if the cache is cleared, I suspect we're keeping something in there that results in us disabling h2, but I can't for the life of me think of what.

Patrick, can you think of anything that might fit the behaviour described above?
Flags: needinfo?(mcmanus)
BTW, let me know if you'd like to directly test your builds against my repro server.  I can send you a test account and test URLs via e-mail.  Note that my test environment is only accessible via IPv6, so you'd need an IPv6-capable Internet connection.
is this the patch? https://hg.mozilla.org/try/rev/794cc79823ffa97fa55d92b61187fd19e7ca1edf

I'd prefer a flag in nsConnectionEntry.. indeed you can probably set the existing flag in the ConnInfo in the connectionEntry and have it work.. would need to test that. that will mean no new string manipulation and for whatever sense of gc/mem-accounting we have for nsConnectionEntry (not much - but some) would just apply.

this doesn't address your question - next comment!
Flags: needinfo?(mcmanus)
(In reply to Nicholas Hurley [:nwgh][:hurley] (also hurley@todesschaf.org) from comment #4)

> 
> particularly the part about not trying h2 even after closing the browser.
> The blacklist I added is memory-only, so unless there's something I'm
> missing (which there probably is), we should try again with h2 after a
> restart. Given you said the blacklisting goes away if the cache is cleared,
> I suspect we're keeping something in there that results in us disabling h2,
> but I can't for the life of me think of what.
> 
> Patrick, can you think of anything that might fit the behaviour described
> above?

I'm puzzled as well - unless the responses themselves are coming from the cache :)

Another possibility is that the browser wasn't actually restarted - sometimes I think I've done that but really there is another open window hidden somewhere which means I'm just opening and closing windows instead of resetting the browser. But that wouldn't really jive with the 'clear cache' comment.. but you didn't put anything in your patch that dropped the blacklist on clear cache (that would be a good idea actually).

sorry. other than retesting very carefully and adding logs, I don't know.
(In reply to Patrick McManus [:mcmanus] from comment #7)
> I'm puzzled as well - unless the responses themselves are coming from the
> cache :)

I think the first browse after restarting the browser is coming from the cache.  The F12 network inspector shows "cached" for all of the resources on that "first browse after restarting the browser" attempt, and I don't get an NTLM authentication prompt.

Then I press Ctrl+F5 to refresh the page, which does give me the NTLM authentication prompt.  Here the F12 network inspector shows that it's actually requesting and receiving those resources from the server.  But all of those requests are made via HTTP/1.1, including the first request.

If the "first browse after restarting the browser" is being served from the cache, could that be seeding the in-memory blacklist for future requests to that server during this browser session?
what about a normal ctrl-r reload? I think you're doing the "hard reload" which disables a lot of features in the name of sanity. h2 might coincidentally be one of them (i.e. not from the blacklist, just from the hard reload.)
(In reply to Patrick McManus [:mcmanus] from comment #9)
> what about a normal ctrl-r reload?

I see the same behavior with Ctrl+R - the requests are made via HTTP/1.1, including the first request.
that's a little odd but I bet its still not part of nick's patch - you can send along a log from about:networking (start it before you hit ctrl r) and I bet nick can sort it out.
Sounds good, I've attached a network log of my Ctrl+R reload attempt using the same test build.
Attached patch Another attempt (obsolete) — Splinter Review
Patrick - is this more along the lines of what you're thinking?
Attachment #8863940 - Flags: feedback?(mcmanus)
So looking at the log, the transaction definitely has NS_HTTP_DISALLOW_SPDY set in its caps - and this happens before the first time in the log that ForceNoSpdy is called. So, either something is connecting to the host in the background before you started the log (about:newtab thumbnails, perhaps?) or the caps are getting persisted somewhere.

Troy, I'm willing to dig further into this behavior further if you're still willing to email me some credentials (I have ipv6 access both from my home office and the office office, so no worries on that bit).
Flags: needinfo?(bugzilla)
(In reply to Nicholas Hurley [:nwgh][:hurley] (also hurley@todesschaf.org) from comment #14)
> Troy, I'm willing to dig further into this behavior further if you're still
> willing to email me some credentials (I have ipv6 access both from my home
> office and the office office, so no worries on that bit).

You bet, I've sent test account credentials and URLs to your Mozilla e-mail address.
Flags: needinfo?(bugzilla)
Comment on attachment 8863940 [details] [diff] [review]
Another attempt

Review of attachment 8863940 [details] [diff] [review]:
-----------------------------------------------------------------

yeah. thanks

::: netwerk/protocol/http/nsHttpChannel.cpp
@@ +5660,5 @@
>  
>      mAllowSpdy = 0;
>      mCaps |= NS_HTTP_DISALLOW_SPDY;
>  
> +    gHttpHandler->BlacklistSpdy(mConnectionInfo);

I know this does what the bug title says - when channel.forceNoSpdy() disable spdy in the future..

but that's kind of counter-intuitive for a channel interface to impact state beyond the channel.

There seems only to be the one user of forceNoSpdy (the auth code). Can we at least change the name of the method to something like disableSpdyForOrigin() ?
Attachment #8863940 - Flags: feedback?(mcmanus) → feedback+
(In reply to Patrick McManus [:mcmanus] from comment #16)
> Comment on attachment 8863940 [details] [diff] [review]
> Another attempt
> 
> Review of attachment 8863940 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> yeah. thanks
> 
> ::: netwerk/protocol/http/nsHttpChannel.cpp
> @@ +5660,5 @@
> >  
> >      mAllowSpdy = 0;
> >      mCaps |= NS_HTTP_DISALLOW_SPDY;
> >  
> > +    gHttpHandler->BlacklistSpdy(mConnectionInfo);
> 
> I know this does what the bug title says - when channel.forceNoSpdy()
> disable spdy in the future..
> 
> but that's kind of counter-intuitive for a channel interface to impact state
> beyond the channel.
> 
> There seems only to be the one user of forceNoSpdy (the auth code). Can we
> at least change the name of the method to something like
> disableSpdyForOrigin() ?

That seems eminently reasonable. I'll make the change, get this one tested against Troy's site, and get it wrapped up for a real review. Thanks!
Ugh, I'm calling DontReuse (which Close()s the h2 session) on the main thread, but Close() is socket thread-only. This will take a bit of time to rework & test.
Oh boy, it gets even worse - even older code around sticky connections ends up calling DontReuse (through nsHttpChannel::CloseStickyConnection) on the main thread, resulting in the same assertions in debug builds. I'll try re-doing this with a non-debug build to see if I can get anything useful, or if we'll have to go deeper down the rabbit hole. (I shudder to think, at this point, just how many nasty bugs/assertions are hiding in our connection-based auth code paths.)
Whiteboard: [necko-active] → [necko-active][ntlm]
I filed a bug for the other assert I've seen so far: https://bugzilla.mozilla.org/show_bug.cgi?id=1362508

However, this gets curiouser and curiouser... in my testing now, even after closing the browser, if I return to https://ntlm.troystarr.net/, the entire page is loaded from cache. Without asking for authentication. In the logs, I never see any request generated for /, only for the pngs on the page. Also, if I load about:cache prior to loading the page, nothing from ntlm.troystarr.net shows up as being in the cache. However, upon loading the page and refreshing about:cache, all the resources show up, as well as having a load count > 1.

I've been testing with opt builds to just avoid the assertions that are happening, but I'm thinking at this point I may need to actually hack around the assertion mentioned above (as well as any others I may come across during future testing) to run this in a debug build under rr so I can really inspect things.
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Priority: P1 → P2
Whiteboard: [necko-active][ntlm] → [ntlm]
Whiteboard: [ntlm] → [ntlm][necko-triaged]
Priority: P2 → P3
In certain cases (such as the case from bug 1050329, where a server claims to speak h2, but really doesn't), we will end up trying every connection to that server as h2 before falling back to http/1.1. Once we know a server is very badly behaved, it doesn't make sense to keep trying h2 (at least for the current browsing session). This adds an in-memory blacklist of origins & conninfos to not try h2 on, so we don't waste round trips trying h2, failing, and then dialing back with http/1.1 except for the first connection to an origin.

Depends on D8436
Attachment #8863940 - Attachment is obsolete: true
Pushed by hurley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/54a37700d0e6
Blacklist hosts from spdy who are very poorly behaved. r=dragana
https://hg.mozilla.org/mozilla-central/rev/54a37700d0e6
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
You need to log in before you can comment on or make changes to this bug.