Last Comment Bug 760608 - Preconnecting first connection with NullTransaction to non-SPDY HTTPS servers broken, hang for 30 seconds
: Preconnecting first connection with NullTransaction to non-SPDY HTTPS servers...
Status: VERIFIED FIXED
[http-conn]
:
Product: Core
Classification: Components
Component: Networking: HTTP (show other bugs)
: Trunk
: All All
: -- major (vote)
: mozilla16
Assigned To: Patrick McManus [:mcmanus]
:
Mentors:
: 758899 760912 (view as bug list)
Depends on:
Blocks: 466524 729133 758899 760912
  Show dependency treegraph
 
Reported: 2012-06-01 11:12 PDT by Honza Bambas (:mayhemer)
Modified: 2012-06-21 06:41 PDT (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed
fixed


Attachments
example log (125.25 KB, text/plain)
2012-06-01 11:12 PDT, Honza Bambas (:mayhemer)
no flags Details
v1 (2.09 KB, patch)
2012-06-04 09:16 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
patch 0 (1.37 KB, patch)
2012-06-04 09:56 PDT, Patrick McManus [:mcmanus]
honzab.moz: review+
Details | Diff | Splinter Review
alternative 1 - v0 (3.83 KB, patch)
2012-06-04 15:03 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Splinter Review
alternative 1 - v1 (5.23 KB, patch)
2012-06-05 08:35 PDT, Patrick McManus [:mcmanus]
honzab.moz: review+
akeybl: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Honza Bambas (:mayhemer) 2012-06-01 11:12:04 PDT
Created attachment 629260 [details]
example log

This is here more for a record and potential backout/disable-only bug, since it will be fixed with bug 715905.

I found this out during work on allowing HTTPS testing in xpcshell.

Problem is that NullTransaction gets stuck on waiting for a response.  But we didn't send any request, so nothing can arrive from the server.

Transaction timeout logic kills this transaction in 30 seconds and the original pending transaction gets then restarted.

I believe I've hit this bug few times my self during daily usage wondering what's the hack is going on.


I vote for disabling of using NullTransaction for SSL pre-negotiation.


The log attached is an example of using a fresh profile with just following changes:
- disabled forgery and attack detection
- HTTP cache was emptied before start
- blank start page
- added an SSL exception for my local IIS testing server's certificate

I was connecting my local IIS server as https://127.0.0.1/
I experience the same hang with ssltunnel

To reproduce this bug opening cache entry must be slower then creation of the connection.  On some mobile devices in WiFi I believe this can happen quit often.

This is not reproducible when I'm connecting https://localhost (both IIS and ssltunnel) since the first address necko tries is ::1 and it fails.  This is changing the timing and no NullTransaction is then created.
Comment 1 Patrick McManus [:mcmanus] 2012-06-01 14:54:05 PDT
I think this is a pipeline integration bug. I'm not sure what you're seeing that makes it look ssl on http/1 only.. ?

here's the key part:

2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]: nsHttpConnection::EnsureNPNComplete cc26440 negotiated to ''
2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]: nsHttpConnection::EnsureNPNComplete setting complete to true
2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]:   writing transaction request stream

NPN and the SSL handshake go ok ('' - plain http). and then it calls readsegments..

2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]: nsHttpPipeline::ReadSegments [this=cb320b8 count=32768]
2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]:   ReadSegments returned [rv=0 read=0 sock-cond=0]

nulltransaction::readsegments is supposed to return BASE_STREAM_CLOSED and set isDone on itself - which will result in the transaction being completed.. but the pipeline layer has turned that into rv=NS_OK somehow.

I would guess this pipeline code is the issue
      // return EOF if send buffer is empty
        if (avail == 0) {
            *countRead = 0;
            return NS_OK;
        }
Comment 2 Honza Bambas (:mayhemer) 2012-06-02 03:49:43 PDT
Patrick, do you think you can provide a simple and stable fix for this?
Comment 3 Patrick McManus [:mcmanus] 2012-06-02 07:57:07 PDT
(In reply to Honza Bambas (:mayhemer) from comment #2)
> Patrick, do you think you can provide a simple and stable fix for this?

assuming my interpretation of the log in comment 1 is correct - definitely.
Comment 4 Honza Bambas (:mayhemer) 2012-06-03 05:42:40 PDT
(In reply to Patrick McManus [:mcmanus] from comment #1)
> I'm not sure what you're seeing
> that makes it look ssl on http/1 only.. ?

Non-ssl connections are not armed with NullTransaction.

SPDY-sends out some data very soon, that moves the negotiation forward again, I think.  I've simply never seen case of a hang when there were SPDY and also NullTransaction involved on a connection.  It may also indicate the bug is actually in the pipelining code, since pipelines are not present when running SPDY.
Comment 5 Honza Bambas (:mayhemer) 2012-06-04 09:16:22 PDT
Created attachment 629799 [details] [diff] [review]
v1

This is a simple patch disabling SSL pre-negotiation with NullTransaction.
Comment 6 Patrick McManus [:mcmanus] 2012-06-04 09:37:36 PDT
(In reply to Honza Bambas (:mayhemer) from comment #5)
> Created attachment 629799 [details] [diff] [review]
> v1
> 
> This is a simple patch disabling SSL pre-negotiation with NullTransaction.

I would hope we don't need that - I'm testing a very simple fix now. SSL is a big part of this feature, especially when combined with the search box logic. And we're only on nightly (with it likely slipping into week 1 of aurora).. so lots of time to verify the
Comment 7 Honza Bambas (:mayhemer) 2012-06-04 09:52:27 PDT
We really should fix this quickly and safely before tomorrow.
Comment 8 Patrick McManus [:mcmanus] 2012-06-04 09:56:36 PDT
Created attachment 629832 [details] [diff] [review]
patch 0
Comment 9 Patrick McManus [:mcmanus] 2012-06-04 10:03:21 PDT
(In reply to Patrick McManus [:mcmanus] from comment #8)
> Created attachment 629832 [details] [diff] [review]
> patch 0

@try too.

took me a little while to repro - for whatever reason https://www.mozilla.com was writing to the connection soon after handshake (even though it doesn't speak spdy), triggering OnSocketReadable so the bug wasn't evident. I found another SSL server that didn't do that and could clearly see the problem by introducing a short sleep in nsHttpChannel immediately after the SpeculativeLookup call. (so the main thread slept for 300ms while neck did its thing - like a slow cache lookup).

I could then, finally, confirm the patch did what was intended and fixed the issue.
Comment 10 Patrick McManus [:mcmanus] 2012-06-04 10:09:39 PDT
(In reply to Honza Bambas (:mayhemer) from comment #4)

> 
> SPDY-sends out some data very soon, that moves the negotiation forward
> again, I think.

I didn't think of that. You're seeing the settings frame from the server that it normally sends first thing after the 3whs. (it's not a required element, but everybody does it.)
Comment 11 Honza Bambas (:mayhemer) 2012-06-04 11:37:01 PDT
Comment on attachment 629832 [details] [diff] [review]
patch 0

Review of attachment 629832 [details] [diff] [review]:
-----------------------------------------------------------------

I have to confirm this fixes this issue.

I'm leaving this open for now.  Since I'm not able to test this patch very well, I'm not much keen to land this.

::: netwerk/protocol/http/NullHttpTransaction.cpp
@@ +91,5 @@
>  
>  PRUint32
>  NullHttpTransaction::Available()
>  {
> +  mIsDone = true;

You are just putting code where it fits to you.  Call to Available() changing state of an object is quit strange.

::: netwerk/protocol/http/nsHttpPipeline.cpp
@@ +613,5 @@
>          // return EOF if send buffer is empty
>          if (avail == 0) {
>              *countRead = 0;
> +            if (IsDone())
> +                return NS_BASE_STREAM_CLOSED;

This change worries me since we have absolutely no tests for keep-alive and pipelined connections (my fault).  This is a core change hard to test what trouble it may cause.
Comment 12 Honza Bambas (:mayhemer) 2012-06-04 11:53:19 PDT
Comment on attachment 629832 [details] [diff] [review]
patch 0

Testing with pipelining on and aggressive using google image search as source of large parallelism this works.

Josh, I'd like to let you potentially veto based on whether we want to disable a never released feature of preconnecting ssl session while opening a cache entry (my patch) OR to allow potentially dangerous change at core level that may influence user even w/o pipelining on (Pat's patch).

I more tend to put this in now, however I hate only manual testing.
Comment 13 Patrick McManus [:mcmanus] 2012-06-04 12:18:21 PDT
> 
> ::: netwerk/protocol/http/NullHttpTransaction.cpp
> @@ +91,5 @@
> >  
> >  PRUint32
> >  NullHttpTransaction::Available()
> >  {
> > +  mIsDone = true;
> 
> You are just putting code where it fits to you.  Call to Available()
> changing state of an object is quit strange.
> 

not true!

you might recall that until a couple weeks ago with the happy-eyeball change, null-transaction was *always* done. (there wasn't even a state variable). The change was made so that it only marked itself done based on I/O happening - so that the TCP retry stuff could determine whether or not the handshake was still in progress. This change is consistent with that - nsHttpPipeline uses nsAHttpTransaction::Available() as a I/O function. (in lieu of ReadSegments basically).

as a testing comment - the only user of nsAHttpTransaction::Available is nsHttpPipeline
Comment 14 Patrick McManus [:mcmanus] 2012-06-04 12:19:05 PDT
I'm going to look into another more conservative approach. Honza, you might be more comfortable with that if it bears fruit.
Comment 15 Josh Aas 2012-06-04 12:19:45 PDT
Comment on attachment 629832 [details] [diff] [review]
patch 0

Review of attachment 629832 [details] [diff] [review]:
-----------------------------------------------------------------

This is a request for comment on a Necko-specific issue that I'm not qualified to assess. If you're OK with either option and you need a tie-breaker opinion I think that should be Patrick's.
Comment 16 Patrick McManus [:mcmanus] 2012-06-04 15:03:36 PDT
Created attachment 629957 [details] [diff] [review]
alternative 1 - v0

Here is an alternative approach - do not use nsHttpPipeline for CLASS_SOLO transactions. NullHttpTransaction's are SOLO.

SOLO's are never reused until the connection is quiescent, so this doesn't remove any functionality. (specifically it is not a return to the old bursty behavior.. transactions eligible to pipeline are not SOLO just because they don't have a pipeline partner at the time of scheduling).. SOLO is for things like audio/video, and XHR, etc..

I liked the property of driving everything through nsHttpPipeline, but it was leading to hacks like the other version of this patch. So I'm ok with the split.

The code for this patch is still small. It also contains an addition of the Keep-Alive flag to the NullTransaction caps in case the in originating transaction was missing it. It doesn't make much sense without it and I saw that happen in at least one of my tests.
Comment 17 Patrick McManus [:mcmanus] 2012-06-04 18:50:55 PDT
Comment on attachment 629957 [details] [diff] [review]
alternative 1 - v0

I need to dig into a test fail before resetting r?
Comment 18 Honza Bambas (:mayhemer) 2012-06-05 01:57:05 PDT
(In reply to Patrick McManus [:mcmanus] from comment #13)
> not true!
> 
> you might recall that until a couple weeks ago with the happy-eyeball
> change, null-transaction was *always* done. (there wasn't even a state
> variable). The change was made so that it only marked itself done based on
> I/O happening - so that the TCP retry stuff could determine whether or not
> the handshake was still in progress. 

Yes, the IO happening means to call Read or WriteSegments to me.

> This change is consistent with that -
> nsHttpPipeline uses nsAHttpTransaction::Available() as a I/O function. (in
> lieu of ReadSegments basically).

Hmm.. calling Available doesn't mean to do actual IO.  Anyone can call this method to just check the state.

> 
> as a testing comment - the only user of nsAHttpTransaction::Available is
> nsHttpPipeline

Yes, for now.  When we introduce a change that may call Available sooner then actual Read/Write is done then we may regress the IPv6 fallback patch that introduced mIsDone.  That is simply not correct to me.

(In reply to Patrick McManus [:mcmanus] from comment #16)
> Created attachment 629957 [details] [diff] [review]
> alternative 1 - v0
> 
> Here is an alternative approach - do not use nsHttpPipeline for CLASS_SOLO
> transactions. NullHttpTransaction's are SOLO.

Thanks!  That seems to me to be the correct approach, I like it.
Comment 19 Bill Gianopoulos [:WG9s] 2012-06-05 06:26:04 PDT
(In reply to Patrick McManus [:mcmanus] from comment #16)
> Created attachment 629957 [details] [diff] [review]
> alternative 1 - v0

I can report that I have verified that this patch corrects the issue I reported in Bug 758899.

I have builds available at http://www.wg9s.com/mozilla/firefox that include this patch.

So far, Win32 and Linux 32-bit.  Android (Native and XUL) and Linux 64-bit are still in progress.
Comment 20 Patrick McManus [:mcmanus] 2012-06-05 08:35:20 PDT
Created attachment 630179 [details] [diff] [review]
alternative 1 - v1

it turns out we do need to disable the SSL handshake via NullHttpTransaction in the presence of a HTTP proxy because NullHttpTransaction does not know how to drive the CONNECT sequence. (all HTTP parsing is done in http transaction, and the CONNECT tunnel is a HTTP transaction of sorts).

on one hand this is unfortunate as SSL is end to end even in the presence of the proxy and a potentially high lantecny operation.. otoh explicit proxy users are a distinct minority so most folks will still get the optimizaton.

rather than teaching nullhttptransaction about how to do tunnels we can just expect that this is part of bug 715905.

@try again
Comment 21 Bill Gianopoulos [:WG9s] 2012-06-05 08:44:47 PDT
I will be doing my windows build over with this (since it is the only one from which I can test the issue from bug 758899).  I will report back with results.
Comment 22 Patrick McManus [:mcmanus] 2012-06-05 08:50:53 PDT
(In reply to Honza Bambas (:mayhemer) from comment #18)

> Hmm.. calling Available doesn't mean to do actual IO.  Anyone can call this
> method to just check the state.
> 

I don't really buy it - it's not an idl or some header with a inter-module scope. We can define it how we want to for the most value. And it seems like I/O to me in the same sense that PEEK is clearly I/O even though its non-destructive.

But I/O is probably the wrong thing anyhow.. mDone was really introduced to draw a line between the TCP/SSL handshake and the other transaction logic.. probly should have just introduced a settr at that point rather than implicitly from I/O functions.

likely all moot anyhow as I suspect we're going with the no-nshttptransaction-on-solo patch anyhow :)
Comment 23 Patrick McManus [:mcmanus] 2012-06-05 08:53:04 PDT
(In reply to Bill Gianopoulos [:WG9s] from comment #21)
> I will be doing my windows build over with this (since it is the only one
> from which I can test the issue from bug 758899).  I will report back with
> results.

thanks bill
Comment 24 Patrick McManus [:mcmanus] 2012-06-05 10:57:31 PDT
try is looking fine
Comment 25 Honza Bambas (:mayhemer) 2012-06-06 11:10:36 PDT
Comment on attachment 630179 [details] [diff] [review]
alternative 1 - v1

Review of attachment 630179 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks.  This looks good to me.  Tested for my case.

r=honzab
Comment 26 Bill Gianopoulos [:WG9s] 2012-06-06 11:22:00 PDT
(In reply to Patrick McManus [:mcmanus] from comment #20)
> Created attachment 630179 [details] [diff] [review]
> alternative 1 - v1
This patch has been tested and fixes bug 758899 also.
Comment 27 Patrick McManus [:mcmanus] 2012-06-06 12:51:41 PDT
we'll let this bake for a few days on m-c and then nom for aurora if all goes well.

https://hg.mozilla.org/integration/mozilla-inbound/rev/3b1f2455579a
Comment 28 Patrick McManus [:mcmanus] 2012-06-06 12:53:19 PDT
*** Bug 760912 has been marked as a duplicate of this bug. ***
Comment 29 Patrick McManus [:mcmanus] 2012-06-06 12:54:30 PDT
*** Bug 758899 has been marked as a duplicate of this bug. ***
Comment 30 Ed Morley [:emorley] 2012-06-07 05:49:47 PDT
https://hg.mozilla.org/mozilla-central/rev/3b1f2455579a
Comment 31 Bill Gianopoulos [:WG9s] 2012-06-07 07:40:53 PDT
Verified fixed with the latest nightly.

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/16.0 Firefox/16.0a1
http://hg.mozilla.org/mozilla-central/rev/7e4c2abb9fc9
Comment 32 Patrick McManus [:mcmanus] 2012-06-11 12:19:47 PDT
Comment on attachment 630179 [details] [diff] [review]
alternative 1 - v1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 729123
User impact if declined: occasional connection hang with ssl site. reload fixes it.
Testing completed (on m-c, etc.): on m-c for 5 days and verified fixed by non-developer
Risk to taking this patch (and alternatives if risky): low - it fundamentally reverts one particular code path to old behavior
String or UUID changes made by this patch: none 

bug impacts only nightly (fixed already) and aurora.
Comment 33 Patrick McManus [:mcmanus] 2012-06-12 17:43:24 PDT
https://hg.mozilla.org/releases/mozilla-aurora/rev/384eece25973

Note You need to log in before you can comment on or make changes to this bug.