Closed Bug 760608 Opened 7 years ago Closed 7 years ago

Preconnecting first connection with NullTransaction to non-SPDY HTTPS servers broken, hang for 30 seconds

Categories

(Core :: Networking: HTTP, defect, major)

defect
Not set
major

Tracking

()

VERIFIED FIXED
mozilla16
Tracking Status
firefox15 --- fixed
firefox16 --- fixed

People

(Reporter: mayhemer, Assigned: mcmanus)

References

(Blocks 1 open bug)

Details

(Whiteboard: [http-conn])

Attachments

(4 files, 1 obsolete file)

Attached file example log
This is here more for a record and potential backout/disable-only bug, since it will be fixed with bug 715905.

I found this out during work on allowing HTTPS testing in xpcshell.

Problem is that NullTransaction gets stuck on waiting for a response.  But we didn't send any request, so nothing can arrive from the server.

Transaction timeout logic kills this transaction in 30 seconds and the original pending transaction gets then restarted.

I believe I've hit this bug few times my self during daily usage wondering what's the hack is going on.


I vote for disabling of using NullTransaction for SSL pre-negotiation.


The log attached is an example of using a fresh profile with just following changes:
- disabled forgery and attack detection
- HTTP cache was emptied before start
- blank start page
- added an SSL exception for my local IIS testing server's certificate

I was connecting my local IIS server as https://127.0.0.1/
I experience the same hang with ssltunnel

To reproduce this bug opening cache entry must be slower then creation of the connection.  On some mobile devices in WiFi I believe this can happen quit often.

This is not reproducible when I'm connecting https://localhost (both IIS and ssltunnel) since the first address necko tries is ::1 and it fails.  This is changing the timing and no NullTransaction is then created.
I think this is a pipeline integration bug. I'm not sure what you're seeing that makes it look ssl on http/1 only.. ?

here's the key part:

2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]: nsHttpConnection::EnsureNPNComplete cc26440 negotiated to ''
2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]: nsHttpConnection::EnsureNPNComplete setting complete to true
2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]:   writing transaction request stream

NPN and the SSL handshake go ok ('' - plain http). and then it calls readsegments..

2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]: nsHttpPipeline::ReadSegments [this=cb320b8 count=32768]
2012-06-01 17:39:11.771000 UTC - 5760[6ec1d48]:   ReadSegments returned [rv=0 read=0 sock-cond=0]

nulltransaction::readsegments is supposed to return BASE_STREAM_CLOSED and set isDone on itself - which will result in the transaction being completed.. but the pipeline layer has turned that into rv=NS_OK somehow.

I would guess this pipeline code is the issue
      // return EOF if send buffer is empty
        if (avail == 0) {
            *countRead = 0;
            return NS_OK;
        }
Blocks: 466524
Patrick, do you think you can provide a simple and stable fix for this?
(In reply to Honza Bambas (:mayhemer) from comment #2)
> Patrick, do you think you can provide a simple and stable fix for this?

assuming my interpretation of the log in comment 1 is correct - definitely.
(In reply to Patrick McManus [:mcmanus] from comment #1)
> I'm not sure what you're seeing
> that makes it look ssl on http/1 only.. ?

Non-ssl connections are not armed with NullTransaction.

SPDY-sends out some data very soon, that moves the negotiation forward again, I think.  I've simply never seen case of a hang when there were SPDY and also NullTransaction involved on a connection.  It may also indicate the bug is actually in the pipelining code, since pipelines are not present when running SPDY.
Attached patch v1Splinter Review
This is a simple patch disabling SSL pre-negotiation with NullTransaction.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Attachment #629799 - Flags: review?(mcmanus)
(In reply to Honza Bambas (:mayhemer) from comment #5)
> Created attachment 629799 [details] [diff] [review]
> v1
> 
> This is a simple patch disabling SSL pre-negotiation with NullTransaction.

I would hope we don't need that - I'm testing a very simple fix now. SSL is a big part of this feature, especially when combined with the search box logic. And we're only on nightly (with it likely slipping into week 1 of aurora).. so lots of time to verify the
We really should fix this quickly and safely before tomorrow.
Attached patch patch 0Splinter Review
Attachment #629832 - Flags: review?(honzab.moz)
(In reply to Patrick McManus [:mcmanus] from comment #8)
> Created attachment 629832 [details] [diff] [review]
> patch 0

@try too.

took me a little while to repro - for whatever reason https://www.mozilla.com was writing to the connection soon after handshake (even though it doesn't speak spdy), triggering OnSocketReadable so the bug wasn't evident. I found another SSL server that didn't do that and could clearly see the problem by introducing a short sleep in nsHttpChannel immediately after the SpeculativeLookup call. (so the main thread slept for 300ms while neck did its thing - like a slow cache lookup).

I could then, finally, confirm the patch did what was intended and fixed the issue.
(In reply to Honza Bambas (:mayhemer) from comment #4)

> 
> SPDY-sends out some data very soon, that moves the negotiation forward
> again, I think.

I didn't think of that. You're seeing the settings frame from the server that it normally sends first thing after the 3whs. (it's not a required element, but everybody does it.)
Comment on attachment 629832 [details] [diff] [review]
patch 0

Review of attachment 629832 [details] [diff] [review]:
-----------------------------------------------------------------

I have to confirm this fixes this issue.

I'm leaving this open for now.  Since I'm not able to test this patch very well, I'm not much keen to land this.

::: netwerk/protocol/http/NullHttpTransaction.cpp
@@ +91,5 @@
>  
>  PRUint32
>  NullHttpTransaction::Available()
>  {
> +  mIsDone = true;

You are just putting code where it fits to you.  Call to Available() changing state of an object is quit strange.

::: netwerk/protocol/http/nsHttpPipeline.cpp
@@ +613,5 @@
>          // return EOF if send buffer is empty
>          if (avail == 0) {
>              *countRead = 0;
> +            if (IsDone())
> +                return NS_BASE_STREAM_CLOSED;

This change worries me since we have absolutely no tests for keep-alive and pipelined connections (my fault).  This is a core change hard to test what trouble it may cause.
Comment on attachment 629832 [details] [diff] [review]
patch 0

Testing with pipelining on and aggressive using google image search as source of large parallelism this works.

Josh, I'd like to let you potentially veto based on whether we want to disable a never released feature of preconnecting ssl session while opening a cache entry (my patch) OR to allow potentially dangerous change at core level that may influence user even w/o pipelining on (Pat's patch).

I more tend to put this in now, however I hate only manual testing.
Attachment #629832 - Flags: superreview?(joshmoz)
Attachment #629832 - Flags: review?(honzab.moz)
Attachment #629832 - Flags: review+
> 
> ::: netwerk/protocol/http/NullHttpTransaction.cpp
> @@ +91,5 @@
> >  
> >  PRUint32
> >  NullHttpTransaction::Available()
> >  {
> > +  mIsDone = true;
> 
> You are just putting code where it fits to you.  Call to Available()
> changing state of an object is quit strange.
> 

not true!

you might recall that until a couple weeks ago with the happy-eyeball change, null-transaction was *always* done. (there wasn't even a state variable). The change was made so that it only marked itself done based on I/O happening - so that the TCP retry stuff could determine whether or not the handshake was still in progress. This change is consistent with that - nsHttpPipeline uses nsAHttpTransaction::Available() as a I/O function. (in lieu of ReadSegments basically).

as a testing comment - the only user of nsAHttpTransaction::Available is nsHttpPipeline
I'm going to look into another more conservative approach. Honza, you might be more comfortable with that if it bears fruit.
Comment on attachment 629832 [details] [diff] [review]
patch 0

Review of attachment 629832 [details] [diff] [review]:
-----------------------------------------------------------------

This is a request for comment on a Necko-specific issue that I'm not qualified to assess. If you're OK with either option and you need a tie-breaker opinion I think that should be Patrick's.
Attachment #629832 - Flags: superreview?(joshmoz)
Attached patch alternative 1 - v0 (obsolete) — Splinter Review
Here is an alternative approach - do not use nsHttpPipeline for CLASS_SOLO transactions. NullHttpTransaction's are SOLO.

SOLO's are never reused until the connection is quiescent, so this doesn't remove any functionality. (specifically it is not a return to the old bursty behavior.. transactions eligible to pipeline are not SOLO just because they don't have a pipeline partner at the time of scheduling).. SOLO is for things like audio/video, and XHR, etc..

I liked the property of driving everything through nsHttpPipeline, but it was leading to hacks like the other version of this patch. So I'm ok with the split.

The code for this patch is still small. It also contains an addition of the Keep-Alive flag to the NullTransaction caps in case the in originating transaction was missing it. It doesn't make much sense without it and I saw that happen in at least one of my tests.
Attachment #629957 - Flags: review?(honzab.moz)
Comment on attachment 629957 [details] [diff] [review]
alternative 1 - v0

I need to dig into a test fail before resetting r?
Attachment #629957 - Flags: review?(honzab.moz)
(In reply to Patrick McManus [:mcmanus] from comment #13)
> not true!
> 
> you might recall that until a couple weeks ago with the happy-eyeball
> change, null-transaction was *always* done. (there wasn't even a state
> variable). The change was made so that it only marked itself done based on
> I/O happening - so that the TCP retry stuff could determine whether or not
> the handshake was still in progress. 

Yes, the IO happening means to call Read or WriteSegments to me.

> This change is consistent with that -
> nsHttpPipeline uses nsAHttpTransaction::Available() as a I/O function. (in
> lieu of ReadSegments basically).

Hmm.. calling Available doesn't mean to do actual IO.  Anyone can call this method to just check the state.

> 
> as a testing comment - the only user of nsAHttpTransaction::Available is
> nsHttpPipeline

Yes, for now.  When we introduce a change that may call Available sooner then actual Read/Write is done then we may regress the IPv6 fallback patch that introduced mIsDone.  That is simply not correct to me.

(In reply to Patrick McManus [:mcmanus] from comment #16)
> Created attachment 629957 [details] [diff] [review]
> alternative 1 - v0
> 
> Here is an alternative approach - do not use nsHttpPipeline for CLASS_SOLO
> transactions. NullHttpTransaction's are SOLO.

Thanks!  That seems to me to be the correct approach, I like it.
Assignee: honzab.moz → mcmanus
(In reply to Patrick McManus [:mcmanus] from comment #16)
> Created attachment 629957 [details] [diff] [review]
> alternative 1 - v0

I can report that I have verified that this patch corrects the issue I reported in Bug 758899.

I have builds available at http://www.wg9s.com/mozilla/firefox that include this patch.

So far, Win32 and Linux 32-bit.  Android (Native and XUL) and Linux 64-bit are still in progress.
it turns out we do need to disable the SSL handshake via NullHttpTransaction in the presence of a HTTP proxy because NullHttpTransaction does not know how to drive the CONNECT sequence. (all HTTP parsing is done in http transaction, and the CONNECT tunnel is a HTTP transaction of sorts).

on one hand this is unfortunate as SSL is end to end even in the presence of the proxy and a potentially high lantecny operation.. otoh explicit proxy users are a distinct minority so most folks will still get the optimizaton.

rather than teaching nullhttptransaction about how to do tunnels we can just expect that this is part of bug 715905.

@try again
Attachment #629957 - Attachment is obsolete: true
Attachment #630179 - Flags: review?(honzab.moz)
I will be doing my windows build over with this (since it is the only one from which I can test the issue from bug 758899).  I will report back with results.
(In reply to Honza Bambas (:mayhemer) from comment #18)

> Hmm.. calling Available doesn't mean to do actual IO.  Anyone can call this
> method to just check the state.
> 

I don't really buy it - it's not an idl or some header with a inter-module scope. We can define it how we want to for the most value. And it seems like I/O to me in the same sense that PEEK is clearly I/O even though its non-destructive.

But I/O is probably the wrong thing anyhow.. mDone was really introduced to draw a line between the TCP/SSL handshake and the other transaction logic.. probly should have just introduced a settr at that point rather than implicitly from I/O functions.

likely all moot anyhow as I suspect we're going with the no-nshttptransaction-on-solo patch anyhow :)
Depends on: 760912
(In reply to Bill Gianopoulos [:WG9s] from comment #21)
> I will be doing my windows build over with this (since it is the only one
> from which I can test the issue from bug 758899).  I will report back with
> results.

thanks bill
Blocks: 760912
No longer depends on: 760912
try is looking fine
Comment on attachment 630179 [details] [diff] [review]
alternative 1 - v1

Review of attachment 630179 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks.  This looks good to me.  Tested for my case.

r=honzab
Attachment #630179 - Flags: review?(honzab.moz) → review+
(In reply to Patrick McManus [:mcmanus] from comment #20)
> Created attachment 630179 [details] [diff] [review]
> alternative 1 - v1
This patch has been tested and fixes bug 758899 also.
we'll let this bake for a few days on m-c and then nom for aurora if all goes well.

https://hg.mozilla.org/integration/mozilla-inbound/rev/3b1f2455579a
Target Milestone: --- → mozilla16
Duplicate of this bug: 760912
Duplicate of this bug: 758899
https://hg.mozilla.org/mozilla-central/rev/3b1f2455579a
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Verified fixed with the latest nightly.

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/16.0 Firefox/16.0a1
http://hg.mozilla.org/mozilla-central/rev/7e4c2abb9fc9
Status: RESOLVED → VERIFIED
Comment on attachment 630179 [details] [diff] [review]
alternative 1 - v1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 729123
User impact if declined: occasional connection hang with ssl site. reload fixes it.
Testing completed (on m-c, etc.): on m-c for 5 days and verified fixed by non-developer
Risk to taking this patch (and alternatives if risky): low - it fundamentally reverts one particular code path to old behavior
String or UUID changes made by this patch: none 

bug impacts only nightly (fixed already) and aurora.
Attachment #630179 - Flags: approval-mozilla-aurora?
Attachment #630179 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #629799 - Flags: review?(mcmanus)
Whiteboard: [http-conn]
You need to log in before you can comment on or make changes to this bug.