Last Comment Bug 739522 - google.com sites with SPDY stop being able to load. Hang at "Looking up foo.google.com...". Continuous CPU usage.
: google.com sites with SPDY stop being able to load. Hang at "Looking up foo....
Status: RESOLVED FIXED
[spdy][qa-][http-conn]
:
Product: Core
Classification: Components
Component: Networking (show other bugs)
: Trunk
: x86_64 Mac OS X
: -- normal (vote)
: mozilla15
Assigned To: Patrick McManus [:mcmanus]
:
Mentors:
https://mail.google.com
: 729343 751689 (view as bug list)
Depends on: 744788 753663 755467
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-26 22:44 PDT by Matthew N. [:MattN] (behind on reviews)
Modified: 2012-06-21 06:41 PDT (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
+
fixed
fixed


Attachments
Sample 3/3 shortly after opening a Gmail tab (which never loads) (288.78 KB, text/plain)
2012-03-26 22:44 PDT, Matthew N. [:MattN] (behind on reviews)
no flags Details
Sample from possibly related shutdown hang (79.72 KB, text/plain)
2012-03-26 23:38 PDT, Matthew N. [:MattN] (behind on reviews)
no flags Details
patch 0 (3.50 KB, patch)
2012-04-13 21:02 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Review
patch 1 (5.79 KB, patch)
2012-04-23 18:21 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Review
patch v2 (3.49 KB, patch)
2012-05-03 11:02 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Review
ptach v3 (3.40 KB, patch)
2012-05-03 11:11 PDT, Patrick McManus [:mcmanus]
honzab.moz: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta+
Details | Diff | Review
patch 3 aurora (ff14) (3.29 KB, patch)
2012-05-03 14:49 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Review
patch 3 beta (ff13) (3.99 KB, patch)
2012-05-03 14:51 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Review

Description Matthew N. [:MattN] (behind on reviews) 2012-03-26 22:44:59 PDT
Created attachment 609622 [details]
Sample 3/3 shortly after opening a Gmail tab (which never loads)

Google websites using SPDY stopped loading for me after I had been using the browser for a few hours.  I had disconnected from Ethernet and enabled Wifi when I moved to a meeting room but I'm not sure if this is related.  I have 3 Gmail app tabs and they were displaying Gmail's Connecting UI with the "try again" and countdown.  Doing a reload of the tab caused them to load indefinitely with a spinning throbber.  Opening new tabs to https://www.google.com/ or https://mail.google.com/ showed "Connecting..." in the tab title and "Looking up foo.google.com..." in Firefox's status overlay.  Firefox stayed between 15% and 40% CPU usage for hours since this started causing the CPU fan to run faster.

Affected Sites (SPDY):
https://www.google.com/
https://mail.google.com/
https://docs.google.com/

Unaffected Sites:
https://encrypted.google.com/
https://maps.google.com/
(both are non-SPDY for me)

Build identifier: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:14.0) Gecko/20120326 Firefox/14.0a1 http://hg.mozilla.org/mozilla-central/rev/ba4983d9c1f9

I do have the SPDY indicator 1.1 extension installed along with many others.

I have more samples, netstat output, Wireshark output, and other data which I will also share privately.

  Library Versions
        Expected minimum version        Version in use
        NSPR        4.9        4.9
        NSS        3.13.3.0 Basic ECC        3.13.3.0 Basic ECC
        NSS Util        3.13.3.0        3.13.3.0
        NSS SSL        3.13.3.0 Basic ECC        3.13.3.0 Basic ECC
        NSS S/MIME        3.13.3.0 Basic ECC        3.13.3.0 Basic ECC
Comment 1 Matthew N. [:MattN] (behind on reviews) 2012-03-26 23:38:42 PDT
Created attachment 609631 [details]
Sample from possibly related shutdown hang

Not sure if this is related but the browser hung on shutdown after these problems with nsXMLHttpRequest::Send(nsIVariant*) on the stack.
Comment 2 Patrick McManus [:mcmanus] 2012-03-26 23:49:51 PDT
In email Matthew indicates this could be related to moving from wired to wireless ethernet.

I think I may have also seen this once and was unable to diagnose or reproduce. That event would also be related to a network interface being disabled and re-enabled.

nspr http logging would probably show what it was blocking on.

I'll think about if there are any other paths to diagnosing.
Comment 3 Matthew N. [:MattN] (behind on reviews) 2012-03-27 01:08:17 PDT
By the way, I also had the following non-default possibly related prefs:
security.OCSP.require = true
network.http.pipelining = true
network.http.pipelining.ssl = true
network.http.pipelining.maxrequests = 8  (I don't know why this was changed from 32)

I tried reproducing the bug in a new profile with the same non-default prefs and NSPR logging (NSPR_LOG_MODULES=nsHttp:5,nsSocketTransport:5,nsHostResolver:5) but I was unsuccessful.  I was trying different combinations of switching between Wired and Wi-fi in the middle of page loads and XHR.

Another theory is that this is related to NoScript because its changelog mentioned being broken on Nightly due to bug 719154. Not sure what the problem was though. https://addons.mozilla.org/en-US/firefox/addon/noscript/versions/2.3.6rc4 

I problem is that I don't know exact STR because I didn't notice the problem until the meeting ended because I wasn't really using the laptop during the meeting.  I'm not even sure it's related to switching networks.  

I was using ethernet at my desk with Wi-fi disabled then I pulled the ethernet out to walk to the meeting room.  I didn't sleep the computer and it would have no connection while walking.  At some point during the meeting I tried an MXR search and noticed I didn't have an internet connection because my Wi-Fi was disabled. I enabled Wi-Fi and searched MXR but didn't pay attention to the Google tabs until an hour later or so.  When I saw that the Gmail loading notification was showing on the gmail app tabs, I refreshed all 3 and then they never finished loading.  I didn't pay much attention since I went back to other applications and then later did a Google search which also failed.
Comment 4 Martin Thomson [:mt:] 2012-03-29 10:08:53 PDT
I can confirm this. 

I have IPv6 access from the IETF meeting network, and the connection logs show that I have been using it to connect.  That is the only variable that I am aware of that might affect things.

CPU usage is moderate on one core, but it's a pretty fast machine and the amount of noise is high.

Matthew, I have pipelining on, but that doesn't affect SPDY at all.  Nor does OCSP since that is all in the clear.  Don't bother chasing those down.
Comment 5 Patrick McManus [:mcmanus] 2012-04-09 11:20:42 PDT
*** Bug 729343 has been marked as a duplicate of this bug. ***
Comment 6 Martin Thomson [:mt:] 2012-04-09 12:53:44 PDT
I've now been surfing with SPDY enabled for a long while without seeing the bug resurface again.  That's a little frustrating for me trying to reproduce it, but I'm sure much worse for Patrick who wants to fix it :)
Comment 7 Patrick McManus [:mcmanus] 2012-04-13 20:47:46 PDT
* I can't reproduce but don't have any reason to believe this has gone away fully. There are 4 credible reports in total of it - including one of my own. If someone does reproduce and they don't have NSPR logging going, could you try going to work offline and then online again and see if that helps? (this will detach all existing sockets).

* 2 or 3 of those reports involved starting a packet capture when things went into this state, and no network traffic was observed when hitting reload

* I can think of two circumstances where no traffic would be generated when clicking reload:

 1] The new http transaction is placed on an existing spdy session (i.e. tcp session that has already carried spdy transactions) and a new SYN-frame is generated but cannot be sent because the TCP CWND is exhausted - having been filled by previous sends. TCP retries are going on, but due to exponential backoff they are few and far between and by the time the capture is setup they are so far apart as to be non-observable. The TCP connection is a black hole - perhaps being abandoned by a NAT that either isn't providing a RST or wasn't reachable at all during the time the retransmits really happened. There are a few variations on this theme - but basically it means we have an "active" connection that we aren't figuring out is broken and we continue to schedule new transactions onto it and those new transactions never send anything.

 2] When using spdy with a known spdy host we basically don't open parallel connections. If something were to prevent a connection from completing its half open state that would prevent any new connections from being attempted until that was resolved. The bug reports tend to involve wireless networks, bridge settings being played with (in my case), shutting of laptops, etc.. so it isn't hard to imagine how that might have happened. We should aggressively do something about that, but the parallelism of HTTP/1 has made it a non-pressing concern. It seems obvious now that if we have a hard limit of 1 and that 1 is non-functioning that's a worse situation.

The plan wrt theory #1 - The K-A code from 728113 should certainly help with this if we really get to deadlock somehow. That code is in ff13, which I'm not sure if any of the bug reports was using. I don't think they were. Additionally, 738012 and 744788 have been fixed on ff14 and should be ported to 13 to ensure its correct operation there.

My gut actually thinks theory #2 is more likely because this scenario would more easily result in the "no observable transmissions" situation. I'm going to attach a patch to this bug that says we should proceed with parallelism on any connection attempt at least once every 3 seconds (subject to the normal upper http connection limits) if there are no active sessions that have successfully confirmed SPDY through NPN yet. With any site that can handshake within a few whole seconds that will give us the current behavior and sites much slower than that will simply see an extra connection. (if both connections get established one will end up receiving all the transactions and the other will naturally quiesce and timeout fairly shortly in a harmless manner).
Comment 8 Patrick McManus [:mcmanus] 2012-04-13 21:02:26 PDT
Created attachment 614997 [details] [diff] [review]
patch 0
Comment 9 Honza Bambas (:mayhemer) 2012-04-15 12:10:53 PDT
 3] missing call to AsyncWait on the output stream or input stream in spdy or logic issue that in some way breaks/prevents the wait+callback chain
 4] some glitch in the dispatch logic - transaction is forever pending in the queue
 5] "stopped loading for me after I had been using the browser for a few hours" -> some issue with GoAway or getting close to stream max number 

Ping timeout (Bug 728113) should fix TCP send buffer congestion (1] and 2] as well probably).

I don't thing the patch is the solution here.  Implementing some kind of a transaction dispatch watch dog (is in queue too long or is dispatched on a connection too long, but has not been even sent out -> redispatch forcefully on a new connection) would be IMO better, but it would just hide the true problem if there is one different from a connection stuck on the TCP level.

Would be great to test this with a Firefox versions that has/has not the SPDY timeout and pipelining patches (regardless the preferences, the code changes are important here).  Matt, looks like you were enable to reproduce.  With what version exactly did you experience the bug?  With what version exactly did those 2 reports happened?

> Doing a reload of the tab caused them to load indefinitely with a spinning throbber.

What was in the status bar?  Connecting, sending or receiving?  Do you remember?  Was the spin the black with a cycling dot [1] or the colored blue [2]?

[1] http://mxr.mozilla.org/mozilla-central/source/browser/themes/pinstripe/tabbrowser/connecting.png
[2] http://mxr.mozilla.org/mozilla-central/source/browser/themes/pinstripe/tabbrowser/loading.png
Comment 10 [not reading bugmail] 2012-04-15 16:54:53 PDT
(In reply to Honza Bambas (:mayhemer) from comment #9)

> > Doing a reload of the tab caused them to load indefinitely with a spinning throbber.

I thought I see this too on older versions since FF4 - maybe the max http connections pref was too high. network.http.max-connections;256 so I been running at 48 or 60 instead, but then maybe its offline bug 744788 which I mentioned over here:
http://hacks.mozilla.org/2012/02/spdy-brings-responsive-and-scalable-transport-to-firefox-11/
Comment 11 Patrick McManus [:mcmanus] 2012-04-15 20:32:12 PDT
(In reply to Honza Bambas (:mayhemer) from comment #9)
>  3] missing call to AsyncWait on the output stream or input stream in spdy
> or logic issue that in some way breaks/prevents the wait+callback chain

I think this is unlikely because a new event (keep clicking reload or submitting new stuff) would get it going again and the report is that its totally locked up.

>  4] some glitch in the dispatch logic - transaction is forever pending in
> the queue

This seems most likely to me - The only permutation of this I can think of is what the patch works around. The connection-establishment issue feels real even if I've never seen it - a stall while getting through half-open or npn-negotiation will essentially create a halt

>  5] "stopped loading for me after I had been using the browser for a few
> hours" -> some issue with GoAway or getting close to stream max number 
> 

interesting thoughts, but exhausting 23 bits of stream number - no way that happened imo. Especially the time I saw it which was fairly near startup, but I suspect in all cases its implausible. 8 million transactions is a lot :). GoAway is always followed by a TCP FIN wihout us having to ack it in any way (at least from *.google.com) so that isn't likely the issue - that would generate a close up in httpconnection::onsocketreadable.

I think some kind of loss (maybe nat induced, or routing change) is a factor.. I got it when playing with a bridge on localhost (though I haven't been able to reproduce despite lots of trying), martin saw it using ipv6 on a wireless net in a conference setting, and another report involved suspending the laptop and moving between access points.. so that's why I lean towards those loss kind of scenarios more than a vanilla dispatch race condition or problem in handling another condition like goaway - but its true we can't tell for sure.

If we can identify real scenarios I would rather have fixes in place for them even if we can't be sure they are the source of this.

> Ping timeout (Bug 728113) should fix TCP send buffer congestion (1] and 2]
> as well probably).
> 

it would fix 1, but not 2.. there is no ping with the half-open conn. I feel like 1 is plausible but unlikely - you have to assume a lot of stuff for #1 to be non-observable in a packet cap.

> I don't thing the patch is the solution here.  Implementing some kind of a
> transaction dispatch watch dog (is in queue too long or is dispatched on a
> connection too long, but has not been even sent out -> redispatch forcefully
> on a new connection) would be IMO better, but it would just hide the true
> problem if there is one different from a connection stuck on the TCP level.
> 

I feel like http/1's parallelism is the reason we don't need to track and do cancellation on connection open for http/1. (though maybe we should - I r-'d a patch that did that basically because the patch was buggy not because it was a bad idea.. it has some minor benefits).. the patch here brings some of that safety-net into play for spdy at almost no cost.
Comment 12 Honza Bambas (:mayhemer) 2012-04-16 09:14:16 PDT
Patrick, can you please say when exactly you believe we get stack?  TCP connect or SSL nego or what is it?

I'm just curious why current timeout logic doesn't catch this.  TCP timeout may be broken by a bug in the system stack or driver (unlikely).  Otherwise we catch it as a failure to connect in some 21 seconds.  SSL has no timeout, so when this is not caught by TLS intolerance timer (unlikely), we may get stuck.

According (your) comments, I more believe this is a transaction scheduling issue we introduced recently, but who knows w/o logs.

The connection timeout should be placed to the SPDY code (session) and let all transactions pending on the that timed-out session rescheduled (returned to the pending queue probably), maybe only after a new session is established.  After some amount of retries we may even fallback to HTTP/1.
Comment 13 Patrick McManus [:mcmanus] 2012-04-16 18:51:41 PDT
(In reply to Honza Bambas (:mayhemer) from comment #12)
> Patrick, can you please say when exactly you believe we get stack?  TCP
> connect or SSL nego or what is it?

either of those are plausible. Any point before SSL handshake being complete really, though I don't know the nss code well enough - maybe there are timeouts detected in there?

Its also quite possible the bug was filed based on something the KA handling should fix - especially with the recent offline-timer fixes. I can guess which is more likely but there is no way to really know.

> 
> I'm just curious why current timeout logic doesn't catch this.  TCP timeout
> may be broken by a bug in the system stack or driver (unlikely).  Otherwise
> we catch it as a failure to connect in some 21 seconds.

where does 21 come from? A system setting? On all platforms? os x included and linux included? Is that true both when arp for next hop succeeeds and fails (which often changes this behavior of system stacks).. I wasn't able to find any necko handling - but maybe its buried in nspr somewhere? I agree this patch isn't relevant if we do connection timeout handling (all the way through ssl handshake) without relying on the system. That's actually the patch I thought about writing here - but its way more complicated.

  SSL has no timeout,

where is the tcp timeout? (other than in the OS)

> The connection timeout should be placed to the SPDY code (session) and let
> all transactions pending on the that timed-out session rescheduled (returned
> to the pending queue probably), maybe only after a new session is
> established. 

That's basically exactly what the K-A code does already.

The patch attached here deals with the case where there are no fully established spdy sessions (thus no transactions on them), but there is 1 in the process of being established.. therefore we are waiting for that to complete.
Comment 14 Honza Bambas (:mayhemer) 2012-04-17 09:52:57 PDT
Re timeouts: we rely on the system TCP timeout and have the 250ms backup.  PSM has a timer to detect TLS intolerant servers.  But it is not used for sites already recognized as TLS intolerant.  So, it's not perfect.

Sorry to repeat my self, but bug 715905 could have a single timeout coverage for all these cases...

I suggest to wait until somebody experience this bug with bug 729736 and bug 744788 fixed.

If so, then this bug is either strictly related to SPDY and is a problem in the logic implementation (transaction queuing) or, is an issue with system implementation bug - we are not getting error reports (unlikely) - and may affect all protocols including ftp and mail protocols.

Based on that, your fix has to be placed strictly to SPDY code as a workaround for currently unidentified issue or should be, on the other side, in the platform touching all protocols (in nsSocketTransport and in PSM).
Comment 15 Patrick McManus [:mcmanus] 2012-04-17 17:51:50 PDT
this is not worth my time to squabble over. I think you're making a mistake resisting an in depth defense to a problem without a known root cause that comes at very low risk.

that being said, I agree that it is probably not the root cause of the issue.

it might be.

I hope its not.

I'll mark it fixed. If you insist that it be left open I'll insist that you take the patch for it there is no point in leaving it open if you don't think its actionable.
Comment 16 Honza Bambas (:mayhemer) 2012-04-18 03:50:20 PDT
I said let's wait.  But people can just open new bugs.
Comment 17 Patrick McManus [:mcmanus] 2012-04-23 18:20:34 PDT
Apparently my subconscious can't let this one just go - I woke up thinking about it 2 days in a row.

Should this reoccur I want to have an answer to "my browser is stuck, how can I help debug this without restarting and starting nspr logging?".  We've gotten pcaps and those have sort of exhausted their usefulness, and after that I'm short of ideas.

So I've reworked the patch to simply log to the js error console the conditions under which an SSL connection gets stalled if the connection entry has been deferring opens for more than 3 seconds. Unlike the previous version of the patch, no connection management behavior is changed - it just logs.

At least this way if it recurs (which it might not), we can rule this code path in or out as the root of the issue and decide at that time what to do about it.

Included here are a few basic convenience functions for doing the console logging. There are some other extreme events in HTTP land that probably could use some visibility outside of NSPR logging and this provides an easy way to do it.
Comment 18 Patrick McManus [:mcmanus] 2012-04-23 18:21:36 PDT
Created attachment 617729 [details] [diff] [review]
patch 1
Comment 19 Honza Bambas (:mayhemer) 2012-04-24 11:58:52 PDT
Good idea.

Few additions:
- 3 secs is to early to log, there can be conditions where the connection may take longer and we may just have confusing false negatives (ocsp, lagging nets)
- better: simply every 30 seconds log that a connection hangs ("conn to x.y.z hangs for 30 secs", then "(the same) conn to x.y.z hangs for 60 secs" etc..)
- but what I would love to have is a log of stuck transaction queuing: when there is a transaction that gets queued for more then say 30 secs (and every ongoing 30 secs) log the number of queued transactions for a host ; this may give us a picture whether transactions just get put to the queue on every page reload (that users obviosly will do)
Comment 20 Honza Bambas (:mayhemer) 2012-04-25 13:36:19 PDT
Comment on attachment 617729 [details] [diff] [review]
patch 1

See comment 19.
Comment 21 Patrick McManus [:mcmanus] 2012-05-03 10:54:19 PDT
(11:56:42 AM) mcmanus: twitter has been performing a little weird for me for the last day or so.. and that was my lead.
(11:56:56 AM) mcmanus: tell me what you think of this..
(11:57:23 AM) mcmanus: a connection entry gets "mUsingSpdy" set to true at some point and that never gets unset.
(11:57:33 AM) mcmanus: that was on purpose.
(11:57:49 AM) mcmanus: twitter is only using a little bit of spdy..
(11:58:05 AM) mcmanus: which brings us to RestrictConnections()..
(11:58:22 AM) mcmanus: that code basically says if using spdy only allow 1 active connection at a time.
(11:58:44 AM) mcmanus: but if the host is no longer using spdy, that 1 active connection becomes the bottleneck.
(11:59:05 AM) mcmanus: If you run a single COMET style hanging GET over it, everything deadlocks.
(11:59:19 AM) mcmanus: (assuming that 1 conection isn't spdy, but the host is marked spdy).
(12:00:05 PM) mcmanus: so.. you go to twitter.. use some spdy.. all is good. you go away and those connections eventually timeout.
(12:00:43 PM) mcmanus: you come back, and this time the handshake is http/1.1 and you aren't allowed any parallelism.. so stuff is slow. if there is a COMET like thing in there, it pretty much deadlocks.
(12:02:11 PM) mcmanus: the google.com "wedged" report might have stemmed from an epsiode where they either briefly disabled spdy, or we went to HTTP/1 due to the TLS intolerance timeouts..
(12:02:59 PM) mcmanus: I think it can be fixed real safely.
(12:03:04 PM) mayhemer__: ok, so we should set the mUsingSpdy flag always like mUsingSpdy = (NPN == "spdy/?"), right?
(12:03:38 PM) mayhemer__: I'm just curious how is that possible that a site (host) changes its mind on using SPDY
(12:03:47 PM) mayhemer__: if I understand well
(12:04:05 PM) mcmanus: re changes its mind: well obviously it's on 2 differrent connections.
(12:04:08 PM) mcmanus: load balancers
(12:04:14 PM) mcmanus: a/b testing (that's what twitter is doing)
(12:04:26 PM) mayhemer__: aha
(12:04:45 PM) mcmanus: and (forced by us) TLS intolerance.. if we fallback to SSLv3 we don't do spdy because we don't have NPN
(12:05:55 PM) mayhemer__: true
(12:05:57 PM) mcmanus: I think the right fix is actually a refinement to restrictConnections().
(12:06:46 PM) mcmanus: if we are restricting because the host is a known spdy host and has an active connection then we just have to make sure that the active connection either is already Spdy or is still in the ssl handshake
(12:07:07 PM) mcmanus: if that's not true, then allow some parallelism.
(12:07:36 PM) mcmanus: That's a pretty safe thing to push to beta.
(12:07:50 PM) mayhemer__: agree
(12:09:13 PM) mayhemer__: if we had an array mActiveSPDYConns it would be fairly simple to just check for its number of items
(12:09:20 PM) mcmanus: cool. I'll go code it up and test the path a little bit. it will need a different patch for aurora/beta because the restrictConnections() code moved to its own function with speculative connect - but the code is the same.
(12:09:40 PM) mcmanus: mActiveSPDYConns isn't enough - it needs suff still negotiating NPN.
Comment 22 Patrick McManus [:mcmanus] 2012-05-03 10:58:42 PDT
I've been able to reliably demonstrate a problem with mixed http/spdy hosts by just adding code that marks every http SSL host as "having seen spdy".. 

I can confirm that patch v2 (about to be posted) fixes that up by testing the assumptions of the restricted connection path (i.e. that it is restricted because there is currently a session in handshake or actively doing tcp), if that assumption cannot be confirmed it just allows some parallelism. I think this is very safe.

I can think of at least 4 ways a host could be mixed http/spdy.. there are probably more:

* server side a/b testing. (some sites are known to be doing this to evaluate spdy)
* load balancers and farms that aren't 100% spdy
* server side maintenance of spdy gateways
* firefox could have gotten into a TLS intolerance mode due to an ephemeral closed session or timeout. That means we would have used SSLv3 which never carries spdy.
Comment 23 Patrick McManus [:mcmanus] 2012-05-03 11:02:13 PDT
Created attachment 620781 [details] [diff] [review]
patch v2
Comment 24 Patrick McManus [:mcmanus] 2012-05-03 11:05:09 PDT
Comment on attachment 620781 [details] [diff] [review]
patch v2

left the printf in.. will update.
Comment 25 Patrick McManus [:mcmanus] 2012-05-03 11:11:34 PDT
Created attachment 620787 [details] [diff] [review]
ptach v3
Comment 26 Honza Bambas (:mayhemer) 2012-05-03 11:28:54 PDT
Comment on attachment 620787 [details] [diff] [review]
ptach v3

Review of attachment 620787 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +1209,5 @@
> +    // There is a concern that a host is using a mix of HTTP/1 and SPDY.
> +    // In that case we don't want to restrict connections just because
> +    // there is a single active HTTP/1 session in use.
> +    if (ent->mUsingSpdy && ent->mActiveConns.Length()) {
> +        bool confirmedRestrict = false;

You actually don't need confirmedRestrict and work directly with doRestrict, but up to you.
Comment 27 Patrick McManus [:mcmanus] 2012-05-03 11:40:10 PDT
https://tbpl.mozilla.org/?tree=Try&rev=e54e089d6bf5
Comment 28 Patrick McManus [:mcmanus] 2012-05-03 14:05:35 PDT
*** Bug 751689 has been marked as a duplicate of this bug. ***
Comment 29 Patrick McManus [:mcmanus] 2012-05-03 14:49:53 PDT
Created attachment 620861 [details] [diff] [review]
patch 3 aurora (ff14)

fyi this is what the patch looks like ported to aurora (ff14)
Comment 30 Patrick McManus [:mcmanus] 2012-05-03 14:51:49 PDT
Created attachment 620862 [details] [diff] [review]
patch 3 beta (ff13)

fyi this is what patch 3 looks like ported to beta (ff13)
Comment 31 Patrick McManus [:mcmanus] 2012-05-03 17:55:42 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/fbce8a608761
Comment 32 Patrick McManus [:mcmanus] 2012-05-04 06:08:16 PDT
Comment on attachment 620787 [details] [diff] [review]
ptach v3

[Approval Request Comment]
Regression caused by (bug #): no regression; longstanding bug
User impact if declined: potential hang of highly visible websites running spdy such as twiter or google.com; only fixable by restart of browser. 

Testing completed (on m-c, etc.): on m-c. Manual whitebox testing can create environemnt to trigger the bug and illustrate fix mitigates it.

Risk to taking this patch (and alternatives if risky): low. Essentially this patch just removes an optimization under some identifiable circumstances

String changes made by this patch: none
Comment 33 Ed Morley [:emorley] 2012-05-04 10:57:26 PDT
https://hg.mozilla.org/mozilla-central/rev/fbce8a608761
Comment 34 Alex Keybl [:akeybl] 2012-05-06 18:53:47 PDT
Comment on attachment 620787 [details] [diff] [review]
ptach v3

[Triage Comment]
Based upon possible user impact and the risk evaluation provided, approving for both Aurora 14 and Beta 13.
Comment 36 Ioana (away) 2012-05-25 08:48:13 PDT
Guys, has any of you found any reliable STR that I could use to verify this bug fix? I don't see any in the comments above, nor can I reproduce it on older builds.
Comment 37 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-05-29 10:48:01 PDT
(In reply to Ioana Budnar [QA] from comment #36)
> Guys, has any of you found any reliable STR that I could use to verify this
> bug fix? I don't see any in the comments above, nor can I reproduce it on
> older builds.

*bump*

Anything QA can do here to verify the fix?
Comment 38 Patrick McManus [:mcmanus] 2012-05-29 13:16:54 PDT
there is no reliable str for this. days of browsing with google.com and twitter would have shown the problem eventually.
Comment 39 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-05-29 13:33:52 PDT
Okay, thanks Patrick. Marking this as [qa-] given that there is no reliable STR. Please mark [qa+] if this changes. If you are able to reproduce the issue on an affected build, please try to verify this for us.

Thanks
Comment 40 [:philipp] 2012-06-07 05:42:46 PDT
on SUMO forums, some users still report page loading problems for gmail & twitter after they have upgraded to firefox 13 and with spdy enabled
Comment 41 [:philipp] 2012-06-07 10:56:31 PDT
it was reported by a user that the problem only occurred when network.http.keep-alive was set to false. 
with a fresh profile of firefox 13 on windows 7 64bit, i could replicate this behaviour. as soon as network.http.keep-alive is switched to false the secure pages of google.com & twitter.com are stuck during loading when spdy is enabled.
Comment 42 Patrick McManus [:mcmanus] 2012-06-07 11:29:08 PDT
(In reply to philipp from comment #41)
> it was reported by a user that the problem only occurred when
> network.http.keep-alive was set to false. 
> with a fresh profile of firefox 13 on windows 7 64bit, i could replicate
> this behaviour. as soon as network.http.keep-alive is switched to false the
> secure pages of google.com & twitter.com are stuck during loading when spdy
> is enabled.

interesting.. I will try and repro. Is there confirmation from other reporters?

can you be a little more specific with the str regarding urls?
Comment 43 [:philipp] 2012-06-07 12:01:36 PDT
https://www.google.com and https://twitter.com/ (loads partly) are already affected when i test it with the configuration as described in comment #41

this is the thread where a the finding was originally posted, https://support.mozilla.org/questions/928182#answer-339391

i'll try to catch up with the other users who posted during the last days, which had problems with spdy & see if they also had network.http.keep-alive set to false.
https://support.mozilla.org/questions/928137
https://support.mozilla.org/questions/928816 
https://support.mozilla.org/questions/928888
Comment 44 Patrick McManus [:mcmanus] 2012-06-07 12:55:37 PDT
Philipp, thanks - can you open a new bug and cc me? The symptom maybe the same as this bug - but this is a different issue.

I can repro with windows (but not the first time with linux.. but maybe I didn't try hard enough) on ff 16 -> ff 13 with the keepalive setting. well, I can repro some trouble loading the page, not quite as bad as you describe, but definitely a problem.

Note You need to log in before you can comment on or make changes to this bug.