767742 - SPDY makes fennec take a lot longer to load links clicked on through Google search results

Reporter

Description

•

12 years ago

Fennec takes significantly longer, from when I tap on a link to when the page actually starts loading (i.e., I'm not just seeing the previous page), to load pages. This is a general problem, but one place I can remember seeing it specifically is the URL (a restaurant's website).

Click on Menus, then (say) Prix Fixe. It can take multiple minutes to start getting the page.

In the stock browser, the same action gets a response almost immediately.

Joe Drew (not getting mail)

Reporter

Comment 1

•

12 years ago

Of course, once I file this bug I find it's not as 100% reproducible as I thought it was. Also, I'm taking a stab and saying this might be networking, but I haven't done any investigation at all.

Boris Zbarsky [:bzbarsky]

Comment 2

•

12 years ago

Joe, if you can reproduce semi-reliably... can you check whether Fennec enables pipelining? If it does, what effect does disabling that have?

Joe Drew (not getting mail)

Reporter

Comment 3

•

12 years ago

Fennec does enable pipelining, but disabling it doesn't seem to make a difference.

When Firefox gets into this state, it'll be slow even to start loading entirely different domains. For example, I tried to navigate from the URL mentioned here to this page in Fennec, and eventually gave up and opened my laptop.

It's as though we don't have enough connections in the pool, or we're not killing the ones we are trying to load when navigating.

Brad Lassey [:blassey] (use needinfo?)

Updated

•

12 years ago

Whiteboard: [lame-network]

Mark Finkle (:mfinkle) (use needinfo?)

Comment 4

•

12 years ago

CC'ing Patrick "Mr. Network" McManus

Joe Drew (not getting mail)

Reporter

Comment 5

•

12 years ago

It has gotten to the point that I can't stand using Fennec because of this bug. :-( As evidence, I submit that I am commenting on this bug while on vacation!

Jason Duell

Comment 6

•

12 years ago

I can see two avenues here:  1) looking at connection limits and other settings/limits in our TCP/HTTP stack that are fennec-specific (Patrick), and 2) possible perf issues with our e10s-specific logic (bug 648433, which I'm going to bump toward the top of my TODO list).   

There could of course be other reasons, but these seems like obvious things to look at first.

Josh Matthews [:jdm]

Comment 7

•

12 years ago

(In reply to Jason Duell (:jduell) from comment #6)
> I can see two avenues here:  1) looking at connection limits and other
> settings/limits in our TCP/HTTP stack that are fennec-specific (Patrick),
> and 2) possible perf issues with our e10s-specific logic (bug 648433, which
> I'm going to bump toward the top of my TODO list).   
> 
> There could of course be other reasons, but these seems like obvious things
> to look at first.

Fennec doesn't use e10s any more.

Jason Duell

Comment 8

•

12 years ago

> Fennec doesn't use e10s any more.

Sorry--B2G on the brain :)

Patrick ran the benchmarks that gave us our current connection limit numbers, etc.

Assignee: nobody → mcmanus

Joe Drew (not getting mail)

Reporter

Comment 9

•

12 years ago

My observation is that I get about 2-3 page loads at startup before enormous, enormous delays start kicking in. I'd love to provide my profile (if possible)/prefs/etc to anyone to reproduce this.

Joe Drew (not getting mail)

Reporter

Comment 10

•

11 years ago

I don't see this a heck of a lot on my personal phone these days, but our Play store feedback is *rife* with reports of precisely this problem.

Joe Drew (not getting mail)

Reporter

Updated

•

11 years ago

tracking-fennec: --- → ?

Brad Lassey [:blassey] (use needinfo?)

Comment 11

•

11 years ago

Josh, can stoneridge confirm or refute that we're slower than chrome in page load?

Flags: needinfo?(joshmoz)

Josh Aas

Comment 12

•

11 years ago

(In reply to Brad Lassey [:blassey] from comment #11)
> Josh, can stoneridge confirm or refute that we're slower than chrome in page
> load?

Stone Ridge cannot answer that question for us right now.

Flags: needinfo?(joshmoz)

Brad Lassey [:blassey] (use needinfo?)

Comment 13

•

11 years ago

(In reply to Josh Aas (Mozilla Corporation) from comment #12)
> (In reply to Brad Lassey [:blassey] from comment #11)
> > Josh, can stoneridge confirm or refute that we're slower than chrome in page
> > load?
> 
> Stone Ridge cannot answer that question for us right now.

any other suggestions then?

Flags: needinfo?(joshmoz)

Josh Aas

Comment 14

•

11 years ago

(In reply to Brad Lassey [:blassey] from comment #13)

> any other suggestions then?

Not right now. Randall Dow has been doing some profiling but hasn't found anything particularly problematic that we don't already know about yet, at least on the network layer. Unfortunately this bug is too vague for us to do much more, and I suspect we've improved the situation significantly since it was filed.

We're always working on new ways to measure page load time and find room for improvement, that's the best way to move forward on reports like this that don't point to specific issues.

Flags: needinfo?(joshmoz)

Brad Lassey [:blassey] (use needinfo?)

Comment 15

•

11 years ago

Patrick, what can we do here? Is this actionable.

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Assignee

Comment 16

•

11 years ago

(In reply to Brad Lassey [:blassey] from comment #15)
> Patrick, what can we do here? Is this actionable.

I suggest it needs more testing to find a reproduction and only then can we figure out what the cause is. As it stands what is there to go on in this bug other than "a restaurant website used to be slow on fennec and now isn't.. but periodically something is so slow as to be unusable".?

Don't take that to mean I'm questioning the validity of the bug, I'm definitely not! But the only thing I can think of here is legwork to get some data about an observed incident.

My understanding was that Randall was going to do some of that but hadn't come up with anything yet. Hopefully he will weigh in.

Flags: needinfo?(mcmanus)

Joe Drew (not getting mail)

Reporter

Comment 17

•

11 years ago

I think that I've narrowed down what makes this happen. Search for a site on Google, one you (presumably) haven't been to before. Click on that site. It takes a lot longer to load than other browsers.

This makes me think that it has something to do with SPDY, so I've disabled that to see if it makes any difference.

Honza Bambas (:mayhemer)

Comment 18

•

11 years ago

(In reply to Joe Drew (:JOEDREW! \o/) from comment #17)
> This makes me think that it has something to do with SPDY, so I've disabled
> that to see if it makes any difference.

And the result?

Joe Drew (not getting mail)

Reporter

Comment 19

•

11 years ago

So far so good!

Joe Drew (not getting mail)

Reporter

Comment 20

•

11 years ago

To the point that I'm going to make this bug specific to that, in fact.

Summary: Loading pages takes much longer in Fennec than the stock browser → SPDY makes fennec take a lot longer to load links clicked on through Google search results

Brad Lassey [:blassey] (use needinfo?)

Updated

•

11 years ago

tracking-fennec: ? → +

Brad Lassey [:blassey] (use needinfo?)

Comment 21

•

11 years ago

Patrick, should we be looking to disable spdy until this is resolved?

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Assignee

Comment 22

•

11 years ago

I've been off the grid the last week, sorry for the lagging reply.

(In reply to Brad Lassey [:blassey] from comment #21)
> Patrick, should we be looking to disable spdy until this is resolved?

let's see what the balance is here first through further assessment. spdy has major gains for high latency environments.

Has anyone other than Joe been able to reproduce? (i.e. could it be environmental in some part?)

(In reply to Joe Drew (:JOEDREW! \o/) from comment #20)
> To the point that I'm going to make this bug specific to that, in fact.

Joe - what I think is most interesting here is that the performance problem is with HTTP/1 done after SPDY usage. i.e. it isn't with the SPDY transfer directly. There is a limited amount of surface where these things interact so we should focus on that.

The most obvious candidate to me would be the max connection pool. Both protocols count against that allocation, but HTTP/1 is much more dependent on turnover in the pool (because spdy tends to be quite stable in practice, while HTTP/1 is constantly setting up and tearing down connections - one of spdy's advantages). The max connection pool on mobile is much smaller than on desktop (20 vs 256)[*] - so I could hyptohesize that the HTTP/1 site is relying on heavy turnover of the pool and the persistent spdy sessions in it are not cooperating (i.e. buggy) and going away under pressure which leads to connection starvation loading the HTTP/1 site. The same bug would exist on desktop but be pretty invisible due to the higher allocation of sockets overall.

That's just an idea that fits the report. You could test it out by increasing network.http.max-connections to 256 to test it out. Would you do that?

-P

[*] connection management needs to be rethought a little bit for mobile and b2g as the capability of those platforms has exploded in the last couple of years - there are potential wins there.. but a same-as-desktop strategy is obviously not right as the bottleneck is still generally CPU, RAM is more constrained, and due to the high latency involved you need to be vigilant against generating huge queues through parallelism that will ruin any kind of interactivity on cancel events etc..

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Assignee

Comment 23

•

11 years ago

Joe - I can't reproduce this (more in another comment) but for now

1] can you increase network.http.max-connections to 256 and retest?

2] is this on wifi or a data connection? DOes it matter where you do it - e.g. home vs office?

3] do you know of anybody else that can reproduce (i.e. should we be looking at environmental factors)?

4] can you confirm your STR as:
1 - clear cache
2 - quit firefox
3 - restart firefox
4* - goto https://www.google.com and search for papilonpark.com
5 - goto papilonpark.com

and the observed behavior is that with step 4* removed the loading of pailonpark is much faster. How much faster?

Can you use adb to get an HTTP log?

Any chance you can get a wireshark capture? I do have a setup here to get caps of my fennec (when its on wifi at least), but I know that's a royal pain.

Thanks!

Flags: needinfo?(joe)

Patrick McManus [:mcmanus]

Assignee

Comment 24

•

11 years ago

I can't reproduce a problem here using the STR from comment 23 on my nexus S using firefox 21 (2-17 nightly build) over wifi. I can upload packet caps if anyone wants them of 2 runs to compare. Each takes about 30 seconds to complete. (the version after spdy actually completes a touch faster but nothing significant.)

One totally unrelated thing I do find interesting about the traces (it appears consistent in each of them) is that the first stylesheet referenced in the HEAD doesn't start a request for itself until about 650ms after the reference to it is received - that's a lot slower than I would have presumed given that not much else is competing for the processor yet, Maybe something that could be profiled jduell or mayhemer? (but its unrelated to this bug - clearly.) That's not what desktop does but it might just be a CPU power difference - nonetheless it results in an idle network for a while and that's always a resource we want to exploit.

My next step is to read some code and try and force the scenario I describe in comment 22 to see if I can confirm that the spdy session goes away under http pressure.. I may just be not creating the pressure in the same way joe does - so an artificial experiment might still yield someting.

after that I'm gonna need more inputs to go farther.

Patrick McManus [:mcmanus]

Assignee

Comment 25

•

11 years ago

(In reply to Patrick McManus [:mcmanus] from comment #24)

> My next step is to read some code and try and force the scenario I describe
> in comment 22 to see if I can confirm that the spdy session goes away under
> http pressure.. 

I was able to see a problem here - so it might be this issue in the end.

I've made a potential fix and have a build started here:

https://tbpl.mozilla.org/?tree=Try&rev=a7ab72687811

Joe can you try that out?

Josh Aas

Comment 26

•

11 years ago

(In reply to Patrick McManus [:mcmanus] from comment #24)

> One totally unrelated thing I do find interesting about the traces (it
> appears consistent in each of them) is that the first stylesheet referenced
> in the HEAD doesn't start a request for itself until about 650ms after the
> reference to it is received - that's a lot slower than I would have presumed
> given that not much else is competing for the processor yet, Maybe something
> that could be profiled jduell or mayhemer? (but its unrelated to this bug -
> clearly.) That's not what desktop does but it might just be a CPU power
> difference - nonetheless it results in an idle network for a while and
> that's always a resource we want to exploit.

I filed bug 845531 about this. Nice catch.

Joe Drew (not getting mail)

Reporter

Comment 27

•

11 years ago

(In reply to Patrick McManus [:mcmanus] from comment #23)
> 1] can you increase network.http.max-connections to 256 and retest?

First question - should I do this testing or test your potential fix? It's a little tricky to reproduce, though sometimes I can do it always, which is why I turned off SPDY altogether in order to see if I could reproduce it.
 
> 2] is this on wifi or a data connection? DOes it matter where you do it -
> e.g. home vs office?

I clearly remember having this problem at home, but cannot recall whether it has ever happened at the office.
 
> 3] do you know of anybody else that can reproduce (i.e. should we be looking
> at environmental factors)?

Not personally, but you can see people complaining about these problems in the Google Play market.

> 4] can you confirm your STR 

Unfortunately not, though it's always "Google search for something" then "click on that something".

I can get traces, etc, if I can reliably reproduce it again; I'll keep my eye out.

Flags: needinfo?(joe) → needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Assignee

Comment 28

•

11 years ago

(In reply to Joe Drew (:JOEDREW! \o/) from comment #27)
> (In reply to Patrick McManus [:mcmanus] from comment #23)
> > 1] can you increase network.http.max-connections to 256 and retest?
> 
> First question - should I do this testing or test your potential fix? 

Testing just the pref change would confirm the general problem, while the try-build contains a specific fix for that hypothesis.

I would start with the try-build (because that's something we could checkin), but if that continued to fail I would still try the pref change to see if it was just the fix that wasn't sufficient and we were still onto something.

The pref change on its own does have the potential to cause you other problems with responsiveness and maybe even OOM. (which is why it isn't set that way be default :))

Flags: needinfo?(mcmanus)

Joe Drew (not getting mail)

Reporter

Comment 29

•

11 years ago

Argh. Patrick, I went to install this (way late) and, because my Nightly is newer than it, Android won't let me. Can you roll another try build?

Flags: needinfo?(mcmanus)

Patrick McManus [:mcmanus]

Assignee

Comment 30

•

11 years ago

(In reply to Joe Drew (:JOEDREW! \o/) from comment #29)
> Argh. Patrick, I went to install this (way late) and, because my Nightly is
> newer than it, Android won't let me. Can you roll another try build?

no problem!
https://tbpl.mozilla.org/?tree=Try&rev=d07c8f60b143

Flags: needinfo?(mcmanus)

Kevin Brosnan [Ex-Mozilla]

Comment 31

•

11 years ago

FYI on newer Nexus devices you need to use adb install -d -r fennec.apk to downgrade.

Mark Finkle (:mfinkle) (use needinfo?)

Comment 32

•

11 years ago

Pinging Joe

Flags: needinfo?(joe)

Joe Drew (not getting mail)

Reporter

Comment 33

•

11 years ago

It seems to be better, though it's a little tough to say definitively. I definitely haven't seen the long delays that characterized this bug since I've been running Patrick's build.

My recommendation is to give this a shot in mozilla-central; if I start seeing this problem regularly again, I'll open a new bug!

Flags: needinfo?(joe)

Patrick McManus [:mcmanus]

Assignee

Updated

•

11 years ago

Component: Networking → Networking: HTTP

Whiteboard: [lame-network] → [spdy]

Patrick McManus [:mcmanus]

Assignee

Comment 34

•

11 years ago

Attached patch patch 0 — Details — Splinter Review

Attachment #726617 - Flags: review?(honzab.moz)

Honza Bambas (:mayhemer)

Comment 35

•

11 years ago

Comment on attachment 726617 [details] [diff] [review]
patch 0

Review of attachment 726617 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab based mainly on tests.

We may want to split the limits for http/1 and spdy.  I think I was suggesting that ones, what was the result?

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +842,5 @@
> +{
> +    if (!ent->mUsingSpdy)
> +        return PL_DHASH_NEXT;
> +
> +    nsHttpConnectionMgr *self = (nsHttpConnectionMgr *) closure;

static cast please

@@ +846,5 @@
> +    nsHttpConnectionMgr *self = (nsHttpConnectionMgr *) closure;
> +    for (uint32_t index = 0; index < ent->mActiveConns.Length(); ++index) {
> +        nsHttpConnection *conn = ent->mActiveConns[index];
> +        if (conn->UsingSpdy() && conn->CanReuse()) {
> +            conn->DontReuse();

To confirm: this calls Close() on the SpdySession[23] that releases the connection handle that makes the http connection close, is that so?

Since this may also call Close on transactions, could MakeConnection be potentially reentered?

@@ +852,5 @@
> +            if (self->mNumIdleConns + self->mNumActiveConns + 1 <= self->mMaxConns)
> +                return PL_DHASH_STOP;
> +        }
> +    }
> +    

White space

Attachment #726617 - Flags: review?(honzab.moz) → review+

Patrick McManus [:mcmanus]

Assignee

Comment 36

•

11 years ago

Thanks Honza!

(In reply to Honza Bambas (:mayhemer) from comment #35)
>
> We may want to split the limits for http/1 and spdy. 

I've been thinking about this and I don't think there is a lot of point.. but it could surely be done. here's my reasoning:

There are 2 kinds of limits, per-host, and global. per-host is basically uninteresting with spdy sockets due to the large spdy mux.. its possible you end up with up to 6 of them at the very beginning (because we don't yet know they are going to be spdy), but eventually it ought to settle down. If these "extras" end up getting purged earlier because of counting http/1 and spdy together (and putting pressure on the limits) I think that's basically a good thing.. and I think that's what is going on in this use case joe had (where the limits on mobile are a lot lower).

For the global counter we want to count both spdy and http together in this case because that number is pretty much acting as a bound on the amount of paralleization we want to do and how much buffering the network can absorb. This heuristic has always been there but I think we should actually get rid of it - instead what we want is a max number of active (i.e. have transactions live on them) TCP streams.. the number of idle global ones is pretty uninteresting to us beyond the per-host limits. (we've effectively disabled it on desktop making it so large which has been good - rdow was just telling the story last week of how it was biting him a couple years ago set at 30.. now on desktop the parallism story is in the hands of sharding decisons for both better and for worse)


> @@ +846,5 @@
> > +    nsHttpConnectionMgr *self = (nsHttpConnectionMgr *) closure;
> > +    for (uint32_t index = 0; index < ent->mActiveConns.Length(); ++index) {
> > +        nsHttpConnection *conn = ent->mActiveConns[index];
> > +        if (conn->UsingSpdy() && conn->CanReuse()) {
> > +            conn->DontReuse();
> 
> To confirm: this calls Close() on the SpdySession[23] that releases the
> connection handle that makes the http connection close, is that so?

That's true if there are no active transactions on the spdysession. If there are active transactions we just have the dont-reuse flag set and the session is cleaned up when the transactions complete (and no new transactions are started on that session).

> Since this may also call Close on transactions, could MakeConnection be
> potentially reentered?

It won't close any transactions, and the connection recycling happens via the normal reclaimconnection() path that takes a trip through PostEvent.

Patrick McManus [:mcmanus]

Assignee

Comment 37

•

11 years ago

  https://hg.mozilla.org/integration/mozilla-inbound/rev/3924eba670bb

Ed Morley [:emorley]

Comment 38

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/3924eba670bb

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla22