Closed Bug 767742 Opened 12 years ago Closed 11 years ago

SPDY makes fennec take a lot longer to load links clicked on through Google search results

Categories

(Core :: Networking: HTTP, defect)

ARM
Android
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla22
Tracking Status
fennec + ---

People

(Reporter: joe, Assigned: mcmanus)

References

()

Details

(Whiteboard: [spdy])

Attachments

(1 file)

Fennec takes significantly longer, from when I tap on a link to when the page actually starts loading (i.e., I'm not just seeing the previous page), to load pages. This is a general problem, but one place I can remember seeing it specifically is the URL (a restaurant's website).

Click on Menus, then (say) Prix Fixe. It can take multiple minutes to start getting the page.

In the stock browser, the same action gets a response almost immediately.
Of course, once I file this bug I find it's not as 100% reproducible as I thought it was. Also, I'm taking a stab and saying this might be networking, but I haven't done any investigation at all.
Joe, if you can reproduce semi-reliably... can you check whether Fennec enables pipelining? If it does, what effect does disabling that have?
Fennec does enable pipelining, but disabling it doesn't seem to make a difference.

When Firefox gets into this state, it'll be slow even to start loading entirely different domains. For example, I tried to navigate from the URL mentioned here to this page in Fennec, and eventually gave up and opened my laptop.

It's as though we don't have enough connections in the pool, or we're not killing the ones we are trying to load when navigating.
Whiteboard: [lame-network]
CC'ing Patrick "Mr. Network" McManus
It has gotten to the point that I can't stand using Fennec because of this bug. :-( As evidence, I submit that I am commenting on this bug while on vacation!
I can see two avenues here:  1) looking at connection limits and other settings/limits in our TCP/HTTP stack that are fennec-specific (Patrick), and 2) possible perf issues with our e10s-specific logic (bug 648433, which I'm going to bump toward the top of my TODO list).   

There could of course be other reasons, but these seems like obvious things to look at first.
(In reply to Jason Duell (:jduell) from comment #6)
> I can see two avenues here:  1) looking at connection limits and other
> settings/limits in our TCP/HTTP stack that are fennec-specific (Patrick),
> and 2) possible perf issues with our e10s-specific logic (bug 648433, which
> I'm going to bump toward the top of my TODO list).   
> 
> There could of course be other reasons, but these seems like obvious things
> to look at first.

Fennec doesn't use e10s any more.
> Fennec doesn't use e10s any more.

Sorry--B2G on the brain :)

Patrick ran the benchmarks that gave us our current connection limit numbers, etc.
Assignee: nobody → mcmanus
My observation is that I get about 2-3 page loads at startup before enormous, enormous delays start kicking in. I'd love to provide my profile (if possible)/prefs/etc to anyone to reproduce this.
I don't see this a heck of a lot on my personal phone these days, but our Play store feedback is *rife* with reports of precisely this problem.
tracking-fennec: --- → ?
Josh, can stoneridge confirm or refute that we're slower than chrome in page load?
Flags: needinfo?(joshmoz)
(In reply to Brad Lassey [:blassey] from comment #11)
> Josh, can stoneridge confirm or refute that we're slower than chrome in page
> load?

Stone Ridge cannot answer that question for us right now.
Flags: needinfo?(joshmoz)
(In reply to Josh Aas (Mozilla Corporation) from comment #12)
> (In reply to Brad Lassey [:blassey] from comment #11)
> > Josh, can stoneridge confirm or refute that we're slower than chrome in page
> > load?
> 
> Stone Ridge cannot answer that question for us right now.

any other suggestions then?
Flags: needinfo?(joshmoz)
(In reply to Brad Lassey [:blassey] from comment #13)

> any other suggestions then?

Not right now. Randall Dow has been doing some profiling but hasn't found anything particularly problematic that we don't already know about yet, at least on the network layer. Unfortunately this bug is too vague for us to do much more, and I suspect we've improved the situation significantly since it was filed.

We're always working on new ways to measure page load time and find room for improvement, that's the best way to move forward on reports like this that don't point to specific issues.
Flags: needinfo?(joshmoz)
Patrick, what can we do here? Is this actionable.
Flags: needinfo?(mcmanus)
(In reply to Brad Lassey [:blassey] from comment #15)
> Patrick, what can we do here? Is this actionable.

I suggest it needs more testing to find a reproduction and only then can we figure out what the cause is. As it stands what is there to go on in this bug other than "a restaurant website used to be slow on fennec and now isn't.. but periodically something is so slow as to be unusable".?

Don't take that to mean I'm questioning the validity of the bug, I'm definitely not! But the only thing I can think of here is legwork to get some data about an observed incident.

My understanding was that Randall was going to do some of that but hadn't come up with anything yet. Hopefully he will weigh in.
Flags: needinfo?(mcmanus)
I think that I've narrowed down what makes this happen. Search for a site on Google, one you (presumably) haven't been to before. Click on that site. It takes a lot longer to load than other browsers.

This makes me think that it has something to do with SPDY, so I've disabled that to see if it makes any difference.
(In reply to Joe Drew (:JOEDREW! \o/) from comment #17)
> This makes me think that it has something to do with SPDY, so I've disabled
> that to see if it makes any difference.

And the result?
So far so good!
To the point that I'm going to make this bug specific to that, in fact.
Summary: Loading pages takes much longer in Fennec than the stock browser → SPDY makes fennec take a lot longer to load links clicked on through Google search results
tracking-fennec: ? → +
Patrick, should we be looking to disable spdy until this is resolved?
Flags: needinfo?(mcmanus)
I've been off the grid the last week, sorry for the lagging reply.

(In reply to Brad Lassey [:blassey] from comment #21)
> Patrick, should we be looking to disable spdy until this is resolved?

let's see what the balance is here first through further assessment. spdy has major gains for high latency environments.

Has anyone other than Joe been able to reproduce? (i.e. could it be environmental in some part?)

(In reply to Joe Drew (:JOEDREW! \o/) from comment #20)
> To the point that I'm going to make this bug specific to that, in fact.

Joe - what I think is most interesting here is that the performance problem is with HTTP/1 done after SPDY usage. i.e. it isn't with the SPDY transfer directly. There is a limited amount of surface where these things interact so we should focus on that.

The most obvious candidate to me would be the max connection pool. Both protocols count against that allocation, but HTTP/1 is much more dependent on turnover in the pool (because spdy tends to be quite stable in practice, while HTTP/1 is constantly setting up and tearing down connections - one of spdy's advantages). The max connection pool on mobile is much smaller than on desktop (20 vs 256)[*] - so I could hyptohesize that the HTTP/1 site is relying on heavy turnover of the pool and the persistent spdy sessions in it are not cooperating (i.e. buggy) and going away under pressure which leads to connection starvation loading the HTTP/1 site. The same bug would exist on desktop but be pretty invisible due to the higher allocation of sockets overall.

That's just an idea that fits the report. You could test it out by increasing network.http.max-connections to 256 to test it out. Would you do that?

-P

[*] connection management needs to be rethought a little bit for mobile and b2g as the capability of those platforms has exploded in the last couple of years - there are potential wins there.. but a same-as-desktop strategy is obviously not right as the bottleneck is still generally CPU, RAM is more constrained, and due to the high latency involved you need to be vigilant against generating huge queues through parallelism that will ruin any kind of interactivity on cancel events etc..
Flags: needinfo?(mcmanus)
Joe - I can't reproduce this (more in another comment) but for now

1] can you increase network.http.max-connections to 256 and retest?

2] is this on wifi or a data connection? DOes it matter where you do it - e.g. home vs office?

3] do you know of anybody else that can reproduce (i.e. should we be looking at environmental factors)?

4] can you confirm your STR as:
1 - clear cache
2 - quit firefox
3 - restart firefox
4* - goto https://www.google.com and search for papilonpark.com
5 - goto papilonpark.com

and the observed behavior is that with step 4* removed the loading of pailonpark is much faster. How much faster?

Can you use adb to get an HTTP log?

Any chance you can get a wireshark capture? I do have a setup here to get caps of my fennec (when its on wifi at least), but I know that's a royal pain.

Thanks!
Flags: needinfo?(joe)
I can't reproduce a problem here using the STR from comment 23 on my nexus S using firefox 21 (2-17 nightly build) over wifi. I can upload packet caps if anyone wants them of 2 runs to compare. Each takes about 30 seconds to complete. (the version after spdy actually completes a touch faster but nothing significant.)

One totally unrelated thing I do find interesting about the traces (it appears consistent in each of them) is that the first stylesheet referenced in the HEAD doesn't start a request for itself until about 650ms after the reference to it is received - that's a lot slower than I would have presumed given that not much else is competing for the processor yet, Maybe something that could be profiled jduell or mayhemer? (but its unrelated to this bug - clearly.) That's not what desktop does but it might just be a CPU power difference - nonetheless it results in an idle network for a while and that's always a resource we want to exploit.

My next step is to read some code and try and force the scenario I describe in comment 22 to see if I can confirm that the spdy session goes away under http pressure.. I may just be not creating the pressure in the same way joe does - so an artificial experiment might still yield someting.

after that I'm gonna need more inputs to go farther.
(In reply to Patrick McManus [:mcmanus] from comment #24)

> My next step is to read some code and try and force the scenario I describe
> in comment 22 to see if I can confirm that the spdy session goes away under
> http pressure.. 

I was able to see a problem here - so it might be this issue in the end.

I've made a potential fix and have a build started here:

https://tbpl.mozilla.org/?tree=Try&rev=a7ab72687811

Joe can you try that out?
(In reply to Patrick McManus [:mcmanus] from comment #24)

> One totally unrelated thing I do find interesting about the traces (it
> appears consistent in each of them) is that the first stylesheet referenced
> in the HEAD doesn't start a request for itself until about 650ms after the
> reference to it is received - that's a lot slower than I would have presumed
> given that not much else is competing for the processor yet, Maybe something
> that could be profiled jduell or mayhemer? (but its unrelated to this bug -
> clearly.) That's not what desktop does but it might just be a CPU power
> difference - nonetheless it results in an idle network for a while and
> that's always a resource we want to exploit.

I filed bug 845531 about this. Nice catch.
(In reply to Patrick McManus [:mcmanus] from comment #23)
> 1] can you increase network.http.max-connections to 256 and retest?

First question - should I do this testing or test your potential fix? It's a little tricky to reproduce, though sometimes I can do it always, which is why I turned off SPDY altogether in order to see if I could reproduce it.
 
> 2] is this on wifi or a data connection? DOes it matter where you do it -
> e.g. home vs office?

I clearly remember having this problem at home, but cannot recall whether it has ever happened at the office.
 
> 3] do you know of anybody else that can reproduce (i.e. should we be looking
> at environmental factors)?

Not personally, but you can see people complaining about these problems in the Google Play market.

> 4] can you confirm your STR 

Unfortunately not, though it's always "Google search for something" then "click on that something".

I can get traces, etc, if I can reliably reproduce it again; I'll keep my eye out.
Flags: needinfo?(joe) → needinfo?(mcmanus)
(In reply to Joe Drew (:JOEDREW! \o/) from comment #27)
> (In reply to Patrick McManus [:mcmanus] from comment #23)
> > 1] can you increase network.http.max-connections to 256 and retest?
> 
> First question - should I do this testing or test your potential fix? 

Testing just the pref change would confirm the general problem, while the try-build contains a specific fix for that hypothesis.

I would start with the try-build (because that's something we could checkin), but if that continued to fail I would still try the pref change to see if it was just the fix that wasn't sufficient and we were still onto something.

The pref change on its own does have the potential to cause you other problems with responsiveness and maybe even OOM. (which is why it isn't set that way be default :))
Flags: needinfo?(mcmanus)
Argh. Patrick, I went to install this (way late) and, because my Nightly is newer than it, Android won't let me. Can you roll another try build?
Flags: needinfo?(mcmanus)
(In reply to Joe Drew (:JOEDREW! \o/) from comment #29)
> Argh. Patrick, I went to install this (way late) and, because my Nightly is
> newer than it, Android won't let me. Can you roll another try build?

no problem!
https://tbpl.mozilla.org/?tree=Try&rev=d07c8f60b143
Flags: needinfo?(mcmanus)
FYI on newer Nexus devices you need to use adb install -d -r fennec.apk to downgrade.
Pinging Joe
Flags: needinfo?(joe)
It seems to be better, though it's a little tough to say definitively. I definitely haven't seen the long delays that characterized this bug since I've been running Patrick's build.

My recommendation is to give this a shot in mozilla-central; if I start seeing this problem regularly again, I'll open a new bug!
Flags: needinfo?(joe)
Component: Networking → Networking: HTTP
Whiteboard: [lame-network] → [spdy]
Attached patch patch 0Splinter Review
Attachment #726617 - Flags: review?(honzab.moz)
Comment on attachment 726617 [details] [diff] [review]
patch 0

Review of attachment 726617 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab based mainly on tests.

We may want to split the limits for http/1 and spdy.  I think I was suggesting that ones, what was the result?

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +842,5 @@
> +{
> +    if (!ent->mUsingSpdy)
> +        return PL_DHASH_NEXT;
> +
> +    nsHttpConnectionMgr *self = (nsHttpConnectionMgr *) closure;

static cast please

@@ +846,5 @@
> +    nsHttpConnectionMgr *self = (nsHttpConnectionMgr *) closure;
> +    for (uint32_t index = 0; index < ent->mActiveConns.Length(); ++index) {
> +        nsHttpConnection *conn = ent->mActiveConns[index];
> +        if (conn->UsingSpdy() && conn->CanReuse()) {
> +            conn->DontReuse();

To confirm: this calls Close() on the SpdySession[23] that releases the connection handle that makes the http connection close, is that so?

Since this may also call Close on transactions, could MakeConnection be potentially reentered?

@@ +852,5 @@
> +            if (self->mNumIdleConns + self->mNumActiveConns + 1 <= self->mMaxConns)
> +                return PL_DHASH_STOP;
> +        }
> +    }
> +    

White space
Attachment #726617 - Flags: review?(honzab.moz) → review+
Thanks Honza!

(In reply to Honza Bambas (:mayhemer) from comment #35)
>
> We may want to split the limits for http/1 and spdy. 

I've been thinking about this and I don't think there is a lot of point.. but it could surely be done. here's my reasoning:

There are 2 kinds of limits, per-host, and global. per-host is basically uninteresting with spdy sockets due to the large spdy mux.. its possible you end up with up to 6 of them at the very beginning (because we don't yet know they are going to be spdy), but eventually it ought to settle down. If these "extras" end up getting purged earlier because of counting http/1 and spdy together (and putting pressure on the limits) I think that's basically a good thing.. and I think that's what is going on in this use case joe had (where the limits on mobile are a lot lower).

For the global counter we want to count both spdy and http together in this case because that number is pretty much acting as a bound on the amount of paralleization we want to do and how much buffering the network can absorb. This heuristic has always been there but I think we should actually get rid of it - instead what we want is a max number of active (i.e. have transactions live on them) TCP streams.. the number of idle global ones is pretty uninteresting to us beyond the per-host limits. (we've effectively disabled it on desktop making it so large which has been good - rdow was just telling the story last week of how it was biting him a couple years ago set at 30.. now on desktop the parallism story is in the hands of sharding decisons for both better and for worse)


> @@ +846,5 @@
> > +    nsHttpConnectionMgr *self = (nsHttpConnectionMgr *) closure;
> > +    for (uint32_t index = 0; index < ent->mActiveConns.Length(); ++index) {
> > +        nsHttpConnection *conn = ent->mActiveConns[index];
> > +        if (conn->UsingSpdy() && conn->CanReuse()) {
> > +            conn->DontReuse();
> 
> To confirm: this calls Close() on the SpdySession[23] that releases the
> connection handle that makes the http connection close, is that so?

That's true if there are no active transactions on the spdysession. If there are active transactions we just have the dont-reuse flag set and the session is cleaned up when the transactions complete (and no new transactions are started on that session).

> Since this may also call Close on transactions, could MakeConnection be
> potentially reentered?

It won't close any transactions, and the connection recycling happens via the normal reclaimconnection() path that takes a trip through PostEvent.
https://hg.mozilla.org/mozilla-central/rev/3924eba670bb
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: