Closed Bug 1275917 Opened 3 years ago Closed 3 years ago

Recently, some websites don't finish loading and the throbber keeps spinning forever

Categories

(Core :: Networking, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox48 --- unaffected
firefox49 + fixed
firefox50 + fixed

People

(Reporter: mstange, Assigned: mcmanus)

References

Details

(Keywords: regression, Whiteboard: [necko-active])

Attachments

(1 file, 1 obsolete file)

This started happening for me a few days ago, on OS X Nightly: Sometimes, when I open a new tab and navigate to a page (by entering a URL or by Cmd-clicking a link), the tab stays blank and the throbber keeps spinning forever, until I give up and reload the page. I think the page title was displayed in the tab during that time, but I'm not 100% sure.

Unfortunately I don't have URLs. I'll grab them when I see the issue again. It has happened to me about 15 times over the last 5 days or so.
I've seen things like this too. Also, perhaps related, my gmail tabs don't ever finish loading.
I saw this on yesterday's nightly, but not (so far) today. This kind of thing can be tricky to get in front of the right people - its generally a matter of what has been added/removed from the load group which is something necko orchestrates but doesn't actually do the adding/removing.. so it could literally be anywhere.

if it comes back for me I'll try and get it in rr..
I got it couple of time in last days. Also this morning, I will try to get logging.
I feel like I've had a couple extra cases of pages not loading (which would reload fine).. that would cause the throbber issue if it were a small subresource that wasn't loading.
this needs an owner soon. jason please dispatch.
Assignee: nobody → jduell.mcbugs
Whiteboard: [necko-active]
I mostly get this on reddit.com, about once a day. I use uBlock.
[Tracking Requested - why for this release]: Pages don't finish loading, which is pretty obviously user-visible.
I see this quite often on https://stephendonner.wordpress.com/ - but not, strangely, wpengine.com-hosted WordPress sites, like https://blog.mozilla.org/webqa/ (perhaps due to different CDNs?)

In particular, of the sites I've noticed it happening, it's often "Waiting for...s[x].wp.com" where x=1, 2, etc.

Nightly, both OS X and Windows 10 builds.
Tracking 50+ - as Ryan notes this is very visible to the end user.
Duplicate of this bug: 1279764
Did anyone try clearing the cache when this happens? (Use Control+Shift+Delete to get the dialog for that.)

I recently cleared all history (browsing, download, form, search), cache, active logins, and offline website data. Then I also deleted the profile folder in AppData\Local, leaving the one in AppData\Roaming. It seemed like the problem was happening more and more often before, and this helped. It hasn't happened yet after that cleanup.
Duplicate of this bug: 1274787
I've had situations where a page wouldn't load and the throbber on the tab would spin.  Clicking on my bookmark for the site a second time during the spinning throbber would usually get the page to load.  I don't know if my problem is the same as this bug report.

Because of Bug 1271944 I disabled the predictor.  Lo and behold I haven't had the problem I stated above since then.
Clicking on the bookmark a second time almost never helped me. Once a URL stopped loading properly, there were very few occasions when I got it to load eventually without restarting Firefox.
The problem happened again, and clearing only the cache via the Control+Shift+Delete interface made the URL work again. I did not restart Firefox or clear any other things. This is in 50.0a1 (2016-06-19) 64-bit in Windows 10 build 14367.
comment 15
Flags: needinfo?(honzab.moz)
Setting network.predictor.enable-prefetch to false in about:config has not helped me in 49.0a2 (2016-06-20) in 32-bit Windows 7. Reddit links get stuck, and clearing only the cache makes them work again.

I've seen cache clearing work enough times that I believe in it as a workaround. Cache cleaning has never failed me. It's not a very nice workaround because of impact on performance and bandwidth usage, and the need to repeat it when the problem comes up again.

I'm using my usual extensions. I went back to my regular profile and re-enabled them after seeing that disabling extensions, safe mode, and even a new profile didn't help.
We should land bug 1239686 at least on Nightly it will be helpful for diagnostics in the situations like this.
(In reply to Dragana Damjanovic [:dragana] from comment #18)
> We should land bug 1239686 at least on Nightly it will be helpful for
> diagnostics in the situations like this.

FWIW, I tried to get a Necko log on Nightly builds, but seem to be hitting bug 1272704 :-(
I have been seeing this problem for a while now, probably similar to the time this bug has been open.  However the problem almost exclusively happened when trying to open Google search result links, so I'd been blaming it on Google (and particularly their 'redirect' rewriting of the links) and/or my ISP's connection to Google.  I had gotten an addon to fix Google's redirect links to see if that would fix it, and it did reduce the number of such problems a great deal.

However over the last couple of weeks I've had another problem that, now that I look at it, may be related.  Open up the debugging area, go to the Network tab, and reload a page.  When I do that, almost all loaded assets from the page have a Connecting time (orange bar, Connecting in the Timings tab) ranging from 500 milliseconds to over 5 seconds.  The most common delay is between 1 and 2 seconds.  Total page load time tends to be on the order of 8-10 seconds instead of the expected 2-3 seconds (for relatively complex pages).  All other timing elements are either 0 ms (DNS resolution, Sending, Receiving) or a reasonably expected time (Waiting lasts for a time typically comparable to a ping result).  Also note that this is the case even when the vast majority of the page elements will return a 304 for already being in the cache.

Safe mode and such had no real effect.  Last week I did a full profile reset, and for a couple days the problem went away (possibly related to the suggestion of fully clearing the cache?).  But, eventually, it was back.  At the present time, reloading a page generally has the page sitting on the grey (unconnected) spinner for several seconds before it switches to the blue (connected) spinner, and even then, the time it takes for the page to load is excessively long compared to other browsers.

However, it does seem like there could be a connection between very slow connection times, and perpetual connection times that never finish loading at all.

Currently using the Aurora/Developer channel, v49.
Progressing through other bugs I had opened, #1271944, dealing with how prefetch affects load times, got me to try turning off the prefetch mechanism. I found that turning prefetch off completely eliminates the slow Connection times I described above.  I've only been using this a short time, and will note if it reverts to the old bad behavior, but it's possible there may be a connection between the two issues.
Im getting this heavily on 9gag - http://9gag.com/gag/aK3PB2g    
I open it with a middle mouse click in a new tab - to let it load in the background.  
It never completes loading. Also the image is fuzzy/partially downloaded.   

I had opened another bug just for the image loading(fuzzy/pixelated) issue though - many sites images are extremely slow to load - https://bugzilla.mozilla.org/show_bug.cgi?id=1268237
I think this is probably a dup of bug 1281366.  Can the affected people try tomorrow's nightly and see if that improves things?  (I mean, it's an intermittent thing, but still...)
(In reply to Nathan Froyd [:froydnj] from comment #23)
> I think this is probably a dup of bug 1281366.  Can the affected people try
> tomorrow's nightly and see if that improves things?  (I mean, it's an
> intermittent thing, but still...)

thanks nathan - that's a good pointer
(In reply to Patrick McManus [:mcmanus] from comment #24)
> (In reply to Nathan Froyd [:froydnj] from comment #23)
> > I think this is probably a dup of bug 1281366.  Can the affected people try
> > tomorrow's nightly and see if that improves things?  (I mean, it's an
> > intermittent thing, but still...)
> 
> thanks nathan - that's a good pointer

Dropping ni based on this.
Flags: needinfo?(honzab.moz)
Doesn't seem to be helping here.
The situation is much better here.
Reddit still gives me the permaspinner frequently with a trunk build from today. I've seen it on other sites intermittently as well.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #28)
> Reddit still gives me the permaspinner frequently with a trunk build from
> today. I've seen it on other sites intermittently as well.

if it is somehow reproducible for you, can you make a http log?
Thanks!
Flags: needinfo?(ryanvm)
I haven't been able to get an NSPR log no matter what I do :(
Flags: needinfo?(ryanvm)
Ryan, have you manage to get something from mozregression, as you suggested on irc?
Flags: needinfo?(ryanvm)
See if you get any improvement if you...

1. toggle the following to false:

network.predictor.enable-prefetch
network.prefetch-next

2. toggle the following to true:
network.dns.disablePrefetch

3. Set the value of the following to 0:

network.predictor.prefetch-force-valid-for
By the way, be sure to do all steps and not just one or the other.
Seems like things have improved recently. I'm using Developer Edition in Windows 7 with network.predictor.enable-prefetch set to false and Nightly in Windows 10 Insider with network.predictor.enable-prefetch left at the default, true. On both, all I've seen recently is sites failing to load once, and then successfully loading when I try again. There haven't been any recent occasions when I had to clear the cache or restart Firefox to access a site.
I've implemented a limit for nspr log file size (rotate, bug 1244306).  If this still reproduces for anyone, at least occasionally, please:

- install a try build from [1] (Nightly capable of log size limiting)
- enable logging by setting following environment variables:

  MOZ_LOG=rotate:100,nsHttp:4,cache2:5,LoadGroup:5
  MOZ_LOG_FILE=/some/writable/directory/moz.log

  (or on Windows: MOZ_LOG_FILE=c:\some\writable\directory\moz.log)


When the problem reproduces, close firefox, and then zip and upload all moz.log.* files from your some/writable/directory (there should be files with .0, .1, .2, .3 extensions, each having up to 25MB)

Don't forget to note which URL was the troubling one (the page URL at least).

Thanks.

[1] http://archive.mozilla.org/pub/firefox/try-builds/honzab.moz@firemni.cz-27bc6057389372e2d844755c56b02b0c44d619e9/
Tracking 49+ for the same reason in Comment 9.
I'm still seeing this. I'll start using the try build now.
(In reply to Markus Stange [:mstange] from comment #37)
> I'm still seeing this. I'll start using the try build now.

Markus, please use nsHttp:5 instead of 4.  Somebody screwed up the logging recently again, so that it doesn't log the important stuff.  You need all the "5" crap :)
Got logs for a problem on reddit.com, forwarding them to Patrick, since it is apparently a spdy problem.
(In reply to IU from comment #32)
> See if you get any improvement if you...
> 
> 1. toggle the following to false:
> 
> network.predictor.enable-prefetch
> network.prefetch-next
> 
> 2. toggle the following to true:
> network.dns.disablePrefetch
> 
> 3. Set the value of the following to 0:
> 
> network.predictor.prefetch-force-valid-for

I was seeing the problem with a gmail tab where it would never complete loading (the throbber kept spinning and File->Print reports the page is still loading). The firefox parent process also seems to be much busier and the OS reports it using much more CPU than normal. top (on OS X) shows the firefox process continually spiking in CPU usage until the gmail tab is closed.

With the above config changes and today's nightly the gmail tab loads normally and CPU usage by the parent is considerably lower, per visual 30-second check of top (on OS X).
Tracking this down via mozregression turned out to be more of a PITA than I'd originally thought. However, I did manage to end up at bug 865314, which looks like a plausible-looking candidate given comment 39. So that's encouraging at least. Also might explain why some people have had better results by disabling the predictor?

Some notes:
1) Ctrl+F5 was needed to reproduce, and sometimes it took quite a few attempts before it finally would. AFAICT, just F5 didn't work for reproducing.
2) Bisection was done on Win10. I wasn't able to reproduce this on Linux. Not saying other OSes are unaffected, just that Win10 appears to be easier to reproduce it on.
3) The hangs generally appeared to be waiting on a random ad server or thumbs.redditmedia.com.
Flags: needinfo?(ryanvm) → needinfo?(mcmanus)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #41)
> Tracking this down via mozregression turned out to be more of a PITA than
> I'd originally thought. However, I did manage to end up at bug 865314, which
> looks like a plausible-looking candidate given comment 39. So that's
> encouraging at least. 

that's interesting - thanks. Nick is going to start looking at this while I'm afk this week. 865314 has surfaced latent/hard-to-otherwise-see bugs in the past so that's one possibility.

>Also might explain why some people have had better
> results by disabling the predictor?

plausible but hard to know at this stage..

> 
> Some notes:
> 1) Ctrl+F5 was needed to reproduce, and sometimes it took quite a few
> attempts before it finally would. AFAICT, just F5 didn't work for
> reproducing.

so that makes sense - ctrl-f5 creates new connections, while plain f5 does not.

> 2) Bisection was done on Win10. I wasn't able to reproduce this on Linux.
> Not saying other OSes are unaffected, just that Win10 appears to be easier
> to reproduce it on.
> 3) The hangs generally appeared to be waiting on a random ad server or
> thumbs.redditmedia.com.

thanks again.. any particular reddit url or just the homepage?
Flags: needinfo?(mcmanus)
Just the homepage.
Another sample, if needed.  Just created a temp profile for the try build and tried to load the page.

Log available here: https://drive.google.com/file/d/0B0A0wGYYRRdaWjJabjh5SVMwWTA/view?usp=sharing

Page load in log starts around line 99,000.

Site: https://codeblog.jonskeet.uk/

Loading on this page regularly stalls when trying to load one of two scripts from https://s2.wp.com/_static (+stuff).  Sometimes also stalls on a stylesheet at s1.wp.com; I didn't check when running the try build.  Entire page loads in Chrome in under a second.
Thanks for the folks that have provided logs.

From the log in comment 44:


2016-06-29 21:47:17.516000 UTC - [Socket Thread]: V/nsHttp nsHttpConnection::DontReuse 8da2020 spdysession=11e83000
2016-06-29 21:47:17.516000 UTC - [Socket Thread]: I/nsHttp Http2Session::DontReuse 11e83000
2016-06-29 21:47:17.516000 UTC - [Socket Thread]: I/nsHttp Http2Session::Close 11e83000 0
2016-06-29 21:47:17.516000 UTC - [Socket Thread]: I/nsHttp Http2Session::GenerateGoAway 11e83000 code=0
2016-06-29 21:47:17.516000 UTC - [Socket Thread]: V/nsHttp Http2Session::LogIO 11e83000 stream=0 id=0x0 [Generate GoAway]
2016-06-29 21:47:17.515000 UTC - [Socket Thread]: V/nsHttp ReportSpdyConnection 8a09880,s2.wp.com conn 8da2020 prefers 8a09100,s1.wp.com
2016-06-29 21:47:17.516000 UTC - [Socket Thread]: V/nsHttp ReportSpdyConnection graceful close of conn=8da2020 ent=8a09880 to migrate to preferred (desharding)
2016-06-29 21:47:17.516000 UTC - [Socket Thread]: V/nsHttp nsHttpConnection::StartSpdy moves single transaction ae85400 into SpdySession 11e83000

so that's a transaction being moved into a session that has been marked goAway - that's not going to work. it also shouldn't happen :) let's figure out why.
Assignee: jduell.mcbugs → mcmanus
Attachment #8768059 - Flags: review?(hurley)
the analysis in comment 45 is not quite right.

The transaction in question is actually the 1st transaction for the h2 session.. as h2 was discovered via alpn the connection discovered that it was not the preferred connection for this origin anymore - because in parallel a new connection had been established that should do the coalescing. And as a side effect it marked itself dontreuse.. which in turn impacted the starting of the first transaction.

this is a long standing bug in both h2 and spdy, but bug 865314 makes this way more obvious and frequent because it introduces new parallel tls handshakes.. it would be pretty hard to trip across otherwise.

The fix is just to late check whether the h2/spdy session has been marked dontreuse when AddStream is done and toss the transaction back to the connectionmgr if necessary (so it ends up on the preferred connection).
Blocks: 865314
Attachment #8768059 - Flags: review?(hurley) → review+
Thanks all - especially David - that testcase in comment 44 is pretty close to the minimal repro. much appreciated.
sorry had to back this out for causing near perma failures like https://treeherder.mozilla.org/logviewer.html#?job_id=31259526&repo=mozilla-inbound
Flags: needinfo?(mcmanus)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ee56ec8064fb
Backed out changeset 7132b115ab91 for causing timeouts in test_notification_http2.js
Thanks :tomcat

the patch rescheduled streams on sessions marked goaway.. but a pushed stream is prebound to a particular session so rescheduling it just makes it go back to the same place and loop.. in that case the goaway doesn't matter because the stream is already initiated (by the server).

so the fix here is just to make the patch conditional on !pushed
Flags: needinfo?(mcmanus)
Attachment #8768059 - Attachment is obsolete: true
Attachment #8768612 - Flags: review?(hurley) → review+
https://hg.mozilla.org/mozilla-central/rev/752282c3e152
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Comment on attachment 8768612 [details] [diff] [review]
1st transaction on dontresused h2 session

Approval Request Comment
[Feature/regressing bug #]: 865314 (actually a long standing bug but almost unnoticeable without 865314).. which landed on 49
[User impact if declined]: occasional lost http transactions
[Describe test coverage new/current, TreeHerder]: well exercised by existing treeherder tests
[Risks and why]: low - very targetted patch
[String/UUID change made/needed]: none
Attachment #8768612 - Flags: approval-mozilla-aurora?
Just updated my local build and reddit is much happier today!
Comment on attachment 8768612 [details] [diff] [review]
1st transaction on dontresused h2 session

The reddit must flow. Please uplift to aurora!
Attachment #8768612 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.