Closed Bug 961318 Opened 10 years ago Closed 10 years ago

Random big delays (up to 5-10s) when opening new windows/chrome documents

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla29

Tracking Flags:

Tracking

Status

firefox28

verified

firefox29

verified

People

(Reporter: dagger.bugzilla, Assigned: luke)

References

Details

(Keywords: regression)

Attachments

(5 files)

test.js 10 years ago dagger.bugzilla 1.90 KB, application/x-javascript		Details
dialog-xul.txt 10 years ago dagger.bugzilla 778 bytes, text/plain		Details
capture.gif 10 years ago dagger.bugzilla 380.21 KB, image/gif		Details
fix-off-thread-heuristics 10 years ago Luke Wagner [:luke] 9.74 KB, patch	billm : review+	Details \| Diff \| Splinter Review
rebased for aurora 10 years ago Luke Wagner [:luke] 11.37 KB, patch	lsblakk : approval-mozilla-aurora+	Details \| Diff \| Splinter Review

dagger.bugzilla

Reporter

Description

•

10 years ago

Attached file test.js — Details

I'm seeing periodic delays of up to 5-10s when opening new dialogs and windows (any windows, not just browser.xul ones), and when opening some chrome documents (at least about:config) into tabs. Weirdly enough, the browser stays completely responsive while the windows load and doesn't completely peg the CPU -- the window just takes a long time to show up.

I've attached a script that measures the window open delay. Run it on a chrome scratchpad (Enable devtools.chrome.enabled -> Shift-F4 -> Environment -> Browser), and it'll open and immediately close a dialog 30 times, printing timing info to the console. The length of the delays seems to depend on memory consumption, so opening up some tabs with a memory-heavy page (I guess Gmail or Facebook; http://us.gizmodo.com/ perhaps if you don't have accounts) will make them a lot more noticeable.

Here's an example of the script output for me (with my Nightly profile that's consuming ~950 MB of RAM). Every 10s or so, the window takes 4-7s to open:

Delay: 31ms (0s since start)
Delay: 4109ms (5.6s since start)
Delay: 93ms (7.3s since start)
Delay: 110ms (8.9s since start)
Delay: 47ms (10.4s since start)
Delay: 6344ms (18.3s since start)
Delay: 31ms (19.8s since start)
Delay: 47ms (21.4s since start)
Delay: 31ms (22.9s since start)
Delay: 6953ms (31.3s since start)
Delay: 47ms (32.9s since start)
Delay: 94ms (34.5s since start)
Delay: 125ms (36.1s since start)
Delay: 6157ms (43.8s since start)
Delay: 62ms (45.4s since start)
Delay: 125ms (47s since start)
Delay: 109ms (48.6s since start)
Delay: 4375ms (54.5s since start)

I see the same problem in a clean profile, provided I open some pages to push the RAM usage up.

Regression range is https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=06c8949f78a0&tochange=1d9356c6c769 so I strongly suspect bug 931864.
github

dagger.bugzilla

Reporter

Comment 1

•

10 years ago

Attached file dialog-xul.txt — Details

This is the XUL the script loads in the dialog (i.e. the big data: URI). This line loads an empty script:

<script type="application/javascript;version=1.7" src="data:application/x-javascript;charset=utf-8,"/>

Oddly enough, without this line, the dialog consistently opens in <150ms. Using an actual .js file instead of a data: URI makes no difference, but if the script is inline:

<script type="application/javascript;version=1.7"> // script here </script>

it works just fine with no delays.

(I also tested without the "load" handler, running the JS immediately, and it made no difference.)

dagger.bugzilla

Reporter

Comment 2

•

10 years ago

Attached image capture.gif — Details

And here's a .gif for good measure.

dagger.bugzilla

Reporter

Updated

•

10 years ago

Blocks: 931864

dagger.bugzilla

Reporter

Comment 3

•

10 years ago

> github

Whoops. Anybody reading #developers from an hour or so ago will know why I randomly have "github" there ;)

Boris Zbarsky [:bzbarsky]

Updated

•

10 years ago

tracking-firefox29: --- → ?

Luke Wagner [:luke]

Assignee

Comment 4

•

10 years ago

What the patch in bug 931864 did is make async script parsing block on the completion of incremental GC.  (Before the patch, we'd just parse the async script synchronously making the determination of whether an async script was parsed asychronously highly non-deterministic.)  Thus, on the slow load times, the window is being opened during a full GC and the delay is the time it takes for the GC to complete.  GC time is proportional to the amount of life stuff, so it makes sense that you see worse times for a big profile.

I'm not sure if there is an easy fix for this or how much work is justified to fix it.  This only happens for async scripts (which is why the inline script in your experiment wasn't affected) where there is already an expectation of delay (from network latency or waiting for other scripts to finish execution).  Also, this only happens for full GCs which are not usually so frequent and close together (closing windows triggers them, which is why your script hits this regularly).

One thought is that, since the GC is incremental, it'll take a longer to complete (because it is letting content run in between GC slices) so perhaps we could do something to try to rush the incremental GC when an async script is waiting to be parsed.  Any thoughts on that Bill?

Luke Wagner [:luke]

Assignee

Updated

•

10 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Boris Zbarsky [:bzbarsky]

Comment 5

•

10 years ago

> This only happens for async scripts (which is why the inline script in your experiment
> wasn't affected) where there is already an expectation of delay (from network latency or
> waiting for other scripts to finish execution).

Then chrome code shouldn't be using async scripts on the blocking path to showing a window, perhaps?  Right now it does.  The user experience of hitting Cmd-N or Ctrl-N and then there being no visible response for 5s is a lot worse than an immediate window appearing that's not usable for 5s (which itself is a lot worse than an immediate and immediately usable window).

Luke Wagner [:luke]

Assignee

Comment 6

•

10 years ago

Yeah, although that also sounds unpleasant.

Alternatively, we could say that, if an off-thread parse task is started, we disable collection of the atom's zone.  This has the possibility of starvation given a constant stream of async script parse jobs, but it doesn't seem like it would arise in practice...

Luke Wagner [:luke]

Assignee

Comment 7

•

10 years ago

On second though, if the atoms compartment did get starved for collection, eventually we'd get one of those powerful (purge/shrinking?) GCs which I assume is non-incremental?  Just guessing here, perhaps Bill could weigh in.

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 8

•

10 years ago

I don't really like the idea of disabling collection of the atoms zone. First, we don't currently have a mechanism to cancel collection of a given compartment when the GC has already started. It might be hard to add this, too, since we could have already executed some |if (isFullGC)| checks, and it we'd have to back off from that somehow.

The starvation thing is also an issue. We don't have any good backstop there. Eventually when the zone got big enough, we'd trigger a non-incremental GC. However, the triggering logic doesn't take the actual string size into account--only sizeof(JSString)--so we could end up with pretty bad behavior. We also have the malloc bytes trigger, but I doubt it would work in this case since it gets reset on every GC.

Could we put in a check so that we only do the parsing off-thread during a GC if the script is huge? My understanding is that the original patch was meant to handle multi-megabyte asm.js scripts. I'm guessing that even browser.js isn't that big.

We really need to find a long-term fix for the atoms zone. I talked about the problem with Brian during the GC work week, but I don't remember the conclusion we came to. Maybe we can talk to him when he gets back.

Flags: needinfo?(wmccloskey)

Luke Wagner [:luke]

Assignee

Comment 9

•

10 years ago

(In reply to Bill McCloskey (:billm) from comment #8)
> Could we put in a check so that we only do the parsing off-thread during a
> GC if the script is huge? My understanding is that the original patch was
> meant to handle multi-megabyte asm.js scripts. I'm guessing that even
> browser.js isn't that big.

Yeah, that's probably the best short-term fix.  Caching isn't even used unless the script is >10,000 chars.

Luke Wagner [:luke]

Assignee

Comment 10

•

10 years ago

browser.js seems to be 549,150 chars.  However, it doesn't seem to be async compiled again after the initial load.  Other than that, when opening a new tab, the biggest script to be async compiled is 58,877 chars (chrome://browser/content/newtab/newTab.js).  The second biggest script is less than 10,000 chars (the caching minimum threshold).

It seems kinda sad to disable async parsing for *all* scripts <59K chars just because of this one potential problem.  Rather, I'm thinking we disable if rt->activeGCInAtomsZone() AND num-chars < 500,000.  This seems like a fair tradeoff: if a full GC is in progress, it's probably faster not to wait for the GC to finish; but for big scripts, we'd still like to avoid the main-thread jank.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 11

•

10 years ago

(In reply to Luke Wagner [:luke] from comment #10)
> It seems kinda sad to disable async parsing for *all* scripts <59K chars
> just because of this one potential problem.  Rather, I'm thinking we disable
> if rt->activeGCInAtomsZone() AND num-chars < 500,000.  This seems like a
> fair tradeoff: if a full GC is in progress, it's probably faster not to wait
> for the GC to finish; but for big scripts, we'd still like to avoid the
> main-thread jank.

Yes, I think that makes sense.

Luke Wagner [:luke]

Assignee

Comment 12

•

10 years ago

Attached patch fix-off-thread-heuristics — Details — Splinter Review

With this patch, the script in comment 0 produces pretty regular <30ms load times.  I also threw in the heuristic of not using async parsing for tiny scripts because, seriously, it's not cheap; we create a whole Zone for each task.

Assignee: nobody → luke

Status: NEW → ASSIGNED

Attachment #8362725 - Flags: review?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 13

•

10 years ago

Comment on attachment 8362725 [details] [diff] [review]
fix-off-thread-heuristics

Review of attachment 8362725 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsworkers.cpp
@@ +234,5 @@
> +{
> +    // Off thread parsing can't occur during incremental collections on the
> +    // atoms compartment, to avoid triggering barriers. (Outside the atoms
> +    // compartment, the compilation will use a new zone which doesn't require
> +    // barriers itself.) If an atoms-zone GC is in progress, hold off on

Rather than say "which doesn't require barriers itself", I think it would be clearer to say "that is never collected".

Attachment #8362725 - Flags: review?(wmccloskey) → review+

Luke Wagner [:luke]

Assignee

Comment 14

•

10 years ago

Bill, I was just thinking that 1,000,000 is overly conservative.  Several large asm.js codebases are in the 100,000-1,000,000 range, with 100-300ms compile times.  Again, browser.js is above this limit, but it isn't parsed repeatedly (thank goodness; XUL prototype cache, I'm guessing?).

Luke Wagner [:luke]

Assignee

Comment 15

•

10 years ago

So does 100,000 seem reasonable?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 16

•

10 years ago

Sure, that's fine.

Luke Wagner [:luke]

Assignee

Comment 17

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/dc82a01d0cfe

Nathan Froyd [:froydnj]

Comment 18

•

10 years ago

Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/2eff560d9458 for bustage like:

https://tbpl.mozilla.org/php/getParsedLog.php?id=33303269&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=33303288&tree=Mozilla-Inbound

Phil Ringnalda (:philor)

Comment 19

•

10 years ago

And just for variety, Android 2.2 (but not Android 2.2 Armv6) mochitest-4, https://tbpl.mozilla.org/php/getParsedLog.php?id=33303231&tree=Mozilla-Inbound

Luke Wagner [:luke]

Assignee

Comment 20

•

10 years ago

D'oh!  In the shell, offThreadCompileScript wants to ignore these heuristics, so I'll add a CompileOption.forceCaching that is set only from this shell function.

The one mochitest failure will be an activeGCInAtomsZone for a script < 100,000 chars.  The scripts are already generated to be bigger than 10,000 chars, so I'll just bump them up in size.

Luke Wagner [:luke]

Assignee

Comment 21

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f97076de7eb0

Carsten Book [:Tomcat]

Comment 22

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/f97076de7eb0

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla29

Benjamin Kerensa [:bkerensa]

Updated

•

10 years ago

Keywords: verifyme

Benjamin Kerensa [:bkerensa]

Updated

•

10 years ago

tracking-firefox29: ? → +

dagger.bugzilla

Reporter

Comment 23

•

10 years ago

I can verify this is fixed on Windows, if that's enough for VERIFIED.

Should this be uplifted to Aurora, since bug 931864 landed on 28?

Boris Zbarsky [:bzbarsky]

Comment 24

•

10 years ago

Good catch!  Nominating this for 28 tracking.

tracking-firefox28: --- → ?

Keywords: regression

Luke Wagner [:luke]

Assignee

Comment 25

•

10 years ago

Attached patch rebased for aurora — Details — Splinter Review

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 931864
User impact if declined: random Chrome stalls when loading
Testing completed (on m-c, etc.): m-c
Risk to taking this patch (and alternatives if risky): low

(Attaching patch with simple rebase conflicts fixed.)

Attachment #8367539 - Flags: approval-mozilla-aurora?

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

10 years ago

status-firefox28: --- → affected

status-firefox29: --- → fixed

tracking-firefox28: ? → +

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

10 years ago

Attachment #8367539 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 26

•

10 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/959dcabaffb1

status-firefox28: affected → fixed

Catalin Varga [QA][:VarCat]

Comment 27

•

10 years ago

I've verified the bug using the following environment:

Firefox 29
User Agent:Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0
Build Id: 20140203004003
Windows 7 x64

Firefox 28
User Agent:Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
Build Id:20140203004003
Windows 7 x64

FF Aurora 29
User Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
Build Id: 20140204004003
Mac OS X 10.9

FF Aurora 29
User Agent:Mozilla/5.0 (X11; Linux i686; rv:29.0) Gecko/20100101 Firefox/29.0
Build Id:20140204004003
Ubuntu 12.04 x32

Status: RESOLVED → VERIFIED

status-firefox28: fixed → verified

status-firefox29: fixed → verified

Keywords: verifyme

You need to log in before you can comment on or make changes to this bug.