Closed Bug 961318 Opened 10 years ago Closed 10 years ago

Random big delays (up to 5-10s) when opening new windows/chrome documents

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla29
Tracking Status
firefox28 + verified
firefox29 + verified

People

(Reporter: dagger.bugzilla, Assigned: luke)

References

Details

(Keywords: regression)

Attachments

(5 files)

Attached file test.js
I'm seeing periodic delays of up to 5-10s when opening new dialogs and windows (any windows, not just browser.xul ones), and when opening some chrome documents (at least about:config) into tabs. Weirdly enough, the browser stays completely responsive while the windows load and doesn't completely peg the CPU -- the window just takes a long time to show up.

I've attached a script that measures the window open delay. Run it on a chrome scratchpad (Enable devtools.chrome.enabled -> Shift-F4 -> Environment -> Browser), and it'll open and immediately close a dialog 30 times, printing timing info to the console. The length of the delays seems to depend on memory consumption, so opening up some tabs with a memory-heavy page (I guess Gmail or Facebook; http://us.gizmodo.com/ perhaps if you don't have accounts) will make them a lot more noticeable.

Here's an example of the script output for me (with my Nightly profile that's consuming ~950 MB of RAM). Every 10s or so, the window takes 4-7s to open:

Delay: 31ms (0s since start)
Delay: 4109ms (5.6s since start)
Delay: 93ms (7.3s since start)
Delay: 110ms (8.9s since start)
Delay: 47ms (10.4s since start)
Delay: 6344ms (18.3s since start)
Delay: 31ms (19.8s since start)
Delay: 47ms (21.4s since start)
Delay: 31ms (22.9s since start)
Delay: 6953ms (31.3s since start)
Delay: 47ms (32.9s since start)
Delay: 94ms (34.5s since start)
Delay: 125ms (36.1s since start)
Delay: 6157ms (43.8s since start)
Delay: 62ms (45.4s since start)
Delay: 125ms (47s since start)
Delay: 109ms (48.6s since start)
Delay: 4375ms (54.5s since start)

I see the same problem in a clean profile, provided I open some pages to push the RAM usage up.

Regression range is https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=06c8949f78a0&tochange=1d9356c6c769 so I strongly suspect bug 931864.
github
Attached file dialog-xul.txt
This is the XUL the script loads in the dialog (i.e. the big data: URI). This line loads an empty script:

<script type="application/javascript;version=1.7" src="data:application/x-javascript;charset=utf-8,"/>

Oddly enough, without this line, the dialog consistently opens in <150ms. Using an actual .js file instead of a data: URI makes no difference, but if the script is inline:

<script type="application/javascript;version=1.7"> // script here </script>

it works just fine with no delays.

(I also tested without the "load" handler, running the JS immediately, and it made no difference.)
Attached image capture.gif
And here's a .gif for good measure.
Blocks: 931864
> github

Whoops. Anybody reading #developers from an hour or so ago will know why I randomly have "github" there ;)
What the patch in bug 931864 did is make async script parsing block on the completion of incremental GC.  (Before the patch, we'd just parse the async script synchronously making the determination of whether an async script was parsed asychronously highly non-deterministic.)  Thus, on the slow load times, the window is being opened during a full GC and the delay is the time it takes for the GC to complete.  GC time is proportional to the amount of life stuff, so it makes sense that you see worse times for a big profile.

I'm not sure if there is an easy fix for this or how much work is justified to fix it.  This only happens for async scripts (which is why the inline script in your experiment wasn't affected) where there is already an expectation of delay (from network latency or waiting for other scripts to finish execution).  Also, this only happens for full GCs which are not usually so frequent and close together (closing windows triggers them, which is why your script hits this regularly).

One thought is that, since the GC is incremental, it'll take a longer to complete (because it is letting content run in between GC slices) so perhaps we could do something to try to rush the incremental GC when an async script is waiting to be parsed.  Any thoughts on that Bill?
Status: UNCONFIRMED → NEW
Ever confirmed: true
> This only happens for async scripts (which is why the inline script in your experiment
> wasn't affected) where there is already an expectation of delay (from network latency or
> waiting for other scripts to finish execution).

Then chrome code shouldn't be using async scripts on the blocking path to showing a window, perhaps?  Right now it does.  The user experience of hitting Cmd-N or Ctrl-N and then there being no visible response for 5s is a lot worse than an immediate window appearing that's not usable for 5s (which itself is a lot worse than an immediate and immediately usable window).
Yeah, although that also sounds unpleasant.

Alternatively, we could say that, if an off-thread parse task is started, we disable collection of the atom's zone.  This has the possibility of starvation given a constant stream of async script parse jobs, but it doesn't seem like it would arise in practice...
On second though, if the atoms compartment did get starved for collection, eventually we'd get one of those powerful (purge/shrinking?) GCs which I assume is non-incremental?  Just guessing here, perhaps Bill could weigh in.
Flags: needinfo?(wmccloskey)
I don't really like the idea of disabling collection of the atoms zone. First, we don't currently have a mechanism to cancel collection of a given compartment when the GC has already started. It might be hard to add this, too, since we could have already executed some |if (isFullGC)| checks, and it we'd have to back off from that somehow.

The starvation thing is also an issue. We don't have any good backstop there. Eventually when the zone got big enough, we'd trigger a non-incremental GC. However, the triggering logic doesn't take the actual string size into account--only sizeof(JSString)--so we could end up with pretty bad behavior. We also have the malloc bytes trigger, but I doubt it would work in this case since it gets reset on every GC.

Could we put in a check so that we only do the parsing off-thread during a GC if the script is huge? My understanding is that the original patch was meant to handle multi-megabyte asm.js scripts. I'm guessing that even browser.js isn't that big.

We really need to find a long-term fix for the atoms zone. I talked about the problem with Brian during the GC work week, but I don't remember the conclusion we came to. Maybe we can talk to him when he gets back.
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #8)
> Could we put in a check so that we only do the parsing off-thread during a
> GC if the script is huge? My understanding is that the original patch was
> meant to handle multi-megabyte asm.js scripts. I'm guessing that even
> browser.js isn't that big.

Yeah, that's probably the best short-term fix.  Caching isn't even used unless the script is >10,000 chars.
browser.js seems to be 549,150 chars.  However, it doesn't seem to be async compiled again after the initial load.  Other than that, when opening a new tab, the biggest script to be async compiled is 58,877 chars (chrome://browser/content/newtab/newTab.js).  The second biggest script is less than 10,000 chars (the caching minimum threshold).

It seems kinda sad to disable async parsing for *all* scripts <59K chars just because of this one potential problem.  Rather, I'm thinking we disable if rt->activeGCInAtomsZone() AND num-chars < 500,000.  This seems like a fair tradeoff: if a full GC is in progress, it's probably faster not to wait for the GC to finish; but for big scripts, we'd still like to avoid the main-thread jank.
(In reply to Luke Wagner [:luke] from comment #10)
> It seems kinda sad to disable async parsing for *all* scripts <59K chars
> just because of this one potential problem.  Rather, I'm thinking we disable
> if rt->activeGCInAtomsZone() AND num-chars < 500,000.  This seems like a
> fair tradeoff: if a full GC is in progress, it's probably faster not to wait
> for the GC to finish; but for big scripts, we'd still like to avoid the
> main-thread jank.

Yes, I think that makes sense.
With this patch, the script in comment 0 produces pretty regular <30ms load times.  I also threw in the heuristic of not using async parsing for tiny scripts because, seriously, it's not cheap; we create a whole Zone for each task.
Assignee: nobody → luke
Status: NEW → ASSIGNED
Attachment #8362725 - Flags: review?(wmccloskey)
Comment on attachment 8362725 [details] [diff] [review]
fix-off-thread-heuristics

Review of attachment 8362725 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsworkers.cpp
@@ +234,5 @@
> +{
> +    // Off thread parsing can't occur during incremental collections on the
> +    // atoms compartment, to avoid triggering barriers. (Outside the atoms
> +    // compartment, the compilation will use a new zone which doesn't require
> +    // barriers itself.) If an atoms-zone GC is in progress, hold off on

Rather than say "which doesn't require barriers itself", I think it would be clearer to say "that is never collected".
Attachment #8362725 - Flags: review?(wmccloskey) → review+
Bill, I was just thinking that 1,000,000 is overly conservative.  Several large asm.js codebases are in the 100,000-1,000,000 range, with 100-300ms compile times.  Again, browser.js is above this limit, but it isn't parsed repeatedly (thank goodness; XUL prototype cache, I'm guessing?).
So does 100,000 seem reasonable?
And just for variety, Android 2.2 (but not Android 2.2 Armv6) mochitest-4, https://tbpl.mozilla.org/php/getParsedLog.php?id=33303231&tree=Mozilla-Inbound
D'oh!  In the shell, offThreadCompileScript wants to ignore these heuristics, so I'll add a CompileOption.forceCaching that is set only from this shell function.

The one mochitest failure will be an activeGCInAtomsZone for a script < 100,000 chars.  The scripts are already generated to be bigger than 10,000 chars, so I'll just bump them up in size.
https://hg.mozilla.org/mozilla-central/rev/f97076de7eb0
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
I can verify this is fixed on Windows, if that's enough for VERIFIED.

Should this be uplifted to Aurora, since bug 931864 landed on 28?
Good catch!  Nominating this for 28 tracking.
Keywords: regression
[Approval Request Comment]
Bug caused by (feature/regressing bug #): 931864
User impact if declined: random Chrome stalls when loading
Testing completed (on m-c, etc.): m-c
Risk to taking this patch (and alternatives if risky): low

(Attaching patch with simple rebase conflicts fixed.)
Attachment #8367539 - Flags: approval-mozilla-aurora?
Attachment #8367539 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I've verified the bug using the following environment:

Firefox 29
User Agent:Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0
Build Id: 20140203004003
Windows 7 x64

Firefox 28
User Agent:Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
Build Id:20140203004003
Windows 7 x64

FF Aurora 29
User Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
Build Id: 20140204004003
Mac OS X 10.9

FF Aurora 29
User Agent:Mozilla/5.0 (X11; Linux i686; rv:29.0) Gecko/20100101 Firefox/29.0
Build Id:20140204004003
Ubuntu 12.04 x32
Status: RESOLVED → VERIFIED
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: