Open Bug 960986 Opened 6 years ago Updated 2 years ago

Chrome Worker startup is very slow on second run. Disabling nglayout.debug.disable_xul_cache removes slowdown

Categories

(Core :: DOM: Workers, defect, P5)

defect

Tracking

()

People

(Reporter: smacleod, Unassigned)

Details

(Whiteboard: [Async:blocker])

While investigating Bug 959130 we noticed the startup time of Chrome Workers was > 600ms (rMBP, slower on windows VM).

It appears these long startup times happen on second run, with a cold startup being much quicker (around 100ms on rMBP). These timings are from sending first message, to JS execution inside the worker. Worker startup timings were taken of the Session Store worker (browser/components/sessionstore/src/SessionWorker.js)

Setting "nglayout.debug.disable_xul_cache" to true results in all timings being "fast".

STR:

 - Rebuild browser/
 - Run Firefox and observe timing to start SessionWorker (Will be ~100ms)
 - Close Firefox
 - Run Firefox and observe timing to start SessionWorker (Will be > 600ms)
 - Subsequent runs will also have large timing

 - Set nglayout.debug.disable_xul_cache = true
 - Rebuild browser/
 - Run Firefox and observe timing to start SessionWorker (Will be ~100ms)
 - Close Firefox
 - Run Firefox and observe timing to start SessionWorker (will be ~100ms)
This doesn't sound like a worker bug but more like something inside our cache implementation.
Is there anything we can do to debug this? Where should we start?
I did some bisecting in StartupCache::GetBuffer() and found the following:

Excluding browser/base/content/browser.js and browser/components/customizableui/content/panelUI.js keeps the session worker startup time at ~250ms, I put this at the top of the function:

> if (!strcmp(id, "xulcache/resource/app/chrome/browser/content/browser/browser.js")) {
>   return NS_ERROR_FAILURE;
> }
>
> if (!strcmp(id, "xulcache/resource/app/chrome/browser/content/browser/customizableui/panelUI.js")) {
>   return NS_ERROR_FAILURE;
> }

Removing one of those bail outs brings the session worker startup time back to 600-700ms on the second start i.e. with a filled startup cache.
Removing the DeserializeOutOfLine() call that was introduced with bug 188744 [1] (I think) keeps the worker fast, without excluding browser.js and panelUI.js. Now I'm sure that call has a purpose but something is going wrong somewhere here.

https://github.com/mozilla/gecko-dev/commit/b11fb7ca9d1c5c22e93d60ce0dc187705fcb32da
Adding Olli who knows a bit about the XUL prototype cache. Maybe he has a clue what's going on.
BTW, it's not only the session worker that suffers. The shared OS.File worker (resource://gre/modules/osfile/osfile_async_worker.js) is delayed by roughly the same amount of time. Measure in ChrashMonitor.jsm that uses it right on startup.
Starting with an empty startup cache, WorkerPrivate::WorkerPrivate() is called and DedicatedWorkerGlobalScope::PostMessage() right after that.

Starting with a filled startup cache there is a slight delay (probably around 400ms) between the worker creation and the call to PostMessage().
With a filled startup cache the CompileScriptRunnable seems to be taking longer, i.e. scriptLoader->LoadWorkerScript():

https://mxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#880
LoadWorkerScript() with empty startup cache:

(292 ms) + (239 ms)

Filled startup cache:

(735 ms) + (675 ms)
We'll have Telemetry once bug 916076 lands.
Measuring ScriptLoaderRunnable times for an empty startup cache, measure from Run() to OnStreamCompleteInternal():

 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_async_worker.js) (23 ms)
 -- ScriptLoaderRunnable (resource:///modules/sessionstore/SessionWorker.js) (18 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile.jsm) (5 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile.jsm) (7 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/workers/require.js) (7 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_unix_back.jsm) (7 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_shared_front.jsm) (7 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_unix_front.jsm) (7 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/workers/require.js) (7 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_unix_back.jsm) (7 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_shared_front.jsm) (7 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_unix_front.jsm) (7 ms)

A filled cache yields the following:

 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_async_worker.js) (54 ms)
 -- ScriptLoaderRunnable (resource:///modules/sessionstore/SessionWorker.js) (49 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile.jsm) (3 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile.jsm) (3 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/workers/require.js) (94 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_unix_back.jsm) (94 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_shared_front.jsm) (94 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_unix_front.jsm) (94 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/workers/require.js) (93 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_unix_back.jsm) (93 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_shared_front.jsm) (93 ms)
 -- ScriptLoaderRunnable (resource://gre/modules/osfile/osfile_unix_front.jsm) (93 ms)
I was afraid that the slow down could be caused by workers/require.js, but comment 12 seems to suggest that this isn't the case, as some of the 94ms scripts are loaded using require(), others using importScripts().
It looks like browser.xul and its script files are loaded off the startup cache and that leads to the browser window being ready much earlier. That I think means the worker scripts have to fight for resources with the images and style sheets of the first browser window that then leads to a delayed worker initialization. Does that sound plausible?
Worker scripts aren't put into the startup cache so it seems like the best option would be to go with bug 959598 and not use Workers in the (critical) startup path.
Why wouldn't we just startup cache worker scripts that we care about?
That of course is another possibility. It just seems bad to rely on that list of worker scripts we care about being up to date at all times. Wouldn't it be easier to discourage from using workers for time-critical startup code? In hindsight, it does sound like a bad idea.
We don't have to manually maintain a list; we could startupcache all ChromeWorkers.
"Don't use parallelism" seems like a bad solution to me.
Right, startup caching all ChromeWorkers would indeed do. With a ~250ms (on a fast machine) worker startup time on first run, I wonder how fast we can become with the startup cache. I am totally not against caching worker scripts, it just seems the easier option to not use workers here?

If you need (or would benefit from) parallelism at startup then sure, use workers. But afaik we won't win much by using workers for disk I/O at startup where the disk is under heavy load and the bottleneck anyway?
Whiteboard: [Async:blocker]
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) (AFK Jan 22/23) from comment #18)
> We don't have to manually maintain a list; we could startupcache all
> ChromeWorkers.

Sounds like we should probably do this either way. Are there downsides I'm missing? Do we need to be concerned about increasing the startupcache size significantly, or other weird interactions?

(In reply to Tim Taubert [:ttaubert] from comment #20)
> But afaik we won't win much by using workers for disk I/O at
> startup where the disk is under heavy load and the bottleneck anyway?

Indeed. Sounds like we should fix bug 959598 and generally avoid relying on OS.File in the startup path.
Note that we're strongly providing a native version of OS.File.read that would completely side-step the issue.
Ok, we have numbers for the OS.File worker:
- 80% of measures show a startup above 200ms;
- 42% are above 600ms;
- 3% are above 5 seconds!

Source: http://telemetry.mozilla.org/#nightly/29/OSFILE_WORKER_LAUNCH_MS (from January 23rd).

That's pretty bad. Gavin, you have a bird's eye view on this. Can you put someone in charge?
Flags: needinfo?(gavin.sharp)
vladan, can you find someone to own "startupcache all ChromeWorkers"?
Flags: needinfo?(gavin.sharp)
Flags: needinfo?(vdjeric)
Asking Aaron for his opinion about comment #24, during the perf meeting it seemed like he definitely has one :)
Flags: needinfo?(vdjeric) → needinfo?(aklotz)
Normally I'm hesitant to add more stuff to the startup cache. Since this problem appears to be caused due to various interdependencies between some things that are already in the cache and some things that are not, I'm more inclined to go along with this provided that we verify that the changes do indeed mitigate this problem.

This stuff is precompiled during make package, so it's going to need some enhancements in http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/installer/packager.py#116 and possibly http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/installer/precompile_cache.js
Flags: needinfo?(aklotz)
Priority: -- → P5
You need to log in before you can comment on or make changes to this bug.