Last Comment Bug 749617 - 10% tpaint regression with compartment-per-global
: 10% tpaint regression with compartment-per-global
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla15
Assigned To: general
: Jason Orendorff [:jorendorff]
Depends on: 752205 1245233
Blocks: cpg
  Show dependency treegraph
Reported: 2012-04-27 07:37 PDT by Bobby Holley (:bholley) (busy with Stylo)
Modified: 2016-02-02 11:53 PST (History)
13 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

patch (2.71 KB, patch)
2012-04-30 08:43 PDT, Brian Hackett (:bhackett)
no flags Details | Diff | Splinter Review
updated patch (2.75 KB, text/plain)
2012-04-30 12:02 PDT, Bobby Holley (:bholley) (busy with Stylo)
no flags Details
tpaint setup (29.20 KB, application/x-bzip2)
2012-05-01 09:11 PDT, Bobby Holley (:bholley) (busy with Stylo)
no flags Details
rm codeString (6.10 KB, patch)
2012-05-01 22:50 PDT, Luke Wagner [:luke]
n.nethercote: review+
Details | Diff | Splinter Review
improve getLocalNamesArray (14.14 KB, patch)
2012-05-01 22:51 PDT, Luke Wagner [:luke]
jwalden+bmo: review+
Details | Diff | Splinter Review
make CloneScript fast (24.03 KB, patch)
2012-05-01 23:34 PDT, Luke Wagner [:luke]
n.nethercote: review+
Details | Diff | Splinter Review

Description Bobby Holley (:bholley) (busy with Stylo) 2012-04-27 07:37:06 PDT
This is the very last CPG blocker.

CPG gives us a 7-23% tpaint regression, where tpaint measures the time it takes to open up a new window and do the initial paint.

I spent some time with a profiler, and the issue appears to be that we're spending time doing script analysis for scripts we cloned across globals.

Gecko caches scripts in various places (like the XUL prototype cache). These scripts live in a compartment, so luke added some code in JS_ExecuteScript to just clone them if they come from the wrong compartment:

So my question is, given that we only need to clone the script so that it lives in a new compartment, is there some way we can reuse the old analysis? It seems like there isn't a huge amount of stuff that changed. These scripts were being reused against multiple globals before, so they can't be tied to any particular global. It's just a compartment issue.

CCing people who know about this stuff. Compartment-per-global is ready to land, modulo this issue. If you guys want it as bad as I do, then hopefully someone can lend a hand to push this over the line. :-)
Comment 1 Bobby Holley (:bholley) (busy with Stylo) 2012-04-27 07:40:05 PDT
Also, the hot function seems to be js::analysis::ScriptAnalysis::analyzeBytecode, at least in the profiler.
Comment 2 Luke Wagner [:luke] 2012-04-27 16:45:14 PDT
One simple idea is to turn off TI in the clone's destination compartment.  There is even reason that this may be an overall speedup: clearly this code is not running for long if analysis time is significant (although that may just be how tpaint exercises it).  Also, since everything gets thrown out on every GC, every time we run this chrome code we may be analyzing the scripts anew.

If nothing else, the patch should be simple (is it as easy as just cx->compartment->inferenceEnabled = false?) so perhaps worth just try'ing.
Comment 3 Brian Hackett (:bhackett) 2012-04-29 18:31:57 PDT
Right now we analyzeBytecode in all scripts that are executed, whether TI is enabled or not.  I'm don't think this behavior is actually necessary and will try fixing this first thing tomorrow.
Comment 4 Bobby Holley (:bholley) (busy with Stylo) 2012-04-30 01:59:29 PDT
(In reply to Brian Hackett (:bhackett) from comment #3)
> Right now we analyzeBytecode in all scripts that are executed, whether TI is
> enabled or not.

Right. These are chrome scripts, so TI is already disabled on the compartment.

> I'm don't think this behavior is actually necessary and will try fixing
> this first thing tomorrow.

Huzzah! Thanks Brian.
Comment 5 Brian Hackett (:bhackett) 2012-04-30 08:43:57 PDT
Created attachment 619580 [details] [diff] [review]

Patch which works in the shell, about to push to try.
Comment 6 Bobby Holley (:bholley) (busy with Stylo) 2012-04-30 12:02:37 PDT
Created attachment 619634 [details]
updated patch

First patch didn't work, so Brian and I did some pair debugging. Attaching the resulting patch.

This patch makes analyzeBytecode disappear from the profile, but unfortunately doesn't make the tpaint regression go away. So it must be something else... :\
Comment 7 Bobby Holley (:bholley) (busy with Stylo) 2012-04-30 12:05:59 PDT
I've got to head out for the evening, though I'll check mail before heading to bed when I get home. If we're lucky, whatever luke discovers about the cssquery regression might indicate what's going on here.
Comment 8 Bobby Holley (:bholley) (busy with Stylo) 2012-05-01 06:13:43 PDT
Alright, so it looks like bhackett's patch didn't do the trick here. I've been running numbers all morning. Each run opens 50 windows, and I did a number of runs for various configurations. Here are the parameters:

precpg/cpg: with/without cpg
cloneall: a patch that clones _every_ script in JS_ExecuteScript.
bhackett3: bhackett's 3rd patch (demonstrated to take analyzeBytecode out of the profile)

Here are my results (each number represents the average opening time (in ms) of a 50-window run, with the first open removed from the average):

precpg + cloneall + bhackett3: 192,194,192
precpg + cloneall: 199, 201, 193,195,198 (not sure why there's such high variance here)
precpg: 181,185,187,181

cpg + cloneall + bhackett3: 200,202,202
cpg + bhackett3: 202, 200,199
cpg: 200, 201, 202

So this tells us a few things. First, cloneall makes a significant difference when applied to precpg, but almost none when applied to cpg. This implies that the cloning is the source of the performance regression. Second, bhackett3 made either slight or no impact in clone-heavy cases, even though it made analyzeBytecode disappear off the profile.

So all this seems to indicate that there's some other way that CloneScript is hurting us. I'm going to see if I can convince Instruments to give me another lead.
Comment 9 Bobby Holley (:bholley) (busy with Stylo) 2012-05-01 06:45:50 PDT
I just checked that none of the numbers change if I turned off TI, chrome jit, and content jit. So the issue here is somewhere else.
Comment 10 Bobby Holley (:bholley) (busy with Stylo) 2012-05-01 07:04:57 PDT
So I just added some test code to precpg that _just_ does a CloneScript in JS_ExecuteScript, but doesn't actually use the result (it still executes the old scriptArg, if it can).

I got 189,191,193,197. So a good chunk (half-ish?) of the overhead is just cloning the script, regardless of whether we're running it.
Comment 11 Brian Hackett (:bhackett) 2012-05-01 07:10:10 PDT
At a glance, js::CloneScript looks hideously inefficient --- it XDR-encodes the script into a byte stream, then decodes it into the new compartment.
Comment 12 Boris Zbarsky [:bz] (still a bit busy) 2012-05-01 09:01:28 PDT
You're doing a, right?  So the scripts you're cloning include chrome://browser/content/browser.js, at least.  All 13000 lines of it.  How long does it take to CloneScript just that one script?
Comment 13 Bobby Holley (:bholley) (busy with Stylo) 2012-05-01 09:11:15 PDT
Created attachment 619948 [details]
tpaint setup

I had to do some hacking on tpaint, so I'm attaching my resulting working directory (with a git repo inside of it). There are two branches, "master" and "profiling". The latter does programmatic enabling/disabling of the profiler.

To run, you need to do the following:
* enable remote xul with a pref
* enable mozafterpaint with a pref
* ./firefox -P yourprofile -chrome file:///full/path/to/winopen.xul?mozafterpaint=1%26phase1=20

The phase1= parameter specifies how many runs it does.
Comment 14 Bobby Holley (:bholley) (busy with Stylo) 2012-05-01 09:15:16 PDT
Rather than making CloneScript faster, it probably makes the most sense just to do bug 679940.
Comment 15 Bobby Holley (:bholley) (busy with Stylo) 2012-05-01 09:16:26 PDT
This is the log of the clone going on:
Comment 16 Luke Wagner [:luke] 2012-05-01 11:02:09 PDT
Sharking "" (which exercises this whole JS_Execute/CloneScript path) shows about 5.1% of total time under CloneScript and 2.2% total time under XDRAtom.  Since atoms are shareable across compartments, this suggests a hack to avoid serializing+deserializing+hashing atoms.
Comment 17 Bobby Holley (:bholley) (busy with Stylo) 2012-05-01 11:33:48 PDT
(In reply to Luke Wagner [:luke] from comment #16)
> Sharking "" (which exercises this whole JS_Execute/CloneScript
> path) shows about 5.1% of total time under CloneScript and 2.2% total time
> under XDRAtom.  Since atoms are shareable across compartments, this suggests
> a hack to avoid serializing+deserializing+hashing atoms.

That sounds good - we can do that before the landing, and track bug 679940 (but not block on it). Is this the sort of patch you can put together in a jiffy?
Comment 18 Luke Wagner [:luke] 2012-05-01 11:36:26 PDT
Actually, there is more CloneScript than I thought; a lot of it comes from JS_CloneFunctionObject (a lot of that from XBL) and more XDRAtom than I thought (under XDRStaticBlock).  I'll try to do this hack and see what it gets us.
Comment 19 Luke Wagner [:luke] 2012-05-01 22:50:12 PDT
Created attachment 620198 [details] [diff] [review]
rm codeString

Simple preparatory patch: assert (as is already true) that script constant strings are atoms.
Comment 20 Luke Wagner [:luke] 2012-05-01 22:51:06 PDT
Created attachment 620199 [details] [diff] [review]
improve getLocalNamesArray

getLocalNamesArray should provide BindingKind in addition to name.
Comment 21 Nicholas Nethercote [:njn] 2012-05-01 23:21:48 PDT
Comment on attachment 620198 [details] [diff] [review]
rm codeString

> template<XDRMode mode>
> bool
> js::XDRAtom(XDRState<mode> *xdr, JSAtom **atomp)
> {
>     if (mode == XDR_ENCODE) {
>-        JSString *str = *atomp;
>-        return xdr->codeString(&str);
>+        uint32_t nchars = (*atomp)->length();
>+        if (!xdr->codeUint32(&nchars))
>+            return false;
>+        jschar *chars = const_cast<jschar *>((*atomp)->getChars(xdr->cx()));
>+        if (!chars)
>+            return false;
>+        return xdr->codeChars(chars, nchars);
>     }

I was expecting to see a corresponding change in DECODE code for this... I guess there must be a reason why it's not necessary?
Comment 22 Luke Wagner [:luke] 2012-05-01 23:25:33 PDT
(In reply to Nicholas Nethercote [:njn] from comment #21)
Yep.  (Wow, you're fast!  Real patch in when some more try results come in.)
Comment 23 Luke Wagner [:luke] 2012-05-01 23:34:34 PDT
Created attachment 620207 [details] [diff] [review]
make CloneScript fast

This patch implements CloneScript by doing a clone directly (no XDR involved).  This avoids all the atomization overhead as well as the copying overhead to and from script/script->data and the XDR buffer.

Running the same micro-bench, this patch cuts the slowdown almost in half (from 20% slower to 11% slower).  I'm still waiting for the full try run:
but based on the partial results of a previous try push (with a borked assert), I see the tpaint slowdown at ~3% on 10.6 and 10.7.
Comment 24 Luke Wagner [:luke] 2012-05-02 11:02:45 PDT
Green on try
Comment 25 Jeff Walden [:Waldo] (remove +bmo to email) 2012-05-02 12:39:59 PDT
Comment on attachment 620199 [details] [diff] [review]
improve getLocalNamesArray

Review of attachment 620199 [details] [diff] [review]:

You sure we don't want low-bit twiddling instead?  Chicken!
Comment 26 Nicholas Nethercote [:njn] 2012-05-02 17:39:17 PDT
Comment on attachment 620207 [details] [diff] [review]
make CloneScript fast

Review of attachment 620207 [details] [diff] [review]:

I don't love this patch, but it'll do.  My main concern is about the JSCompartment maps, see below.

::: js/src/jsfun.cpp
@@ +373,5 @@
>  template<XDRMode mode>
>  bool
>  js::XDRInterpretedFunction(XDRState<mode> *xdr, JSObject **objp, JSScript *parentScript)
>  {
> +    /* NB: updates to XDRInterpretedFunction should propagate to CloneInterpretedFunction. */

You mean "if you modify this function you should modify CloneInterpretedFunction as well"?  If so, can you make the comment clearer?  (Maybe "Keep this in sync with CloneInterpretedFunction".)  And should CloneInterpretedFunction have a similar comment pointing back here?

::: js/src/jsscript.cpp
@@ +397,5 @@
>  template<XDRMode mode>
>  bool
>  js::XDRScript(XDRState<mode> *xdr, JSScript **scriptp, JSScript *parentScript)
>  {
> +    /* NB: updates to XDRScript should propagate to CloneScript. */

Again, I found "propagate" unclear here.

@@ -1137,5 @@
>      /*
>       * We assume that calloc aligns on sizeof(Value) if the size we ask to
>       * allocate divides sizeof(Value).
>       */
> -    JS_STATIC_ASSERT(sizeof(Value) == sizeof(double));

The comment is kind of bogus too.  A better comment would be "We assume that calloc aligns on sizeof(Value) if the size we request is greater than sizeof(Value)."

And can you add an assertion that |data| actually is sizeof(Value)-aligned?  Thanks!

@@ +1749,2 @@
>  JSScript *
> +js::CloneScript(JSContext *cx, JSScript *src)

This function is disturbingly low-level, but I guess you gotta do what you gotta do :/

JSCompartment has three maps holding optional per-JSScript info:  scriptCountsMap, sourceMapMap, debugScriptMap.  Does the info in these maps need to be copied for the new script?  I'm not sure.  If it does, it does;  if it does not, a comment explaining why would be helpful.

@@ +1763,5 @@
> +    size_t size = ScriptDataSize(cx, src->length, src->numNotes(), src->natoms,
> +                                 nobjects, nregexps, ntrynotes, nconsts, nglobals,
> +                                 nClosedArgs, nClosedVars);
> +
> +    uint8_t *data = static_cast<uint8_t *>(cx->calloc_(JS_ROUNDUP(size, sizeof(Value))));

Can you assert that this is sizeof(Value)-aligned as well?

::: js/src/vm/RegExpObject.cpp
@@ +757,5 @@
>  template<XDRMode mode>
>  bool
>  js::XDRScriptRegExpObject(XDRState<mode> *xdr, HeapPtrObject *objp)
>  {
> +    /* NB: updates to XDRScriptRegExpObject should propagate to CloneScriptRegExpObject. */

Ditto for "propagate", and the comment pointing back.

::: js/src/vm/ScopeObject.cpp
@@ +903,5 @@
>  template<XDRMode mode>
>  bool
>  js::XDRStaticBlockObject(XDRState<mode> *xdr, JSScript *script, StaticBlockObject **objp)
>  {
> +    /* NB: updates to XDRStaticBlockObject should propagate to CloneStaticBlockObject. */

Ditto for "propagate".

@@ +1018,5 @@
>  js::XDRStaticBlockObject(XDRState<XDR_DECODE> *xdr, JSScript *script, StaticBlockObject **objp);
> +
> +JSObject *
> +js::CloneStaticBlockObject(JSContext *cx, StaticBlockObject &srcBlock,
> +                           const AutoObjectVector &objects, JSScript *src)

I won't pretend to understand this function -- I don't know anything about StaticBlockObject.
Comment 28 Ed Morley [:emorley] 2012-05-04 09:31:57 PDT
Followup to fix warnings:
Comment 29 Luke Wagner [:luke] 2012-05-04 09:39:07 PDT
On failure return values, I'm insufferable.

Note You need to log in before you can comment on or make changes to this bug.