10% tpaint regression with compartment-per-global

RESOLVED FIXED in mozilla15



JavaScript Engine
5 years ago
2 years ago


(Reporter: bholley, Unassigned)


Dependency tree / graph

Firefox Tracking Flags

(Not tracked)



(5 attachments, 1 obsolete attachment)

This is the very last CPG blocker.

CPG gives us a 7-23% tpaint regression, where tpaint measures the time it takes to open up a new window and do the initial paint.

I spent some time with a profiler, and the issue appears to be that we're spending time doing script analysis for scripts we cloned across globals.

Gecko caches scripts in various places (like the XUL prototype cache). These scripts live in a compartment, so luke added some code in JS_ExecuteScript to just clone them if they come from the wrong compartment:

So my question is, given that we only need to clone the script so that it lives in a new compartment, is there some way we can reuse the old analysis? It seems like there isn't a huge amount of stuff that changed. These scripts were being reused against multiple globals before, so they can't be tied to any particular global. It's just a compartment issue.

CCing people who know about this stuff. Compartment-per-global is ready to land, modulo this issue. If you guys want it as bad as I do, then hopefully someone can lend a hand to push this over the line. :-)
Also, the hot function seems to be js::analysis::ScriptAnalysis::analyzeBytecode, at least in the profiler.

Comment 2

5 years ago
One simple idea is to turn off TI in the clone's destination compartment.  There is even reason that this may be an overall speedup: clearly this code is not running for long if analysis time is significant (although that may just be how tpaint exercises it).  Also, since everything gets thrown out on every GC, every time we run this chrome code we may be analyzing the scripts anew.

If nothing else, the patch should be simple (is it as easy as just cx->compartment->inferenceEnabled = false?) so perhaps worth just try'ing.
Right now we analyzeBytecode in all scripts that are executed, whether TI is enabled or not.  I'm don't think this behavior is actually necessary and will try fixing this first thing tomorrow.
(In reply to Brian Hackett (:bhackett) from comment #3)
> Right now we analyzeBytecode in all scripts that are executed, whether TI is
> enabled or not.

Right. These are chrome scripts, so TI is already disabled on the compartment.

> I'm don't think this behavior is actually necessary and will try fixing
> this first thing tomorrow.

Huzzah! Thanks Brian.
Created attachment 619580 [details] [diff] [review]

Patch which works in the shell, about to push to try.
Created attachment 619634 [details]
updated patch

First patch didn't work, so Brian and I did some pair debugging. Attaching the resulting patch.

This patch makes analyzeBytecode disappear from the profile, but unfortunately doesn't make the tpaint regression go away. So it must be something else... :\
Attachment #619580 - Attachment is obsolete: true
I've got to head out for the evening, though I'll check mail before heading to bed when I get home. If we're lucky, whatever luke discovers about the cssquery regression might indicate what's going on here.
Alright, so it looks like bhackett's patch didn't do the trick here. I've been running numbers all morning. Each run opens 50 windows, and I did a number of runs for various configurations. Here are the parameters:

precpg/cpg: with/without cpg
cloneall: a patch that clones _every_ script in JS_ExecuteScript.
bhackett3: bhackett's 3rd patch (demonstrated to take analyzeBytecode out of the profile)

Here are my results (each number represents the average opening time (in ms) of a 50-window run, with the first open removed from the average):

precpg + cloneall + bhackett3: 192,194,192
precpg + cloneall: 199, 201, 193,195,198 (not sure why there's such high variance here)
precpg: 181,185,187,181

cpg + cloneall + bhackett3: 200,202,202
cpg + bhackett3: 202, 200,199
cpg: 200, 201, 202

So this tells us a few things. First, cloneall makes a significant difference when applied to precpg, but almost none when applied to cpg. This implies that the cloning is the source of the performance regression. Second, bhackett3 made either slight or no impact in clone-heavy cases, even though it made analyzeBytecode disappear off the profile.

So all this seems to indicate that there's some other way that CloneScript is hurting us. I'm going to see if I can convince Instruments to give me another lead.
Summary: Reduce the performance impact of analyzing scripts cloned across globals → 10% tpaint regression with compartment-per-global
I just checked that none of the numbers change if I turned off TI, chrome jit, and content jit. So the issue here is somewhere else.
So I just added some test code to precpg that _just_ does a CloneScript in JS_ExecuteScript, but doesn't actually use the result (it still executes the old scriptArg, if it can).

I got 189,191,193,197. So a good chunk (half-ish?) of the overhead is just cloning the script, regardless of whether we're running it.
At a glance, js::CloneScript looks hideously inefficient --- it XDR-encodes the script into a byte stream, then decodes it into the new compartment.
You're doing a window.open, right?  So the scripts you're cloning include chrome://browser/content/browser.js, at least.  All 13000 lines of it.  How long does it take to CloneScript just that one script?
Created attachment 619948 [details]
tpaint setup

I had to do some hacking on tpaint, so I'm attaching my resulting working directory (with a git repo inside of it). There are two branches, "master" and "profiling". The latter does programmatic enabling/disabling of the profiler.

To run, you need to do the following:
* enable remote xul with a pref
* enable mozafterpaint with a pref
* ./firefox -P yourprofile -chrome file:///full/path/to/winopen.xul?mozafterpaint=1%26phase1=20

The phase1= parameter specifies how many runs it does.
Rather than making CloneScript faster, it probably makes the most sense just to do bug 679940.
This is the log of the clone going on: http://pastebin.mozilla.org/1610307

Comment 16

5 years ago
Sharking "window.open" (which exercises this whole JS_Execute/CloneScript path) shows about 5.1% of total time under CloneScript and 2.2% total time under XDRAtom.  Since atoms are shareable across compartments, this suggests a hack to avoid serializing+deserializing+hashing atoms.
(In reply to Luke Wagner [:luke] from comment #16)
> Sharking "window.open" (which exercises this whole JS_Execute/CloneScript
> path) shows about 5.1% of total time under CloneScript and 2.2% total time
> under XDRAtom.  Since atoms are shareable across compartments, this suggests
> a hack to avoid serializing+deserializing+hashing atoms.

That sounds good - we can do that before the landing, and track bug 679940 (but not block on it). Is this the sort of patch you can put together in a jiffy?

Comment 18

5 years ago
Actually, there is more CloneScript than I thought; a lot of it comes from JS_CloneFunctionObject (a lot of that from XBL) and more XDRAtom than I thought (under XDRStaticBlock).  I'll try to do this hack and see what it gets us.

Comment 19

5 years ago
Created attachment 620198 [details] [diff] [review]
rm codeString

Simple preparatory patch: assert (as is already true) that script constant strings are atoms.
Attachment #620198 - Flags: review?(n.nethercote)

Comment 20

5 years ago
Created attachment 620199 [details] [diff] [review]
improve getLocalNamesArray

getLocalNamesArray should provide BindingKind in addition to name.
Attachment #620199 - Flags: review?(jwalden+bmo)
Comment on attachment 620198 [details] [diff] [review]
rm codeString

> template<XDRMode mode>
> bool
> js::XDRAtom(XDRState<mode> *xdr, JSAtom **atomp)
> {
>     if (mode == XDR_ENCODE) {
>-        JSString *str = *atomp;
>-        return xdr->codeString(&str);
>+        uint32_t nchars = (*atomp)->length();
>+        if (!xdr->codeUint32(&nchars))
>+            return false;
>+        jschar *chars = const_cast<jschar *>((*atomp)->getChars(xdr->cx()));
>+        if (!chars)
>+            return false;
>+        return xdr->codeChars(chars, nchars);
>     }

I was expecting to see a corresponding change in DECODE code for this... I guess there must be a reason why it's not necessary?
Attachment #620198 - Flags: review?(n.nethercote) → review+

Comment 22

5 years ago
(In reply to Nicholas Nethercote [:njn] from comment #21)
Yep.  (Wow, you're fast!  Real patch in when some more try results come in.)

Comment 23

5 years ago
Created attachment 620207 [details] [diff] [review]
make CloneScript fast

This patch implements CloneScript by doing a clone directly (no XDR involved).  This avoids all the atomization overhead as well as the copying overhead to and from script/script->data and the XDR buffer.

Running the same window.open micro-bench, this patch cuts the slowdown almost in half (from 20% slower to 11% slower).  I'm still waiting for the full try run:
but based on the partial results of a previous try push (with a borked assert), I see the tpaint slowdown at ~3% on 10.6 and 10.7.
Attachment #620207 - Flags: review?(n.nethercote)

Comment 24

5 years ago
Green on try
Comment on attachment 620199 [details] [diff] [review]
improve getLocalNamesArray

Review of attachment 620199 [details] [diff] [review]:

You sure we don't want low-bit twiddling instead?  Chicken!
Attachment #620199 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 620207 [details] [diff] [review]
make CloneScript fast

Review of attachment 620207 [details] [diff] [review]:

I don't love this patch, but it'll do.  My main concern is about the JSCompartment maps, see below.

::: js/src/jsfun.cpp
@@ +373,5 @@
>  template<XDRMode mode>
>  bool
>  js::XDRInterpretedFunction(XDRState<mode> *xdr, JSObject **objp, JSScript *parentScript)
>  {
> +    /* NB: updates to XDRInterpretedFunction should propagate to CloneInterpretedFunction. */

You mean "if you modify this function you should modify CloneInterpretedFunction as well"?  If so, can you make the comment clearer?  (Maybe "Keep this in sync with CloneInterpretedFunction".)  And should CloneInterpretedFunction have a similar comment pointing back here?

::: js/src/jsscript.cpp
@@ +397,5 @@
>  template<XDRMode mode>
>  bool
>  js::XDRScript(XDRState<mode> *xdr, JSScript **scriptp, JSScript *parentScript)
>  {
> +    /* NB: updates to XDRScript should propagate to CloneScript. */

Again, I found "propagate" unclear here.

@@ -1137,5 @@
>      /*
>       * We assume that calloc aligns on sizeof(Value) if the size we ask to
>       * allocate divides sizeof(Value).
>       */
> -    JS_STATIC_ASSERT(sizeof(Value) == sizeof(double));

The comment is kind of bogus too.  A better comment would be "We assume that calloc aligns on sizeof(Value) if the size we request is greater than sizeof(Value)."

And can you add an assertion that |data| actually is sizeof(Value)-aligned?  Thanks!

@@ +1749,2 @@
>  JSScript *
> +js::CloneScript(JSContext *cx, JSScript *src)

This function is disturbingly low-level, but I guess you gotta do what you gotta do :/

JSCompartment has three maps holding optional per-JSScript info:  scriptCountsMap, sourceMapMap, debugScriptMap.  Does the info in these maps need to be copied for the new script?  I'm not sure.  If it does, it does;  if it does not, a comment explaining why would be helpful.

@@ +1763,5 @@
> +    size_t size = ScriptDataSize(cx, src->length, src->numNotes(), src->natoms,
> +                                 nobjects, nregexps, ntrynotes, nconsts, nglobals,
> +                                 nClosedArgs, nClosedVars);
> +
> +    uint8_t *data = static_cast<uint8_t *>(cx->calloc_(JS_ROUNDUP(size, sizeof(Value))));

Can you assert that this is sizeof(Value)-aligned as well?

::: js/src/vm/RegExpObject.cpp
@@ +757,5 @@
>  template<XDRMode mode>
>  bool
>  js::XDRScriptRegExpObject(XDRState<mode> *xdr, HeapPtrObject *objp)
>  {
> +    /* NB: updates to XDRScriptRegExpObject should propagate to CloneScriptRegExpObject. */

Ditto for "propagate", and the comment pointing back.

::: js/src/vm/ScopeObject.cpp
@@ +903,5 @@
>  template<XDRMode mode>
>  bool
>  js::XDRStaticBlockObject(XDRState<mode> *xdr, JSScript *script, StaticBlockObject **objp)
>  {
> +    /* NB: updates to XDRStaticBlockObject should propagate to CloneStaticBlockObject. */

Ditto for "propagate".

@@ +1018,5 @@
>  js::XDRStaticBlockObject(XDRState<XDR_DECODE> *xdr, JSScript *script, StaticBlockObject **objp);
> +
> +JSObject *
> +js::CloneStaticBlockObject(JSContext *cx, StaticBlockObject &srcBlock,
> +                           const AutoObjectVector &objects, JSScript *src)

I won't pretend to understand this function -- I don't know anything about StaticBlockObject.
Attachment #620207 - Flags: review?(n.nethercote) → review+

Comment 27

5 years ago
Thanks again njn!

Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla15
Followup to fix warnings:

Comment 29

5 years ago
On failure return values, I'm insufferable.
Depends on: 752205
Depends on: 1245233
You need to log in before you can comment on or make changes to this bug.