Closed Bug 591836 Opened 14 years ago Closed 14 years ago

JM: Make FrameState work for large, linear scripts

Categories

(Core :: JavaScript Engine, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- beta7+

People

(Reporter: dvander, Assigned: dvander)

References

Details

(Whiteboard: [ETA: 9/24] fixed-in-tracemonkey[suspect-regress-ts_cold_generated_med][suspect-regress-dromaeo_css][suspect-regress-dromaeo_jslib])

Attachments

(10 files)

8.71 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
28.98 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
4.97 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
3.33 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
10.51 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
2.16 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
13.81 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
4.27 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
6.36 KB, application/javascript
Details
2.31 KB, patch
dmandelin
: review+
Details | Diff | Splinter Review
The test case in bug 591631 reveals that FrameState can exhibit quadratic behavior with large, linear scripts. This was originally required to safely sync for GC points. Now that we have conservative GC, and precise stack depth syncing at ops, we can make this algorithm much simpler.
blocking2.0: --- → beta6+
This patch removes the "base" array in FrameState. "base" was a vector of FrameEntry pointers, and was used to tell whether a FrameEntry was tracked. Now, the "tracked" bit is stored directly in the FrameEntry.

The purpose of this patch is to ween FrameState off the tracker.

At this patch (baseline):
 interations: 100; total dur: 10073ms; per item: 100.73ms
Attachment #470369 - Flags: review?(dmandelin)
This patch changes the register pinning invariant such that RegisterState::save is guaranteed NULL if the register is not pinned.

The purpose of this patch is to make it possible to walk FrameState::regstate and correctly determine register assignment. A side effect is that all uses of pinReg must be matched by either unpinReg, or unpinKilledReg. It's no longer acceptable to rely on sync-and-kill implicitly unpinning registers.

To make some of this clearer there are also some function renamings.
Attachment #470371 - Flags: review?(dmandelin)
Get rid of a function that's not needed after part 1.
Attachment #470372 - Flags: review?(dmandelin)
Right now we're guaranteed that by walking the tracker backwards, we observe every copy before its backing store.

This patch guarantees the same invariant if we want to walk the stack backwards.
Attachment #470373 - Flags: review?(dmandelin)
This patch rewrites the constant-sync algorithm, and ImmutableSync, in two ways:
 * First, all registers are synced up-front. This removes some complexity, and guarantees that we never have to spill.
 * Next, only |Uses| is traversed, instead of the entire tracker. This works because registers have already been synced.

interations: 100; total dur: 2466ms; per item: 24.66ms
Attachment #470374 - Flags: review?(dmandelin)
This patch changes FrameState::merge() to only look at the register state, rather than the tracker.

interations: 100; total dur: 961ms; per item: 9.61ms
Attachment #470375 - Flags: review?(dmandelin)
This patch tries to decide whether it's more advantageous to use the tracker to find copies, or the stack. The tracker is better when there are many slots, but few items tracked (lots of control flow). The stack is better when the code is very linear, since the tracker must be swept twice: Once to find the best candidate, and another time to fix up references.

Unfortunately this patch doesn't help a whole lot because we create lots of copies for arithmetic, which sets the "isCopied" bit, even though a pop is imminent. That means we perform a very wide search for copies and come up with nothing.

We could improve this by precisely tracking how many times each FE is copied. The bookkeeping overhead may be worth it. In this case, it could save ~130ms.

interations: 100; total dur: 894ms; per item: 8.94ms
Attachment #470377 - Flags: review?(dmandelin)
This patch makes syncAndKill() simpler and faster:
 * Registers are synced up-front.
 * Only stack uses are scanned.
 * Any live registers are re-scanned at the end and evicted. This is cheap because they are guaranteed to have been synced.

interations: 100; total dur: 843ms; per item: 8.43ms
Attachment #470378 - Flags: review?(dmandelin)
Out of curiousity... how does interp do on the same hardware for the relevant testcase?
Attachment #470369 - Flags: review?(dmandelin) → review+
Comment on attachment 470371 [details] [diff] [review]
part 2: new register pinning invariants

Looks good. Pinning seems much improved. 

This comment has a lot of redundancy:

>+    /*
>+     * Some RegisterState invariants.
>+     *
>+     *  If |fe| is non-NULL, |save| is NULL.
>+     *  If |save| is non-NULL, |fe| is NULL.
>+     *  That is, both |fe| and |save| cannot be non-NULL.
>+     *
>+     *  If either |fe| or |save| is non-NULL, the register is not in freeRegs.
>+     *  If both |fe| and |save| are NULL, the register is either in freeRegs,
>+     *  or owned by the compiler.
>+     */
>     struct RegisterState {

The first 3 are all equivalent. I'd pick just one, probably the last. 

The last section is good. Except you might want to gloss on "owned by compiler". I think this expands to "the compiler code for the current op is managing the register out of the FrameState" or something like that.

Otherwise, r+.
Attachment #470371 - Flags: review?(dmandelin) → review+
Attachment #470372 - Flags: review?(dmandelin) → review+
Attachment #470373 - Flags: review?(dmandelin) → review+
Attachment #470374 - Flags: review?(dmandelin) → review+
Attachment #470375 - Flags: review?(dmandelin) → review+
Attachment #470377 - Flags: review?(dmandelin) → review+
Attachment #470378 - Flags: review?(dmandelin) → review+
(In reply to comment #7)
> Unfortunately this patch doesn't help a whole lot because we create lots of
> copies for arithmetic, which sets the "isCopied" bit, even though a pop is
> imminent. That means we perform a very wide search for copies and come up with
> nothing.

I was idly taking a look at the compiler, and there are several places (including very common ops) where the following code pattern is used:

>    /* Perform a swap. */
>    frame.dup2();
>    frame.shift(-3);
>    frame.shift(-1);

Would it be worth introducing a function frame.swap() which swaps the top two entries without unnecessary copying?
(In reply to comment #11)
> (In reply to comment #7)
> >    /* Perform a swap. */
> >    frame.dup2();
> >    frame.shift(-3);
> >    frame.shift(-1);
> 
> Would it be worth introducing a function frame.swap() which swaps the top two
> entries without unnecessary copying?

It probably doesn't matter too much, because the copying should only occur at compile time--the register allocator should be able to avoid generating actual copies. dvander, is that right, or is there a chance this could create more memory moves?

Adding a swap() function to FrameState would make the API a little nicer, though.
(In reply to comment #12)
Yes, compile-time only, but comment #11 might have been concerned about the compile-time cost of uncopy().

The copy created by the swap will live at the end of the stack, so the uncopy() operation, if it needs to happen, should be very fast. But we should verify this. The big problem is code like:

  GETLOCAL 2
  GETLOCAL 4000
  ADD
  SETLOCAL 2

The GETLOCAL 2 creates a copy on the stack, and the distance between slot 2 and the copy is guaranteed to be at least (4000 - 2 + 1). The ADD pops the copy, but slot 2 is still marked as being copied.

When SETLOCAL 2 comes into play, uncopy() has to sweep for any copies, but there are none.
Yes, I was worried about compile-time cost.  I was having trouble convincing myself that the uncopy() wouldn't be expensive if there were lots of variables around for the frame to track.  Oddly enough, when I try to benchmark it, it goes the opposite direction: compilation time for my "lots of variables" version is about 1/3 of that for a similar test without lots of variables.  Hmmm...

Test case 1: lots of variables:

d1=(new Date).getTime(); 
a="da=(new Date).getTime();"; 
for (i=0;i<100000;i++){a+="v"+i+"=0;";} 
for (i=0;i<100000;i++){a+="v0+=v"+i+";"} 
d2=(new Date).getTime(); 
eval(a); d3=da; d4=(new Date).getTime(); 
eval(a); d5=da; d6=(new Date).getTime(); 
print("Generate:",d2-d1,"Compile:",d3-d2,"(",d5-d4,") Execute:",d4-d3,"(",d6-d5,")");

Test case 2: not so many variables:

d1=(new Date).getTime(); 
a="da=(new Date).getTime();"; 
for (i=0;i<100000;i++){a+="v0=0;";} 
for (i=0;i<100000;i++){a+="v0+=v0;"} 
d2=(new Date).getTime(); 
eval(a); d3=da; d4=(new Date).getTime(); 
eval(a); d5=da; d6=(new Date).getTime(); 
print("Generate:",d2-d1,"Compile:",d3-d2,"(",d5-d4,") Execute:",d4-d3,"(",d6-d5,")");

Each test case builds a big string and then calls eval() on it twice, and reports timing results.  For the first "lots of variables" version, my aged laptop gets:

/c/mozilla-build/tracemonkey-802d34381fe4/js/src/build_opt
$ js -m twftest3a.js
Generate: 209 Compile: 1138 ( 1086 ) Execute: 199 ( 165 )

For the second "not so many variables" version, I get:
/c/mozilla-build/tracemonkey-802d34381fe4/js/src/build_opt
$ js -m twftest4a.js
Generate: 36 Compile: 2858 ( 3153 ) Execute: 133 ( 133 )
The first script is so long it doesn't JIT compile. We don't yet support some of the more obscure opcodes that occur in extremely long scripts (in this case, JSOP_INDEXBASE).

Neither test uses the "var" keyword, so the variables will not be tracked as local. Instead they'll be name lookups.

Wrapping the code inside a function and using the var keyword should let you test uncopy() performance. Keep in mind that the number of local variables is capped at 65,535 or something, and that uncopy() is more likely to be slow if it had an early declaration and late use.
Thanks!  Update testcases which maybe look at the right thing:

1. Creates lots of local variables, then repeatedly uses the earliest-declared of these.

d1=(new Date).getTime(); 
a="da=(new Date).getTime(); function f() {"; 
for (i=0;i<10000;i++){a+="var v"+i+"=0;";} 
for (i=0;i<10000;i++){a+="v0+=v"+i+";"}
a+="} f();"; 
d2=(new Date).getTime(); 
eval(a); d3=da; d4=(new Date).getTime(); 
eval(a); d5=da; d6=(new Date).getTime(); 
print("Generate:",d2-d1,"Compile:",d3-d2,"(",d5-d4,") Execute:",d4-d3,"(",d6-d5,")");

vs. 

2. similar version using only one variable:

d1=(new Date).getTime(); 
a="da=(new Date).getTime(); function f() {"; 
for (i=0;i<10000;i++){a+="var v0=0;";} 
for (i=0;i<10000;i++){a+="v0+=v0;"}
a+="} f();"; 
d2=(new Date).getTime(); 
eval(a); d3=da; d4=(new Date).getTime(); 
eval(a); d5=da; d6=(new Date).getTime(); 
print("Generate:",d2-d1,"Compile:",d3-d2,"(",d5-d4,") Execute:",d4-d3,"(",d6-d5,")");

Results:
1. js -m twftest3b.js
Generate: 23 Compile: 364 ( 331 ) Execute: 29 ( 29 )

2. js -m twftest4b.js
Generate: 2 Compile: 59 ( 59 ) Execute: 22 ( 23 )

This time the result is more as expected, at least in terms of more variables being more expensive.
Playing with this kind of testcase further, it doesn't look like there is anything pathological -- the expense seems to be just the overhead for adding those variables.  So sorry for the distraction...
For what its worth, the "lots of local variables" testcase I was playing with spends most of its time in the parser -- it triggers quadratic-in-the-number-of-local-variables behavior via a linear lookup for adding new variables.  The stack trace to the hot routine is like:

 	js.exe!js::Shape::search(js::Shape * * startp=0x01507a98, jsid id={...}, bool adding=false)  Line 788 + 0x7 bytes	C++
 	js.exe!JSFunction::lookupLocal(JSContext * cx=0x01271a18, JSAtom * atom=0x01f8d6e0, unsigned int * indexp=0x00000000)  Line 3236 + 0x21 bytes	C++
>	js.exe!BindVarOrConst(JSContext * cx=0x01271a18, BindData * data=0x0013f0b4, JSAtom * atom=0x01f8d6e0, JSTreeContext * tc=0x0013f2c8)  Line 3668 + 0x15 bytes	C++
 	js.exe!js::Parser::variables(bool inLetHead=false)  Line 5992 + 0x1c bytes	C++
 	js.exe!js::Parser::statement()  Line 5578 + 0xd bytes	C++
 	js.exe!js::Parser::statements()  Line 3189 + 0x8 bytes	C++
 	js.exe!js::Parser::functionBody()  Line 1309 + 0x8 bytes	C++
 	js.exe!js::Parser::functionDef(JSAtom * funAtom=0x007f9e50, js::Parser::FunctionType type=GENERAL, unsigned int lambda=0)  Line 2933 + 0xb bytes	C++
 	js.exe!js::Parser::functionStmt()  Line 3110	C++
 	js.exe!js::Parser::statement()  Line 4732 + 0xb bytes	C++
 	js.exe!js::Compiler::compileScript(JSContext * cx=0x01271a18, JSObject * scopeChain=0x015021f8, JSStackFrame * callerFrame=0x00e30108, JSPrincipals * principals=0x00000000, unsigned long tcflags=33693696, const wchar_t * chars=0x02b00040, unsigned int length=573934, _iobuf * file=0x00000000, const char * filename=0x01280471, unsigned int lineno=8, JSString * source=0x01f3cf00, unsigned int staticLevel=2)  Line 864 + 0x8 bytes	C++
 	js.exe!obj_eval(JSContext * cx=0x01271a18, unsigned int argc=1, js::Value * vp=0x00e30160)  Line 1242 + 0x33 bytes	C++
(In reply to comment #18)
> For what its worth, the "lots of local variables" testcase I was playing with
> spends most of its time in the parser -- it triggers
> quadratic-in-the-number-of-local-variables behavior via a linear lookup for
> adding new variables.  The stack trace to the hot routine is like:
> 
>      js.exe!js::Shape::search(js::Shape * * startp=0x01507a98, jsid id={...},
> bool adding=false)  Line 788 + 0x7 bytes    C++

Whoops -- need some maybeHash love there, but it's not on the path specialized for JSFunction::addLocal. Filed bug 595918 on this, must fix.

/be
Attached file another test case
This test case also exhibits pathological behaviour (10 seconds to compile), because the compile-time algorithms deoptimize inside |try| blocks.

Our method of handling |try| is to sync the entire frame on every slow exit, so when jumping to the exception handler, live variables are readable. That's crap both for test cases like this, and in general it causes code explosion.

An alternate solution is to sync immediately when setting local variables/args inside try blocks. It'll slightly slow down code inside |try|. v8 seems to do something similar, at a slightly greater expense (it seems to require a load before the store - maybe storing to a scope object).
Amusingly, with this patch, we run the |try| test case 1% faster than the non-|try| version. v8 is 20% slower on the |try| version.

(Either way, we're 2X slower than v8.)
Attachment #474935 - Flags: review?(dmandelin)
Attachment #474935 - Flags: review?(dmandelin) → review+
Whiteboard: [ETA: 9/24]
http://hg.mozilla.org/tracemonkey/rev/1af8f0c895bc
Whiteboard: [ETA: 9/24] → [ETA: 9/24] fixed-in-tracemonkey
http://hg.mozilla.org/tracemonkey/rev/c6000ca7e7fd

Backed out from PGO bustage.
Whiteboard: [ETA: 9/24] fixed-in-tracemonkey → [ETA: 9/24]
Blocks: 598839
Switching to --enable-shared-js let PGO complete for me on try, so I recommend that as a way to land this hotly-desired patch!
http://hg.mozilla.org/tracemonkey/rev/1bbc0fc10747
Whiteboard: [ETA: 9/24] → [ETA: 9/24] fixed-in-tracemonkey
Blocks: 596837
Blocks: 600373
Depends on: 600419
http://hg.mozilla.org/mozilla-central/rev/0ff917ab21e4
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Extrapolating from mass-orange on the tinderbox and from IRC remarks, this probably caused mozjs.dll not to get properly packaged with Windows builds.
(In reply to comment #27)
> Extrapolating from mass-orange on the tinderbox and from IRC remarks, this
> probably caused mozjs.dll not to get properly packaged with Windows builds.

Bug 600493.
Depends on: 600493
Whiteboard: [ETA: 9/24] fixed-in-tracemonkey → [ETA: 9/24] fixed-in-tracemonkey[suspect-regress-ts_cold_generated_med]
A changeset from this bug was associated with a XP Ts, Cold MED Dirty Profile regression on Firefox. boo-urns :(

  Previous: avg 420.356 stddev 3.679 of 30 runs up to a9d1ad0bc386
  New     : avg 430.832 stddev 1.905 of 5 runs since a60414d076b5
  Change  : +10.475 (2.49% / z=2.847)
  Graph   : http://mzl.la/bZFaB3

The regression occurred from changesets in the following range:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a9d1ad0bc386&tochange=a60414d076b5

The tag [suspect-regress-ts_cold_generated_med] has been added to the status whiteboard;
please remove it only once you have confirmed this bug is not the cause
of the regression.
Whiteboard: [ETA: 9/24] fixed-in-tracemonkey[suspect-regress-ts_cold_generated_med] → [ETA: 9/24] fixed-in-tracemonkey[suspect-regress-ts_cold_generated_med][suspect-regress-dromaeo_css]
A changeset from this bug was associated with a Win7 Dromaeo (CSS) regression on Firefox. boo-urns :(

  Previous: avg 2014.419 stddev 40.480 of 30 runs up to a9d1ad0bc386
  New     : avg 1901.610 stddev 12.432 of 5 runs since a60414d076b5
  Change  : -112.809 (-5.6% / z=2.787)
  Graph   : http://mzl.la/9gFu4n

The regression occurred from changesets in the following range:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a9d1ad0bc386&tochange=a60414d076b5

The tag [suspect-regress-dromaeo_css] has been added to the status whiteboard;
please remove it only once you have confirmed this bug is not the cause
of the regression.
Whiteboard: [ETA: 9/24] fixed-in-tracemonkey[suspect-regress-ts_cold_generated_med][suspect-regress-dromaeo_css] → [ETA: 9/24] fixed-in-tracemonkey[suspect-regress-ts_cold_generated_med][suspect-regress-dromaeo_css][suspect-regress-dromaeo_jslib]
A changeset from this bug was associated with a Win7 Dromaeo (jslib) regression on Firefox. boo-urns :(

  Previous: avg 127.610 stddev 4.222 of 30 runs up to a9d1ad0bc386
  New     : avg 116.384 stddev 0.751 of 5 runs since a60414d076b5
  Change  : -11.226 (-8.8% / z=2.659)
  Graph   : http://mzl.la/bZu5UP

The regression occurred from changesets in the following range:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a9d1ad0bc386&tochange=a60414d076b5

The tag [suspect-regress-dromaeo_jslib] has been added to the status whiteboard;
please remove it only once you have confirmed this bug is not the cause
of the regression.
A changeset from this bug was associated with a XP Dromaeo (CSS) regression on Firefox. boo-urns :(

  Previous: avg 2045.275 stddev 49.676 of 30 runs up to a9d1ad0bc386
  New     : avg 1936.120 stddev 13.937 of 5 runs since a60414d076b5
  Change  : -109.155 (-5.34% / z=2.197)
  Graph   : http://mzl.la/b0dlou

The regression occurred from changesets in the following range:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a9d1ad0bc386&tochange=a60414d076b5

The tag [suspect-regress-dromaeo_css] has been added to the status whiteboard;
please remove it only once you have confirmed this bug is not the cause
of the regression.
A changeset from this bug was associated with a XP Dromaeo (jslib) regression on Firefox. boo-urns :(

  Previous: avg 129.703 stddev 4.099 of 30 runs up to a9d1ad0bc386
  New     : avg 117.954 stddev 0.660 of 5 runs since a60414d076b5
  Change  : -11.749 (-9.06% / z=2.866)
  Graph   : http://mzl.la/avZij4

The regression occurred from changesets in the following range:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a9d1ad0bc386&tochange=a60414d076b5

The tag [suspect-regress-dromaeo_jslib] has been added to the status whiteboard;
please remove it only once you have confirmed this bug is not the cause
of the regression.
Blocks: 601109
No longer blocks: 601109
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: