Closed Bug 504587 Opened 15 years ago Closed 15 years ago

TM: investigate v8-richards performance

Categories

(Core :: JavaScript Engine, defect, P1)

x86
Linux
defect

Tracking

()

RESOLVED DUPLICATE of bug 518487
mozilla1.9.2

People

(Reporter: sayrer, Assigned: brendan)

References

Details

Attachments

(1 file, 22 obsolete files)

5.16 KB, patch
Details | Diff | Splinter Review
This test supposedly traces relatively well, but our perf is not particularly competitive (it is much faster than Fx3).

Julian is going to take a look at the quality of LIR and asm we're generating.
Depends on: 503424
A couple of preliminary comments:

js/src/v8/run-richards.js isn't so useful for repeatable comparative
benchmarking, because it appears to run as many iterations as it can
in one second.  Am using SunSpider/tests/v8-richards.js instead, as it
repeatably runs for 350 iterations.

Initial results from a 32-bit optimized jsshell show that it
(jsshell + generated code) run for 98.2 million basic blocks.
Of these, 83.5% are in the functions js_UnboxDouble(int),
js_UnboxInt32(int), js_BoxInt32 and js_BoxDouble.  The number
of basic blocks in JIT-generated code is only about 1% of the
total, by comparison.

Of course this doesn't say anything specific about instruction
counts or where all the cycles went.  But it gives some  ballpark
feel for what the main costs are.  Continuing to investigate.
status update:

jsshell running v8-richards runs for about 3.5 billion x86 insns, depending
on build options.  Of these:

246 050 915 insns in 22 368 256 calls to js_UnboxDouble
 68 058 108 insns in  7 562 012 calls to js_UnboxInt32
 70 388 157 insns in  3 351 817 calls to js_BoxInt32
 94 199 942 insns in  2 140 908 calls to js_BoxDouble

Further investigation shows almost all of the insns in these 4 calls are
on the fast paths (lowest bit is set, so just >> 1 and return, etc).

I implemented a partial inline of js_UnboxInt32, so that the fast 
path is done directly in LIR, calling out to the fn only if the
lowest bit is not set.  (using forward branches and labels in LIR).

That works, but is slower.  Profiling shows a substantial increase in 
icache misses and branch mispredictions.  Am investigating why.  This
benchmark appears to have a surprisingly high icache miss rate anyway,
about 1.2%.  Considering that there are only 34 active traces, this
is a bit strange.  (can 34 traces fill up a 64k I1 cache?)
This is very obviously WIP, but shows it's relatively easy to do.
It gives a small (1%) slowdown on v8-richards, so is not recommended.

New strategy (now I realise I'm looking for roughly a factor of
2 performance loss) is to build chrome and try reduce the size of
the test case whilst preserving the performance difference.
Some more low level figures.  These compare 32-bit TM compiled "-O2" with
32-bit V8 compiled "-O3 -fomit-frame-pointer", so TM is at a small relative
disadvantage, but not much.

What's evident is:

* V8 requires just over half the instructions that TM does

* V8 manages to do the whole thing in fixed point, no FP,
  which is not the case with TM.  TM does almost 29 million
  FP loads/stores, V8 does about 34 thousand.

* Both V8 and TM comfortably keep the data in D1 (essentially
  zero D1 miss rate), but only V8 manages to keep the insns in
  I1 (268k I1 misses vs almost 42 million for TM).

I'm sure this has nothing much to do with Nanojit -- it must
be a higher level phenomenon to do with jstracer.


-------------- #### v8 #### --------------
user times: 0.308 0.332 0.360

==24316== I   refs:      1,669,315,545
==24316== I1  misses:          268,223
==24316== L2i misses:            8,569
==24316== I1  miss rate:          0.01%
==24316== L2i miss rate:          0.00%
==24316==
==24316== D   refs:        713,605,023  (500,682,881 rd   + 212,922,142 wr)
==24316== D1  misses:          148,545  (     88,461 rd   +      60,084 wr)
==24316== L2d misses:           40,063  (      6,893 rd   +      33,170 wr)
==24316== D1  miss rate:           0.0% (        0.0%     +         0.0%  )
==24316== L2d miss rate:           0.0% (        0.0%     +         0.0%  )
==24316==
==24316== L2 refs:             416,768  (    356,684 rd   +      60,084 wr)
==24316== L2 misses:            48,632  (     15,462 rd   +      33,170 wr)
==24316== L2 miss rate:            0.0% (        0.0%     +         0.0%  )
==24316==
==24316== Branches:        346,592,089  (343,912,720 cond +   2,679,369 ind)
==24316== Mispredicts:       9,462,357  (  8,114,948 cond +   1,347,409 ind)
==24316== Mispred rate:            2.7% (        2.3%     +        50.2%   )

==24435== IR-level counts by type:
==24435==    Type        Loads       Stores       AluOps
==24435==    -------------------------------------------
==24435==    I1              0            0  688,484,494
==24435==    I8     10,831,622      612,699  135,873,871
==24435==    I16        16,250       20,675      140,534
==24435==    I32   489,816,561  212,667,488 1,496,876,771
==24435==    I64         1,340        2,040        6,550
==24435==    I128            0            0            0
==24435==    F32           870            6            6
==24435==    F64        15,241       18,652       44,521
==24435==    V128            0            0            0


-------------- JSSHELL ORIGINAL --------------
user times: 0.864 0.872 0.888

==21158== I   refs:      3,584,825,170
==21158== I1  misses:       41,953,492
==21158== L2i misses:            5,672
==21158== I1  miss rate:          1.17%
==21158== L2i miss rate:          0.00%
==21158==
==21158== D   refs:      1,670,310,720  (1,122,641,178 rd   + 547,669,542 wr)
==21158== D1  misses:          158,251  (      132,178 rd   +      26,073 wr)
==21158== L2d misses:           36,050  (       17,362 rd   +      18,688 wr)
==21158== D1  miss rate:           0.0% (          0.0%     +         0.0%  )
==21158== L2d miss rate:           0.0% (          0.0%     +         0.0%  )
==21158==
==21158== L2 refs:          42,111,743  (   42,085,670 rd   +      26,073 wr)
==21158== L2 misses:            41,722  (       23,034 rd   +      18,688 wr)
==21158== L2 miss rate:            0.0% (          0.0%     +         0.0%  )
==21158==
==21158== Branches:        542,647,807  (  542,522,855 cond +     124,952 ind)
==21158== Mispredicts:      11,744,745  (   11,713,792 cond +      30,953 ind)
==21158== Mispred rate:            2.1% (          2.1%     +        24.7%   )

==25013== IR-level counts by type:
==25013==    Type        Loads       Stores       AluOps
==25013==    -------------------------------------------
==25013==    I1              0            0 1,234,299,003
==25013==    I8      1,630,040      591,149  131,858,369
==25013==    I16     6,766,107    4,364,883    4,385,942
==25013==    I32  1,081,810,453  491,730,952 2,599,907,232
==25013==    I64    30,245,994   25,237,135        4,287
==25013==    I128            0            0            0
==25013==    F32            14            0            0
==25013==    F64     2,176,798   26,686,976   94,042,506
==25013==    V128            0            0    2,140,908
(In reply to comment #1)
> Initial results from a 32-bit optimized jsshell show that it
> (jsshell + generated code) run for 98.2 million basic blocks.
> Of these, 83.5% are in the functions js_UnboxDouble(int),
> js_UnboxInt32(int), js_BoxInt32 and js_BoxDouble.  The number
> of basic blocks in JIT-generated code is only about 1% of the
> total, by comparison.

Ignore these numbers, they are bogus.  AFAIK all other numbers in
this report are legit, though.
I believe the slowness is at least in part due to inadequate removal
of redundant guards in the LIR of the innermost loop.  That is:

1   Scheduler.prototype.schedule = function () {
2     this.currentTcb = this.list;
3     while (this.currentTcb != null) {
4       if (this.currentTcb.isHeldOrSuspended()) {
5         this.currentTcb = this.currentTcb.link;
6       } else {
7         this.currentId = this.currentTcb.id;
8         this.currentTcb = this.currentTcb.run();
9       }
10   }
11 };

The hottest trace begins at the start of the while loop and runs
through the then-branch of the if.  It contains 48 guards.  Examination
of these shows that at least one of the guards to do with the access
of .currentTcb at the line marked 3 are redundant w.r.t guards 
which appear earlier in the trace, which have to do with the access of
.currentTcb at the line marked 2.

Am investigating why these didn't get CSEd and removed.

I think all the hot traces involve this loop, so perhaps they are all
similarly afflicted.
Julian, this is great. We've know of redundant guard problems since last year, it will be huge to eliminate them. Hope it's straightforward once diagnosed.

/be
In the traces I looked at, the redundant guards were not being
removed because their guard expressions were not CSEd together, due
to containing loads.

It appears that LIR offers a way to say a load is CSEable, and NJ knows
what to do with these, but jstracer.cpp almost completely fails to mark
any loads it generates as CSEable.  grep LIR_ldc jstracer.cpp shows just
one instance.

I got a quick upper bound on how much better we could do if loads 
generated by jstracer were properly annotated.  There are 43 places
in jstracer where loads are formed.  I did a binary search, switching
them individually between cse-able and non-cse-able, to find the largest
set that could be marked cse-able and still have trace-tests.js
work properly, and found that only 5 places needed to be marked
non-cse-able.

Of course this is a complete hack and is unsafe, but it does give an
upper bound.

Results are modest.  In the hottest trace, the number of guards
is reduced from 45 to 41.  Overall run time is reduced by around 8%,
and the instruction count is reduced by about 6%, from 3.584 billion
to 3.370 billion.

So not the factor 2 we need, but a start.  One interesting thing is
that if I make it treat all loads as cse-able, then the number of
guards in the hottest trace falls from 45 to 26.  Of course it then
doesn't work, but it would be interesting to look at why so many more
guards vanish.
I extended the profiling patch in bug 503424 to count the number
of static exits and static instruction bytes in each fragment,
in order to study the effects of messing with CSE & guard removal.

The results are astonishing (or at least, I was astonished).
It seems like the tracer has gone wild on this benchmark.  It
generates 34 traces, the longest of which has no less than 860
exits in it.  The total amount of code generated is 92kb, so it's
no surprise that cachegrind reports serious icache thrashing
on a Core 2 with a 32K I1 (41.9 million I1 misses, vs a mere
268k I1 misses for v8, as per comment #4).

I wonder if there's a way to quantify the extent to which the
tracer is generating multiple overlapping traces.  I'm guessing
here, but 92k of code with 3661 static exits feels excessive for
this small benchmark, and I suspect rampant trace duplication.

The longest fragment, FragID=000034, begins at the top level loop
that calls runRichards() multiple times, although where it goes
after that I don't know.  It seems reasonable (desirable, even)
to compile this whole benchmark into one huge trace.  But it doesn't
seem desirable to compile it into a whole set of mostly mutually
redundant overlapping traces, if that's what is happening.

Profile follows.  "se" is static exits, "sib" is static instruction bytes.

----------------- Per-fragment execution counts ------------------

Total count = 11836222

  0:     (31.55%   3734848)     31.55%   3734848     (se  45, sib 1000)    FragID=000002
  1:     (50.99%   6035396)     19.44%   2300548     (se 149, sib 3304)    FragID=000003
  2:     (63.55%   7521844)     12.56%   1486448     (se 181, sib 4091)    FragID=000010
  3:     (71.98%   8519342)      8.43%    997498     (se 118, sib 2847)    FragID=000016
  4:     (76.35%   9037339)      4.38%    517997     (se 218, sib 5030)    FragID=000004
  5:     (79.31%   9387336)      2.96%    349997     (se  92, sib 2360)    FragID=000023
  6:     (82.08%   9714934)      2.77%    327598     (se  27, sib  807)    FragID=000008
  7:     (84.82%  10039380)      2.74%    324446     (se 185, sib 4274)    FragID=000019
  8:     (87.55%  10363128)      2.74%    323748     (se  52, sib 1294)    FragID=000017
  9:     (90.04%  10657476)      2.49%    294348     (se  67, sib 1700)    FragID=000018
 10:     (92.08%  10899324)      2.04%    241848     (se  52, sib 1261)    FragID=000012
 11:     (93.73%  11093922)      1.64%    194598     (se  23, sib  574)    FragID=000011
 12:     (95.20%  11267869)      1.47%    173947     (se  73, sib 1861)    FragID=000025
 13:     (95.89%  11349767)      0.69%     81898     (se  54, sib 1324)    FragID=000013
 14:     (96.58%  11431315)      0.69%     81548     (se 128, sib 3315)    FragID=000014
 15:     (97.27%  11512513)      0.69%     81198     (se 117, sib 2864)    FragID=000020
 16:     (97.71%  11565711)      0.45%     53198     (se  52, sib 1297)    FragID=000021
 17:     (98.06%  11606658)      0.35%     40947     (se 119, sib 3150)    FragID=000024
 18:     (98.33%  11638856)      0.27%     32198     (se  52, sib 1297)    FragID=000022
 19:     (98.59%  11668954)      0.25%     30098     (se  67, sib 1658)    FragID=000009
 20:     (98.84%  11699050)      0.25%     30096     (se  78, sib 2074)    FragID=000028
 21:     (99.08%  11727048)      0.24%     27998     (se  89, sib 2186)    FragID=000026
 22:     (99.31%  11755046)      0.24%     27998     (se 105, sib 2721)    FragID=000029
 23:     (99.49%  11776044)      0.18%     20998     (se  52, sib 1351)    FragID=000027
 24:     (99.64%  11793541)      0.15%     17497     (se  61, sib 1596)    FragID=000031
 25:     (99.75%  11806139)      0.11%     12598     (se   7, sib  107)    FragID=000005
 26:     (99.85%  11818736)      0.11%     12597     (se 136, sib 3399)    FragID=000015
 27:     (99.96%  11830984)      0.10%     12248     (se  21, sib  700)    FragID=000030
 28:     (99.99%  11834482)      0.03%      3498     (se 132, sib 3253)    FragID=000032
 29:     (99.99%  11835179)      0.01%       697     (se  64, sib 1699)    FragID=000006
 30:     (99.99%  11835527)      0.00%       348     (se 128, sib 3385)    FragID=000007
 31:     (100.00%  11835875)      0.00%       348     (se  57, sib 1544)    FragID=000033
 32:     (100.00%  11836222)      0.00%       347     (se 860, sib 23196)    FragID=000034
 33:     (100.00%  11836222)      0.00%         0     (se   0, sib    0)    FragID=000001

Total code bytes = 92519, total static exits = 3661
Sounds like overspecialization. Look at the individual side exits where the tree splits up to see what we specialized and why its so variant (maybe object shape?).
Investigating further using enhanced exit-profiling patch in
bug 503424 comment #14.  One thing that's immediately obvious
is that for the top 10 hot frags (accounting for 90% of the total
frag count), most runs never make it to the trace end -- that's
definitely a minority.  Cursory examination suggests traces are
exiting early due to control flow in the app, not due to shape
or type problems.
#11: This is pretty good news. We can actually do something about excessive tail duplication due to control flow. Ed had a mechanism in tamarin tracing to estimate tail duplication and when a trace gets too branchy, it just stops early (inside the loop) and we start a new trace. That new trace is then targeted by all future traces that reach the same point in the program. Essentially we merge traces back together, at the expensive of some memory traffic (take a side exit, and immediately reload the state). This inhibits tree compilation and loop-based optimization, but it beats getting annihilated by too many traces. Implementing this would also greatly speed up 3d-cube.
Poking around with the profiler on a couple of other tests
(trace-test, v8-crypto) it looks like it is very common for
hot traces not to make it to the end.  Some do, but many don't.
(In reply to comment #12)
> Ed had a mechanism in tamarin tracing to
> estimate tail duplication and when a trace gets too branchy,

Got any further details on this?  (pointers to code or comments?)

The top 10 frags account for 90.0% of the 11.83 million frag
starts.  Of those 10.64 million, only 2.90 million made it
through to the end.  So we're getting a trace completion
rate of about 27%.

AFAICS the problem isn't that the traces are too branchy, they're
just way too long.  At some point it's pointless to extend the trace
because app-level control flow causes the probability of reaching
the end to decline exponentially.  Problem seems like jstracer is
committing to making these huge long traces based on woefully 
inadequate profiling data picked up by the interpreter, and as a
result has to make a whole bunch of near-duplicates.

That notwithstanding: generating too much code is totalling I1 but
that can only be a small part of the performance gap w.r.t v8/SFX.
Any way you slice it, we're burning twice as many instructions as
v8, so there must be some translation-level badness happening too.

Investigation continues.

> Implementing this would also greatly speed up 3d-cube.

A quick spin with the profiler doesn't support the notion that 3d-cube
is suffering from heavily duplicated or overlong traces.  Although I'm
not sure if I used the right workload.
(In reply to comment #14)
> (In reply to comment #12)
> > Ed had a mechanism in tamarin tracing to
> > estimate tail duplication and when a trace gets too branchy,
> 
> Got any further details on this?  (pointers to code or comments?)

the general sketch: insert special opcodes into the interpreted bytecode stream which mark merge nodes in the original control flow graph.  when tracing, limit the # of merge nodes traversed on a single trace.    this would let the tracer do some initial tail duplication but eventually it would stop.  by stopping, traces could begin on a CFG merge node and all other traces leading there would transfer to the single trace starting there.  

if the limit was set to 1, then trace trees == superblocks, and there would be no tail duplication.

implementation details are still visible in tamarin-tracing.  

It did help control code duplication but we never came up with a good way to set the limit -- any given setting was only good for a narrow range of tests, and we didnt' explore more adaptive ways to choose a trace length limit.

> > Implementing this would also greatly speed up 3d-cube.
> 
> A quick spin with the profiler doesn't support the notion that 3d-cube
> is suffering from heavily duplicated or overlong traces.  Although I'm
> not sure if I used the right workload.

last time I looked, s3d-cube suffered more from code bloat due to excessive inlining than it did from tail duplication.  its been a little while so obviously YMMV.
(In reply to comment #14)
> (In reply to comment #12)
> > Ed had a mechanism in tamarin tracing to
> > estimate tail duplication and when a trace gets too branchy,
> 
> Got any further details on this?  (pointers to code or comments?)
> 
> The top 10 frags account for 90.0% of the 11.83 million frag
> starts.  Of those 10.64 million, only 2.90 million made it
> through to the end.  So we're getting a trace completion
> rate of about 27%.
> 
> AFAICS the problem isn't that the traces are too branchy, they're
> just way too long.  At some point it's pointless to extend the trace
> because app-level control flow causes the probability of reaching
> the end to decline exponentially.  Problem seems like jstracer is
> committing to making these huge long traces based on woefully 
> inadequate profiling data picked up by the interpreter, and as a
> result has to make a whole bunch of near-duplicates.

This is interesting, but I'm not sure I understand correctly. Can you quantify this in more detail? Here is an example of what I'm getting at: Apparently for these 10 traces, 27% of entries to those traces reach the end. (Btw, what does "reach the end" mean? Exit via LOOP_EXIT?) So presumably this means the native code at the end of the trace is executed 27% of the time. 

My stats show that the cost to record a trace is 100-400x the cost to interpret the same code. Running natively is then much faster than interpretation, so it takes 100-500 native runs to break even. So as long as there are no other effects (which is part of my question), if the traces are entered 400-2000 times or more, then I would predict they are worthwhile to compile and are helping run faster. 

> That notwithstanding: generating too much code is totalling I1 but
> that can only be a small part of the performance gap w.r.t v8/SFX.
> Any way you slice it, we're burning twice as many instructions as
> v8, so there must be some translation-level badness happening too.
> 
> Investigation continues.

So we run 2x as many x86 instructions? Can you measure our overhead (recording, compilation, and trace switching costs)?
(In reply to comment #16)

> these 10 traces, 27% of entries to those traces reach the end. (Btw, what does
> "reach the end" mean? Exit via LOOP_EXIT?)

Gets past the last xt, xf or x in the trace.  So presumably that means 
hits the LIR_loop at the end of the trace and jumps back to the start
(right?  my understanding is more than somewhat shaky)

> So presumably this means the native
> code at the end of the trace is executed 27% of the time. 

Yes.

> My stats show that the cost to record a trace is 100-400x the cost to interpret
> the same code. Running natively is then much faster than interpretation, so it
> takes 100-500 native runs to break even. So as long as there are no other
> effects (which is part of my question), if the traces are entered 400-2000
> times or more, then I would predict they are worthwhile to compile and are
> helping run faster. 

In that case, why is HOTLOOP in jstracer.cpp set to 2?  AIUI this means a
loop will be compiled after only two iterations.

> So we run 2x as many x86 instructions? Can you measure our overhead (recording,
> compilation, and trace switching costs)?

Not easily.  I can measure the time in generated code vs the rest,
as shown below.  It looks like everything else is insignificant.
Although am not confident about the v8 split since none of the
listed function names have an I'm-involved-in-generating-machine-code
feel.

---- jsshell ----

3,583,157,156  PROGRAM TOTAL

3,103,031,257  ???:??? (generated code)
  223,682,650  jsbuiltins.cpp:js_UnboxDouble(int)
   79,213,596  jsbuiltins.cpp:js_BoxDouble(JSContext*,
   68,058,108  jsbuiltins.cpp:js_UnboxInt32(int)
   36,869,987  jsbuiltins.cpp:js_BoxInt32(JSContext*, 
    9,384,010  jsarray.cpp:js_Array_dense_setelem(JSCo
    3,415,448  jstracer.cpp:bool VisitFrameSlots<Deter
    3,158,402  nanojit/Assembler.cpp:nanojit::Assemble


---- v8 ----

1,669,313,816  PROGRAM TOTALS

1,613,301,031  ???:??? (generated code)
    7,058,265  ???:v8::internal::Scanner::ScanIdentifi
    4,817,178  ???:v8::internal::Scanner::Scan()
    3,977,770  ???:strcmp
    3,072,383  ???:v8::internal::String::ComputeLength
    3,059,655  ???:unibrow::Utf8::ReadBlock(unibrow::B
    1,926,402  ???:v8::internal::Utf8SymbolKey::IsMatc
    1,331,737  ???:v8::internal::Runtime::FunctionForN
    1,273,022  ???:v8::internal::Token::Lookup(char co
    1,207,470  ???:v8::internal::Scanner::Next()
    1,097,319  ???:unibrow::CharacterStream::Length()
Numbers in comment #17 are insn counts, btw.
(In reply to comment #17)
> (In reply to comment #16)
> 
> > these 10 traces, 27% of entries to those traces reach the end. (Btw, what does
> > "reach the end" mean? Exit via LOOP_EXIT?)
> 
> Gets past the last xt, xf or x in the trace.  So presumably that means 
> hits the LIR_loop at the end of the trace and jumps back to the start
> (right?  my understanding is more than somewhat shaky)

Sounds right to me. For some reason I also find this confusing to think about. I suppose if we had basic block profiling it would be easier to understand sub-trace execution counts. :-)

> > So presumably this means the native
> > code at the end of the trace is executed 27% of the time. 
> 
> Yes.
> 
> > My stats show that the cost to record a trace is 100-400x the cost to interpret
> > the same code. Running natively is then much faster than interpretation, so it
> > takes 100-500 native runs to break even. So as long as there are no other
> > effects (which is part of my question), if the traces are entered 400-2000
> > times or more, then I would predict they are worthwhile to compile and are
> > helping run faster. 
> 
> In that case, why is HOTLOOP in jstracer.cpp set to 2?  AIUI this means a
> loop will be compiled after only two iterations.

That is a very good question and well worth revisiting. AIUI, HOTLOOP=2 was chosen by empirical tuning. I think that for the initial tuning set (mainly SunSpider), most code runs either once or 1000+ times. That makes 2 the ideal choice because then we start running natively ASAP on code that will pay off.

> > So we run 2x as many x86 instructions? Can you measure our overhead (recording,
> > compilation, and trace switching costs)?
> 
> Not easily.  I can measure the time in generated code vs the rest,
> as shown below.  It looks like everything else is insignificant.
> Although am not confident about the v8 split since none of the
> listed function names have an I'm-involved-in-generating-machine-code
> feel.

OK. Your stats [snipped] are clear enough--it's definitely something with the generated code. We could probably estimate trace switching costs as the number of trace entries times the estimated cost of jumping from one trace to another. I think it's 10-15 instructions or so. If the traces are long, it's probably not that important, unless we tend to exit them *really* early.
All this stuff about overlong traces is, I believe, secondary.

Current front-runner theory (with much help from jorendorff) is
that property accesses are poorly optimized.  v8-richards has
lots of accesses of the form foo.bar.xyzzy, and many of these are
duplicated along the traces.  For example:

(around line 304):

TaskControlBlock.prototype.isHeldOrSuspended = function () {
  return (this.state & STATE_HELD) != 0 || (this.state == STATE_SUSPENDED);
};

produces a trace where the left side of the || is false, and so goes
to the right side:

00000: 304  getthisprop "state"
00003: 304  name "STATE_HELD"
00006: 304  bitand
00007: 304  zero
00008: 304  ne
00009: 304  or 19 (10)
00012: 304  getthisprop "state"
00015: 304  name "STATE_SUSPENDED"
00018: 304  eq
00019: 304  return
00020: 304  stop

The first  getthisprop "state"  produces 5 guards, fine, ok:

 xt6:  xt guard(class is With) -> pc=0x81f2048 imacpc=(nil) sp+24 rp+4
 xt7:  xt eq9 -> pc=0x81f2048 imacpc=(nil) sp+24 rp+4
 xf9:  xf guard(native-map) -> pc=0x81f2048 imacpc=(nil) sp+24 rp+4
 xf10: xf guard_kshape -> pc=0x81f2048 imacpc=(nil) sp+24 rp+4
 xt8:  xt eq11 -> pc=0x81f2048 imacpc=(nil) sp+24 rp+4

but the second one produces 4, which strikes me as grossly redundant:

 xt10: xt guard(class is With) -> pc=0x81f2054 imacpc=(nil) sp+24 rp+4
 xf11: xf guard(native-map) -> pc=0x81f2054 imacpc=(nil) sp+24 rp+4
 xf12: xf guard_kshape -> pc=0x81f2054 imacpc=(nil) sp+24 rp+4
 xt11: xt eq16 -> pc=0x81f2054 imacpc=(nil) sp+24 rp+4

Examination of the controlling expressions for xf9 and xf11 shows
they are identical, so xf11 should be removable.  NJ can nearly
do that, but fails because they both involve three memory loads,
and so it fails to CSE those and hence fails to identify the
controlling expressions overall as identical.

To fix this properly would require marking those loads as
non-aliasing any intervening stores or helper calls, of which
there are both in this example.  NJ can then common them up.

Required actions:

- (definitely) look in jstracer where these loads are made, and
  mark them ldc (cse-able).  This of course requires justifying
  that they are cse-able.

- (probably) make LIR's notion of aliasing more flexible, so that
  we can express the facts like: this store writes the stack, but
  that load reads the heap, so the store doesn't invalidate CSEs
  involving the load.  Redo CseFilter to handle this.  Also add
  such alias-set annotations to helper calls.

So if that works out it'd help remove one guard, at least.
Whether it will help remove others, I don't know.

There are lots of other similar examples.  For example, the 
main loop of this program is:

  while (this.currentTcb != null) {
    if (this.currentTcb.isHeldOrSuspended()) {
      this.currentTcb = this.currentTcb.link;
    } else {
      this.currentId = this.currentTcb.id;
      this.currentTcb = this.currentTcb.run();
    }
  }

Lots of duplicated poking of this, this.currentTcb, et al, within
traces.  Feels like there should be quite some mileage to be had here.

----------

Jason also mentioned something to the effect that the "class is
With" tests are redundant, or mutually redundant, or constant, or
some such.  Unfortunately I didn't understand the details, or how
it's proposed to detect and exploit such redundancy.  Could you
please clarify?
(minor additional comment):
Also in a getprop translation (not a getthisprop), I saw the code
below, which amounts to

    if (ld5 == 0) leave the trace
    if ((ld5 & 7) != 0) leave the trace

I set GNU superopt the task of finding a short branch free sequence that
computes x == 0 || (x & 7) != 0, but so far it only found sequences using
the x86 carry flag, which aren't expressible in LIR.  Hence no good.

    About to try emitting guard code for SideExit=0xf7f193e4 exitType=BRANCH
    ld4 = ld cmov2[28]
    ld5 = ld ld4[4]
    eq4 = eq ld5, 0
    xt3: xt eq4 -> pc=0x81f0d0d imacpc=(nil) sp+8 rp+0 (GuardID=006)

    About to try emitting guard code for SideExit=0xf7f193e4 exitType=BRANCH
    JSVAL_TAGMASK
    and2 = and ld5, JSVAL_TAGMASK
    eq5 = eq and2, 0
    xf4: xf eq5 -> pc=0x81f0d0d imacpc=(nil) sp+8 rp+0 (GuardID=007)

I _think_ that's generated by

   guard(false, lir->ins_eq0(v_ins), exit);
   guard(true,  lir->ins2i(LIR_eq,
                           lir->ins2(LIR_piand, v_ins, 
                                     INS_CONSTWORD(JSVAL_TAGMASK)),
                           JSVAL_OBJECT),
                 exit);

in TraceRecorder::unbox_jsval.
Sigh, this was a forgotten action item from early tracemonkey daze: get CSE going for shape guards. I remember talking about it with Andreas but I failed to file a bug. Sorry about that.

We should be able to use ldc instructions for things like loading obj->map and then obj->map (cast to JSScope*) -> shape, for a given obj. Jason, thoughts?

Comment 21 suggests a better jsval tagging scheme (bug 360324), to separate null from object tag (we already distinguish the trace types). I'll update bug 360324.

/be
I would suggest solving this at a slightly higher level, since shape guards are not merely redundant when they are common-able. Consider the following sequence of operations:

x.a = 1
x.b = 2
x.c = 3

The shape evolves, and each time we emit a guard for a different shape. That can't be cse-d. However, the guards are still redundant because traces are a linear sequence of instructions. You can't get to x.b = 2 without going past x.a = 1, which already contains a guard. The shape evolves in a predictable (constant) fashion.

My suggested solution is a set of objects that have been shape-guarded along this trace (or a bitset in parallel to the tracker). Its not as high level and pretty as solving this in lir, but it might be easier and more effective.

Along branches it would be nice if this information can be communicated to attaching traces, but I guess that is not super important. It will lead to a couple more redundant guards at every branch. At least we don't get tons of redundant guards on the same trace though. Lets just hack it up and see how much things improve with the simple per-trace fix. I bet this also helps sunspider here and there.
I don't know that the last part is sound yet. Jason, your brainwaves are wanted.

/be
This patch speeds up richards for me nicely, from 1212 (best of three runs without the patch) to 1409.

But for some reason I haven't had time to find, the patch slows down v8 raytrace from 240 to 213 (approximately: best-of-three again).

Julian, would you have time to measure the detailed difference this makes? Thanks.

/be
(In reply to comment #25)
> This patch speeds up richards for me nicely, from 1212 (best of three runs
> without the patch) to 1409.

Nice!  Unfortunately causes 2 regressions on trace-tests:
testComparisons,testMethodInitSafety

I'd be willing to bet it's due to the use of ldcs.  I was playing
around last night with ldc-ifying loads associated with
offsetof(JSScope, shape) and offsetof(JSObject, map) and got 
precisely the same two failures.  I think that backing off on the
latter ones made the failure go away.  Investigating.
(In reply to comment #26)

I spoke too soon re the ldcs, unfortunately:

* 31064:d2362511eaa9 (tip as of just now) runs trace-tests, no fails

* with Brendan's patch, 2 failures: testComparisons,testMethodInitSafety

* changing the 4 x ldcp back to ldp in the patch does not help
> We should be able to use ldc instructions for things like loading obj->map
> and then obj->map (cast to JSScope*) -> shape, for a given obj. Jason,
> thoughts?

ldc-ifying a field load is OK if the field is const.

ldc-ifying obj->map or map->shape (or JSObjectMap::shape as brendan's patch has it) would definitely be bad. Both can change! (If there are no intervening calls/stores that we can't account for)

But all the ldc-ification in Brendan's actual patch looks ok ...though this one is a little precarious:

             guard(HAS_FUNCTION_CLASS(JSVAL_TO_OBJECT(v)),
                   lir->ins2(LIR_eq,
                             lir->ins2(LIR_piand,
-                                      lir->insLoad(LIR_ldp, v_ins, offsetof(JSObject, classword)),
+                                      lir->insLoad(LIR_ldcp, v_ins, offsetof(JSObject, classword)),
                                       INS_CONSTWORD(~JSSLOT_CLASS_MASK_BITS)),
                             INS_CONSTPTR(&js_FunctionClass)),

This one is safe not because the classword is actually constant over the lifetime of all objects--for arrays, it isn't--but because we never mutate functions into nonfunctions or vice versa.

Tricksy. That deserves a comment maybe.

> Nice!  Unfortunately causes 2 regressions on trace-tests:
> testComparisons,testMethodInitSafety
>
> I'd be willing to bet it's due to the use of ldcs.

I just wrote "I'll take that bet!" and mid-aired with your retraction. Rats. :)
Comment on attachment 393245 [details] [diff] [review]
patch to remove obj-is-native tests/guards, use ldc*, and minimize shape guard to once per object per trace

> 
>-static const JSObjectMap SharedArrayMap = { &js_ArrayObjectOps };
>+static const JSObjectMap SharedArrayMap = { &js_ArrayObjectOps, JSObjectMap::SHAPELESS };
> 

Native array objects. Nice.

>+bool
>+TraceRecorder::alreadyGuardedShape(JSObject* obj)
>+{
>+    if (!guardedShapeTable.ops) {
>+        JS_DHashTableInit(&guardedShapeTable, JS_DHashGetStubOps(), NULL,
>+                          sizeof(JSDHashEntryStub), JS_DHASH_MIN_SIZE);
>+    }
>+
>+    JSDHashEntryStub* stub = (JSDHashEntryStub*)
>+        JS_DHashTableOperate(&guardedShapeTable, obj, JS_DHASH_ADD);
>+    if (!stub)
>+        return false;
>+
>+    if (!stub->key) {
>+        stub->key = obj;
>+        return false;
>+    }
>+    JS_ASSERT(stub->key == obj);
>+    return true;
>+}

This isn't safe. Imagine we are tracing a function f(a, b) { a.x = b.x = 1; } and at recording time its called with a == b == obj. We would emit a guard for a, but not for b. Next time we might call with a != b and then miss the shape guard for b (since we don't enforce/guard that a and b must alias).

What you should check instead is that the definition of a (tracker.get(a)) has been shape-guarded on. That's always safe.

>         map_ins = map(obj2_ins);
>-        LIns* ops_ins;
>-        if (!map_is_native(obj2->map, map_ins, ops_ins))
>-            ABORT_TRACE("non-native map");
>-
>-        LIns* shape_ins = addName(lir->insLoad(LIR_ld, map_ins, offsetof(JSScope, shape)),
>+        LIns* shape_ins = addName(lir->insLoad(LIR_ld, map_ins, offsetof(JSObjectMap, shape)),
>                                   "obj2_shape");

This is a really nice simplification. Does this improve our SS score? There is a lot of array access in there.
Thanks, Andreas -- I should have known better, was in a hurry.

Indeed still have not had time to measure this patch's effect on SS or other benchmarketing scores. This rev does fix the raytrace regression, and still wins big on richards (1744 new, 1445 old best of 3). Help wanted on perf-testing and analysis, as before.

/be
Attachment #393245 - Attachment is obsolete: true
Did someone say perf analysis? Adding Gregor :)
Comment on attachment 393317 [details] [diff] [review]
remove obj-is-native tests/guards, use ldc*, and minimize shape guard to once per obj_ins per trace

Looks good. Does this pass tryserver?
Fails testMethodInitSafety in trace-test.js -- digging into it as I am able, not able to spend much time on it tonight, seem to have some kind of food poisoning...

/be
> What you should check instead is that the definition of a (tracker.get(a)) has
> been shape-guarded on. That's always safe.

It's not always safe, because objects can change shape! Calling out from trace to most builtins, probably including everything that uses bailExit, must invalidate all preceding shape guards.
I'd love to review a patch with just ldc and JSObjectMap::shape. Those seem like they are probably ready to land (though I haven't thought through the consequences of moving shape--at least we're losing some assertion coverage there).

Eliminating redundant guards will take a while.
Jason, nothing is safe in the presence of shape-modifying builtins. Not even elimination of redundant ldc loads since the shape mutation might occur invisible in the builtin. So you might fail a guard because you CSEed the load (ldc), even though its different now (and expected to be different). We should do Brendan's patch but purge as needed.
(In reply to comment #36)
> (ldc), even though its different now (and expected to be different). We should
> do Brendan's patch but purge as needed.

IIUC you mean we should do something like Brendan's patch, and emit ldcs,
but at a (potentially) shape modifying builtin, we need to invalidate all
from-memory CSEs.  Right?  In that case I don't think we have a way to
express that in LIR.  LIR only knows "this load aliases no stores (including
stores done by helper calls)" or "this load aliases all stores (...)".
No, I don't think that's what comment 36 means to say.

Brendan's patch uses CSEable loads in the appropriate cases, i.e. where the field never changes, so a CSE load actually fits Nanojit's model.

(In reply to comment #36)
> Jason, nothing is safe in the presence of shape-modifying builtins. Not even
> elimination of redundant ldc loads since the shape mutation might occur
> invisible in the builtin.

Well, yeah, using LIR_ldc to load JSScope::shape is a total non-starter. But it is safe to use it for those other fields, right?

> So you might fail a guard because you CSEed the load
> (ldc), even though its different now (and expected to be different).

Heh. That too.

> We should do Brendan's patch but purge as needed.

Julian, I think this means the approach in the patch, which makes a high-level cache separate from Nanojit's CSE cache, is the right way to go.  We just need to blow away the high-level cache each time we emit a call to an unsafe builtin/native.
(In reply to comment #38)
> Julian, I think this means the approach in the patch, which makes a high-level
> cache separate from Nanojit's CSE cache, is the right way to go.  We just need
> to blow away the high-level cache each time we emit a call to an unsafe
> builtin/native.

Instead, why not bail off trace from the builtin if (not necessarily predictable from the builtin) shape changes unpredictably?

/be
(In reply to comment #39)
> (In reply to comment #38)
> > Julian, I think this means the approach in the patch, which makes a high-level
> > cache separate from Nanojit's CSE cache, is the right way to go.  We just need
> > to blow away the high-level cache each time we emit a call to an unsafe
> > builtin/native.
> 
> Instead, why not bail off trace from the builtin if (not necessarily
> predictable from the builtin) shape changes unpredictably?

Strike "Instead", we need what Jason said. The testMethodInitSafety failure is due to a (currently; see bug 497789 soon -- I hope -- for less random method evolution shaping) random shape generated due to method call scope branding[1].

If this happens during recording, we should purge the guardedShapeTable entries for the reshaped object. If it happens on-trace then we should bail off trace.

/be

[1] JS lacks classes with methods, but when you call a function-valued property for the first time on a given receiver, the receiver's property map is branded in a cheap-ish way to approximate inferred nominal typing.
Attached patch patch, v3 (obsolete) — Splinter Review
Maybe I should spin this out into a separate bug. Happy to do so.

I could also split it up, but I'm trying to go forward in a straight line (which carries risk, for sure).

/be
Attachment #393317 - Attachment is obsolete: true
Attachment #393693 - Flags: review?(jorendorff)
(In reply to comment #28)
> This one is safe not because the classword is actually constant over the
> lifetime of all objects--for arrays, it isn't--but because we never mutate
> functions into nonfunctions or vice versa.

I didn't comment on this (the context shows the array class exclusion), but I will do it now.

/be
(In reply to comment #41)
> Created an attachment (id=393693) [details]
> patch, v3

Cool.  Runs trace-tests ok.  Gives a fairly reliable 8%-9% reduction
in run time for a less than 6% reduction in insn count, possibly
due to 23% reduction in I1 misses.  Other figures (data refs, etc)
are unsurprising.

        cpu(s)  insns(M)  I1miss(M)  
Before: 0.81     3581      30.52
After:  0.74     3381      23.61

Cpu times averaged over 10 runs.

Further details to follow, incl overall sunspider measurements.
Attached patch patch, v3a (obsolete) — Splinter Review
Attachment #393693 - Attachment is obsolete: true
Attachment #393834 - Flags: review?(jorendorff)
Attachment #393693 - Flags: review?(jorendorff)
(In reply to comment #43)
SunSpider, averaged over 500 runs, shows no overall gain; it's a
mixed bag, with changes in the -3 % to +3% range; overall a 0.4%
lossage.  I need to peer at traces & trace profiles for selected
slowdowns, eg string/unpack-code.  However 509648 is atm blocking
performance analysis, so that's the current priority.


TEST                   COMPARISON            FROM                 TO

=============================================================================

** TOTAL **:           ??                1000.1ms +/- 0.4%   1004.1ms +/- 0.4%

=============================================================================

  3d:                  -                  153.8ms +/- 0.6%    153.0ms +/- 0.6%
    cube:              ??                  42.0ms +/- 0.8%     42.3ms +/- 1.0%
    morph:             1.026x as fast      35.5ms +/- 0.8%     34.6ms +/- 0.7%
    raytrace:          -                   76.2ms +/- 0.6%     76.1ms +/- 0.6%

  access:              1.007x as fast     147.8ms +/- 0.5%    146.7ms +/- 0.5%
    binary-trees:      1.029x as fast      37.0ms +/- 0.7%     36.0ms +/- 0.6%
    fannkuch:          *1.007x as slow*    68.6ms +/- 0.5%     69.1ms +/- 0.5%
    nbody:             1.032x as fast      29.5ms +/- 0.7%     28.5ms +/- 0.7%
    nsieve:            *1.028x as slow*    12.7ms +/- 1.5%     13.1ms +/- 0.8%

  bitops:              1.031x as fast      40.9ms +/- 0.7%     39.7ms +/- 0.6%
    3bit-bits-in-byte: *1.046x as slow*     1.6ms +/- 2.9%      1.7ms +/- 3.0%
    bits-in-byte:      ??                   8.8ms +/- 0.8%      8.9ms +/- 0.8%
    bitwise-and:       ??                   2.5ms +/- 2.0%      2.5ms +/- 3.2%
    nsieve-bits:       1.053x as fast      28.0ms +/- 0.8%     26.6ms +/- 0.6%

  controlflow:         ??                  37.1ms +/- 0.4%     37.2ms +/- 0.6%
    recursive:         ??                  37.1ms +/- 0.4%     37.2ms +/- 0.6%

  crypto:              *1.016x as slow*    57.7ms +/- 0.5%     58.6ms +/- 0.7%
    aes:               *1.016x as slow*    31.8ms +/- 0.6%     32.3ms +/- 0.8%
    md5:               *1.019x as slow*    16.3ms +/- 0.5%     16.6ms +/- 0.8%
    sha1:              *1.012x as slow*     9.6ms +/- 0.7%      9.7ms +/- 1.0%

  date:                1.013x as fast     142.8ms +/- 0.4%    141.0ms +/- 0.5%
    format-tofte:      ??                  67.6ms +/- 0.5%     67.8ms +/- 0.6%
    format-xparb:      1.028x as fast      75.2ms +/- 0.5%     73.2ms +/- 0.6%

  math:                ??                  42.7ms +/- 0.7%     42.9ms +/- 0.7%
    cordic:            ??                  12.9ms +/- 1.4%     13.0ms +/- 1.0%
    partial-sums:      ??                  22.2ms +/- 0.6%     22.3ms +/- 0.7%
    spectral-norm:     ??                   7.6ms +/- 1.1%      7.6ms +/- 0.9%

  regexp:              ??                  37.9ms +/- 0.9%     38.4ms +/- 0.9%
    dna:               ??                  37.9ms +/- 0.9%     38.4ms +/- 0.9%

  string:              *1.021x as slow*   339.4ms +/- 0.5%    346.5ms +/- 0.5%
    base64:            *1.028x as slow*    18.0ms +/- 0.7%     18.5ms +/- 0.7%
    fasta:             *1.026x as slow*    66.6ms +/- 0.5%     68.4ms +/- 0.5%
    tagcloud:          *1.013x as slow*   107.8ms +/- 0.6%    109.2ms +/- 0.6%
    unpack-code:       *1.021x as slow*   116.0ms +/- 0.6%    118.4ms +/- 0.6%
    validate-input:    *1.028x as slow*    31.1ms +/- 1.5%     31.9ms +/- 0.8%
Attached patch patch, v4 (obsolete) — Splinter Review
Happy to split this up it if helps, or give it away to someone else to split.

Not sure why this isn't pure win. This revision dumps the guardedShapeTable from TraceRecorder::deepAbort, which would seem to be important for correctness.

/be
Attachment #393834 - Attachment is obsolete: true
Attachment #394060 - Flags: review?(jorendorff)
Attachment #393834 - Flags: review?(jorendorff)
Attached patch patch, v5 (obsolete) — Splinter Review
Ahem.

/be
Attachment #394060 - Attachment is obsolete: true
Attachment #394137 - Flags: review?(jorendorff)
Attachment #394060 - Flags: review?(jorendorff)
Tryserver results look good -- please double-check me here (look for a0b08627fb50 at http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTry).

/be
(In reply to comment #48)
> Tryserver results look good -- please double-check me here 

lgtm
Sunspider results for patch v4 (comment #46):

Sunspider v8 indicates clear wins for richards and crypto,
minor improvements for earley-boyer and raytrace.  Minor regression
for deltablue -- investigating.

Sunspider default set differences are practically at the noise level,
but there are as many minor regressions as minor improvements.  Most
of these are listed as significant by the analysis program.  Overall
stated result is 1.003x as fast (fwiw).


------------------------- v8 set --- 100 runs -------------------------

TEST              COMPARISON            FROM                 TO           

=============================================================================

** TOTAL **:      -                 13770.0ms +/- 0.3%   13720.0ms +/- 0.2%

=============================================================================

  v8:             -                 13770.0ms +/- 0.3%   13720.0ms +/- 0.2%
    crypto:       1.040x as fast      939.8ms +/- 1.5%     903.4ms +/- 1.6%
    deltablue:    *1.008x as slow*   8313.1ms +/- 0.4%    8382.1ms +/- 0.3%
    earley-boyer: 1.006x as fast     2009.5ms +/- 0.3%    1998.1ms +/- 0.4%
    raytrace:     1.005x as fast     1709.4ms +/- 0.3%    1700.7ms +/- 0.3%
    richards:     1.085x as fast      798.2ms +/- 0.2%     735.7ms +/- 0.1%



----------------------- default set --- 100 runs -----------------------
dflt 500 runs results:

TEST                   COMPARISON            FROM                 TO   

=============================================================================

** TOTAL **:           1.003x as fast    959.3ms +/- 0.1%   956.2ms +/- 0.1%

=============================================================================

  3d:                  -                 148.9ms +/- 0.3%   148.8ms +/- 0.3%
    cube:              *1.017x as slow*   39.7ms +/- 0.4%    40.3ms +/- 0.5%
    morph:             *1.009x as slow*   35.3ms +/- 0.5%    35.6ms +/- 0.5%
    raytrace:          1.014x as fast     73.9ms +/- 0.5%    72.9ms +/- 0.3%

  access:              1.019x as fast    143.6ms +/- 0.2%   140.9ms +/- 0.2%
    binary-trees:      1.025x as fast     35.5ms +/- 0.5%    34.7ms +/- 0.4%
    fannkuch:          -                  66.9ms +/- 0.3%    66.8ms +/- 0.2%
    nbody:             1.067x as fast     29.2ms +/- 0.4%    27.4ms +/- 0.3%
    nsieve:            *1.013x as slow*   11.8ms +/- 0.4%    12.0ms +/- 0.5%

  bitops:              -                  39.9ms +/- 0.3%    39.8ms +/- 0.3%
    3bit-bits-in-byte: ??                  1.6ms +/- 2.7%     1.6ms +/- 2.7%
    bits-in-byte:      -                   8.5ms +/- 0.6%     8.5ms +/- 0.6%
    bitwise-and:       ??                  2.4ms +/- 1.8%     2.4ms +/- 2.0%
    nsieve-bits:       -                  27.4ms +/- 0.3%    27.3ms +/- 0.3%

  controlflow:         1.018x as fast     36.7ms +/- 0.3%    36.1ms +/- 0.3%
    recursive:         1.018x as fast     36.7ms +/- 0.3%    36.1ms +/- 0.3%

  crypto:              ??                 56.1ms +/- 0.4%    56.3ms +/- 0.7%
    aes:               ??                 30.9ms +/- 0.5%    31.0ms +/- 1.3%
    md5:               ??                 15.8ms +/- 0.5%    15.9ms +/- 0.4%
    sha1:              -                   9.4ms +/- 0.5%     9.3ms +/- 0.7%

  date:                1.020x as fast    139.7ms +/- 0.2%   137.0ms +/- 0.2%
    format-tofte:      1.012x as fast     66.9ms +/- 0.2%    66.1ms +/- 0.3%
    format-xparb:      1.028x as fast     72.8ms +/- 0.2%    70.8ms +/- 0.2%

  math:                -                  41.4ms +/- 0.3%    41.4ms +/- 0.3%
    cordic:            -                  12.4ms +/- 0.6%    12.4ms +/- 0.5%
    partial-sums:      -                  21.6ms +/- 0.3%    21.6ms +/- 0.3%
    spectral-norm:     -                   7.3ms +/- 0.6%     7.3ms +/- 0.7%

  regexp:              1.018x as fast     35.7ms +/- 0.2%    35.1ms +/- 0.2%
    dna:               1.018x as fast     35.7ms +/- 0.2%    35.1ms +/- 0.2%

  string:              *1.011x as slow*  317.4ms +/- 0.1%   320.9ms +/- 0.1%
    base64:            *1.022x as slow*   17.5ms +/- 0.5%    17.8ms +/- 0.4%
    fasta:             *1.009x as slow*   65.9ms +/- 0.1%    66.5ms +/- 0.2%
    tagcloud:          -                 103.3ms +/- 0.2%   103.3ms +/- 0.2%
    unpack-code:       *1.011x as slow*  100.6ms +/- 0.2%   101.7ms +/- 0.2%
    validate-input:    *1.050x as slow*   30.0ms +/- 0.2%    31.5ms +/- 0.3%
Trace-level comments for patch v4 (comment #46), for v8-richards:

34 traces (Fragments) created.

Number of instruction bytes generated falls from 92461 to 82955
(good)

Number of static exits generated falls from 3661 to 3069
(quite striking)

The longest generated trace falls from 23715 bytes/860 exits
to 19215 bytes/614 exits.

This is all good.  It's pretty obvious that the greatest improvements
are on the longer traces, but that's as expected given the nature of
the patch.
(In reply to comment #51)
To reiterate comment #14, the trace completion rate stands at
a measly 27%.  This means that 73% of trace starts involve a slow
trace-to-trace transition, right?

Of course the patch doesn't change this, since it merely removes
exits which would never have been taken anyway.  (hmm.  that'd be
a good way to verify correctness of the patch.  hmm)
Improving trace-to-trace transition sounds like a natural next step. We can now recover register allocation information from LIR, so we could peek into registers or read straight from spill slots, bypassing the stores to the stack/globals the trace we just came from did.
(In reply to comment #50)
> Minor regression for deltablue -- investigating.

Summary: patch v4 is effective even here, but tickles inherent
badness in the trace selector.

Patch v4 appears to give 0.8% slowdown on deltablue.

Turns out: v4 is actually effective on deltablue too.  It reduces
the instruction count by 3.4%, yet it runs slower, probably because
it increases the number of icache misses by 11.5%.  All other
microarchitectural bogeys I can measure (D1/L2 misses, & mispredicts)
are more or less unchanged.

It appears that the patch interacts with the trace selector in some
way, causing it to generate (even) more duplicated translations of
innermost-y loops.  The total number of fragments started increases
by 0.1% (to 61.0 million).  Their average static length decreases by
more than 6%, reflecting the effectiveness of the patch.  But the total
number of fragments seems to have gone up, hence the total amount of
code being run, and I1 misses, is up.

Before the patch, the top 50 traces account for 84% of the total executed.
Afterwards, they account for only 79%.
(In reply to comment #47)
> Ahem.

Brendan, what's the difference between v5 and v4?  Is it just
spaceleak avoidance fixes, or do I need to rerun all the perf
measurements?
As per discussion with Julian, earlier investigation of deltablue showed that it has deeply nested loops that get their vars hoisted:

for (var ...)
   function
   for (var ...)
      function
      for (var ...)

Each loop is captured with a tree. For each var we capture one tree with type(var) = undefined and then the actual loop version of it. This leads to a combinatorial explosion as we go further out from the innermost loop.

This urgently calls for block shrink-wrapping vars (var to let conversion), or add liveness analysis that can tell that the undefined value of the var is never read. But that basically is the proof necessary for shrink wrapping, so I think we should go with that.
Attachment #394257 - Attachment is obsolete: true
(In reply to comment #46)
> Created an attachment (id=394060) [details]
> patch, v4

Summary:

Congratulations if you made it this far without drowning in numbers.

Speaking wrt patch v4, not v5, which I haven't tested:

* IMO, we should land this and have done.  It's a clear win at the
  individual-trace-translation level.

* In both richards and deltablue, we are generating way too much code,
  due to excessive trace duplication.  This particularly marked for
  deltablue, in which:

  - there are many duplicated traces (at least 28 variants of the innermost
    loop)

  - almost all of them exit practically at the start

  TM's performance on deltablue is very poor compared to v8.  Andreas will
  open a new bug for deltablue performance.
(In reply to comment #54)
> Turns out: v4 is actually effective on deltablue too.  It reduces
> the instruction count by 3.4%, yet it runs slower, probably because
> it increases the number of icache misses by 11.5%. 

On rereading: 11.5% by itself is meaningless.  I mean, 11.5% increase
from an already high baseline.  If each executed insn counts as one
icache read, then the icache miss rate has increased from 1.69% to 1.96%.
I think there are holes in this scheme. Testing today.
(In reply to comment #60)

Yes.  I thought it odd that it produces different traces for
deltablue, not merely shorter versions of the same ones.
The holes I think I see are:

1. It seems like this

>@@ -7969,6 +8018,11 @@ TraceRecorder::guardPropertyCacheHit(LIn
>                                      JSPropCacheEntry* entry,
>                                      jsuword& pcval)
> {
>+    if (alreadyGuardedShape(obj_ins, aobj)) {
>+        pcval = entry->vword;
>+        return JSRS_CONTINUE;
>+    }
>+

should only skip the kshape guard, if there is one, not a
kobj guard or any other guards this method emits.

2. (In reply to comment #40)
> If this happens during recording, we should purge the guardedShapeTable entries
> for the reshaped object. If it happens on-trace then we should bail off trace.

I see the purge in JSScope::generateOwnShape, but I don't see where we bail if an unpredictable shape change happens on-trace.
3. It looks like JSScope::extend and JSScope::replacingShapeChange change an object's shape without going through JSScope::generateOwnShape. (JSScope::remove could do it too, but doesn't bother.)
>-     * properties without magic getters and setters), and its scope->shape
>+     * properties without magic getters and setters), and its scope->map.shape

This change is bogus.

>-#define OBJ_SHAPE(obj)                  (OBJ_SCOPE(obj)->shape)
>+#define OBJ_SHAPE(obj)                  ((obj)->map->shape)

Up to now, OBJ_SCOPE asserts that obj is native. To retain assertion coverage,
this could assert that the result is not SHAPELESS. Maybe that is silly. It's a
shame to lose good asserts though.

>@@ -1819,6 +1823,7 @@ TraceRecorder::deepAbort()
> {
>     debug_only_print0(LC_TMTracer|LC_TMAbort, "deep abort");
>     deepAborted = true;
>+    forgetGuardedShapes();
> }

Why is this necessary? The trace is going to be thrown away, so it seems like
this can't affect correctness.

>@@ -7828,8 +7877,8 @@ TraceRecorder::map_is_native(JSObjectMap
>         return false;
> #undef OP
> 
>-    ops_ins = addName(lir->insLoad(LIR_ldp, map_ins, int(offsetof(JSObjectMap, ops))), "ops");
>-    LIns* n = lir->insLoad(LIR_ldp, ops_ins, op_offset);
>+    ops_ins = addName(lir->insLoad(LIR_ldcp, map_ins, int(offsetof(JSObjectMap, ops))), "ops");
>+    LIns* n = lir->insLoad(LIR_ldcp, ops_ins, op_offset);

This means JSObjectMap::ops and the relevant members of JSObjectMap must be
immutable. To a great extent we can enforce this in C++, and I think we should.

  struct JSObjectMap {
      // Here the first const means individual ops don't change, and the second
      // means we never swap out the whole vtable for another.  The JIT depends
      // on both promises of immutability.
      const JSObjectOps * const ops;
      ...

      explicit JSObjectMap(const JSObjectOps *ops) : ops(ops) {}
  };

At the least a comment there about this especially unobvious dependency is
merited.

>@@ -7988,7 +8042,7 @@ TraceRecorder::guardPropertyCacheHit(LIn
>                 lir->insLoad(LIR_ld,
>-                             addName(lir->insLoad(LIR_ldp, cx_ins, offsetof(JSContext, runtime)),
>+                             addName(lir->insLoad(LIR_ldcp, cx_ins, offsetof(JSContext, runtime)),
>                                      "runtime"),
>                              offsetof(JSRuntime, protoHazardShape)),

Similarly:

  struct JSContext {
      ...
      JSRuntime * const runtime;
      ...
      explicit JSContext(JSRuntime *rt) : runtime(rt) {}
  };

>@@ -8348,12 +8404,11 @@ TraceRecorder::guardPrototypeHasNoIndexe
>         return JSRS_STOP;
> 
>     while (guardHasPrototype(obj, obj_ins, &obj, &obj_ins, exit)) {
>+        if (alreadyGuardedShape(obj_ins, obj))
>+            continue;
>+

Is it possible to have a TR::guardShape method instead of calling
alreadyGuardedShape before each shape guard?
Attachment #394137 - Flags: review?(jorendorff) → review-
(In reply to comment #55)
> (In reply to comment #47)
> > Ahem.
> 
> Brendan, what's the difference between v5 and v4?  Is it just
> spaceleak avoidance fixes, or do I need to rerun all the perf
> measurements?

Belated reply: yes, just the leak fix.

(In reply to comment #56)
> ...
> This urgently calls for block shrink-wrapping vars (var to let conversion), or
> add liveness analysis that can tell that the undefined value of the var is
> never read. But that basically is the proof necessary for shrink wrapping, so I
> think we should go with that.

Bug 456588, on my list and I should get to it soon. But mrbkap or jorendorff could steal and get it done with my advice.

(In reply to comment #62)
> The holes I think I see are:
> 
> 1. It seems like this
> 
> >@@ -7969,6 +8018,11 @@ TraceRecorder::guardPropertyCacheHit(LIn
> >                                      JSPropCacheEntry* entry,
> >                                      jsuword& pcval)
> > {
> >+    if (alreadyGuardedShape(obj_ins, aobj)) {
> >+        pcval = entry->vword;
> >+        return JSRS_CONTINUE;
> >+    }
> >+
> 
> should only skip the kshape guard, if there is one, not a
> kobj guard or any other guards this method emits.

Oops. Will fix.

Jason, many thanks for reading this -- I've been preoccupied over the last week with various stuff that has meant I needed your brain (but in a good non-zombie way ;-).

> 2. (In reply to comment #40)
> > If this happens during recording, we should purge the guardedShapeTable entries
> > for the reshaped object. If it happens on-trace then we should bail off trace.
> 
> I see the purge in JSScope::generateOwnShape, but I don't see where we bail if
> an unpredictable shape change happens on-trace.

I changed JSScope::generateOwnShape to call js_LeaveTrace, but then convinced myself that the recording-time defense of purging guardedShapeTable meant we did not need the runtime pessimism.

The assumption is that shape evolution other than by generateOwnShape() is deterministic from recording start to end, and then replays exactly once we've guarded (due to LIR's linear SSA nature) on trace -- and of course re-guarded after a purge.

Is this sound? It ought to be, in my opinion. Anything adding non-determinism here is a bug to fix.

(In reply to comment #63)
> 3. It looks like JSScope::extend and JSScope::replacingShapeChange change an
> object's shape without going through JSScope::generateOwnShape.

But such changes are predictable.

> (JSScope::remove could do it too, but doesn't bother.)

Yes, that's long-standing code. I'm not sure we can rewind to the previous shape when lastProp is deleted and there are no middle deletes. Have to think about delete harder.

Will work on a new patch based on all these comments now.

/be
To be more precise, if the recorder witnesses a generateOwnShape, it purges guardedShapeTable but does not abort recording. New shapes for the given scope evolve unpredictably from this point on, and the first dependency on such a shape will be guarded again.

When the trace executes, the generateOwnShape need not leave trace, however. Perhaps there is no subsequent guard on a mispredicted shape. So it's premature to leave trace always. The re-guarded shape should mismatch.

This all suggests that generateOwnShape is too blunt an instrument, and we should build up more deterministic and trace-stable shape evolutionary paths. But that is for bug 497789.

/be
Attached patch patch, v6 (obsolete) — Splinter Review
Passes the spanky new trace-test python3.1-driven tests.

Hope it's ok to take this bug -- further work in followup bugs so we can have one patch landing per resolved bug still seems best. Julian, please keep up the great anaysis, here or in new bugs.

/be
Assignee: jseward → brendan
Attachment #394137 - Attachment is obsolete: true
Attachment #394567 - Flags: review?(jorendorff)
Comment on attachment 394567 [details] [diff] [review]
patch, v6

function B() {}
B.prototype.x = 1;
var d = new B;

var names = ['z', 'z', 'z', 'z', 'z', 'z', 'z', 'x'];
for (var i = 0; i < names.length; i++) {
    x = d.x;  // guard on shapeOf(d)
    d[names[i]] = 2;  // unpredicted shape change
    y = d.x;  // guard here is elided
}
assertEq(y, 2);  // Assertion failed: got 1, expected 2
Attachment #394567 - Flags: review?(jorendorff) → review-
In comment 68, the shape change happens in SetProperty_tn (soon to be SetPropertyByName), but the moral of the story is, whenever we call into arbitrary native code, anything can happen.

You could probably arrange it so that if a shape changes unexpectedly in native code, we detect it in JSScope and deep-bail there. That's a trade-off though: better best-case behavior for horrible worst-case behavior. (Deep-bailing unconditionally kicks us back to the interpreter; we cannot at present patch a deep-bail exit.) It's not clear to me that the worst case is pathological. I bet it could happen in real-world code (e.g. two variables refer to different objects at record time, but the same object at run time).
(In reply to comment #68)
> (From update of attachment 394567 [details] [diff] [review])
> function B() {}
[snip]

I'd like to make sure test cases like this, that find a bug even in development, get added to the suite. Should we just add this right away?
(In reply to comment #69)
> In comment 68, the shape change happens in SetProperty_tn (soon to be
> SetPropertyByName), but the moral of the story is, whenever we call into
> arbitrary native code, anything can happen.

But the recorder is generating that call, so it can purge the entire guardedShapeTable, eh?

Great feedback as always -- got any more on the other bits of the patch? I like the const-ipation (painless). I'll get a new patch up shortly.

/be
Attached patch patch, v7 (obsolete) — Splinter Review
Attachment #394567 - Attachment is obsolete: true
Attachment #394617 - Flags: review?(jorendorff)
Bugzilla interdiff lies. Here's the non-noise diff-of-patches output:

545c545,555
< @@ -10880,13 +10934,10 @@ TraceRecorder::prop(JSObject* obj, LIns*
---
> @@ -9894,6 +9948,9 @@ TraceRecorder::enterDeepBailCall()
>      // Tell nanojit not to discard or defer stack writes before this call.
>      LIns* guardRec = createGuardRecord(exit);
>      lir->insGuard(LIR_xbarrier, guardRec, guardRec);
> +
> +    // Forget about guarded shapes, since deep bailers can reshape the world.
> +    forgetGuardedShapes();
>  }
>  
>  JS_REQUIRES_STACK void

/be
For further nutty consistency, I renamed like so:

249c249
< +            tr->purgeGuardedShapeTableByObject(object);
---
> +            tr->forgetGuardedShapesForObject(object);
405c405
< +PurgeGuardedShapeByObject(JSDHashTable *table, JSDHashEntryHdr *hdr, uint32 number, void *arg)
---
> +ForgetGuardedShapesForObject(JSDHashTable *table, JSDHashEntryHdr *hdr, uint32 number, void *arg)
414c414
< +TraceRecorder::purgeGuardedShapeTableByObject(JSObject* obj)
---
> +TraceRecorder::forgetGuardedShapesForObject(JSObject* obj)
417c417
< +        JS_DHashTableEnumerate(&guardedShapeTable, PurgeGuardedShapeByObject, obj);
---
> +        JS_DHashTableEnumerate(&guardedShapeTable, ForgetGuardedShapesForObject, obj);
628c628
< +    void purgeGuardedShapeTableByObject(JSObject* obj);
---
> +    void forgetGuardedShapesForObject(JSObject* obj);

/be
There's a place where we emit a deep-bail call without calling enterDeepBailCall. It's in TR::emitNativeCall. (definitely silly, please fix it)

If this new approach is correct, then I think we can assert in JSScope, when the shape changes:

  // Check for unpredictable shape change. If this fails, the alreadyGuarded
  // cache may have elided necessary shape guards.
  JS_ASSERT_IF(JS_ON_TRACE(cx), cx->deepBail);

Maybe that assertion is too strict and we would get false alarms. I hope not.
(In reply to comment #70)
> 
> Should we just add this right away?

yes. r=me
Attached patch patch, v8 (obsolete) — Splinter Review
As requested (cx->bailExit, not cx->deepBail). Passes good old js/tests and the new trace-test regime. Tryserver'ing now, expect a clean bill of health. Great to see daylight at the end of this tunnel (and it's not an oncoming train! ;-).

/be
Attachment #394617 - Attachment is obsolete: true
Attachment #394617 - Flags: review?(jorendorff)
(In reply to comment #61)
> (In reply to comment #60)
> 
> Yes.  I thought it odd that it produces different traces for
> deltablue, not merely shorter versions of the same ones.

Julian, if you have time could you please test the latest patch and confirm it doesn't do something unexpected on deltablue? Thanks,

/be
(In reply to comment #77)
>  Passes good old js/tests and the new trace-test regime.

Can someone explain what the new trace-test regime is?  I seemed to have missed the memo...
You have to install python from source and then run the python script in trace-test.
(In reply to comment #80)
> You have to install python from source and then run the python script in
> trace-test.

there should be one that works with python 2.3+ at this point, covering any recent OSX or linux.
I tested patch v7 (not the latest) overnight and it still
doesn't seem conclusively a win to me.  I'll try patch v8
today.  One thing I've learn the hard way is that Linux is
a really sucky platform to do measurement runs on, since 
the per-run variation is so high.

Compared to the patch v4 results, the gain for richards has fallen from
8.5% to 6.9%, and the 4% gain for crypto is now less than 1%.  This
might be measurement noise or it might reflect increasing conservatism
in v7 resulting from efforts to make it completely correct.

I'll try with patch v8 on a quiet MacOS box.  Am tired of dealing with
Linux's measurement noise.

Are we sure that v8-crypto is entirely deterministic?  (eg, it doesn't
start off doing any random number generation nonsense that varies
from run to run?)  I ask because it seems to have a much higher
measurement noise level than the rest of them (see below).

(50 run averages):

** TOTAL **:      -                 13182.4ms +/- 0.3%   13177.2ms +/- 0.3%
  v8:             -                 13182.4ms +/- 0.3%   13177.2ms +/- 0.3%
    crypto:       -                   948.8ms +/- 1.9%     941.8ms +/- 2.2%
    deltablue:    ??                 8140.9ms +/- 0.4%    8153.4ms +/- 0.4%
    earley-boyer: *1.013x as slow*   1976.7ms +/- 0.4%    2002.4ms +/- 0.3%
    raytrace:     *1.012x as slow*   1312.3ms +/- 0.3%    1327.8ms +/- 0.4%
    richards:     1.069x as fast      803.7ms +/- 0.1%     751.7ms +/- 0.1%
\\(In reply to comment #79)
> (In reply to comment #77)
> >  Passes good old js/tests and the new trace-test regime.
> 
> Can someone explain what the new trace-test regime is?  I seemed to have missed
> the memo...

$ make -n check
/usr/bin/python ../trace-test/trace-test.py \
	        -x slow ./dist/bin/js

To add a new test, put some code in a file, js/src/trace-test/tests/*/*.js
and hg add it. The new regime will automatically pick it up.
$ grep -w random v8/crypto.js
  while(rng_pptr < rng_psize) {  // extract some randomness from Math.random()
    t = Math.floor(65536 * Math.random());
// PKCS#1 (type 2, random) pad input string s to n bytes, and return a bigint
  while(n > 2) { // random non-zero pad
// Undo PKCS#1 (type 2, random) padding and, if valid, return the plaintext
// Generate a new random private key B bits long, using public expt E

/be
(In reply to comment #77)
> Created an attachment (id=394756) [details]
> patch, v8

Numbers for v8-tests of patch v8, Mac Mini, Core 2 Duo, 2.13 GHz,
100 iterations.  Looks like a clear win.  Interestingly the patch
helps more than just richards.  Default-test-set numbers to follow.

  v8:             1.031x as fast    16845.0ms +/- 0.1%   16345.7ms +/- 0.1%
    crypto:       -                  1074.4ms +/- 0.8%    1066.8ms +/- 0.9%
    deltablue:    1.046x as fast    10225.3ms +/- 0.0%    9774.1ms +/- 0.1%
    earley-boyer: -                  2742.4ms +/- 0.3%    2740.6ms +/- 0.3%
    raytrace:     *1.006x as slow*   1785.6ms +/- 0.2%    1795.7ms +/- 0.2%
    richards:     1.050x as fast     1017.3ms +/- 0.0%     968.4ms +/- 0.1%
(In reply to comment #77)
> Created an attachment (id=394756) [details]
> patch, v8

Numbers for default test set for patch v8, same setup as #85, 
500 iterations.  Seems like much of a muchness.

** TOTAL **:           1.001x as fast    1214.3ms +/- 0.0%   1213.1ms +/- 0.0%

  3d:                  1.003x as fast     188.6ms +/- 0.0%    188.1ms +/- 0.0%
    cube:              -                   51.0ms +/- 0.1%     51.0ms +/- 0.1%
    morph:             -                   39.2ms +/- 0.1%     39.2ms +/- 0.1%
    raytrace:          1.005x as fast      98.3ms +/- 0.0%     97.9ms +/- 0.1%

  access:              *1.005x as slow*   178.9ms +/- 0.1%    179.9ms +/- 0.0%
    binary-trees:      *1.040x as slow*    50.5ms +/- 0.1%     52.5ms +/- 0.1%
    fannkuch:          1.006x as fast      76.2ms +/- 0.0%     75.8ms +/- 0.1%
    nbody:             1.024x as fast      34.6ms +/- 0.1%     33.8ms +/- 0.1%
    nsieve:            *1.013x as slow*    17.6ms +/- 0.2%     17.8ms +/- 0.2%

  bitops:              1.010x as fast      47.8ms +/- 0.1%     47.4ms +/- 0.1%
    3bit-bits-in-byte: -                    2.1ms +/- 1.4%      2.1ms +/- 1.4%
    bits-in-byte:      1.002x as fast      11.0ms +/- 0.2%     11.0ms +/- 0.1%
    bitwise-and:       -                    3.2ms +/- 1.1%      3.2ms +/- 1.1%
    nsieve-bits:       1.013x as fast      31.4ms +/- 0.1%     31.0ms +/- 0.1%

  controlflow:         *1.017x as slow*    44.8ms +/- 0.1%     45.5ms +/- 0.1%
    recursive:         *1.017x as slow*    44.8ms +/- 0.1%     45.5ms +/- 0.1%

  crypto:              *1.011x as slow*    72.7ms +/- 0.2%     73.6ms +/- 0.2%
    aes:               ??                  41.0ms +/- 0.4%     41.1ms +/- 0.4%
    md5:               *1.031x as slow*    20.3ms +/- 0.2%     20.9ms +/- 0.1%
    sha1:              *1.010x as slow*    11.4ms +/- 0.4%     11.5ms +/- 0.4%

  date:                1.009x as fast     173.9ms +/- 0.0%    172.4ms +/- 0.0%
    format-tofte:      1.014x as fast      82.5ms +/- 0.1%     81.4ms +/- 0.1%
    format-xparb:      1.004x as fast      91.4ms +/- 0.0%     91.0ms +/- 0.0%

  math:                *1.003x as slow*    43.3ms +/- 0.2%     43.4ms +/- 0.2%
    cordic:            *1.011x as slow*    14.1ms +/- 0.2%     14.3ms +/- 0.3%
    partial-sums:      -                   19.9ms +/- 0.2%     19.8ms +/- 0.2%
    spectral-norm:     ??                   9.3ms +/- 0.4%      9.4ms +/- 0.4%

  regexp:              ??                  52.7ms +/- 0.1%     52.8ms +/- 0.1%
    dna:               ??                  52.7ms +/- 0.1%     52.8ms +/- 0.1%

  string:              1.004x as fast     411.5ms +/- 0.0%    409.9ms +/- 0.0%
    base64:            1.013x as fast      24.0ms +/- 0.1%     23.7ms +/- 0.2%
    fasta:             *1.004x as slow*    82.7ms +/- 0.1%     83.0ms +/- 0.0%
    tagcloud:          *1.002x as slow*   130.2ms +/- 0.0%    130.4ms +/- 0.0%
    unpack-code:       1.013x as fast     132.9ms +/- 0.1%    131.2ms +/- 0.0%
    validate-input:    -                   41.6ms +/- 0.1%     41.6ms +/- 0.1%
(In reply to comment #85)
> > patch, v8
> 
> Numbers for v8-tests of patch v8, Mac Mini, Core 2 Duo, 2.13 GHz,
> 100 iterations.  Looks like a clear win.  Interestingly the patch
> helps more than just richards.  Default-test-set numbers to follow.
> 
>   v8:             1.031x as fast    16845.0ms +/- 0.1%   16345.7ms +/- 0.1%
>     crypto:       -                  1074.4ms +/- 0.8%    1066.8ms +/- 0.9%
>     deltablue:    1.046x as fast    10225.3ms +/- 0.0%    9774.1ms +/- 0.1%
>     earley-boyer: -                  2742.4ms +/- 0.3%    2740.6ms +/- 0.3%
>     raytrace:     *1.006x as slow*   1785.6ms +/- 0.2%    1795.7ms +/- 0.2%
>     richards:     1.050x as fast     1017.3ms +/- 0.0%     968.4ms +/- 0.1%

For the raytrace regression there, instruction counts are down around 1.5%
but the I1 and to a lesser extent L2 miss rates are up.  I'll investigate,
but requires first rebasing the frag profiling patch.
No LIR_dbreak-generated int3 traps in my testing.

/be
Attachment #394756 - Attachment is obsolete: true
Attachment #394883 - Flags: review?(jorendorff)
Comment on attachment 394883 [details] [diff] [review]
patch, v9 with dvander's lir->insAssert/LIR_dbreak patch

For the nanojit changes.

/be
Attachment #394883 - Flags: review?(graydon)
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → mozilla1.9.2
Attachment #394883 - Flags: review?(graydon) → review+
Comment on attachment 394883 [details] [diff] [review]
patch, v9 with dvander's lir->insAssert/LIR_dbreak patch

Debug-break stuff? looks fine to me.
Attached patch v9a, refreshed to tm tip (obsolete) — Splinter Review
Attachment #394883 - Attachment is obsolete: true
Attachment #394944 - Flags: review?(jorendorff)
Attachment #394883 - Flags: review?(jorendorff)
(In reply to comment #90)
> (From update of attachment 394883 [details] [diff] [review])
> Debug-break stuff? looks fine to me.

Looks ok to me too.  I tried changing the insAssert condition
to something obviously false, and checked it really does trap.
Which it does.
Fragment profiling results for patch v8 (which, aiui, is functionally
equivalent to v9 and v9a).

Compared to baseline (which you can see in comment #9), the number
of fragment starts increases from 11.836 million to 12.863 million.

This is worrying, since my understanding was that the patch merely
removed unused guards, and would not affect the higher level trace
selection mechanism.

Indeed, this is new behavior.  In version 4 of patch (see comment #51),
which was the last version I fragprofiled, the number of frag starts
was unaffected by the patch (11.836 mill).

The program still runs circa 5% faster because the traces are shorter
overall, despite the extra 1 million trace-to-trace transfers.

-------------

Details: I think almost all the extra activity manifests itself
in extra entries to FragID=000013.  This trace is 50ish guards
long.  I just show the first three here:
  
 Recording starting from ../SunSpider/tests/v8-richards.js:184@11 
                                                  (FragID=000013)
 00070: 331  callprop "run"
     About to try emitting guard code for SideExit=0x825f9bc exitType=BRANCH
     start
     state = iparam 0 ecx
     sp = ld state[0]
     rp = ld state[4]
     cx = ld state[8]
     eos = ld state[12]
     eor = ld state[16]
     ld630 = ld state[880]
     $global0 = i2f ld630
     ld631 = ld state[872]
     $global1 = i2f ld631
     ld632 = ld state[888]
     $global2 = i2f ld632
     ld633 = ld state[864]
     $global3 = i2f ld633
     ld634 = ld state[840]
     $global4 = i2f ld634
     ld635 = ld state[856]
     $global5 = i2f ld635
     ld636 = ld state[936]
     $global6 = i2f ld636
     $args0 = ld sp[-24]
     $args1 = ld sp[-16]
     $arguments0 = ld sp[-8]
     $stack0 = ld sp[0]
     $stack1 = ld sp[8]
     $stack2 = ld sp[16]
     $arguments0 = ld sp[24]
     $var0 = ld sp[32]
     $stack0 = ld sp[40]
     map = ld $stack0[0]
     ops = ld map[0]
     ld637 = ld ops[12]
     ptr
     guard(native-map) = eq ld637, ptr
     xf463: xf guard(native-map) -> pc=0x81ee922 imacpc=(nil) sp+48 rp+4 
                                    (GuardID=001)
  
     About to try emitting guard code for SideExit=0x825fa60 exitType=BRANCH
     shape = ld map[16]
     265
     guard_kshape = eq shape, 265
     xf464: xf guard_kshape -> pc=0x81ee922 imacpc=(nil) sp+48 rp+4 
                               (GuardID=002)
  
     About to try emitting guard code for SideExit=0x825fb04 exitType=BRANCH
     proto = ld $stack0[8]
     0
     eq398 = eq proto, 0
     xt359: xt eq398 -> pc=0x81ee922 imacpc=(nil) sp+48 rp+4 (GuardID=003)
  

Without the patch, the trace is executed 81898 times, and
always goes to the end:
       
      FragID=000013, total count 81898:
         Looped (057)     81898 (100.00%)
       
With patch v8, this fragment is executed 12 x more often.  But
all of those extra entries are thrown out at the first shape
check:
       
      FragID=000013, total count 1079398:
         GuardID=002     997500 (92.41%)
         Looped (052)     81898 ( 7.59%)

So it seems like the trace is being entered a lot more, but all
those new entries fail the first shape check.
(In reply to comment #76)
> (In reply to comment #70)
> > 
> > Should we just add this right away?
> 
> yes. r=me

A few days behind on this, but I think everyone should feel free, at any time, to add any tracing testcase to the suite, reviewed at the discretion of the committer.  I don't think anyone will disagree, but it seems best to make sure everyone's on the same page with this.
at tm-tip, before this patch:

recorder: started(6), aborted(2), completed(34), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(2), breaks(0), returns(0), unstableInnerCalls(1), blacklisted(0)
monitor: triggered(72), exits(72), type mismatch(0), global mismatch(0)

after:

recorder: started(4), aborted(2), completed(31), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(1), breaks(0), returns(0), unstableInnerCalls(1), blacklisted(0)
monitor: triggered(66), exits(66), type mismatch(0), global mismatch(0)

I could have sworn there was something much different yesterday but now I can't reproduce it.
Attached patch patch, v10 (obsolete) — Splinter Review
Revised to include #ifdef DEBUG code to dump shape guards to /tmp/shapes.dump. Julian, this is rough but it might be usable to say more about guards in profiler output. We would need to map GuardId to this information somehow, possibly just in the .dump file.

/be
Attachment #394944 - Attachment is obsolete: true
Attachment #394944 - Flags: review?(jorendorff)
(In reply to comment #94)
> A few days behind on this, but I think everyone should feel free, at any time,
> to add any tracing testcase to the suite, reviewed at the discretion of the
> committer.

Well, if (as in this case) the test passes on tip, absolutely. But usually a new test case is relevant because it fails.
(In reply to comment #95)
> at tm-tip, before this patch:

I see that too, although the numbers are not as different as yours.
There's one more "started" and one more "aborted", but that's it.
This is with 31574:12a9bea2d331 (Sat Aug 15).  Without patch:

recorder: started(4), aborted(1), completed(33), different header(0), trees trashed(4), slot promoted(0), unstable loop variable(1), breaks(0), returns(0), unstableInnerCalls(1), blacklisted(0)
monitor: triggered(70), exits(70), type mismatch(0), global mismatch(0)

With patch v8:

recorder: started(5), aborted(2), completed(33), different header(0), trees trashed(4), slot promoted(0), unstable loop variable(1), breaks(0), returns(0), unstableInnerCalls(1), blacklisted(0)
monitor: triggered(71), exits(71), type mismatch(0), global mismatch(0)
> Revised to include #ifdef DEBUG code to dump shape guards to /tmp/shapes.dump.
> Julian, this is rough but it might be usable to say more about guards in
> profiler output.

Trying to clarify what you mean.

AIUI a shape is a 24 bit int which uniquely identifies a structural type.
The shape guards throw us off-trace if the shape of an object does not 
match its shape at recording-time.

So what info do you want the profiler to collect?  I'm thinking, something
like: for each shape guard that fails, the expected and actual shape.  And
also a mapping from the shapes themselves to the type it refers to, as 
printed out by the v10 patch into /tmp/shapes.log.

Is that right?  Or completely the wrong direction?
I collected more info re the with/without-patch discrepancy 
notes in comments #95 and #98.  This is easily reproduced using r31775
with and without patch v9 (r31775 since it no longer applies to tip).

I first changed all the ldcp's back to ldp's.  Difference persists.
So it's not to do with bogus claims about non-aliasing.

I disabled eliding of unnecessary shape checks in TraceRecorder::guardShape.
Difference persists.

So the impression I get is, even after disabling the code generation changes
that are driven by (what I assume is) a redundant shape check analysis,
the difference is still there.

-----------------

diffing the output of TMFLAGS=minimal shows that, relatively late in the
day (after 32 of 34 total recordings are made), the v9-ised version starts
to record a new trace, that the baseline one doesn't.  But then it aborts:

 Recording starting from ../../../SunSpider/tests/v8-richards.js:535@608
 Abort recording of tree ../../../SunSpider/tests/v8-richards.js:535@608 at
 ../../../SunSpider/tests/v8-richards.js:331@70: Inner tree is trying to
 grow, abort outer recording.

and that's for the reason BRANCH_EXIT.

----------------

I also compared outputs for TMFLAGS=tracer,minimal,stats,abort.  This
seems to show a divergence very much earlier in the process, after completion
of the 6th recording.  The first differing line is

 trying to attach another branch to the tree (hits = 0)

which only the baseline version does, not the v9 version.  Baseline 
version continues with a few lines of "Looking for compat peer",
leaving/entering a trace, synthesising a shallow frame.  The two
versions finally reconverge at the line

 trying to attach another branch to the tree (hits = 1)

IOW the baseline version has

 trying to attach another branch to the tree (hits = 0)
 (a few lines of other stuff)
 trying to attach another branch to the tree (hits = 1)

whereas the v9 version only has

 trying to attach another branch to the tree (hits = 1)

-----------------------

Another thing I noticed is that in many places where lists of types
are printed, the v9 version would have an extra type listed:

 checkType slot 10: interp=I typemap=I isNum=1 promoteInt=1

that's not present with the baseline version.

------------

Is any of this meaningful/relevant?
Here's a small case which produces one trace without the patch and
two with it:

$ TMFLAGS=minimal ./Rbase/js/src/BuildD/js -j ./foo.js

Recording starting from ./foo.js:5@8 (FragID=000001)
Recording completed at  ./foo.js:5@25 via endLoop (FragID=000001)


$ TMFLAGS=minimal ./Rv10/js/src/BuildD/js -j ./foo.js

Recording starting from ./foo.js:5@8 (FragID=000001)
Recording completed at  ./foo.js:5@25 via endLoop (FragID=000001)

Recording starting from ./foo.js:5@8 (FragID=000002)
Recording completed at  ./foo.js:5@25 via endLoop (FragID=000002)


Test case is testDeepPropertyShadowing from trace-tests:

function testDeepPropertyShadowing()
{
    function h(node) {
        var x = 0;
        while (node) {
            x++;
            node = node.parent;
        }
        return x;
    }
    var tree = {__proto__: {__proto__: {parent: null}}};
    h(tree);
    h(tree);
    tree.parent = {};
    assertEq(h(tree), 2);
}

testDeepPropertyShadowing();
Depends on: 513160
Attached patch patch v10, refreshed (obsolete) — Splinter Review
No change other than merging. We understand better what's going on. I'll write it up when I have time and see about a real fix.

/be
Attachment #395192 - Attachment is obsolete: true
Depends on: 513407
Attached patch patch v11 (obsolete) — Splinter Review
Tested on top of patch for bug 471214 (see bug 471214 comment 73), I stopped in every JSScope::generateOwnShape and verified that each such shape was generated for a specific method assignment to a prototype object, or a scope branding on first call to a method of an as-yet unbranded prototype. These look minimal, although there is more to do in bug 497789 to share shape evolutionary paths.

However, this patch gives three aborts:

Abort recording of tree richards.js:190@11 at richards.js:531@30: No compatible inner tree.
Abort recording of tree base.js:159@23 at richards.js:470@47: Inner tree is trying to grow, abort outer recording.
Abort recording of tree base.js:159@23 at richards.js:337@70: Inner tree is trying to grow, abort outer recording.

Without this patch, with the patch for bug 471214, the aborts are:

Abort recording of tree richards.js:190@11 at richards.js:531@30: No compatible inner tree.
Abort recording of tree base.js:159@23 at richards.js:470@47: Inner tree is trying to grow, abort outer recording.

With no patches applied, the same two aborts present.

The reason testDeepPropertyShadowing gets two traces with the patch, one without, is that the guardedShapeTable eliminates redundant shape guards, so that the hits count on the guard's side exit's target meets the HOTEXIT threshold. Without the patch we get two guards, two side exits, and insufficient heat at the first's branch exit to keep recording.

I could use fragment profiler data, again -- Julian, I hope your travel went well enough! Thanks,

/be
Attachment #397381 - Attachment is obsolete: true
(In reply to comment #103)
> The reason testDeepPropertyShadowing gets two traces with the patch, one
> without, is that the guardedShapeTable eliminates redundant shape guards, so
> that the hits count on the guard's side exit's target meets the HOTEXIT
> threshold. Without the patch we get two guards, two side exits, and
> insufficient heat at the first's branch exit to keep recording.

The relevant code is in AttemptToExtendTree:

    Fragment* c;
    if (!(c = anchor->target)) {
        Allocator& alloc = *JS_TRACE_MONITOR(cx).allocator;
        c = new (alloc) Fragment(cx->fp->regs->pc);
        c->root = anchor->from->root;
        debug_only_printf(...);
        anchor->target = c;
    ...

Where anchor is the VMSideExit* passed in. With redundant guards, we have two anchors for the same ip, so we see null anchor->target twice and create a new fragment, c, with zero hits count. With the patch, we have one guard, one side exit, one place to create and save anchor->target aka c, so we find it and its hits counter reaches the critical HOTEXIT threshold.

With the v11 patch on top of the bug 471214 patch, I wonder whether the fragment profiler still shows same compiled fragments, more executions of one particular one -- where the extra executions all branch-exit early. Need that profiler!

/be
Attached patch patch v12 (obsolete) — Splinter Review
Attachment #397494 - Attachment is obsolete: true
Interdiff lies, patch v12 is just a refresh.

/be
(In reply to comment #105)
> Created an attachment (id=397582) [details]
> patch v12

Comparing (32099:c1a97865c476 + bug 471214 att 397899)
to        (32099:c1a97865c476 + bug 471214 att 397899 + this bug patch v12)

Note that  bug 471214 att 397899  is not the final version of that
bug's patch, but at least does not appear to cause any perf regressions.

Summary: v12 is faster than baseline (as before), and has lower insn
counts and I1 misses.  However it still has different behaviour at
the fragprofiling level -- entries increase from 11,836,222 to
12,863,820.

ten runs native:
  base 8.76
  v12  8.36

base

==8163== I   refs:      3,735,417,294
==8163== I1  misses:       32,483,178
==8163== L2i misses:            5,789
==8163== I1  miss rate:          0.86%
==8163== L2i miss rate:          0.00%
==8163==
==8163== D   refs:      1,790,558,493  (1,187,668,439 rd   + 602,890,054 wr)
==8163== D1  misses:          159,349  (      133,304 rd   +      26,045 wr)
==8163== L2d misses:           41,320  (       20,611 rd   +      20,709 wr)
==8163== D1  miss rate:           0.0% (          0.0%     +         0.0%  )
==8163== L2d miss rate:           0.0% (          0.0%     +         0.0%  )
==8163==
==8163== L2 refs:          32,642,527  (   32,616,482 rd   +      26,045 wr)
==8163== L2 misses:            47,109  (       26,400 rd   +      20,709 wr)
==8163== L2 miss rate:            0.0% (          0.0%     +         0.0%  )
==8163==
==8163== Branches:        541,992,592  (  541,898,572 cond +      94,020 ind)
==8163== Mispredicts:      11,523,392  (   11,497,063 cond +      26,329 ind)
==8163== Mispred rate:            2.1% (          2.1%     +        28.0%   )

v12

==8187== I   refs:      3,594,439,441
==8187== I1  misses:       28,277,581
==8187== L2i misses:            5,794
==8187== I1  miss rate:          0.78%
==8187== L2i miss rate:          0.00%
==8187==
==8187== D   refs:      1,737,612,513  (1,131,324,134 rd   + 606,288,379 wr)
==8187== D1  misses:          191,409  (      163,599 rd   +      27,810 wr)
==8187== L2d misses:           44,319  (       23,062 rd   +      21,257 wr)
==8187== D1  miss rate:           0.0% (          0.0%     +         0.0%  )
==8187== L2d miss rate:           0.0% (          0.0%     +         0.0%  )
==8187==
==8187== L2 refs:          28,468,990  (   28,441,180 rd   +      27,810 wr)
==8187== L2 misses:            50,113  (       28,856 rd   +      21,257 wr)
==8187== L2 miss rate:            0.0% (          0.0%     +         0.0%  )
==8187==
==8187== Branches:        494,290,098  (  494,191,015 cond +      99,083 ind)
==8187== Mispredicts:      11,331,061  (   11,304,453 cond +      26,608 ind)
==8187== Mispred rate:            2.2% (          2.2%     +        26.8%   )

base

Total count = 11836222

           Entry counts         Entry counts         Static
         ------Self------     ----Cumulative---   Exits  IBytes   FragID

  0:     31.55%   3734848      31.55%   3734848      46  1043     000002
  1:     19.44%   2300548      50.99%   6035396     150  3315     000003
  2:     12.56%   1486448      63.55%   7521844     182  4107     000010
  3:      8.43%    997498      71.98%   8519342     119  2874     000016
  4:      4.38%    517997      76.35%   9037339     219  5047     000004
  5:      2.96%    349997      79.31%   9387336      93  2390     000023
  6:      2.77%    327598      82.08%   9714934      27   831     000008
  7:      2.74%    324446      84.82%  10039380     186  4280     000019
  8:      2.74%    323748      87.55%  10363128      53  1324     000017
  9:      2.49%    294348      90.04%  10657476      68  1729     000018

v12

Total count = 12863820

           Entry counts         Entry counts         Static
         ------Self------     ----Cumulative---   Exits  IBytes   FragID

  0:     29.03%   3734848      29.03%   3734848      45  1079     000002
  1:     17.88%   2300548      46.92%   6035396     142  3432     000003
  2:     11.56%   1486448      58.47%   7521844     176  4238     000010
  3:      8.39%   1079398      66.86%   8601242      52  1368     000013
  4:      7.75%    997498      74.62%   9598740     112  2933     000016
  5:      4.03%    517997      78.64%  10116737     211  5250     000004
  6:      2.72%    349997      81.37%  10466734      88  2424     000023
  7:      2.55%    327598      83.91%  10794332      27   855     000008
  8:      2.52%    324446      86.43%  11118778     180  4407     000018
  9:      2.52%    323748      88.95%  11442526      50  1310     000017
Attachment #397582 - Attachment is obsolete: true
Attachment #398548 - Attachment is obsolete: true
(In reply to comment #109)
> rebased again, still essentially v12 plus tracking changes

Performance numbers for this patch are essentially the same as in
comment #107.  (when applied to TM r32149, which is the latest
version to which it will apply cleanly).
Attached patch rebased yet again (obsolete) — Splinter Review
Attachment #398792 - Attachment is obsolete: true
Attached patch rebased one more time (obsolete) — Splinter Review
Attachment #399660 - Attachment is obsolete: true
Depends on: 516069
Splitting the patch up into dependency bugs. First one is bug 516069.

/be
Attachment #399910 - Attachment is obsolete: true
Depends on: 516075
Attached patch the residuum (obsolete) — Splinter Review
Fragment profiling help welcome. I really hope this doesn't introduce anything unexpected. The LIR_dbreak-based assertions should confirm it's memozing constant shape guards correctly.

The SubShapeOf(cx, obj, shape) built-in, used when a predictable shape (one along OBJ_SCOPE(obj)'s property tree lineage at scope->lastProp, where the scope has not had its own overriding shape assigned) is guarded, tries to match the recording-time shape to an older property on the ancestor line, in case the object being used to access the wanted property has been extended with new properties since recording-time.

SubShapeOf helps when objects related by pure property extension are accessed from the same site.

The comment in TR::guardShape says what I think we must do next: figure out how to implement a PIC instead of branch-exiting on unrelated shape access from the same site. v8/richards.js is all about this: it uses different JS constructors with prototypes, where the unrelated shapes have methods and properties named the same and at the same offset (slot), and it expects the VM to optimize.

Taking a branch exit means re-recording and growing the tree, but with the same code except for the guarded shape.

Branch exit from an inner tree while recording an outer tree gets you an aborted outer tree, and we hope to re-record inner then outer with the new shape guarded in the inner, but again if the code is the same (same slot, etc.), then this is suboptimcal compard to a PIC.

And branch exits can reach MAX_BRANCHES and fail hard.

I'll file a followup bug if this patch passes review and frag-profiling muster, and get on with the PIC work tomorrow.

/be
Attachment #400163 - Attachment is obsolete: true
Attachment #400655 - Flags: review?(jorendorff)
Attachment #400655 - Attachment is obsolete: true
Attachment #400655 - Flags: review?(jorendorff)
Comment on attachment 400655 [details] [diff] [review]
the residuum

SubShapeOf is badly broken (say trace-tests) but I have no time left today to work on this.

/be
Depends on: 518448
Depends on: TM:PIC
I think this is a dup -- anyone disagree, undup and take ownership.

/be
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
FWIW, I've (safely) implemented the CSE-ing of loads and am getting roughly a 20% speed-up.  That's higher than the 8% Julian got in comment 8 when he did it unsafely;  I guess other things have changed in the meantime that make the CSE-ing have a bigger benefit.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: