Closed Bug 536641 Opened 15 years ago Closed 13 years ago

TM: cheap static analysis of loop variables to avoid overflow guards

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: n.nethercote, Unassigned)

References

Details

Attachments

(2 files, 2 obsolete files)

Raised in bug 536630:  in simple loops like this:

  for (var i = 0; i < 100; i++) {
      ...
  }

we do an unnecessary overflow check.  It would be nice to have a cheap static analysis that identifies when it can be avoided.

Some care is needed -- we have to ensure that 'i' can't be modified by the loop body.

Although before doing a lot of work on this it would be good to determine if the potential benefits are significant.
Attached file SS results
As an easy limit study I tried turning off overflow checking.  SS results are attached.  Overall it was a ~1.7% (13ms) win, with big improvements for 3bit-bits-in-byte (1.77x), bitwise-and (1.23x), crypto-md5 (1.35x), crypto-sha1 (1.13x) and slight improvements everywhere else.

This result will overstate the best results possible with this bug, as it removes all overflow checks, not just those in simple for-loops.  It's hard to know what the real proportion removable would be.  But it's encouraging that bitwise-and does well because it consists of a single small loop where the only overflow check is for the loop variable increment and so we see that the overflow check cost is significant for it.
Blocks: 537868
I'm working with this issue. I will keep you up to date.

If you have any suggestions or ideas, please share them with me.
I was just thinking about this last night!  Here's the analysis I sketched; unless you want to forbid all function calls or possibly-side-effecting property accesses, you need pretty much all of these.  The upvar lexdep stuff should make step 4 relatively straightforward.

Initial analysis:
1) if the for-loop's cond is of the form TOK_RELOP[JSOP_LT](TOK_NAME,TOK_NUMBER) and
2) the loop control is initialized to an integer and
3) the loop control's name is only updated within the loop's update and
4) there are no closures which reference the loop control and
5) the loop update is an increment expression and
6) the bounding constant plus the constant operand of the increment expression is less than the maximum integer value then
WIN.

js_FoldConstants looks like a good place to perform the analysis: if the analysis indicates that the loop has an integer-suited loop control, then annotate the for-head NOP with a different source note to propagate this fact (or use an opcode?)
It's nice to have a start point. I'll follow this path.

Thanks.
I am toiling with this bug, and I would like to discuss the best approach to handle it. I was thinking, originally, to work on the bytecode level. Now, there is also the idea to work on the Abstract Syntax Tree.
    
On the bytecode level, once the monitor is activated, I look for branches, and patterns such as "i < 100", for instance. Once a segment is built, I try to see if the i is not modified inside the segment. If that is the case, I just remove all the overflow tests that use 'i' and other bounded variables, or constants that will not cause overflow.

On the AST, I can use the tree structure to save some time: I simply look for loop nodes, for instance, and try to see, if, on the subnodes, the induction variable is not changed. The problem of working on the AST is that I can't see how to propagate the information that a name is bounded up to the place where the segment is produced.

What do you think about these two approaches?
Bytecode is being read by JaegerMonkey JIT code and a CFG built, IIRC. That might be easier to analyze than the AST.

What shaver describes is good for AST analysis of "for (... i = 0; i < 100; i++) { /* no i changes here */ }" loops. That would give quick joy but I bet the rich target is where the code uses N, not 100, for some variable N.

/be
(In reply to comment #5)

The N case seems like a good target, since I think it could be any expression?. Even if N changes wildly on trace, if the root of the tree specialized it as an integer, "i" can't overflow. If a branch exits with a double or string or something we'll just build a new tree anyway.
Currently I'm working with TraceMonkey 
baseline. Should I migrate to JaegerMonkey baseline?
(In reply to comment #8)
> Currently I'm working with TraceMonkey 
> baseline. Should I migrate to JaegerMonkey baseline?

I think it's best to stick with TraceMonkey--it won't change as much underneath you, and the work will get tested and used faster.

At first, I thought it didn't even apply to JM, but I think the results actually could help JM as well, which is really cool. But we'll figure that out later.
Dear all,

My idea is to go over the bytecodes, looking for patterns like "i < N", so that I can put the constraint (i < 100) onto name "i". For instance, given a loop like:

for (i = 0; i < 100; i++) {
}

tracemonkey produces, at some point in the bytecodes, the pattern below:

00040:  name "i"
00043:  int8 100
00045:  lt
00046:  ifne 20 (-26)
...

So, I changed the macro DO_OP, in file jsinterp.cpp, to print the bytecodes that it is processing. However, it does not print the "ifne bytecode. It prints the "lt", and then the bytecode that would be executed after the ifne. So, questions:

1) What happened to this bytecode inside the js_Interpret() function?

2) Instead of monitoring the DO_OP macro, is it possible to use the TraceRecorder and the TraceMonitor to discover the constraints that apply on the variables? Say (i < 100), (i < b), (i > 100), etc. It would be nice to have a way to identify the head of the loops.
(In reply to comment #10)
> Dear all,
> 
> My idea is to go over the bytecodes, looking for patterns like "i < N", so that
> I can put the constraint (i < 100) onto name "i". For instance, given a loop
> like:
> 
> for (i = 0; i < 100; i++) {
> }

Side note: your code example above seems to be using a global variable |i|. This is also indicated by the |name| op in your disassembly output (getarg/getlocal are used for locals). Of course, globals have more analysis constraints, because any function called inside the current one could potentially mutate a global, so you probably want to start with locals.

> tracemonkey produces, at some point in the bytecodes, the pattern below:
> 
> 00040:  name "i"
> 00043:  int8 100
> 00045:  lt
> 00046:  ifne 20 (-26)
> ...
> 
> So, I changed the macro DO_OP, in file jsinterp.cpp, to print the bytecodes
> that it is processing. However, it does not print the "ifne bytecode. It prints
> the "lt", and then the bytecode that would be executed after the ifne. So,
> questions:
> 
> 1) What happened to this bytecode inside the js_Interpret() function?

I'm glad you asked. :-) SpiderMonkey has an "op fusion" optimization, where certain pairs of opcodes are executed together, in order to save on dispatch costs and value store/load costs. This happens for (a) relationals followed by ifs (as in your example) and (b) sets followed by pops. If you look at the macro RELATIONAL_OP inside jsops.cpp, you will see that it invokes another macro TRY_BRANCH_AFTER_COND. If the instruction following the relational is ifeq/ifne, this macro will directly execute the conditional branch without using DO_OP. I think you can make your instrumentation correct by adding it inside the 'if (diff <= 1)' if statement.

> 2) Instead of monitoring the DO_OP macro, is it possible to use the
> TraceRecorder and the TraceMonitor to discover the constraints that apply on
> the variables? Say (i < 100), (i < b), (i > 100), etc. It would be nice to have
> a way to identify the head of the loops.

I'm not sure exactly what you want to do, but it seems like it would work, as long as you are only trying to analyze individual traces as they are recorded instead of the whole loop. Another fact of interest is that the bytecode |trace| occurs at each loop header for the tracer's benefit (I think it also occurs at function entry points, and maybe another place or 2, so you'll have to check).

And now I recall why you asked about JM -- the new abstract interpretation function is in there. You could probably import it into your code right now if you wanted to use it--it is in methodjit/BytecodeAnalyzer.{h.cpp} and I don't think it depends on any other part of the method JIT. It figures out incoming non-fallthrough CFG edges for each instruction, so you could probably use that, possibly with a tweak or two, plus the |trace| bytecodes to identify loops directly.
Hi there,

I'm glad to inform to you that I have the first version of the algorithm for overflow elimination! It works only for local variables of induction, but now we will work in order to add other analyses. 

The code needs some cleanups, after that I will share with you.

I have some doubt to test. Which test suite is more appropriate to run? I used to use trace-test.js, but it no longer available.
Rodrigo, could you attach a patch? No problem if it is rough, it would be good to take a look and help advise.

/be
(In reply to comment #11)
> I'm glad you asked. :-) SpiderMonkey has an "op fusion" optimization, where
> certain pairs of opcodes are executed together, in order to save on dispatch
> costs and value store/load costs.

Fusion also improves branch prediction by replicating "come from" code. Indirect branch prediction in Intel Core 2 Duo and beyond is surprisingly good. It still uses the branch or jump address as key, though, so splitting cases helps.

/be
(In reply to comment #12)
> Hi there,
> 
> I'm glad to inform to you that I have the first version of the algorithm for
> overflow elimination! It works only for local variables of induction, but now
> we will work in order to add other analyses. 
> 
> The code needs some cleanups, after that I will share with you.
> 
> I have some doubt to test. Which test suite is more appropriate to run? I used
> to use trace-test.js, but it no longer available.

The new way is to use js/src/trace-test/trace-test.py. For example, from a directory like js/src/debug, I usually run:

python ../trace-test/trace-test.py --no-slow shell/js
I found a bug and for this reason I did not sent the patch 
before.

About the algorithm:

1 - It is a crude attempt yet. 
2 - It only treats cases like i < n, when both are local 
variables.
3 - It only eliminates overflow tests for "i" if and only 
if i and n are not modified inside the trace.
4 - I think that is possible insert some sophisticated analysis without much 
overhead. I will work now to improve these analysis. 

Please, send me feedbacks.
Attached patch Patch to avoid overflow guards (obsolete) — Splinter Review
Rodrigo: looks like you forgot to |hg add| the new files.  Also, there are a lot of extraneous whitespace/line-ending changes.
Ohh my bad.
Attached patch Patch to avoid overflow guards (obsolete) — Splinter Review
Attachment #433320 - Attachment is obsolete: true
Dear guys,

I coded a simple analysis to remove overflow tests. The analysis runs in linear time in the number of instructions seen in the trace. I have written a small report about the analysis, that is available http://www.dcc.ufmg.br/~rsol/docs/ovs.pdf (starts in section 3) . In a nutshell, my analysis identifies about 60% of the overflow tests, and removes approximately half the tests that it recognizes. On the average I decrease the size of x86 instructions in about 3%, at the expenses of increasing the compilation time by about 5%. Yet, I believe my code can be greatly improved for performance after some review.
Rodrigo's done work to get this going, someone should give him the courtesy of a review.  I would but I barely know anything about SM's bytecode.  dmandelin, could you take a look?

Rodrigo, can you do measurements on SunSpider and V8?  You can get both of them here:  http://svn.webkit.org/repository/webkit/trunk/SunSpider/.  You'll need to run the 'sunspider' script twice, once for before, once for after, then use 'sunspider-compare-results' to compare the results.
Yes, this is great work. I cc'ed Brian, hoping that we can suck this into his static analysis framework.
The only issue I see there is the analysis data structures in the inference bug are geared towards whole-program, X always holds for a bytecode information, whereas Rodrigo's analysis is computing information on a trace-by-trace basis (and different things could hold for a given bytecode depending on the trace).

I do have a question about the analysis though (also tracing in general).  When a loop is executed multiple times (the whole loop, not individual iterations), are the traces regenerated and/or reanalyzed to remove overflow checks?  The analysis pulls loop boundaries from the interpreter during trace recording (really cool), so in a 'for (i = 0; i < N; i++)' loop it knows what N is and analyzes based on that, but if the loop is hit again N might be different.
We can assert that N is a specific value if we think its worth specializing on it. Though, ideally some analysis would prove that fact since miss-speculating can be costly.
(In reply to comment #25)
> We can assert that N is a specific value if we think its worth specializing on

s/assert/guard/ ;-)

/be
Nicholas: I will definitely give SunSpider a try. Currently it is
holiday in Brazil, and I am out of town, but I will do it as early as
this Monday, when I get back into the school.

Brian: the analysis runs once, when the trace is created. I did
not think about the situation when the loop is hit again, i.e, when
you get out of the loop, go to the interpreter, and then back to the
loop. Actually, if possible I would like to have suggestions about how
to handle this case: should I invalidate the old trace, and try to
produce a new trace?
Hi,

I am sending  SunSpider's results:
 

Electiveness results for locals variables 

script	ovs removed	ovs total	% Removals
----------------------------------------
3d-cube	23	124	19%
3d-morph	3	16	19%
3d-raytrace	87	484	18%
Access-fannkuch	26	310	8%
Access-nbody	20	71	28%
Access-nsieve	12	12	100%
Access-binary-tree	0	0	-
Bitops-3bit-bits-in-byte	2	2	100%
Bitops-bits-in-byte	3	3	100%
bitwise-and	0	0	-
Bitops-nsieve-bits	8	8	100%
controlflow	0	0	-
Crypto-aes	69	633	11%
Crypto-md5	3	3	100%
Crypto-sha1	14	14	100%
Date-format-tofte	3	3	100%
Date-format-xparb	8	97	8%
Math-cordic	5	5	100%
Math-partial-sums	1	7	14%
Math-spectral-norm	20	42	48%
regexp-dna	0	0	-
String-base64	8	20	40%
String-fasta	2	18	11%
String-tagcloud	3	12	25%
String-validate-input	9	49	18%


********************************************************************
SunSpider Results

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           *1.093x as slow*  4373.6ms +/- 0.5%   4779.1ms +/- 0.5%     significant

=============================================================================

  3d:                  *1.157x as slow*   830.6ms +/- 0.1%    961.3ms +/- 0.2%     significant
    cube:              *1.28x as slow*    386.9ms +/- 0.1%    494.2ms +/- 0.1%     significant
    morph:             *1.025x as slow*    78.5ms +/- 0.5%     80.5ms +/- 2.5%     significant
    raytrace:          *1.059x as slow*   365.2ms +/- 0.1%    386.6ms +/- 0.1%     significant

  access:              *1.005x as slow*   547.7ms +/- 0.5%    550.5ms +/- 0.2%     significant
    binary-trees:      *1.008x as slow*   130.3ms +/- 0.3%    131.3ms +/- 0.3%     significant
    fannkuch:          1.008x as fast     244.1ms +/- 0.4%    242.2ms +/- 0.1%     significant
    nbody:             *1.024x as slow*   114.8ms +/- 1.2%    117.5ms +/- 0.9%     significant
    nsieve:            *1.017x as slow*    58.5ms +/- 0.6%     59.5ms +/- 0.6%     significant

  bitops:              *1.044x as slow*   102.5ms +/- 0.5%    107.0ms +/- 0.9%     significant
    3bit-bits-in-byte: -                    2.8ms +/- 10.8%      2.8ms +/- 10.8% 
    bits-in-byte:      *1.21x as slow*     13.6ms +/- 2.7%     16.4ms +/- 2.3%     significant
    bitwise-and:       ??                   2.8ms +/- 10.8%      3.1ms +/- 7.3%     not conclusive: might be *1.107x as slow*
    nsieve-bits:       *1.017x as slow*    83.3ms +/- 0.4%     84.7ms +/- 0.4%     significant

  controlflow:         ??                  19.0ms +/- 0.0%     19.1ms +/- 1.2%     not conclusive: might be *1.005x as slow*
    recursive:         ??                  19.0ms +/- 0.0%     19.1ms +/- 1.2%     not conclusive: might be *1.005x as slow*

  crypto:              ??                 334.0ms +/- 6.0%    334.6ms +/- 5.9%     not conclusive: might be *1.002x as slow*
    aes:               -                  183.9ms +/- 10.9%    178.8ms +/- 11.1% 
    md5:               *1.034x as slow*   108.7ms +/- 0.3%    112.4ms +/- 0.3%     significant
    sha1:              *1.048x as slow*    41.4ms +/- 0.9%     43.4ms +/- 0.9%     significant

  date:                *1.108x as slow*   805.0ms +/- 0.1%    891.9ms +/- 0.1%     significant
    format-tofte:      *1.150x as slow*   532.9ms +/- 0.2%    612.8ms +/- 0.2%     significant
    format-xparb:      *1.026x as slow*   272.1ms +/- 0.2%    279.1ms +/- 0.1%     significant

  math:                *1.022x as slow*    72.3ms +/- 0.7%     73.9ms +/- 0.5%     significant
    cordic:            *1.030x as slow*    27.0ms +/- 0.0%     27.8ms +/- 1.1%     significant
    partial-sums:      ??                  25.9ms +/- 0.9%     26.1ms +/- 0.9%     not conclusive: might be *1.008x as slow*
    spectral-norm:     *1.031x as slow*    19.4ms +/- 1.9%     20.0ms +/- 0.0%     significant

  regexp:              1.008x as fast      85.8ms +/- 0.7%     85.1ms +/- 0.3%     significant
    dna:               1.008x as fast      85.8ms +/- 0.7%     85.1ms +/- 0.3%     significant

  string:              *1.114x as slow*  1576.7ms +/- 0.2%   1755.7ms +/- 0.2%     significant
    base64:            ??                  57.8ms +/- 0.5%     58.0ms +/- 0.6%     not conclusive: might be *1.003x as slow*
    fasta:             *1.037x as slow*   344.8ms +/- 0.3%    357.4ms +/- 0.6%     significant
    tagcloud:          *1.110x as slow*   538.9ms +/- 0.2%    598.3ms +/- 0.1%     significant
    unpack-code:       *1.191x as slow*   547.5ms +/- 0.3%    651.8ms +/- 0.5%     significant
    validate-input:    *1.029x as slow*    87.7ms +/- 0.4%     90.2ms +/- 0.3%     significant
Rodrigo, those times are slow, it looks like you're running the interpreter.  Can you rerun using --args='-j' so that the JIT is running?
Hi,

I am sending to you the patch with the algorithm. I am trying to improve its performance, but without success until now. I am using shark to profiling the code and seems that the bottleneck is a data structure responsible for to map variables to nodes in the graph. I have tried STL MAP and 
using a simple array to keep this information and both were inefficient.

Any help will be welcome.

Thanks.
Attachment #433456 - Attachment is obsolete: true
Obsolete with the removal of tracejit.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: