Closed Bug 536641 Opened 15 years ago Closed 13 years ago

TM: cheap static analysis of loop variables to avoid overflow guards

Categories

(Core :: JavaScript Engine, defect)

Product:

Component:

Platform:

x86

macOS

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: n.nethercote, Unassigned)

References

Details

Attachments

(2 files, 2 obsolete files)

SS results 15 years ago Nicholas Nethercote [inactive] 3.31 KB, text/plain		Details
Patch to avoid overflow guards 15 years ago Rodrigo Sol 10.19 KB, patch		Details \| Diff \| Splinter Review
Patch to avoid overflow guards 15 years ago Rodrigo Sol 8.63 KB, patch		Details \| Diff \| Splinter Review
Patch to avoid overflow guards 15 years ago Rodrigo Sol 50.90 KB, patch		Details \| Diff \| Splinter Review

Nicholas Nethercote [inactive]

Reporter

Description

•

15 years ago

Raised in bug 536630: in simple loops like this: for (var i = 0; i < 100; i++) { ... } we do an unnecessary overflow check. It would be nice to have a cheap static analysis that identifies when it can be avoided. Some care is needed -- we have to ensure that 'i' can't be modified by the loop body. Although before doing a lot of work on this it would be good to determine if the potential benefits are significant.

Nicholas Nethercote [inactive]

Reporter

Comment 1

•

15 years ago

Attached file SS results — Details

As an easy limit study I tried turning off overflow checking. SS results are attached. Overall it was a ~1.7% (13ms) win, with big improvements for 3bit-bits-in-byte (1.77x), bitwise-and (1.23x), crypto-md5 (1.35x), crypto-sha1 (1.13x) and slight improvements everywhere else. This result will overstate the best results possible with this bug, as it removes all overflow checks, not just those in simple for-loops. It's hard to know what the real proportion removable would be. But it's encouraging that bitwise-and does well because it consists of a single small loop where the only overflow check is for the loop variable increment and so we see that the overflow check cost is significant for it.

Nicholas Nethercote [inactive]

Reporter

Updated

•

15 years ago

Blocks: 537868

Comment 2

•

15 years ago

I'm working with this issue. I will keep you up to date. If you have any suggestions or ideas, please share them with me.

Mike Shaver (:shaver emeritus)

Comment 3

•

15 years ago

I was just thinking about this last night! Here's the analysis I sketched; unless you want to forbid all function calls or possibly-side-effecting property accesses, you need pretty much all of these. The upvar lexdep stuff should make step 4 relatively straightforward. Initial analysis: 1) if the for-loop's cond is of the form TOK_RELOP[JSOP_LT](TOK_NAME,TOK_NUMBER) and 2) the loop control is initialized to an integer and 3) the loop control's name is only updated within the loop's update and 4) there are no closures which reference the loop control and 5) the loop update is an increment expression and 6) the bounding constant plus the constant operand of the increment expression is less than the maximum integer value then WIN. js_FoldConstants looks like a good place to perform the analysis: if the analysis indicates that the loop has an integer-suited loop control, then annotate the for-head NOP with a different source note to propagate this fact (or use an opcode?)

Comment 4

•

15 years ago

It's nice to have a start point. I'll follow this path. Thanks.

Comment 5

•

15 years ago

I am toiling with this bug, and I would like to discuss the best approach to handle it. I was thinking, originally, to work on the bytecode level. Now, there is also the idea to work on the Abstract Syntax Tree. On the bytecode level, once the monitor is activated, I look for branches, and patterns such as "i < 100", for instance. Once a segment is built, I try to see if the i is not modified inside the segment. If that is the case, I just remove all the overflow tests that use 'i' and other bounded variables, or constants that will not cause overflow. On the AST, I can use the tree structure to save some time: I simply look for loop nodes, for instance, and try to see, if, on the subnodes, the induction variable is not changed. The problem of working on the AST is that I can't see how to propagate the information that a name is bounded up to the place where the segment is produced. What do you think about these two approaches?

Brendan Eich [:brendan]

Comment 6

•

15 years ago

Bytecode is being read by JaegerMonkey JIT code and a CFG built, IIRC. That might be easier to analyze than the AST. What shaver describes is good for AST analysis of "for (... i = 0; i < 100; i++) { /* no i changes here */ }" loops. That would give quick joy but I bet the rich target is where the code uses N, not 100, for some variable N. /be

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 7

•

15 years ago

(In reply to comment #5) The N case seems like a good target, since I think it could be any expression?. Even if N changes wildly on trace, if the root of the tree specialized it as an integer, "i" can't overflow. If a branch exits with a double or string or something we'll just build a new tree anyway.

Comment 8

•

15 years ago

Currently I'm working with TraceMonkey baseline. Should I migrate to JaegerMonkey baseline?

David Mandelin [:dmandelin]

Comment 9

•

15 years ago

(In reply to comment #8) > Currently I'm working with TraceMonkey > baseline. Should I migrate to JaegerMonkey baseline? I think it's best to stick with TraceMonkey--it won't change as much underneath you, and the work will get tested and used faster. At first, I thought it didn't even apply to JM, but I think the results actually could help JM as well, which is really cool. But we'll figure that out later.

Comment 10

•

15 years ago

Dear all, My idea is to go over the bytecodes, looking for patterns like "i < N", so that I can put the constraint (i < 100) onto name "i". For instance, given a loop like: for (i = 0; i < 100; i++) { } tracemonkey produces, at some point in the bytecodes, the pattern below: 00040: name "i" 00043: int8 100 00045: lt 00046: ifne 20 (-26) ... So, I changed the macro DO_OP, in file jsinterp.cpp, to print the bytecodes that it is processing. However, it does not print the "ifne bytecode. It prints the "lt", and then the bytecode that would be executed after the ifne. So, questions: 1) What happened to this bytecode inside the js_Interpret() function? 2) Instead of monitoring the DO_OP macro, is it possible to use the TraceRecorder and the TraceMonitor to discover the constraints that apply on the variables? Say (i < 100), (i < b), (i > 100), etc. It would be nice to have a way to identify the head of the loops.

David Mandelin [:dmandelin]

Comment 11

•

15 years ago

(In reply to comment #10) > Dear all, > > My idea is to go over the bytecodes, looking for patterns like "i < N", so that > I can put the constraint (i < 100) onto name "i". For instance, given a loop > like: > > for (i = 0; i < 100; i++) { > } Side note: your code example above seems to be using a global variable |i|. This is also indicated by the |name| op in your disassembly output (getarg/getlocal are used for locals). Of course, globals have more analysis constraints, because any function called inside the current one could potentially mutate a global, so you probably want to start with locals. > tracemonkey produces, at some point in the bytecodes, the pattern below: > > 00040: name "i" > 00043: int8 100 > 00045: lt > 00046: ifne 20 (-26) > ... > > So, I changed the macro DO_OP, in file jsinterp.cpp, to print the bytecodes > that it is processing. However, it does not print the "ifne bytecode. It prints > the "lt", and then the bytecode that would be executed after the ifne. So, > questions: > > 1) What happened to this bytecode inside the js_Interpret() function? I'm glad you asked. :-) SpiderMonkey has an "op fusion" optimization, where certain pairs of opcodes are executed together, in order to save on dispatch costs and value store/load costs. This happens for (a) relationals followed by ifs (as in your example) and (b) sets followed by pops. If you look at the macro RELATIONAL_OP inside jsops.cpp, you will see that it invokes another macro TRY_BRANCH_AFTER_COND. If the instruction following the relational is ifeq/ifne, this macro will directly execute the conditional branch without using DO_OP. I think you can make your instrumentation correct by adding it inside the 'if (diff <= 1)' if statement. > 2) Instead of monitoring the DO_OP macro, is it possible to use the > TraceRecorder and the TraceMonitor to discover the constraints that apply on > the variables? Say (i < 100), (i < b), (i > 100), etc. It would be nice to have > a way to identify the head of the loops. I'm not sure exactly what you want to do, but it seems like it would work, as long as you are only trying to analyze individual traces as they are recorded instead of the whole loop. Another fact of interest is that the bytecode |trace| occurs at each loop header for the tracer's benefit (I think it also occurs at function entry points, and maybe another place or 2, so you'll have to check). And now I recall why you asked about JM -- the new abstract interpretation function is in there. You could probably import it into your code right now if you wanted to use it--it is in methodjit/BytecodeAnalyzer.{h.cpp} and I don't think it depends on any other part of the method JIT. It figures out incoming non-fallthrough CFG edges for each instruction, so you could probably use that, possibly with a tweak or two, plus the |trace| bytecodes to identify loops directly.

Comment 12

•

15 years ago

Hi there, I'm glad to inform to you that I have the first version of the algorithm for overflow elimination! It works only for local variables of induction, but now we will work in order to add other analyses. The code needs some cleanups, after that I will share with you. I have some doubt to test. Which test suite is more appropriate to run? I used to use trace-test.js, but it no longer available.

Brendan Eich [:brendan]

Comment 13

•

15 years ago

Rodrigo, could you attach a patch? No problem if it is rough, it would be good to take a look and help advise. /be

Brendan Eich [:brendan]

Comment 14

•

15 years ago

(In reply to comment #11) > I'm glad you asked. :-) SpiderMonkey has an "op fusion" optimization, where > certain pairs of opcodes are executed together, in order to save on dispatch > costs and value store/load costs. Fusion also improves branch prediction by replicating "come from" code. Indirect branch prediction in Intel Core 2 Duo and beyond is surprisingly good. It still uses the branch or jump address as key, though, so splitting cases helps. /be

David Mandelin [:dmandelin]

Comment 15

•

15 years ago

(In reply to comment #12) > Hi there, > > I'm glad to inform to you that I have the first version of the algorithm for > overflow elimination! It works only for local variables of induction, but now > we will work in order to add other analyses. > > The code needs some cleanups, after that I will share with you. > > I have some doubt to test. Which test suite is more appropriate to run? I used > to use trace-test.js, but it no longer available. The new way is to use js/src/trace-test/trace-test.py. For example, from a directory like js/src/debug, I usually run: python ../trace-test/trace-test.py --no-slow shell/js

Comment 16

•

15 years ago

I found a bug and for this reason I did not sent the patch before. About the algorithm: 1 - It is a crude attempt yet. 2 - It only treats cases like i < n, when both are local variables. 3 - It only eliminates overflow tests for "i" if and only if i and n are not modified inside the trace. 4 - I think that is possible insert some sophisticated analysis without much overhead. I will work now to improve these analysis. Please, send me feedbacks.

Comment 17

•

15 years ago

Attached patch Patch to avoid overflow guards (obsolete) — Details — Splinter Review

Josh Matthews [:jdm]

Comment 18

•

15 years ago

Rodrigo: looks like you forgot to |hg add| the new files. Also, there are a lot of extraneous whitespace/line-ending changes.

Comment 19

•

15 years ago

Ohh my bad.

Comment 20

•

15 years ago

Attached patch Patch to avoid overflow guards (obsolete) — Details — Splinter Review

Attachment #433320 - Attachment is obsolete: true

Comment 21

•

15 years ago

Dear guys, I coded a simple analysis to remove overflow tests. The analysis runs in linear time in the number of instructions seen in the trace. I have written a small report about the analysis, that is available http://www.dcc.ufmg.br/~rsol/docs/ovs.pdf (starts in section 3) . In a nutshell, my analysis identifies about 60% of the overflow tests, and removes approximately half the tests that it recognizes. On the average I decrease the size of x86 instructions in about 3%, at the expenses of increasing the compilation time by about 5%. Yet, I believe my code can be greatly improved for performance after some review.

Nicholas Nethercote [inactive]

Reporter

Comment 22

•

15 years ago

Rodrigo's done work to get this going, someone should give him the courtesy of a review. I would but I barely know anything about SM's bytecode. dmandelin, could you take a look? Rodrigo, can you do measurements on SunSpider and V8? You can get both of them here: http://svn.webkit.org/repository/webkit/trunk/SunSpider/. You'll need to run the 'sunspider' script twice, once for before, once for after, then use 'sunspider-compare-results' to compare the results.

Andreas Gal :gal

Comment 23

•

15 years ago

Yes, this is great work. I cc'ed Brian, hoping that we can suck this into his static analysis framework.

Brian Hackett [Laid off!]

Comment 24

•

15 years ago

The only issue I see there is the analysis data structures in the inference bug are geared towards whole-program, X always holds for a bytecode information, whereas Rodrigo's analysis is computing information on a trace-by-trace basis (and different things could hold for a given bytecode depending on the trace). I do have a question about the analysis though (also tracing in general). When a loop is executed multiple times (the whole loop, not individual iterations), are the traces regenerated and/or reanalyzed to remove overflow checks? The analysis pulls loop boundaries from the interpreter during trace recording (really cool), so in a 'for (i = 0; i < N; i++)' loop it knows what N is and analyzes based on that, but if the loop is hit again N might be different.

Andreas Gal :gal

Comment 25

•

15 years ago

We can assert that N is a specific value if we think its worth specializing on it. Though, ideally some analysis would prove that fact since miss-speculating can be costly.

Brendan Eich [:brendan]

Comment 26

•

15 years ago

(In reply to comment #25) > We can assert that N is a specific value if we think its worth specializing on s/assert/guard/ ;-) /be

Comment 27

•

15 years ago

Nicholas: I will definitely give SunSpider a try. Currently it is holiday in Brazil, and I am out of town, but I will do it as early as this Monday, when I get back into the school. Brian: the analysis runs once, when the trace is created. I did not think about the situation when the loop is hit again, i.e, when you get out of the loop, go to the interpreter, and then back to the loop. Actually, if possible I would like to have suggestions about how to handle this case: should I invalidate the old trace, and try to produce a new trace?

Comment 28

•

15 years ago

Hi, I am sending SunSpider's results: Electiveness results for locals variables script ovs removed ovs total % Removals ---------------------------------------- 3d-cube 23 124 19% 3d-morph 3 16 19% 3d-raytrace 87 484 18% Access-fannkuch 26 310 8% Access-nbody 20 71 28% Access-nsieve 12 12 100% Access-binary-tree 0 0 - Bitops-3bit-bits-in-byte 2 2 100% Bitops-bits-in-byte 3 3 100% bitwise-and 0 0 - Bitops-nsieve-bits 8 8 100% controlflow 0 0 - Crypto-aes 69 633 11% Crypto-md5 3 3 100% Crypto-sha1 14 14 100% Date-format-tofte 3 3 100% Date-format-xparb 8 97 8% Math-cordic 5 5 100% Math-partial-sums 1 7 14% Math-spectral-norm 20 42 48% regexp-dna 0 0 - String-base64 8 20 40% String-fasta 2 18 11% String-tagcloud 3 12 25% String-validate-input 9 49 18% ******************************************************************** SunSpider Results TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: *1.093x as slow* 4373.6ms +/- 0.5% 4779.1ms +/- 0.5% significant ============================================================================= 3d: *1.157x as slow* 830.6ms +/- 0.1% 961.3ms +/- 0.2% significant cube: *1.28x as slow* 386.9ms +/- 0.1% 494.2ms +/- 0.1% significant morph: *1.025x as slow* 78.5ms +/- 0.5% 80.5ms +/- 2.5% significant raytrace: *1.059x as slow* 365.2ms +/- 0.1% 386.6ms +/- 0.1% significant access: *1.005x as slow* 547.7ms +/- 0.5% 550.5ms +/- 0.2% significant binary-trees: *1.008x as slow* 130.3ms +/- 0.3% 131.3ms +/- 0.3% significant fannkuch: 1.008x as fast 244.1ms +/- 0.4% 242.2ms +/- 0.1% significant nbody: *1.024x as slow* 114.8ms +/- 1.2% 117.5ms +/- 0.9% significant nsieve: *1.017x as slow* 58.5ms +/- 0.6% 59.5ms +/- 0.6% significant bitops: *1.044x as slow* 102.5ms +/- 0.5% 107.0ms +/- 0.9% significant 3bit-bits-in-byte: - 2.8ms +/- 10.8% 2.8ms +/- 10.8% bits-in-byte: *1.21x as slow* 13.6ms +/- 2.7% 16.4ms +/- 2.3% significant bitwise-and: ?? 2.8ms +/- 10.8% 3.1ms +/- 7.3% not conclusive: might be *1.107x as slow* nsieve-bits: *1.017x as slow* 83.3ms +/- 0.4% 84.7ms +/- 0.4% significant controlflow: ?? 19.0ms +/- 0.0% 19.1ms +/- 1.2% not conclusive: might be *1.005x as slow* recursive: ?? 19.0ms +/- 0.0% 19.1ms +/- 1.2% not conclusive: might be *1.005x as slow* crypto: ?? 334.0ms +/- 6.0% 334.6ms +/- 5.9% not conclusive: might be *1.002x as slow* aes: - 183.9ms +/- 10.9% 178.8ms +/- 11.1% md5: *1.034x as slow* 108.7ms +/- 0.3% 112.4ms +/- 0.3% significant sha1: *1.048x as slow* 41.4ms +/- 0.9% 43.4ms +/- 0.9% significant date: *1.108x as slow* 805.0ms +/- 0.1% 891.9ms +/- 0.1% significant format-tofte: *1.150x as slow* 532.9ms +/- 0.2% 612.8ms +/- 0.2% significant format-xparb: *1.026x as slow* 272.1ms +/- 0.2% 279.1ms +/- 0.1% significant math: *1.022x as slow* 72.3ms +/- 0.7% 73.9ms +/- 0.5% significant cordic: *1.030x as slow* 27.0ms +/- 0.0% 27.8ms +/- 1.1% significant partial-sums: ?? 25.9ms +/- 0.9% 26.1ms +/- 0.9% not conclusive: might be *1.008x as slow* spectral-norm: *1.031x as slow* 19.4ms +/- 1.9% 20.0ms +/- 0.0% significant regexp: 1.008x as fast 85.8ms +/- 0.7% 85.1ms +/- 0.3% significant dna: 1.008x as fast 85.8ms +/- 0.7% 85.1ms +/- 0.3% significant string: *1.114x as slow* 1576.7ms +/- 0.2% 1755.7ms +/- 0.2% significant base64: ?? 57.8ms +/- 0.5% 58.0ms +/- 0.6% not conclusive: might be *1.003x as slow* fasta: *1.037x as slow* 344.8ms +/- 0.3% 357.4ms +/- 0.6% significant tagcloud: *1.110x as slow* 538.9ms +/- 0.2% 598.3ms +/- 0.1% significant unpack-code: *1.191x as slow* 547.5ms +/- 0.3% 651.8ms +/- 0.5% significant validate-input: *1.029x as slow* 87.7ms +/- 0.4% 90.2ms +/- 0.3% significant

Nicholas Nethercote [inactive]

Reporter

Comment 29

•

15 years ago

Rodrigo, those times are slow, it looks like you're running the interpreter. Can you rerun using --args='-j' so that the JIT is running?

Comment 30

•

15 years ago

Hi, I am sending to you the patch with the algorithm. I am trying to improve its performance, but without success until now. I am using shark to profiling the code and seems that the bottleneck is a data structure responsible for to map variables to nodes in the graph. I have tried STL MAP and using a simple array to keep this information and both were inefficient. Any help will be welcome. Thanks.

Comment 31

•

15 years ago

Attached patch Patch to avoid overflow guards — Details — Splinter Review

Attachment #433456 - Attachment is obsolete: true

Ryan VanderMeulen [:RyanVM]

Comment 32

•

13 years ago

Obsolete with the removal of tracejit.

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.