Closed Bug 535925 Opened 16 years ago Closed 14 years ago

TM: High loop and branch exits and poor performance with JIT in this example

Tracking

()

Status:

RESOLVED WORKSFORME

People

(Reporter: duncan.loveday, Assigned: dvander)

References

Details

(Keywords: perf, regression, testcase)

Attachments

(3 files, 1 obsolete file)

test case 16 years ago Duncan Loveday 691.91 KB, application/x-javascript		Details
test case that runs much better with minor changes 16 years ago Duncan Loveday 691.92 KB, application/x-javascript		Details
Reduced testcase 16 years ago Duncan Loveday 526 bytes, application/x-javascript		Details
Reduced testcase v2 16 years ago Duncan Loveday 402 bytes, application/x-javascript		Details

Duncan Loveday

Reporter

Description

•

16 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 (.NET CLR 2.0.50727) Build Identifier: Local m-c build as of 18/12/2009 The attached has a loop structure that seems to perform badly with JIT. I get large numbers of LOOP and BRANCH exits Reproducible: Always Steps to Reproduce: 1. Run the attached in the shell with TMFLAGS=stats 2. 3. Actual Results: $ ./js test.js Time=2696 $ ./js -j test.js Time=5874 $ export TMFLAGS=stats $ ./js -j test.js Time=7915 recorder: started(27), aborted(20), completed(135), different header(0), trees trashed(11), slot promoted(0), unstable loop variable(93), breaks(0), returns(0), merged loop exits(37), unstableInnerCalls(5), blacklisted(3) monitor: exits(81794), BRANCH(36160), CASE(0), DEFAULT(0), LOOP(45634), NESTED(0), MISMATCH(0), OOM(0), OVERFLOW(0), UNSTABLE_LOOP(0), TIMEOUT(0), DEEP_BAIL(0), STATUS(0), RECURSIVE_UNLINKED(0), RECURSIVE_LOOP(0), RECURSIVE_MISMATCH(0), RECURSIVE_EMPTY_RP(0), RECURSIVE_SLURP_FAIL(0), RECURSIVE_SLURP_MISMATCH(0), timeouts(0), type mismatch(0), triggered(81794), global mismatch(0), flushed(1) $

Duncan Loveday

Reporter

Comment 1

•

16 years ago

Attached file test case — Details

Duncan Loveday

Reporter

Updated

•

16 years ago

Blocks: 530956

Keywords: perf, regression, testcase

Duncan Loveday

Reporter

Comment 2

•

16 years ago

Attached file test case that runs much better with minor changes — Details

This might have something to do with initialisation of locals to null within a nested loop structure. Attaching a slightly modified case that runs a ton better. The only difference is a couple of locals are not initialised. These locals are set but never read so it shouldn't be affecting the control flow. $ diff test.js test2.js 33,34c33,34 < var inboundREQGraphID=null; < var leftArrowGraphID=null; --- > var inboundREQGraphID; //=null; > var leftArrowGraphID; //=null; $ diff -u test.js test2.js --- test.js 2009-12-19 15:01:45.078125000 +0000 +++ test2.js 2009-12-19 15:01:20.468750000 +0000 @@ -30,8 +30,8 @@ var messageOtherTierID=values[7]; var isREQOrRCT=(messageType=="REQ" || messageType=="RCT"); - var inboundREQGraphID=null; - var leftArrowGraphID=null; + var inboundREQGraphID; //=null; + var leftArrowGraphID; //=null; var lTierName=null, otherTierKey=null; var otherTierIDorName=(messageOtherTierID=="" ? (dir=="IN" ? from : to) : messageOtherTierID); $ ./js -j test.js Time=7453 recorder: started(27), aborted(20), completed(135), different header(0), trees trashed(11), slot promoted(0), unstable loop variable(93), breaks(0), returns(0), merged loop exits(37), unstableInnerCalls(5), blacklisted(3) monitor: exits(81794), BRANCH(36160), CASE(0), DEFAULT(0), LOOP(45634), NESTED(0), MISMATCH(0), OOM(0), OVERFLOW(0), UNSTABLE_LOOP(0), TIMEOUT(0), DEEP_BAIL(0), STATUS(0), RECURSIVE_UNLINKED(0), RECURSIVE_LOOP(0), RECURSIVE_MISMATCH(0), RECURSIVE_EMPTY_RP(0), RECURSIVE_SLURP_FAIL(0), RECURSIVE_SLURP_MISMATCH(0), timeouts(0), type mismatch(0), triggered(81794), global mismatch(0), flushed(1) $ ./js -j test2.js Time=3391 recorder: started(32), aborted(26), completed(154), different header(0), trees trashed(13), slot promoted(0), unstable loop variable(94), breaks(0), returns(0), merged loop exits(50), unstableInnerCalls(9), blacklisted(5) monitor: exits(3618), BRANCH(1736), CASE(0), DEFAULT(0), LOOP(1882), NESTED(0), MISMATCH(0), OOM(0), OVERFLOW(0), UNSTABLE_LOOP(0), TIMEOUT(0), DEEP_BAIL(0), STATUS(0), RECURSIVE_UNLINKED(0), RECURSIVE_LOOP(0), RECURSIVE_MISMATCH(0), RECURSIVE_EMPTY_RP(0), RECURSIVE_SLURP_FAIL(0), RECURSIVE_SLURP_MISMATCH(0), timeouts(0), type mismatch(0), triggered(3618), global mismatch(0), flushed(1) $

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 3

•

16 years ago

I have been working on a patch to help mitigate this problem, so taking. It is sort of a prerequisite to fixing the NES emulator, JPEG decoder, etc - pretty much anything with deep nesting and lots of variables.

Assignee: general → dvander

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Updated

•

16 years ago

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Duncan Loveday

Reporter

Comment 4

•

16 years ago

I talked myself into believing the work on merge traces might help here (though not sure it explains the difference between the two test cases). I counted 384 paths from top to bottom of the inner loop (don't know if they're all taken).

Depends on: 516264

Duncan Loveday

Reporter

Updated

•

16 years ago

No longer depends on: 516264

Duncan Loveday

Reporter

Comment 5

•

16 years ago

(In reply to comment #4) > I talked myself into believing the work on merge traces might help here... No, that can't be right. The second test case is no more branchy than the first. I'll shut up.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 6

•

16 years ago

Duncan Loveday

Reporter

Comment 7

•

16 years ago

I think I see. So I ought to be able to reduce the test case quite a bit - the branchiness is a distraction. Might have a go at that tomorrow. One thing that I don't follow: There are other variables that are declared, initialised and set in much the same way, e.g. "var lTierName=null" but don't seem to disrupt tracing to the same extent (if at all).

Duncan Loveday

Reporter

Comment 8

•

16 years ago

Attached file Reduced testcase (obsolete) — Details

I think this is more or less minimal and hopefully shows the same effect. Seems at least four variables are required (perhaps you cater for up to 16 different type maps ?) My results are No JIT ====== $ ./js.exe test4.js cnt=1000000 Time=8765 WITHOUT INITIALISATION TO null ============================== $ ./js.exe -j test4.js cnt=1000000 Time=265 recorder: started(7), aborted(0), completed(22), different header(0), trees trashed(7), slot promoted(0), unstable loop variable(2), breaks(0), returns(0), merged loop exits(15), unstableInnerCalls(0), blacklisted(0) monitor: exits(73), BRANCH(32), CASE(0), DEFAULT(0), LOOP(41), NESTED(0), MISMATCH(0), OOM(0), OVERFLOW(0), UNSTABLE_LOOP(0), TIMEOUT(0), DEEP_BAIL(0), STATUS(0), RECURSIVE_UNLINKED(0), RECURSIVE_LOOP(0), RECURSIVE_MISMATCH(0), RECURSIVE_EMPTY_RP(0), RECURSIVE_SLURP_FAIL(0), RECURSIVE_SLURP_MISMATCH(0), timeouts(0), type mismatch(0), triggered(73), global mismatch(0), flushed(1) $ WITH INITIALISATION TO null =========================== $ ./js.exe -j test4.js cnt=1000000 Time=15328 recorder: started(30), aborted(18), completed(28), different header(0), trees trashed(12), slot promoted(0), unstable loop variable(27), breaks(0), returns(0), merged loop exits(0), unstableInnerCalls(2), blacklisted(5) monitor: exits(999987), BRANCH(23), CASE(0), DEFAULT(0), LOOP(999961), NESTED(0), MISMATCH(0), OOM(0), OVERFLOW(0), UNSTABLE_LOOP(3), TIMEOUT(0), DEEP_BAIL(0), STATUS(0), RECURSIVE_UNLINKED(0), RECURSIVE_LOOP(0), RECURSIVE_MISMATCH(0), RECURSIVE_EMPTY_RP(0), RECURSIVE_SLURP_FAIL(0), RECURSIVE_SLURP_MISMATCH(0), timeouts(0), type mismatch(0), triggered(999987), global mismatch(0), flushed(1) $

Duncan Loveday

Reporter

Comment 9

•

16 years ago

Attached file Reduced testcase v2 — Details

Slightly reduced

Attachment #418816 - Attachment is obsolete: true

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 10

•

16 years ago

Here is the culprit: checking vm types 0x83e0dc (ip: 0x30cc45): args0=O/F args1=O/O arguments0=O/N var0=I/I var1=O/N var2=S/N null != tag4 checking vm types 0x849384 (ip: 0x30cc45): args0=O/F args1=O/O arguments0=O/N var0=I/I var1=O/S string != tag0 checking vm types 0x84f6dc (ip: 0x30cc45): args0=O/F args1=O/O arguments0=O/N var0=I/I var1=O/N var2=S/N null != tag4 checking vm types 0x850de4 (ip: 0x30cc45): args0=O/F args1=O/O arguments0=O/N var0=I/I var1=O/S string != tag0 checking vm types 0x85157c (ip: 0x30cc45): args0=O/F args1=O/O arguments0=O/N var0=I/I var1=O/S string != tag0 Blacklisted: too many peer trees. Merge traces won't help here. This loop has exponential type explosion and we stop trying to specialize it further. The long-term strategy for this type of code is to back off on aggressive specialization. In the short term we should be able to do some sort of dumb analysis and figure out that those variables don't matter for connecting the loop edge to the header. My first thought was to do this entirely in the tracer, but it doesn't seem possible. Say a trace has variable |x| that is only used or defined in one or more branches from the main trace. When leaving the trace, there is no (efficient) way to know which set of branches were taken in order to flush native values with the correct types. We could trash trees if this problem came up, and blacklist those slots from being lazily specialized, but that seems pretty complicated, and awful. So it seems like a dumb parser analysis is the best way to try tackling this. If we can mark which variables have a definition that dominates all uses, we can exclude those variables from poisoning tree stability.

Duncan Loveday

Reporter

Comment 11

•

16 years ago

In my last attachment the variables can't impact the loop because they're not in scope. This is an orthogonal point but has the suggestion been made that once the system has stopped recording traces, whether due to branchiness or too many type maps or any other reason, stats are kept on trace entries vs exits ? If the percentage of bad exits exceeds a threshold, delete all the compiled traces and black list the loop. Crude but would allow code that can't currently be traced well to revert to interpreter performance rather than suffer a drop in performance due to continual exits.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 12

•

16 years ago

(In reply to comment #11) > In my last attachment the variables can't impact the loop Yup, that is the sort of thing we should be able to recognize. > but has the suggestion been made that once the system has stopped recording traces Yes, definitely - I don't think there's an open bug on it yet, but we need a heuristic for situations where a very short inner tree's time is dominated by the entry/exit expense, and an outer tree around it doesn't trace.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 13

•

16 years ago

The ultimate goal is bug 536277, a hybrid approach where we can de-specialize, still get decent perf on the things that don't trace well, and quickly transition to the inner loops which do trace well.

Duncan Loveday

Reporter

Comment 14

•

16 years ago

Thanks David. For me, I feel TM has yet to show its true potential. It performs brilliantly on very simple control structures but the gains quickly evaporate for code of even modest complexity. I now understand a little more about why this is so and I can see there's a great deal of good work going which I am sure will make things better and better so keep it up !

Duncan Loveday

Reporter

Comment 15

•

16 years ago

> ... we need a heuristic for situations where a very short inner tree's time is dominated by the entry/exit expense, and an outer tree around it doesn't trace. Yes, indeed, that is the hard part. I played about with a simple patch where whenever an outer tree is blacklisted, all inner trees are ignored. Huge gain in the test case here but as you point out that's only because the inner tree is so short and entered so often. With a longer inner tree and a short, untraceable outer the patch destroys JIT performance of a well-traced inner loop. The only heuristic I can think of is to record the elapsed time spent executing the inner tree vs time spent looking up fragments etc. Short loops would have a low percentage of time spent actually executing the tree.

Boris Zbarsky [:bzbarsky]

Updated

•

16 years ago

Depends on: 470779

Ryan VanderMeulen [:RyanVM]

Comment 16

•

14 years ago

Current js shell results for attached reduced testcase v2: Interp: 856.004 ms -j: 872.811 ms -m: 75.313 ms -m -n: 46.262 ms Looks like JM+TI wins here.

Blocks: 467263

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.