Closed Bug 777583 Opened 8 years ago Closed 7 years ago

IonMonkey: Unregress 3d-raytrace performance

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla18

People

(Reporter: dvander, Assigned: nbp)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [ion:p1:fx18])

Attachments

(5 files, 1 obsolete file)

Attached file benchmark
We're losing about 3ms on 3d-raytrace on my machine (13.6ms -> 16.8ms). The attached benchmark attempts to increase the runtime - I'm not sure how representative this is of the original benchmark, but there is clearly a difference:

JM+TI: 440ms
Ion:   644ms
V8:    295ms
Whiteboard: [js:p1:fx18]
Whiteboard: [js:p1:fx18] → [ion:p1:fx18]
Depends on: 780020
Depends on: 780052
Depends on: 781052
Assignee: general → nicolas.b.pierron
Status: NEW → ASSIGNED
Attached file Reuced benchmark.
This benchmarks is a reduced test case showing ~2.75 slowdown on IonMonkey branch between --no-ion (~1.1ms) and the default JM+Ion couple (~3.0ms) with the same build.  It can be even worst by using the commented line ~1.2ms to ~4.1ms (~3.6 slowdown)

I don't know if this reduced test case is representative of the current performance issue.  I noticed that the performances issue seems to disappear when we increase the loop counter inside the original benchmark.  The performances of JM+Ion are then better then JM and better than Ion alone.  This behaviour can be reproduced on the current test case by increasing the loop boundary.

Increasing the loop boundary above 10000 should cause a recompilation from JM to Ion, so the error is likely located in JM when IonMonkey is enabled.
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #1)
> Created attachment 655133 [details]
> Reuced benchmark.
> 
> Increasing the loop boundary above 10000 should cause a recompilation from
> JM to Ion, so the error is likely located in JM when IonMonkey is enabled.

Correction, the performance issue noticed here is caused by Ion->JM calls after the compilation of SceneIntersect which happen early due to usecounts which are incremented by 3, guessing at: function entrance and loop-condition, even if the content of the for loop is run only once per function call.

Based on profiling, we spend half of InvokeFunction to decide if we are calling js::RunScript the script or not — see below — knowing that the JM-compiled script takes about 1/3 of js::RunScript times to execute.

The useCount of TriangleIntersect is high enough to be inlined with the new inline rules for small bytecodes, however the type oracle refuse the compilation because arguments have to be guarded with type barriers.

Adding arguments type barrier when we are inlining should remove the Ion->JM calls and thus improve the performances on this micro-benchmark.  This problem also seems to appear in the original benchmark, with 7 already hot script and 6 not-so-hot scripts (among all inlining refused due to the argument type barriers).  Knowing the impact of Ion->JM call performances and the runtime of small sunspider benchmarks, this is likely a cause of slow down.


====
With a loop limit at 5000 instead of 10000, only SceneIntersect is compiled and:

5000   10000  loop boundary.

25.04% 38.39% EnterIon
 `  1.43%  3.23% PR_SetThreadPrivate [1]
21.66% 30.74% C++ --> Ion
21.02% 29.30% SceneIntersect (Ion)
20.28% 24.25% VM Wrapper
19.99% 23.91% js::ion::InvokeFunction
       …
10.13% 12.40% js::RunScript
 `  1.11%  1.86% js::ion::CanEnter [2]
       …
 3.07%  3.59% C++ --> JM
 2.69%  3.14% TriangleIntersect (JM)
Attachment #657467 - Flags: review?(dvander)
This is useful when the caller cannot be compile with Ion, but the callee is.
Attachment #657469 - Flags: review?(dvander)
Compared to the previous version of this patch, in addition to handling the exclusive type guard it also handle the corner case of double & int type-set monitored before a call-site, where only int are given to the callee, due to setNumber which convert doubles into integers when possible.

The performance of impact of this optimization should be that more code would be inlined and it would avoid calls from IonMonkey to JäegerMonkey during the startup of the script.  I don't see any improvement on sunspider, but I am persuaded this still avoid us a startup performance regression even if sunspider timing does not highlight the preformance regression happenning during the JM->Ion transition.
Attachment #657467 - Attachment is obsolete: true
Attachment #657467 - Flags: review?(dvander)
Attachment #657969 - Flags: review?(dvander)
What benchmark or code is this intended to improve? What do you mean by a "startup performance regression", and is that measurable?
(In reply to David Anderson [:dvander] from comment #6)
> What benchmark or code is this intended to improve? What do you mean by a
> "startup performance regression", and is that measurable?

This for sure optimize the reduced benchmark.  I was thinking of making a new harness to highlight start-up performance in a better fashion than sunspider benchmarks.  I think sunspider benchmarks are too short to show start-up performances, and some compilation cost might account in a few extra ms letting us think we are making bad optimizations, even if our changes are supposed to improve start-up performances.  I think the current harness of sunspider benchmark is not talkative enough to make fully judged heuristic improvement.

I've planned to spend a day at experimenting a new harness to highlight start-up performances by re-using current benchmarks.  Then I will look at IonMonkey history and judge previous heuristic modifications to see if this can be correlated with past sunspider variations.
Comment on attachment 657469 [details] [diff] [review]
JM Call ICs, Generate an Ion stub when an IonScript is seen.

Review of attachment 657469 [details] [diff] [review]:
-----------------------------------------------------------------

I think this patch will attach an Ion stub, but it will do so disconnecting the normal Call stub. If the Ion code gets discarded, the JM stub might then flow back into the call IC and try to generate another.

I think generateFullCallStub needs to call ic.updateLastOolJump(), and hasStubOolJump needs to be updated to take into account pool[Pool_ScriptStub].
Attachment #657469 - Flags: review?(dvander)
Comment on attachment 657969 [details] [diff] [review]
Inline with type-checked arguments.

This deserves a new bug, since it doesn't affect 3d-raytrace and there is already a significant patch here. However, I'm worried that this is a lot of code motion and we don't understand what the performance improvement is. The patch looks straightforward, but it does a lot of stuff, so we should justify it harder - ideally by measuring some positive effect on notable tests.
Attachment #657969 - Flags: review?(dvander)
Comment on attachment 657969 [details] [diff] [review]
Inline with type-checked arguments.

This patch does not provide any optimization yet.  This patch remove a case that we were unable to inline before and make it possible now.  This is suppose to avoid too many Ion->JM calls.  Sadly it does not seems to improve benchmarks because this create an extra compilation cost induced by the inlining of functions.

I guess a heuristic change would be needed to either limit the use case of this patch while avoiding Ion->JM calls or to balance the inlining choice such as the overhead of inlining does not kill sunspider performances.
I compared the master branch with attachment 657969 [details] [diff] [review] plus this patch making sure we don't inline everything.  Each sunspider benchmark were run 5000 times and each kraken benchmark were run 500 times on a Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz.

            sunspider               kraken
master:  190.5ms +/- 0.0%       1841.2ms +/- 0.1%
     1:  197.6ms +/- 0.0%       1859.2ms +/- 0.1%
     2:  195.1ms +/- 0.0%       1818.0ms +/- 0.1%
     4:  194.1ms +/- 0.0%       1818.0ms +/- 0.1%
     6:  192.9ms +/- 0.0%       1818.5ms +/- 0.1%
     8:  190.4ms +/- 0.0%       1816.6ms +/- 0.1%
    12:  190.6ms +/- 0.0%       1809.8ms +/- 0.1%
    16:  190.1ms +/- 0.0%       1807.4ms +/- 0.1%
    32:  190.3ms +/- 0.0%       1805.9ms +/- 0.1%  <--- best kraken
    64:  189.9ms +/- 0.0%       1806.8ms +/- 0.1%
   128:  189.7ms +/- 0.0%       1807.5ms +/- 0.1%  <--- best sunspider
   256:  190.1ms +/- 0.1%       1808.8ms +/- 0.1%
   512:  190.3ms +/- 0.0%       1806.1ms +/- 0.1%

Kraken improvements are mostly coming from attachment 657969 [details] [diff] [review] which provide similar results for larger values.  The reasons is explained with the increase of run time which flatten results of eager inlining.  Sunspider improvements are mostly coming from the current heuristic which prevent eager inlining if there is not enough call of the function to be inlined.  Alone, attachment 657969 [details] [diff] [review] is not performing well on sunspider because it was increasing the spectrum of inlinable functions and we can see that by pushing the ratio value to the highest numbers.

On top of 128 (master + attachment 657969 [details] [diff] [review] + heuristic), attachment 657469 [details] [diff] [review] regress sunspider by 0.2ms (189.7 -> 189.9) and improve kraken by 14.1ms (1807.5 -> 1793.4).

sunspider: master vs. 128 (master + attachment 657969 [details] [diff] [review] + heuristic)

TEST                   COMPARISON            FROM                 TO             DETAILS
=============================================================================
** TOTAL **:           1.004x as fast    190.5ms +/- 0.0%   189.7ms +/- 0.0%     significant
=============================================================================

  3d:                  *1.004x as slow*   34.1ms +/- 0.1%    34.2ms +/- 0.0%     significant
    cube:              1.029x as fast     13.7ms +/- 0.1%    13.3ms +/- 0.1%     significant
    morph:             *1.005x as slow*    4.8ms +/- 0.1%     4.8ms +/- 0.1%     significant
    raytrace:          *1.031x as slow*   15.6ms +/- 0.1%    16.1ms +/- 0.1%     significant

  access:              1.006x as fast     16.4ms +/- 0.1%    16.3ms +/- 0.1%     significant
    binary-trees:      *1.015x as slow*    2.4ms +/- 0.1%     2.5ms +/- 0.1%     significant
    fannkuch:          1.018x as fast      7.6ms +/- 0.1%     7.4ms +/- 0.1%     significant
    nbody:             -                   3.7ms +/- 0.1%     3.7ms +/- 0.1% 
    nsieve:            -                   2.8ms +/- 0.1%     2.8ms +/- 0.1% 

  bitops:              *1.012x as slow*   10.8ms +/- 0.1%    10.9ms +/- 0.1%     significant
    3bit-bits-in-byte: 1.002x as fast      1.1ms +/- 0.1%     1.1ms +/- 0.1%     significant
    bits-in-byte:      *1.035x as slow*    3.5ms +/- 0.2%     3.6ms +/- 0.2%     significant
    bitwise-and:       -                   2.9ms +/- 0.1%     2.9ms +/- 0.1% 
    nsieve-bits:       *1.003x as slow*    3.2ms +/- 0.1%     3.2ms +/- 0.1%     significant

  controlflow:         *1.012x as slow*    2.4ms +/- 0.1%     2.5ms +/- 0.1%     significant
    recursive:         *1.012x as slow*    2.4ms +/- 0.1%     2.5ms +/- 0.1%     significant

  crypto:              1.006x as fast     21.9ms +/- 0.1%    21.7ms +/- 0.1%     significant
    aes:               1.008x as fast     10.5ms +/- 0.1%    10.4ms +/- 0.1%     significant
    md5:               1.006x as fast      7.8ms +/- 0.1%     7.7ms +/- 0.1%     significant
    sha1:              1.002x as fast      3.6ms +/- 0.1%     3.6ms +/- 0.1%     significant

  date:                1.003x as fast     28.6ms +/- 0.1%    28.5ms +/- 0.1%     significant
    format-tofte:      1.017x as fast     16.8ms +/- 0.1%    16.6ms +/- 0.1%     significant
    format-xparb:      *1.016x as slow*   11.8ms +/- 0.1%    11.9ms +/- 0.1%     significant

  math:                1.008x as fast     13.6ms +/- 0.1%    13.4ms +/- 0.1%     significant
    cordic:            1.020x as fast      3.3ms +/- 0.1%     3.3ms +/- 0.1%     significant
    partial-sums:      1.005x as fast      7.4ms +/- 0.1%     7.4ms +/- 0.1%     significant
    spectral-norm:     1.003x as fast      2.8ms +/- 0.1%     2.8ms +/- 0.1%     significant

  regexp:              1.004x as fast      9.9ms +/- 0.1%     9.8ms +/- 0.0%     significant
    dna:               1.004x as fast      9.9ms +/- 0.1%     9.8ms +/- 0.0%     significant

  string:              1.011x as fast     52.9ms +/- 0.1%    52.3ms +/- 0.1%     significant
    base64:            1.004x as fast      4.3ms +/- 0.1%     4.2ms +/- 0.1%     significant
    fasta:             1.002x as fast      6.2ms +/- 0.1%     6.2ms +/- 0.1%     significant
    tagcloud:          1.009x as fast     16.9ms +/- 0.1%    16.7ms +/- 0.1%     significant
    unpack-code:       1.022x as fast     19.2ms +/- 0.1%    18.8ms +/- 0.1%     significant
    validate-input:    ??                  6.4ms +/- 0.1%     6.4ms +/- 0.1%     not conclusive: might be *1.001x as slow*

PS: maybe I should have moved these patches to a new bug named regress ss-3d-raytrace and improve the rest …
Attachment #663363 - Flags: review?(kvijayan)
After looking that one profile cost, I can ensure that most of the cost is coming from the compilation cost.  Reducing the usesBeforeCompile from a few thousand such as we avoid any invalidation does not improve performances since we have approximatively 2ms of compilation.  Removing all inlined function is even more costly because we are making Ion-to-JM calls. (16.3ms).  The bottom profile of inlined functions correspond the master branch (15.6ms).  Modification made by attachment 657969 [details] [diff] [review] are visible with a (+) sign — heuristic change provided in attachment 663363 [details] [diff] [review] does not provide anything.  Removing large functions (*) from the list of inlined functions give a little boost (15.7ms).

  [Scripts] Analyzing script ./tests/sunspider-1.0/3d-raytrace.js:160 (0xadd8add8) (usecount=10240) (maxloopcount=0)
  [Scripts] Analyzing script ./tests/sunspider-1.0/3d-raytrace.js:189 (0xadd8add8) (usecount=10252) (maxloopcount=17)
+*[Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:160 (0xadd8add8)
+ [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:83 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:67 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:79 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:79 (0xadd8add8)
+ [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:83 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:38 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:71 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:34 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:52 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:79 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:83 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:38 (0xadd8add8)
  [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:75 (0xadd8add8)
+ [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:52 (0xadd8add8)
+ [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:52 (0xadd8add8)
+ [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:38 (0xadd8add8)
  [Scripts] Analyzing script ./tests/sunspider-1.0/3d-raytrace.js:244 (0xadd8add8) (usecount=10247) (maxloopcount=14)
 *[Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:160 (0xadd8add8)

All functions with a line number below 100 are all small functions and they should not impact much the compilation, compiling these functions as stand-alone IonCode just after the compilation of 3d-raytrace.js:189, with (*) improvement gives a tiny boost (15.5ms), and we have to do so if we want to avoid Ion-to-JM calls.
Comment on attachment 663363 [details] [diff] [review]
Add caller/callee use count ratio to balance inlining cost.

Review of attachment 663363 [details] [diff] [review]:
-----------------------------------------------------------------

Nice idea.
Attachment #663363 - Flags: review?(kvijayan) → review+
Depends on: 793905
Comment on attachment 663363 [details] [diff] [review]
Add caller/callee use count ratio to balance inlining cost.

https://hg.mozilla.org/integration/mozilla-inbound/rev/67f93efc40c8
Attachment #663363 - Flags: review+
Comment on attachment 663363 [details] [diff] [review]
Add caller/callee use count ratio to balance inlining cost.

Oups, wrong flag.
Attachment #663363 - Flags: review+ → checkin+
https://hg.mozilla.org/mozilla-central/rev/6d6fc6e9472a
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Blocks: 796114
You need to log in before you can comment on or make changes to this bug.