Closed Bug 777583 Opened 13 years ago Closed 13 years ago

IonMonkey: Unregress 3d-raytrace performance

Categories

(Core :: JavaScript Engine, defect)

Product:

Component:

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla18

People

(Reporter: dvander, Assigned: nbp)

References

Details

(Whiteboard: [ion:p1:fx18])

Attachments

(5 files, 1 obsolete file)

benchmark 13 years ago David Anderson [:dvander] - inactive, e-mail if emergency 14.00 KB, application/javascript		Details
Reuced benchmark. 13 years ago Nicolas B. Pierron [:nbp] 560 bytes, application/x-javascript		Details
Inline with type-checked arguments. 13 years ago Nicolas B. Pierron [:nbp] 35.04 KB, patch		Details \| Diff \| Splinter Review
JM Call ICs, Generate an Ion stub when an IonScript is seen. 13 years ago Nicolas B. Pierron [:nbp] 3.78 KB, patch		Details \| Diff \| Splinter Review
Inline with type-checked arguments. 13 years ago Nicolas B. Pierron [:nbp] 53.92 KB, patch		Details \| Diff \| Splinter Review
Add caller/callee use count ratio to balance inlining cost. 13 years ago Nicolas B. Pierron [:nbp] 3.60 KB, patch	djvj : review+ nbp : checkin+	Details \| Diff \| Splinter Review

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Description

•

13 years ago

Attached file benchmark — Details

We're losing about 3ms on 3d-raytrace on my machine (13.6ms -> 16.8ms). The attached benchmark attempts to increase the runtime - I'm not sure how representative this is of the original benchmark, but there is clearly a difference: JM+TI: 440ms Ion: 644ms V8: 295ms

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Updated

•

13 years ago

Whiteboard: [js:p1:fx18]

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Updated

•

13 years ago

Whiteboard: [js:p1:fx18] → [ion:p1:fx18]

Sean Stangl [:sstangl]

Updated

•

13 years ago

Depends on: 780020

Sean Stangl [:sstangl]

Updated

•

13 years ago

Depends on: 780052

Sean Stangl [:sstangl]

Updated

•

13 years ago

Depends on: 781052

Nicolas B. Pierron [:nbp]

Assignee

Updated

•

13 years ago

Assignee: general → nicolas.b.pierron

Status: NEW → ASSIGNED

Nicolas B. Pierron [:nbp]

Assignee

Comment 1

•

13 years ago

Attached file Reuced benchmark. — Details

This benchmarks is a reduced test case showing ~2.75 slowdown on IonMonkey branch between --no-ion (~1.1ms) and the default JM+Ion couple (~3.0ms) with the same build. It can be even worst by using the commented line ~1.2ms to ~4.1ms (~3.6 slowdown) I don't know if this reduced test case is representative of the current performance issue. I noticed that the performances issue seems to disappear when we increase the loop counter inside the original benchmark. The performances of JM+Ion are then better then JM and better than Ion alone. This behaviour can be reproduced on the current test case by increasing the loop boundary. Increasing the loop boundary above 10000 should cause a recompilation from JM to Ion, so the error is likely located in JM when IonMonkey is enabled.

Nicolas B. Pierron [:nbp]

Assignee

Comment 2

•

13 years ago

(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #1) > Created attachment 655133 [details] > Reuced benchmark. > > Increasing the loop boundary above 10000 should cause a recompilation from > JM to Ion, so the error is likely located in JM when IonMonkey is enabled. Correction, the performance issue noticed here is caused by Ion->JM calls after the compilation of SceneIntersect which happen early due to usecounts which are incremented by 3, guessing at: function entrance and loop-condition, even if the content of the for loop is run only once per function call. Based on profiling, we spend half of InvokeFunction to decide if we are calling js::RunScript the script or not — see below — knowing that the JM-compiled script takes about 1/3 of js::RunScript times to execute. The useCount of TriangleIntersect is high enough to be inlined with the new inline rules for small bytecodes, however the type oracle refuse the compilation because arguments have to be guarded with type barriers. Adding arguments type barrier when we are inlining should remove the Ion->JM calls and thus improve the performances on this micro-benchmark. This problem also seems to appear in the original benchmark, with 7 already hot script and 6 not-so-hot scripts (among all inlining refused due to the argument type barriers). Knowing the impact of Ion->JM call performances and the runtime of small sunspider benchmarks, this is likely a cause of slow down. ==== With a loop limit at 5000 instead of 10000, only SceneIntersect is compiled and: 5000 10000 loop boundary. 25.04% 38.39% EnterIon ` 1.43% 3.23% PR_SetThreadPrivate [1] 21.66% 30.74% C++ --> Ion 21.02% 29.30% SceneIntersect (Ion) 20.28% 24.25% VM Wrapper 19.99% 23.91% js::ion::InvokeFunction … 10.13% 12.40% js::RunScript ` 1.11% 1.86% js::ion::CanEnter [2] … 3.07% 3.59% C++ --> JM 2.69% 3.14% TriangleIntersect (JM)

Nicolas B. Pierron [:nbp]

Assignee

Comment 3

•

13 years ago

Attached patch Inline with type-checked arguments. (obsolete) — Details — Splinter Review

Attachment #657467 - Flags: review?(dvander)

Nicolas B. Pierron [:nbp]

Assignee

Comment 4

•

13 years ago

Attached patch JM Call ICs, Generate an Ion stub when an IonScript is seen. — Details — Splinter Review

This is useful when the caller cannot be compile with Ion, but the callee is.

Attachment #657469 - Flags: review?(dvander)

Nicolas B. Pierron [:nbp]

Assignee

Comment 5

•

13 years ago

Attached patch Inline with type-checked arguments. — Details — Splinter Review

Compared to the previous version of this patch, in addition to handling the exclusive type guard it also handle the corner case of double & int type-set monitored before a call-site, where only int are given to the callee, due to setNumber which convert doubles into integers when possible. The performance of impact of this optimization should be that more code would be inlined and it would avoid calls from IonMonkey to JäegerMonkey during the startup of the script. I don't see any improvement on sunspider, but I am persuaded this still avoid us a startup performance regression even if sunspider timing does not highlight the preformance regression happenning during the JM->Ion transition.

Attachment #657467 - Attachment is obsolete: true

Attachment #657467 - Flags: review?(dvander)

Attachment #657969 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 6

•

13 years ago

What benchmark or code is this intended to improve? What do you mean by a "startup performance regression", and is that measurable?

Nicolas B. Pierron [:nbp]

Assignee

Comment 7

•

13 years ago

(In reply to David Anderson [:dvander] from comment #6) > What benchmark or code is this intended to improve? What do you mean by a > "startup performance regression", and is that measurable? This for sure optimize the reduced benchmark. I was thinking of making a new harness to highlight start-up performance in a better fashion than sunspider benchmarks. I think sunspider benchmarks are too short to show start-up performances, and some compilation cost might account in a few extra ms letting us think we are making bad optimizations, even if our changes are supposed to improve start-up performances. I think the current harness of sunspider benchmark is not talkative enough to make fully judged heuristic improvement. I've planned to spend a day at experimenting a new harness to highlight start-up performances by re-using current benchmarks. Then I will look at IonMonkey history and judge previous heuristic modifications to see if this can be correlated with past sunspider variations.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 8

•

13 years ago

Comment on attachment 657469 [details] [diff] [review] JM Call ICs, Generate an Ion stub when an IonScript is seen. Review of attachment 657469 [details] [diff] [review]: ----------------------------------------------------------------- I think this patch will attach an Ion stub, but it will do so disconnecting the normal Call stub. If the Ion code gets discarded, the JM stub might then flow back into the call IC and try to generate another. I think generateFullCallStub needs to call ic.updateLastOolJump(), and hasStubOolJump needs to be updated to take into account pool[Pool_ScriptStub].

Attachment #657469 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 9

•

13 years ago

Comment on attachment 657969 [details] [diff] [review] Inline with type-checked arguments. This deserves a new bug, since it doesn't affect 3d-raytrace and there is already a significant patch here. However, I'm worried that this is a lot of code motion and we don't understand what the performance improvement is. The patch looks straightforward, but it does a lot of stuff, so we should justify it harder - ideally by measuring some positive effect on notable tests.

Attachment #657969 - Flags: review?(dvander)

Nicolas B. Pierron [:nbp]

Assignee

Comment 10

•

13 years ago

Comment on attachment 657969 [details] [diff] [review] Inline with type-checked arguments. This patch does not provide any optimization yet. This patch remove a case that we were unable to inline before and make it possible now. This is suppose to avoid too many Ion->JM calls. Sadly it does not seems to improve benchmarks because this create an extra compilation cost induced by the inlining of functions. I guess a heuristic change would be needed to either limit the use case of this patch while avoiding Ion->JM calls or to balance the inlining choice such as the overhead of inlining does not kill sunspider performances.

Nicolas B. Pierron [:nbp]

Assignee

Comment 11

•

13 years ago

Attached patch Add caller/callee use count ratio to balance inlining cost. — Details — Splinter Review

I compared the master branch with attachment 657969 [details] [diff] [review] plus this patch making sure we don't inline everything. Each sunspider benchmark were run 5000 times and each kraken benchmark were run 500 times on a Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz. sunspider kraken master: 190.5ms +/- 0.0% 1841.2ms +/- 0.1% 1: 197.6ms +/- 0.0% 1859.2ms +/- 0.1% 2: 195.1ms +/- 0.0% 1818.0ms +/- 0.1% 4: 194.1ms +/- 0.0% 1818.0ms +/- 0.1% 6: 192.9ms +/- 0.0% 1818.5ms +/- 0.1% 8: 190.4ms +/- 0.0% 1816.6ms +/- 0.1% 12: 190.6ms +/- 0.0% 1809.8ms +/- 0.1% 16: 190.1ms +/- 0.0% 1807.4ms +/- 0.1% 32: 190.3ms +/- 0.0% 1805.9ms +/- 0.1% <--- best kraken 64: 189.9ms +/- 0.0% 1806.8ms +/- 0.1% 128: 189.7ms +/- 0.0% 1807.5ms +/- 0.1% <--- best sunspider 256: 190.1ms +/- 0.1% 1808.8ms +/- 0.1% 512: 190.3ms +/- 0.0% 1806.1ms +/- 0.1% Kraken improvements are mostly coming from attachment 657969 [details] [diff] [review] which provide similar results for larger values. The reasons is explained with the increase of run time which flatten results of eager inlining. Sunspider improvements are mostly coming from the current heuristic which prevent eager inlining if there is not enough call of the function to be inlined. Alone, attachment 657969 [details] [diff] [review] is not performing well on sunspider because it was increasing the spectrum of inlinable functions and we can see that by pushing the ratio value to the highest numbers. On top of 128 (master + attachment 657969 [details] [diff] [review] + heuristic), attachment 657469 [details] [diff] [review] regress sunspider by 0.2ms (189.7 -> 189.9) and improve kraken by 14.1ms (1807.5 -> 1793.4). sunspider: master vs. 128 (master + attachment 657969 [details] [diff] [review] + heuristic) TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.004x as fast 190.5ms +/- 0.0% 189.7ms +/- 0.0% significant ============================================================================= 3d: *1.004x as slow* 34.1ms +/- 0.1% 34.2ms +/- 0.0% significant cube: 1.029x as fast 13.7ms +/- 0.1% 13.3ms +/- 0.1% significant morph: *1.005x as slow* 4.8ms +/- 0.1% 4.8ms +/- 0.1% significant raytrace: *1.031x as slow* 15.6ms +/- 0.1% 16.1ms +/- 0.1% significant access: 1.006x as fast 16.4ms +/- 0.1% 16.3ms +/- 0.1% significant binary-trees: *1.015x as slow* 2.4ms +/- 0.1% 2.5ms +/- 0.1% significant fannkuch: 1.018x as fast 7.6ms +/- 0.1% 7.4ms +/- 0.1% significant nbody: - 3.7ms +/- 0.1% 3.7ms +/- 0.1% nsieve: - 2.8ms +/- 0.1% 2.8ms +/- 0.1% bitops: *1.012x as slow* 10.8ms +/- 0.1% 10.9ms +/- 0.1% significant 3bit-bits-in-byte: 1.002x as fast 1.1ms +/- 0.1% 1.1ms +/- 0.1% significant bits-in-byte: *1.035x as slow* 3.5ms +/- 0.2% 3.6ms +/- 0.2% significant bitwise-and: - 2.9ms +/- 0.1% 2.9ms +/- 0.1% nsieve-bits: *1.003x as slow* 3.2ms +/- 0.1% 3.2ms +/- 0.1% significant controlflow: *1.012x as slow* 2.4ms +/- 0.1% 2.5ms +/- 0.1% significant recursive: *1.012x as slow* 2.4ms +/- 0.1% 2.5ms +/- 0.1% significant crypto: 1.006x as fast 21.9ms +/- 0.1% 21.7ms +/- 0.1% significant aes: 1.008x as fast 10.5ms +/- 0.1% 10.4ms +/- 0.1% significant md5: 1.006x as fast 7.8ms +/- 0.1% 7.7ms +/- 0.1% significant sha1: 1.002x as fast 3.6ms +/- 0.1% 3.6ms +/- 0.1% significant date: 1.003x as fast 28.6ms +/- 0.1% 28.5ms +/- 0.1% significant format-tofte: 1.017x as fast 16.8ms +/- 0.1% 16.6ms +/- 0.1% significant format-xparb: *1.016x as slow* 11.8ms +/- 0.1% 11.9ms +/- 0.1% significant math: 1.008x as fast 13.6ms +/- 0.1% 13.4ms +/- 0.1% significant cordic: 1.020x as fast 3.3ms +/- 0.1% 3.3ms +/- 0.1% significant partial-sums: 1.005x as fast 7.4ms +/- 0.1% 7.4ms +/- 0.1% significant spectral-norm: 1.003x as fast 2.8ms +/- 0.1% 2.8ms +/- 0.1% significant regexp: 1.004x as fast 9.9ms +/- 0.1% 9.8ms +/- 0.0% significant dna: 1.004x as fast 9.9ms +/- 0.1% 9.8ms +/- 0.0% significant string: 1.011x as fast 52.9ms +/- 0.1% 52.3ms +/- 0.1% significant base64: 1.004x as fast 4.3ms +/- 0.1% 4.2ms +/- 0.1% significant fasta: 1.002x as fast 6.2ms +/- 0.1% 6.2ms +/- 0.1% significant tagcloud: 1.009x as fast 16.9ms +/- 0.1% 16.7ms +/- 0.1% significant unpack-code: 1.022x as fast 19.2ms +/- 0.1% 18.8ms +/- 0.1% significant validate-input: ?? 6.4ms +/- 0.1% 6.4ms +/- 0.1% not conclusive: might be *1.001x as slow* PS: maybe I should have moved these patches to a new bug named regress ss-3d-raytrace and improve the rest …

Attachment #663363 - Flags: review?(kvijayan)

Nicolas B. Pierron [:nbp]

Assignee

Comment 12

•

13 years ago

After looking that one profile cost, I can ensure that most of the cost is coming from the compilation cost. Reducing the usesBeforeCompile from a few thousand such as we avoid any invalidation does not improve performances since we have approximatively 2ms of compilation. Removing all inlined function is even more costly because we are making Ion-to-JM calls. (16.3ms). The bottom profile of inlined functions correspond the master branch (15.6ms). Modification made by attachment 657969 [details] [diff] [review] are visible with a (+) sign — heuristic change provided in attachment 663363 [details] [diff] [review] does not provide anything. Removing large functions (*) from the list of inlined functions give a little boost (15.7ms). [Scripts] Analyzing script ./tests/sunspider-1.0/3d-raytrace.js:160 (0xadd8add8) (usecount=10240) (maxloopcount=0) [Scripts] Analyzing script ./tests/sunspider-1.0/3d-raytrace.js:189 (0xadd8add8) (usecount=10252) (maxloopcount=17) +*[Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:160 (0xadd8add8) + [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:83 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:67 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:79 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:79 (0xadd8add8) + [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:83 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:38 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:71 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:34 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:52 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:79 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:83 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:38 (0xadd8add8) [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:75 (0xadd8add8) + [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:52 (0xadd8add8) + [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:52 (0xadd8add8) + [Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:38 (0xadd8add8) [Scripts] Analyzing script ./tests/sunspider-1.0/3d-raytrace.js:244 (0xadd8add8) (usecount=10247) (maxloopcount=14) *[Scripts] Inlining script ./tests/sunspider-1.0/3d-raytrace.js:160 (0xadd8add8) All functions with a line number below 100 are all small functions and they should not impact much the compilation, compiling these functions as stand-alone IonCode just after the compilation of 3d-raytrace.js:189, with (*) improvement gives a tiny boost (15.5ms), and we have to do so if we want to avoid Ion-to-JM calls.

Kannan Vijayan [:djvj]

Comment 13

•

13 years ago

Comment on attachment 663363 [details] [diff] [review] Add caller/callee use count ratio to balance inlining cost. Review of attachment 663363 [details] [diff] [review]: ----------------------------------------------------------------- Nice idea.

Attachment #663363 - Flags: review?(kvijayan) → review+

Nicolas B. Pierron [:nbp]

Assignee

Updated

•

13 years ago

Depends on: 793905

Nicolas B. Pierron [:nbp]

Assignee

Comment 14

•

13 years ago

Comment on attachment 663363 [details] [diff] [review] Add caller/callee use count ratio to balance inlining cost. https://hg.mozilla.org/integration/mozilla-inbound/rev/67f93efc40c8

Attachment #663363 - Flags: review+

Nicolas B. Pierron [:nbp]

Assignee

Comment 15

•

13 years ago

Comment on attachment 663363 [details] [diff] [review] Add caller/callee use count ratio to balance inlining cost. Oups, wrong flag.

Attachment #663363 - Flags: review+ → checkin+

Ed Morley [:emorley]

Comment 16

•

13 years ago

Backed out for failing to build: https://hg.mozilla.org/integration/mozilla-inbound/rev/189a1aaac345

Nicolas B. Pierron [:nbp]

Assignee

Comment 17

•

13 years ago

Comment on attachment 663363 [details] [diff] [review] Add caller/callee use count ratio to balance inlining cost. https://tbpl.mozilla.org/?tree=Try&rev=90e54bd59b82 https://hg.mozilla.org/integration/mozilla-inbound/rev/6d6fc6e9472a

Ryan VanderMeulen [:RyanVM]

Comment 18

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/6d6fc6e9472a

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Flags: in-testsuite-

Resolution: --- → FIXED

Target Milestone: --- → mozilla18

Nicolas B. Pierron [:nbp]

Assignee

Updated

•

13 years ago

Blocks: 796114

You need to log in before you can comment on or make changes to this bug.