JM+TI 1.5x slower than v8 in JS-based JPEG encoder

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
10 years ago
5 years ago

People

(Reporter: blizzard, Assigned: dvander)

Tracking

(Blocks 1 bug)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

()

Attachments

(1 attachment)

This guy ported over a JPEG encoder into JS.  TraceMonkey (3.6b1) is about 10x slower than Safari.

Comment 1

10 years ago
We should turn this into a shell test case and then do some analysis.
Bug 531225 will help the worker version, probably.
Depends on: 531225
jitstats for the non-worker version:

recorder: started(85), aborted(28), completed(197), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(85), breaks(0), returns(0), merged loop exits(116), unstableInnerCalls(17), blacklisted(8)
monitor: exits(6085), timeouts(1), type mismatch(0), triggered(6086), global mismatch(0), flushed(0)

Aborts (piped through sort | uniq -c | sort -n -r):

   5 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:560@275: No compatible inner tree.
   3 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:560@289: Inner tree is trying to grow, abort outer recording.
   2 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:635@312: No compatible inner tree.
   2 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:560@289: Inner tree is trying to grow, abort outer recording.
   2 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:550@213: Inner tree is trying to grow, abort outer recording.
   2 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:550@203: No compatible inner tree.
   2 Abort recording of tree jpegencoder.js:194@30 at jpegencoder.js:195@39: No compatible inner tree.
   1 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:558@263: No compatible inner tree.
   1 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:558@263: Inner tree is trying to stabilize, abort outer recording.
   1 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:536@68: No compatible inner tree.
   1 Abort recording of tree jpegencoder.js:629@270 at jpegencoder.js:263@66: Inner tree is trying to grow, abort outer recording.
   1 Abort recording of tree jpegencoder.js:627@261 at jpegencoder.js:629@270: No compatible inner tree.
   1 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:564@341: No compatible inner tree.
   1 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:263@66: Inner tree is trying to grow, abort outer recording.
   1 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:258@34: Inner tree is trying to grow, abort outer recording.
   1 Abort recording of tree jpegencoder.js:558@263 at jpegencoder.js:257@24: No compatible inner tree.
   1 Abort recording of tree jpegencoder.js:219@20 at jpegencoder.js:221@31: No compatible inner tree.
If I make Blacklist(jsbytecode* pc) a no-op, that doesn't change the wall-clock times much as far as I can tell.  Just gets rid of that blacklisted(8) and adds a few hundred triggered/exits.

Looking at the profile:

35% under js_GetPropertyHelper, called from GetClosureVar.
 5% in GetClosureVar itself.
18% under TraceRecorder::monitorRecording (about half of this under closeLoop,
    17% under JSOP_GETELEM, 6% under js_GetScrNoteCached, etc).
 4% under TraceRecorder::recordLoopEdge, mostly under attemptTreeCall doing the
    emit, prepare, snapshot dance.
32% under ExecuteTree.  The time under here breaks down as (percentages of
    ExecuteTree time; all times in not under):
    10% js_Array_dense_setelem_double
     6% js_NewDoubleInRootedValue
     5% js_UnboxDouble
     4% VisitFrameSlots<FlusNativeStackFrameVisitor>
     4% js_DoubleToInt32
     3% VisitFrameSlots<BuildNativeFrameVisitor>
     3% LeaveTree
     2% js_SetReservedSlot
     2% js_Array_dense_setelem_int
    31% actual jit-generated instructions
2.2% spent in js_Interpret

General diagnosis, without having examined the typemaps for those compatible inner tree aborts is that we have int/double mismatch issues in the typemaps, lots of cost in GetClosureVar (maybe bug 530255 will help?).  We're also storing doubles in arrays, which is making alarms ring in my head; that ought not to be happening in this testcase, I'd think.
I guess all those doubles might be coming from us reading values out of arrays....  I thought we had a bug on that.
Applying the patch from bug 486213 doesn't seem to make the jitstats or wall-clock time any better.  And the profile still shows doubles all over.

Updated

10 years ago
Assignee: general → dvander
I made a shell test case out of this (attachment forthcoming). Roughly, where we are with performance here:

TM Interp: 424ms (x64)
TM JIT:    180ms (x64)
v8:         70ms (x86)
Nitro:      30ms (x64)

We seem to have a few problems.
 1. Hoisted undefineds poison trace type stability.
 2. We always read numbers from arrays as doubles.
 3. Because of #2, we get more type instability.
 4. We spend time in the interpreter.

So to test the first two theories I edited the JPEG decoder to hoist definitions to the top of functions. This got us down to around 160ms. Then I disabled the Oracle and we got down to very few inner compatible tree problems, and that brought us to 140ms.

Clearly those are both big problems we need to tackle, but there's still something else dominating this test case. TraceVis says:

interpret      257.881230
monitor          5.255656
record          12.698644
compile         12.710549
execute          1.352516
native         213.188171
-------------------------
Subtotal       503.086766
Non-JS           0.000000
=========================
Total          503.086766

recorder: started(40), aborted(7), completed(81), different header(0), trees trashed(34), slot promoted(0), unstable loop variable(2), breaks(0), returns(0), merged loop exits(43), unstableInnerCalls(4), blacklisted(0)
monitor: exits(880), timeouts(0), type mismatch(0), triggered(880), global mismatch(0), flushed(0)

So tree connectivity is still not perfect... will investigate further.

Comment 8

10 years ago
We should work on the array unbox path:

https://bugzilla.mozilla.org/show_bug.cgi?id=486213

We have a patch, but if I remember correctly it regressed 3d-cube a lot for some unknown reason.
Current js shell results for jpeg_encoder_basic:
Interp: 558ms
JM: 77ms
JM+TI: 71ms
d8: 46ms

Current js shell results for altered_jpeg_encoder:
Interp: 560ms
JM: 78ms
JM+TI: 72ms
d8: 48ms

Looks like v8 is still about 1.5x faster than JM+TI.
Blocks: 467263
OS: Mac OS X → All
Hardware: x86 → All
Summary: TraceMonkey 10x slower in JS-based JPEG encoder → JM+TI 1.5x slower than v8 in JS-based JPEG encoder
Version: 1.9.2 Branch → Trunk
We are now as fast as d8 with --ion-parallel-compile=on (20 ms). With baseline we are a few ms faster (17 ms).
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.