Last Comment Bug 676515 - More detailed opcode-level profiling
: More detailed opcode-level profiling
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: ---
Assigned To: general
:
Mentors:
Depends on: 637393
Blocks:
  Show dependency treegraph
 
Reported: 2011-08-04 06:03 PDT by Brian Hackett (:bhackett)
Modified: 2012-01-10 17:02 PST (History)
4 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
WIP (46.17 KB, patch)
2011-08-04 12:42 PDT, Brian Hackett (:bhackett)
no flags Details | Diff | Splinter Review
patch (52.18 KB, patch)
2011-08-05 05:52 PDT, Brian Hackett (:bhackett)
sphink: review+
Details | Diff | Splinter Review

Description Brian Hackett (:bhackett) 2011-08-04 06:03:42 PDT
It would be cool to capture some more information about a script's execution than raw counts for each PC in each run mode, as is currently done.  To start, for JM/JM+TI:

- Number of stub calls executed at each opcode.
- Number of opcodes times the number of bytes of inline code.
- Number of opcodes times the number of PIC stubs.

The latter ones look a little odd.  In many cases we compile ops entirely as inline code or as PICs, but there is great variety in how good the result is; there may be lots of tests generated for unknown types, sync code due to register pressure, code generating for loop invariant stuff we didn't hoist, ICs we needed to use, and so forth.  Counting the number of bytes of inline code is easy to do (vs. grubbing with the assembler), and will hopefully work as a decent surrogate for the amount of memory traffic and number of instructions.

Getting this to work right will require that accumulating these counts does not affect the code generated for the op itself (currently the pc counters allocate registers, but this should be easy to fix).

Having this info should make it easier to identify code that is not only just hot but also likely to have performance problems.

In the hopefully not too distant future, it would also be great to combine this data with compiler knowledge about which paths it generated for each op, and why, and convey this to JS programmers through a debugger.  The compiler knows how good the code it is generating is, but this info is opaque to programmers and making it available (in such a way that it can be acted on without detailed knowledge of how compilers work) would I think really make it easier for people to improve their code's performance.
Comment 1 Brian Hackett (:bhackett) 2011-08-04 12:42:56 PDT
Created attachment 550795 [details] [diff] [review]
WIP

Working patch for JM, needs more testing.  This removes any effects PC profiling has on register allocation, and adds three more counters:

- Number of stub calls made from the opcode while running in JM.
- Aggregate length of the inline path for the opcode, for each time it executed in JM.
- Aggregate length of all PIC stubs for the opcode, for each time it executed in JM.

Example:

for (var i = 0; i < 10000; i++) {
  a.foo = i;
  if (i < N)
    a[i] = 0;
}

For different values of N, the SETPROP on a.foo will behave differently at runtime.

N = 5: After warmup in the interpreter the access will be monomorphic.

39/0/9961/2/358596/0 x   29  setprop "foo"

Interpretation:

39 executions in the interpreter
0 in the tjit
9961 in the mjit
2 stub calls --- warmup (faintly pointless now that the script itself gets to warm up first) plus the call to fill in the inline path.
358596 bytes of inline code executed (36 bytes per execution)
0 bytes of PIC stub code executed

N = 50: The access will be polymorphic, with 10 stubs generated (aside: with TI we let scripts run 40 times in the interpreter rather than 16 without TI).

39/0/9961/12/358596/2687715 x   29  setprop "foo"

39 same
0 same
9961 same
12 stub calls, more for the different stubs generated
358596 same
2687715 a lot more PIC stub code generated (270 bytes per execution). As with the inline code, not all of this is guaranteed to run each execution.

N = 500: The access will be megamorphic, and the PIC will be disabled.

39/0/9961/9961/358596/3240 x   29  setprop "foo"

39 same
0 same
9961 same
9961 stub calls, this never stays entirely in inline/PIC stub code
358596 same
3240 a small amount of PIC stub code was executed before too many stubs were generated and the PIC was disabled.
Comment 2 Brian Hackett (:bhackett) 2011-08-05 05:52:33 PDT
Created attachment 551018 [details] [diff] [review]
patch
Comment 3 Brian Hackett (:bhackett) 2011-08-05 05:59:22 PDT
http://hg.mozilla.org/projects/jaegermonkey/rev/b93ba9765288
Comment 4 Steve Fink [:sfink] [:s:] 2011-08-05 16:37:37 PDT
Comment on attachment 551018 [details] [diff] [review]
patch

Review of attachment 551018 [details] [diff] [review]:
-----------------------------------------------------------------

Great stuff! I'm going to trust you on the calculations. Now we just need to stuff a readtsc these...

::: js/src/jsscript.h
@@ +398,5 @@
> +        METHODJIT_CODE  = 4,
> +        METHODJIT_PICS  = 5,
> +        COUNT = 6,
> +    };
> +

Can you rename 'COUNT' to something more descriptive? I know it's JSPCCounters::COUNT, but it strikes me as a little cryptic in the member implementations. NUM_COUNTS or NUM_COUNTERS would be fine.

Why explicitly assign the values? Setting the first to 0 makes some amount of sense, but the exact values of the rest aren't very interesting.

Note You need to log in before you can comment on or make changes to this bug.