Add coarse performance counters to VM for benchmarking

ASSIGNED
Unassigned

Status

Tamarin
Virtual Machine
ASSIGNED
8 years ago
6 years ago

People

(Reporter: Edwin Smith, Unassigned)

Tracking

(Depends on: 1 bug)

unspecified
Future
Dependency tree / graph
Bug Flags:
flashplayer-injection -
flashplayer-qrb +
flashplayer-bug -

Details

(Whiteboard: has-patch)

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

8 years ago
This elaborates on an idea for benchmark tracking.  instead of having the test print a performance value, let the VM do it.  (analogous to using a profiler instead of instrumenting code).

We already do this for memory, and it makes sense to do it for performance.

A simple counter that is incremented at coarse control flow boundaries would be independent of execution mechanism and optimization level.  Combined with a vm-level timer instead of code-level.  Tests would still need to be somewhat deterministic, but no longer need to measure themselves.
(Reporter)

Updated

8 years ago
Priority: -- → P1
Target Milestone: --- → flash10.2
(Reporter)

Comment 1

8 years ago
Created attachment 448087 [details] [diff] [review]
add "blops" metric

First prototype.   keeps two counters:
call_count
loop_count

call count is incremented once per call, in the prolog.  this includes native methods.

loop_count is incremented:
  * once per branch, if the branch direction is backwards, whether or not it is taken.
  * once per OP_lookupswitch, regardless of case or direction

the VM reports "blops"  (blocks per second) as a loose measure, computed as:  
(call_count + loop_count) / ms / 10.0

It is simpler for the interpreter to increment loop_count at the point of the branch rather than the loop header, since loop headers are not identified in ABC.  Its simpler for the JIT to increment loop_count unconditionally just before a conditional back-branch.

For switch, its simpler to just increment loop_count always.  Its expensive for the jit to add increment code to the subset of edges that are backwards (code on edge? table lookup?).

in testing, call_count isn't identical for jit vs interp -- i dont know why yet.  I suspect inlined cast expressions:   T(expr).

the patch also modifies runtest.py to use "metric blops" if found.  indeed, we can now run sunspider and v8 side by side and get beleivable higher-is-better scores, while ignoring all test output and only using vm-generated metrics.

issues
 - i'm measuring about a 3% counter overhead cost on boids.as.  that's higher than i'd like, but in principle we can live with it.
 - need to debug call_count.
 - looping inside native method is not accounted for
 - there is no tie-in with vprof or dtrace, and there probably should be (for perf and gc, arguably), or even a simple universal telemetry mechanism.  (channelling Lars here...)

design question
 - should we just count loops and calls and do all the analysis after the fact (outside the vm?).  on the one hand, i LIKE having the vm generate a single # as input to various scripts.  on the other hand, we'll get more and more metrics over time, and we'll be doing post-game analysis anyway.
Assignee: nobody → edwsmith
Status: NEW → ASSIGNED
Attachment #448087 - Flags: feedback?(lhansen)

Comment 2

8 years ago
Comment on attachment 448087 [details] [diff] [review]
add "blops" metric

I like it, and I believe in what you write.  I think it's most useful to have the shell print a single metric; if there's need for something else add a command line switch to select a different format.

All of this needs to be conditionalized, obviously - 3% is too much overhead for production runs in the Flash Player.  For the GC, there's MMGC_POLICY_PROFILING which is on by default in the shell and off by default in the player (5% overhead).
Attachment #448087 - Flags: feedback?(lhansen) → feedback+
(Reporter)

Comment 3

8 years ago
I think we dont want to count builtin code, for two reasons:

1. In ad-hoc testing I find I can reduce the counter overhead significantly by not counting in native methods.  Probably this is because it is expensive to access AvmCore.  Maybe we should use a static counter instead of one on AvmCore, but read on:

2. For native methods, we can only get the call count, not the loop count.  For builtin AS3, we can get both, but if we change builtin code from C++ to AS3 or back, we will affect the metric and we probably don't want to.  e.g. by making changes to AS3.push().
(Reporter)

Comment 4

8 years ago
Created attachment 448535 [details] [diff] [review]
refined as "metric steps"

Changes since last patch
* conditionalized
* merged call_count and loop_count as "step_count"
* count backedges whether taken or not
* count all OP_lookupswitch (too messy to worry about which path taken)
* do not count builtin code (as3 or native)
* rebased to TR tip

these changes reduced metric overhead in boids to within noise.

open issues:
- print metric to stdout?
- need commandline switch to enable metric?  (should all vm metrics be under one switch?)
- use static mem isnstead of AvmCore?  (one metric for whole process, or one per vm instance?)
Attachment #448087 - Attachment is obsolete: true

Comment 5

8 years ago
Nice, and I'm starting to wonder if it makes sense look at a more dynamic/comprehensive means of generating this data.

We'll probably need something similar in future as input for jit policy decisions and having a generic mechansim that handles both use-cases would be nice.

Although you could argue that the effort/engineering required for the jit policy effort is much higher and suitably involved that having this patch available now is more valuable.
(Reporter)

Comment 6

8 years ago
(In reply to comment #5)

> Although you could argue that the effort/engineering required for the jit
> policy effort is much higher and suitably involved that having this patch
> available now is more valuable.

exactly :-)  If we end up adding more counters and a general mechanism seems worthwhile, i'm all for it.  not a blocker tho.

Updated

8 years ago
Blocks: 572860
(Reporter)

Updated

8 years ago
Whiteboard: has-patch
(Reporter)

Updated

8 years ago
Assignee: edwsmith → nobody

Updated

7 years ago
Flags: flashplayer-bug-
Whiteboard: has-patch → has-patch, must-fix-candidate

Updated

7 years ago
Depends on: 645018

Updated

7 years ago
Flags: flashplayer-qrb+
Flags: flashplayer-injection-
Target Milestone: Q3 11 - Serrano → Q4 11 - Anza

Comment 7

7 years ago
This still seems like a useful idea and measurement.  Perhaps the blops measurement simply turns into a Telemetry measurement.

Updated

7 years ago
Whiteboard: has-patch, must-fix-candidate → has-patch

Comment 8

7 years ago
Retargeting to future as this has lingered fo so long.  Ed, retarget if you feel it is needed for HM.
Priority: P1 → --
Target Milestone: Q4 11 - Anza → Future
You need to log in before you can comment on or make changes to this bug.