Closed Bug 583955 Opened 14 years ago Closed 6 years ago

JIT speed could be improved

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX
Q1 12 - Brannan

People

(Reporter: edwsmith, Unassigned)

References

Details

(Whiteboard: Tracking, Pacman)

strategy:  use -Dverifyonly -Ojit,
process a large number of input files,
and see what happens.

This is a tracking bug to capture analysis and serve as an anchor for small fixes.
Preliminary results show a lot of time (15% of verifyJit() self+children) in _bzero, mostly due to calling HashMap::clear() on the VarTracker.checked map, whose bucket size is hardcoded to 16700.  bug 565489 should choose a more reasonable size, possibly based on code_length.
Depends on: 565489
disabling mprotect on code pages saves about 10%
with the above two changes applied, 13% of overall time is in deadvars().

Almost 100% of methods have frame size < 64, so both VarTracker and deadvars could use a faster bitmap.  (just a word, no BitSet abstraction).

Methods that have no back edges should not require iteration in deadvars.  we could do a single backwards sweep (deadvars_kill only) to save time.  Furthermore this sweep could be concurrent with code generation to avoid traversing LIR one extra time.  (*very* similar to nanojit::StackFilter).
disabling deadvars and cse saves a further 13% of verifyJit self+children time.
If you focus on just emitMD(), the biggest bottom-up hotspot is the self-time
in Assembler::gen().  Likewise, if you focus on verifyCommon() (the lir
generation stage), then the biggest hotspot (7%) is self-time in verifyBlock().

gen() is essentially O(num-lir-instructions) and verifyBlock() is essentially
O(num-abc-instructions).  this supports the notion we should simply jit less
code, and/or possibly use peephole analysis to visit fewer abc bytecodes.
Target Milestone: --- → flash10.2
Priority: -- → P1
(In reply to comment #5)
> this supports the notion we should simply jit less
> code, and/or possibly use peephole analysis to visit fewer abc bytecodes.

This is nonsensical since every bytecode must be verified.

Callgrind shows that verifyBlock() incurs a disproportionate % of I1 misses.  The method alone is 29K on x86 (32k I1 cache), not to mention everything it calls.
(In reply to comment #6)

> This is nonsensical since every bytecode must be verified.

If peepholing significantly reduces instruction count and is performed early, and the verifier pipeline is deep (ie many levels of opcode dispatch) then it might actually be true - assuming it's OK to peephole unverified code.  But I expect the effect to be marginal.

> Callgrind shows that verifyBlock() incurs a disproportionate % of I1 misses. 
> The method alone is 29K on x86 (32k I1 cache), not to mention everything it
> calls.

That is probably true for the interpreter too (the wordcode interpreter body was above 32KB on ARM at some point).  I'm not sure what to do about it - for the interpreter (especially the wordcode interpreter) we can factor out common instructions or common cases of instructions and group the code for those instructions while shunting other cases off to elsewhere, and that will probably be worthwhile, but will it be worthwhile for the verifier?  Is it reasonably maintainable?
We can make sure the inline choices are sensible, and generally reduce duplication with old-school procedural factoring, but theres probably not much more to do.

I'm not sure how much of what follows is worthwhile, given other hotspots... but continuing the discussion:

More important is probably factoring the verifier so we dont over-use it (Simon's patches do this, but need updating.  the first of the series is Bug 538181 and is ready to land).  Phase 1 needs to do a lot of checking and iterate to a fixed point, but after that we could avoid re-checking everything.  Phase 2 consists of:

abc-interp:  SW->RAA
wordcode:  SW->RAA->WE
jit:  SW->RA->JIT

(SW=ScopeWriter, RA=RestArgsAnalyzer, WE=WordcodeEmitter, JIT=CodegenLIR)

We could do better than ScopeWriter.  many methods don't include OP_newfunction, OP_newclass, or OP_newactivation, making ScopeWriter a no-op.  Even when its not a no-op, we know all the relevant types at the end of phase 1, they're just lost.  We could make a list of Scope-Chain "commits" and just execute it once phase 1 is finished.

That would render phase2 to be purely for emitting code.  It does not benefit from any of the checks in verifyBlock(), it only wants the types in FrameState, which come from the [many] symbol table lookups done in the various opcodes.  In the long run, that early binding logic could/should be factored out; the translators need to continually improve it (make it more precise) but the verifier wants to freeze it and/or version it.
Depends on: 531840
Whiteboard: Tracking, Pacman
Depends on: 585433
(In reply to comment #3)
> Methods that have no back edges should not require iteration in deadvars.  we
> could do a single backwards sweep (deadvars_kill only) to save time. 
> Furthermore this sweep could be concurrent with code generation to avoid
> traversing LIR one extra time.  (*very* similar to nanojit::StackFilter).

idea moved to bug 585433
Depends on: 584935
Depends on: 563146
Depends on: 593188
No longer depends on: 585433
Depends on: 585433
Ed, we had some discussion a while ago about 64-bit ints being bad in BitSet.  They require a C++ helper for shifting when calling bitnum2mask?

Not sure if changing this would change any performance during JITing but thought I would mention it.
Depends on: 620133
Flags: flashplayer-bug-
Flags: flashplayer-injection-
Flags: flashplayer-qrb+
Target Milestone: Q3 11 - Serrano → Q1 12 - Brannan
Depends on: 661133
Tamarin is a dead project now. Mass WONTFIX.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Tamarin isn't maintained anymore. WONTFIX remaining bugs.
You need to log in before you can comment on or make changes to this bug.