Closed
Bug 515722
Opened 15 years ago
Closed 13 years ago
Improve finddef inline cache for jit'd code
Categories
(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect, P3)
Tamarin Graveyard
Baseline JIT (CodegenLIR)
Tracking
(Not tracked)
RESOLVED
WONTFIX
Q2 12 - Cyril
People
(Reporter: edwsmith, Assigned: edwsmith)
References
Details
(Whiteboard: PACMAN, benchmark-needed)
Attachments
(2 files, 3 obsolete files)
10.35 KB,
patch
|
Details | Diff | Splinter Review | |
5.73 KB,
patch
|
Details | Diff | Splinter Review |
Wordcode interpreter implements a per-method cache of finddef results, to speed up finddef. lets adapt this to jit-ed code too.
Assignee | ||
Comment 1•15 years ago
|
||
This is pure refactoring. finddefNs and finddefNsset were specialized for some reason, but we aren't leveraging that and the inline cache will specialize even further.
This patch combines the emitter code for OP_finddef into one place and eliminates the Ns/Nsset split.
Assignee: nobody → edwsmith
Attachment #399795 -
Flags: review?(rreitmai)
Assignee | ||
Comment 2•15 years ago
|
||
(this patch is independent of the finddefNs cleanup)
This refactors the lookup cache code so it can be turned on with either nanojit or wordcode. posting this to review but i dont plan to land as-is.
known problems:
1. While it separates the code, it only implements one instance of the cache. there is no guarantee that jit code and word code would build the same cache (# of slots, ordering).
2. lookup cache code is too spread around for good taste. e.g. LookupCacheBuilder lives in WordcodeEmitter.cpp/h.
the right thing to do is clean this up so we can share code but not instances of the cache.
or, we can made jit and wordcode preclude each other. quick fix since we dont expect them to coexist but less than ideal since its good to keep the option open.
Attachment #399801 -
Flags: review?(lhansen)
Updated•15 years ago
|
Attachment #399795 -
Flags: review?(rreitmai) → review+
Comment 3•15 years ago
|
||
Comment on attachment 399795 [details] [diff] [review]
finddefNs and finddefNsset both call MethodEnv::finddef so collapse them and clean up.
forgot to mention the flush_add(i) change. Is that debugging remains.
Assignee | ||
Comment 4•15 years ago
|
||
yes, will revert it
Updated•15 years ago
|
Attachment #399801 -
Flags: review?(lhansen) → review+
Comment 5•15 years ago
|
||
(In reply to comment #2)
> or, we can made jit and wordcode preclude each other. quick fix since we dont
> expect them to coexist but less than ideal since its good to keep the option
> open.
agree it's good to keep the option open but only if it doesn't cost us too much. in this case almost any cost is probably too high; I say make them mutually exclusive for now and add comments to avmfeatures.as that explain why it has to be that way.
Assignee | ||
Comment 6•15 years ago
|
||
Comment on attachment 399795 [details] [diff] [review]
finddefNs and finddefNsset both call MethodEnv::finddef so collapse them and clean up.
pushed https://hg.mozilla.org/tamarin-redux/rev/aed8ed056f4c
Attachment #399795 -
Attachment is obsolete: true
Assignee | ||
Comment 7•15 years ago
|
||
Previous patch factors the LookupCache out from the Wordcode translator so we can use it in either wordcode or the jit.
This patch enables it in the jit. When we emit a call to finddef(), we allocate a cache slot using the multiname as a key. At runtime, if we've got a valid cache entry for that slot, return it and avoid calling the expensive finddef() lookup.
Allocation of the cache itself is done lazily. We could do it sooner but for only little gain (a single compare & branch in the common case) at bigger code complexity cost.
deferred issue: some cache slots will be unused if the coresponding finddef call is found to be dead code later in the compilation pipeline.
Attachment #400771 -
Flags: review?(rreitmai)
Comment 8•15 years ago
|
||
Comment on attachment 400771 [details] [diff] [review]
Uses lookup cache to optimize finddef in jit code
shouldn't the method go in MethodEnv.h/.cpp rather than vm_fops.h?
Assignee | ||
Comment 9•15 years ago
|
||
I understand the precedent, however i'm working on several changes that involve using very specialized helpers that are only called from the JIT. (eventually, nearly all jit callouts will go to specialized helpers that only sometimes call generic vm methods). They really don't belong on MethodEnv (or Toplevel, or AvmCore), where many other existing helpers live.
'course, vm_fops.h is an odd place to put them. A dedicated file for jit helpers is what I want, and vm_fops.h is exactly that, despite the odd name.
Assignee | ||
Comment 10•15 years ago
|
||
Comment on attachment 399801 [details] [diff] [review]
separates LookupCache into its own VMCFG setting
pushed http://hg.mozilla.org/tamarin-redux/rev/15b85ac876bb
Assignee | ||
Updated•15 years ago
|
Attachment #399801 -
Attachment is obsolete: true
Assignee | ||
Comment 11•15 years ago
|
||
Comment on attachment 400771 [details] [diff] [review]
Uses lookup cache to optimize finddef in jit code
whoever can get to it first. thanks.
Attachment #400771 -
Flags: review?(stejohns)
Comment 12•15 years ago
|
||
Comment on attachment 400771 [details] [diff] [review]
Uses lookup cache to optimize finddef in jit code
re: "fixme: this allocates a cache slot even if the finddef ultimately becomes dead", what is the implication -- ie, how bad is this problem?
re: "todo - do this earlier" -- same question
Attachment #400771 -
Flags: review?(stejohns) → review+
Assignee | ||
Comment 13•15 years ago
|
||
(In reply to comment #12)
> (From update of attachment 400771 [details] [diff] [review])
> re: "fixme: this allocates a cache slot even if the finddef ultimately becomes
> dead", what is the implication -- ie, how bad is this problem?
the overall cache memory used in some big test cases is about 1/8 the
size of precomputed multinames, or 1/30th the size of generated code.
Haven't determined what % of that small memory is ultimately unused, but in any case its small.
> re: "todo - do this earlier" -- same question
insignificant afaikt. we load the cache pointer, anyway, so the cost is one extra test + branch that predicts well. basically nil. as code paths around this get hotter, it might matter more.
I'll clarify the comments before landing.
Assignee | ||
Comment 14•15 years ago
|
||
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•15 years ago
|
Attachment #400771 -
Flags: review?(rreitmai) → review-
Assignee | ||
Updated•15 years ago
|
Attachment #400771 -
Flags: review-
Assignee | ||
Comment 15•15 years ago
|
||
ad-hoc profiling is showing finddef_cache() in the top 10 hotspots in enough apps that its worth re-opening this bug and tuning the implementation some more.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•15 years ago
|
Priority: -- → P3
Target Milestone: --- → flash10.1
Updated•15 years ago
|
Whiteboard: PACMAN
Assignee | ||
Updated•15 years ago
|
Assignee: edwsmith → nobody
Assignee | ||
Comment 16•15 years ago
|
||
Here is the hot path through finddef_cache on x64 (warning: ATT syntax), compiled with gcc -O3
Dump of assembler code for function avmplus::finddef_cache
0: push %rbp
1: mov %rsp,%rbp
4: mov %rbx,-0x28(%rbp)
8: mov %r12,-0x20(%rbp)
12: mov %edx,%ebx
14: mov %r13,-0x18(%rbp)
18: mov %r15,-0x8(%rbp)
22: mov %rdi,%r13
25: mov %r14,-0x10(%rbp)
29: sub $0x30,%rsp
33: mov 0x20(%rdi),%r14
37: mov %rsi,%r15
40: lea 0x20(%rdi),%r12
44: test %r14,%r14 if (env->cache)
47: je <+182>
53: mov 0x8(%r13),%rax
57: mov 0x28(%rax),%rax
61: mov 0x8(%rax),%r12 core = env->core()
65: mov %ebx,%eax
67: shl $0x4,%rax
71: lea (%r14,%rax,1),%rbx
75: mov (%rbx),%eax
77: cmp 0x728(%r12),%eax compare to global timestamp
85: je <+176>
176: mov 0x8(%rbx),%r13 load cache[slot].object
180: jmp <+142>
142: mov %r13,%rax
145: mov -0x28(%rbp),%rbx
149: mov -0x20(%rbp),%r12
153: mov -0x18(%rbp),%r13
157: mov -0x10(%rbp),%r14
161: mov -0x8(%rbp),%r15
165: leaveq
166: retq
Assignee | ||
Updated•15 years ago
|
Assignee: nobody → edwsmith
Assignee | ||
Updated•15 years ago
|
Attachment #400771 -
Attachment is obsolete: true
Assignee | ||
Comment 17•15 years ago
|
||
Instead of a table per MethodEnv, uses one per AbcEnv, indexed by name_index. The table can be sized before any calls, and the result can be shared by all MethodEnvs. The fast path avoids a null check to allocate the table.
In preliminary testing, the table is about 10x bigger than the current sum of all MethodEnv tables. (i.e. bad). Posting here just to save the prototype.
Assignee | ||
Comment 18•15 years ago
|
||
Streamlined finddef_cache somewhat:
* pass AvmCore* as a parameter; jit'd code will inline the ptr as a constant, but looking it up in finddef_cache() costs three dependent loads.
* move all slow-path code into finddef_miss(), including cache initialization and handling ordinary misses.
on x64, the fast path now looks like this (15 instructions).
0x100077a90 push rbp
0x100077a91 mov rbp, rsp
0x100077a94 mov r8, [rdi+32]
0x100077a98 test r8, r8
0x100077a9b jz 0x0000000100077ab4
0x100077a9d mov eax, edx
0x100077a9f shl rax, 4
0x100077aa3 add rax, r8
0x100077aa6 mov r8, [rax+8]
0x100077aaa mov eax, [rax]
0x100077aac cmp eax, [rsi+1824]
0x100077ab2 jz L1 (taken on fast path)
0x100077ab4 leave
0x100077ab5 jmp finddef_miss
0x100077aba nop [rax+rax+0]
L1
0x100077ac0 mov rax, r8
0x100077ac3 leave
0x100077ac4 ret
Assignee | ||
Updated•15 years ago
|
Summary: implement finddef inline cache for jit'd code → Improve finddef inline cache for jit'd code
Assignee | ||
Updated•15 years ago
|
OS: Mac OS X → All
Hardware: x86 → All
Assignee | ||
Comment 19•15 years ago
|
||
microbenchmark. no measurable time is in finddef_cache with this test, on a 2.6 ghz Nehalem Xeon. so, either we're seeing a lower than expected hit rate on larger apps, or worse performance on Core 2. Downgrading expected gain to "trivial" until further testing is done.
package {
function t():int {
var c
var i:int
for (i = 0; i < 1000000; i++) {
c = Object
c = String
c = Number
c = Date
c = XML
c = int
}
return i
}
function run() {
var t1:Number = getTimer()
var t2:Number
var c:Number = 0
do {
c += t()
t2 = getTimer()
} while ((t2 - t1) < 10000.0) // run for 10 sec
print(c/(t2-t1)) // score normalized to iterations/sec
}
run()
}
Severity: normal → trivial
Assignee | ||
Updated•15 years ago
|
Component: Virtual Machine → JIT Compiler (NanoJIT)
Assignee | ||
Comment 20•14 years ago
|
||
Since the last test run above, I have seen finddef_cache still be hot in bigger flex apps. Cache effects are likely coming into play in ways the microbenchmark does not capture.
Furthermore, once bug 567284 is finished, we may be able to significantly streamline the cache; if the semantics are such that once finddef<Name> always returns a specific value (never null or throwing an execption), then at verify time we could determine the coordinates of an object (domain-depth D, table index T) that would allow a load-and-compare cache check.
Once bug 563944 is resolved we could also contemplate inlining the fast path of finddef_cache().
Depends on: 563944
Assignee | ||
Updated•14 years ago
|
Assignee: edwsmith → nobody
Updated•14 years ago
|
Flags: flashplayer-bug-
Comment 21•14 years ago
|
||
Edwin, can we close this one?
No longer depends on: Andre
Flags: flashplayer-injection-
Assignee | ||
Comment 22•14 years ago
|
||
No, this is still valid. You could argue that it's blocked waiting on valid benchmark data, however.
Comment 23•14 years ago
|
||
Rescheduling to Brannan.
Target Milestone: Q3 11 - Serrano → Q1 12 - Brannan
Comment 24•13 years ago
|
||
Trevor, who will do the benchmark honor? Removing Andre Blocker.
Assignee: nobody → trbaker
No longer depends on: Andre
Assignee: trbaker → edwsmith
Flags: flashplayer-qrb+
QA Contact: vm → nanojit
Whiteboard: PACMAN → PACMAN, benchmark-needed
Target Milestone: Q1 12 - Brannan → Q2 12 - Cyril
Assignee | ||
Comment 25•13 years ago
|
||
Subsumed by bug 733807.
Status: REOPENED → RESOLVED
Closed: 15 years ago → 13 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•