Closed Bug 515722 Opened 15 years ago Closed 13 years ago

Improve finddef inline cache for jit'd code

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX
Q2 12 - Cyril

People

(Reporter: edwsmith, Assigned: edwsmith)

References

Details

(Whiteboard: PACMAN, benchmark-needed)

Attachments

(2 files, 3 obsolete files)

Wordcode interpreter implements a per-method cache of finddef results, to speed up finddef. lets adapt this to jit-ed code too.
This is pure refactoring. finddefNs and finddefNsset were specialized for some reason, but we aren't leveraging that and the inline cache will specialize even further. This patch combines the emitter code for OP_finddef into one place and eliminates the Ns/Nsset split.
Assignee: nobody → edwsmith
Attachment #399795 - Flags: review?(rreitmai)
(this patch is independent of the finddefNs cleanup) This refactors the lookup cache code so it can be turned on with either nanojit or wordcode. posting this to review but i dont plan to land as-is. known problems: 1. While it separates the code, it only implements one instance of the cache. there is no guarantee that jit code and word code would build the same cache (# of slots, ordering). 2. lookup cache code is too spread around for good taste. e.g. LookupCacheBuilder lives in WordcodeEmitter.cpp/h. the right thing to do is clean this up so we can share code but not instances of the cache. or, we can made jit and wordcode preclude each other. quick fix since we dont expect them to coexist but less than ideal since its good to keep the option open.
Attachment #399801 - Flags: review?(lhansen)
Attachment #399795 - Flags: review?(rreitmai) → review+
Comment on attachment 399795 [details] [diff] [review] finddefNs and finddefNsset both call MethodEnv::finddef so collapse them and clean up. forgot to mention the flush_add(i) change. Is that debugging remains.
yes, will revert it
Attachment #399801 - Flags: review?(lhansen) → review+
(In reply to comment #2) > or, we can made jit and wordcode preclude each other. quick fix since we dont > expect them to coexist but less than ideal since its good to keep the option > open. agree it's good to keep the option open but only if it doesn't cost us too much. in this case almost any cost is probably too high; I say make them mutually exclusive for now and add comments to avmfeatures.as that explain why it has to be that way.
Comment on attachment 399795 [details] [diff] [review] finddefNs and finddefNsset both call MethodEnv::finddef so collapse them and clean up. pushed https://hg.mozilla.org/tamarin-redux/rev/aed8ed056f4c
Attachment #399795 - Attachment is obsolete: true
Previous patch factors the LookupCache out from the Wordcode translator so we can use it in either wordcode or the jit. This patch enables it in the jit. When we emit a call to finddef(), we allocate a cache slot using the multiname as a key. At runtime, if we've got a valid cache entry for that slot, return it and avoid calling the expensive finddef() lookup. Allocation of the cache itself is done lazily. We could do it sooner but for only little gain (a single compare & branch in the common case) at bigger code complexity cost. deferred issue: some cache slots will be unused if the coresponding finddef call is found to be dead code later in the compilation pipeline.
Attachment #400771 - Flags: review?(rreitmai)
Comment on attachment 400771 [details] [diff] [review] Uses lookup cache to optimize finddef in jit code shouldn't the method go in MethodEnv.h/.cpp rather than vm_fops.h?
I understand the precedent, however i'm working on several changes that involve using very specialized helpers that are only called from the JIT. (eventually, nearly all jit callouts will go to specialized helpers that only sometimes call generic vm methods). They really don't belong on MethodEnv (or Toplevel, or AvmCore), where many other existing helpers live. 'course, vm_fops.h is an odd place to put them. A dedicated file for jit helpers is what I want, and vm_fops.h is exactly that, despite the odd name.
Comment on attachment 399801 [details] [diff] [review] separates LookupCache into its own VMCFG setting pushed http://hg.mozilla.org/tamarin-redux/rev/15b85ac876bb
Attachment #399801 - Attachment is obsolete: true
Comment on attachment 400771 [details] [diff] [review] Uses lookup cache to optimize finddef in jit code whoever can get to it first. thanks.
Attachment #400771 - Flags: review?(stejohns)
Comment on attachment 400771 [details] [diff] [review] Uses lookup cache to optimize finddef in jit code re: "fixme: this allocates a cache slot even if the finddef ultimately becomes dead", what is the implication -- ie, how bad is this problem? re: "todo - do this earlier" -- same question
Attachment #400771 - Flags: review?(stejohns) → review+
(In reply to comment #12) > (From update of attachment 400771 [details] [diff] [review]) > re: "fixme: this allocates a cache slot even if the finddef ultimately becomes > dead", what is the implication -- ie, how bad is this problem? the overall cache memory used in some big test cases is about 1/8 the size of precomputed multinames, or 1/30th the size of generated code. Haven't determined what % of that small memory is ultimately unused, but in any case its small. > re: "todo - do this earlier" -- same question insignificant afaikt. we load the cache pointer, anyway, so the cost is one extra test + branch that predicts well. basically nil. as code paths around this get hotter, it might matter more. I'll clarify the comments before landing.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Blocks: 460079
Attachment #400771 - Flags: review?(rreitmai) → review-
Attachment #400771 - Flags: review-
ad-hoc profiling is showing finddef_cache() in the top 10 hotspots in enough apps that its worth re-opening this bug and tuning the implementation some more.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Priority: -- → P3
Target Milestone: --- → flash10.1
Target Milestone: flash10.1 → flash10.2
Whiteboard: PACMAN
Assignee: edwsmith → nobody
Here is the hot path through finddef_cache on x64 (warning: ATT syntax), compiled with gcc -O3 Dump of assembler code for function avmplus::finddef_cache 0: push %rbp 1: mov %rsp,%rbp 4: mov %rbx,-0x28(%rbp) 8: mov %r12,-0x20(%rbp) 12: mov %edx,%ebx 14: mov %r13,-0x18(%rbp) 18: mov %r15,-0x8(%rbp) 22: mov %rdi,%r13 25: mov %r14,-0x10(%rbp) 29: sub $0x30,%rsp 33: mov 0x20(%rdi),%r14 37: mov %rsi,%r15 40: lea 0x20(%rdi),%r12 44: test %r14,%r14 if (env->cache) 47: je <+182> 53: mov 0x8(%r13),%rax 57: mov 0x28(%rax),%rax 61: mov 0x8(%rax),%r12 core = env->core() 65: mov %ebx,%eax 67: shl $0x4,%rax 71: lea (%r14,%rax,1),%rbx 75: mov (%rbx),%eax 77: cmp 0x728(%r12),%eax compare to global timestamp 85: je <+176> 176: mov 0x8(%rbx),%r13 load cache[slot].object 180: jmp <+142> 142: mov %r13,%rax 145: mov -0x28(%rbp),%rbx 149: mov -0x20(%rbp),%r12 153: mov -0x18(%rbp),%r13 157: mov -0x10(%rbp),%r14 161: mov -0x8(%rbp),%r15 165: leaveq 166: retq
Assignee: nobody → edwsmith
Attachment #400771 - Attachment is obsolete: true
Instead of a table per MethodEnv, uses one per AbcEnv, indexed by name_index. The table can be sized before any calls, and the result can be shared by all MethodEnvs. The fast path avoids a null check to allocate the table. In preliminary testing, the table is about 10x bigger than the current sum of all MethodEnv tables. (i.e. bad). Posting here just to save the prototype.
Streamlined finddef_cache somewhat: * pass AvmCore* as a parameter; jit'd code will inline the ptr as a constant, but looking it up in finddef_cache() costs three dependent loads. * move all slow-path code into finddef_miss(), including cache initialization and handling ordinary misses. on x64, the fast path now looks like this (15 instructions). 0x100077a90 push rbp 0x100077a91 mov rbp, rsp 0x100077a94 mov r8, [rdi+32] 0x100077a98 test r8, r8 0x100077a9b jz 0x0000000100077ab4 0x100077a9d mov eax, edx 0x100077a9f shl rax, 4 0x100077aa3 add rax, r8 0x100077aa6 mov r8, [rax+8] 0x100077aaa mov eax, [rax] 0x100077aac cmp eax, [rsi+1824] 0x100077ab2 jz L1 (taken on fast path) 0x100077ab4 leave 0x100077ab5 jmp finddef_miss 0x100077aba nop [rax+rax+0] L1 0x100077ac0 mov rax, r8 0x100077ac3 leave 0x100077ac4 ret
Summary: implement finddef inline cache for jit'd code → Improve finddef inline cache for jit'd code
OS: Mac OS X → All
Hardware: x86 → All
microbenchmark. no measurable time is in finddef_cache with this test, on a 2.6 ghz Nehalem Xeon. so, either we're seeing a lower than expected hit rate on larger apps, or worse performance on Core 2. Downgrading expected gain to "trivial" until further testing is done. package { function t():int { var c var i:int for (i = 0; i < 1000000; i++) { c = Object c = String c = Number c = Date c = XML c = int } return i } function run() { var t1:Number = getTimer() var t2:Number var c:Number = 0 do { c += t() t2 = getTimer() } while ((t2 - t1) < 10000.0) // run for 10 sec print(c/(t2-t1)) // score normalized to iterations/sec } run() }
Severity: normal → trivial
Component: Virtual Machine → JIT Compiler (NanoJIT)
Since the last test run above, I have seen finddef_cache still be hot in bigger flex apps. Cache effects are likely coming into play in ways the microbenchmark does not capture. Furthermore, once bug 567284 is finished, we may be able to significantly streamline the cache; if the semantics are such that once finddef<Name> always returns a specific value (never null or throwing an execption), then at verify time we could determine the coordinates of an object (domain-depth D, table index T) that would allow a load-and-compare cache check. Once bug 563944 is resolved we could also contemplate inlining the fast path of finddef_cache().
Depends on: 563944
Assignee: edwsmith → nobody
Flags: flashplayer-bug-
Depends on: Andre
Edwin, can we close this one?
No longer depends on: Andre
Flags: flashplayer-injection-
Depends on: Andre
No, this is still valid. You could argue that it's blocked waiting on valid benchmark data, however.
Rescheduling to Brannan.
Target Milestone: Q3 11 - Serrano → Q1 12 - Brannan
Trevor, who will do the benchmark honor? Removing Andre Blocker.
Assignee: nobody → trbaker
No longer depends on: Andre
Assignee: trbaker → edwsmith
Flags: flashplayer-qrb+
QA Contact: vm → nanojit
Whiteboard: PACMAN → PACMAN, benchmark-needed
Target Milestone: Q1 12 - Brannan → Q2 12 - Cyril
Subsumed by bug 733807.
Status: REOPENED → RESOLVED
Closed: 15 years ago13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: