Closed Bug 650102 Opened 14 years ago Closed 7 years ago

Optimize slot tracers

Categories

(Tamarin Graveyard :: Garbage Collection (mmGC), defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX
Q1 12 - Brannan

People

(Reporter: lhansen, Unassigned)

References

Details

Attachments

(1 file, 2 obsolete files)

Experiments (bug 619913) suggest that the slot tracers are slow. This would be because there is significant overhead to walking the bit table. Though it's possible to do that walking faster, it's still a lot of overhead. A better scheme is to precompute the tracer per type. Ideally we'd do that by jitting code, but a lot can be done with just C++.
Attached patch Tentative patch (obsolete) — Splinter Review
Optimizes the slot tracers by generating trees of C++ functions that perform the tracing. There is no bit table interpretation overhead. Some overhead remains in decoding atoms however, and jitting this code would reduce the call depth and would allow the GC*, m_sizeofInstance, and maybe other values to be inlined, so there's likely value in taking this further. Shows a large speedup on splay.as for 64-bit (on Mac), otherwise not much happening.
The patch is not quite ready for landing; Traits::computeTracers only handles up to 200 32-bit slot fields, ie, objects of size up to about 800 bytes. It's easy enough to increase the static limit, but it would probably be better to clean up the code to handle the general case.
Comment on attachment 526095 [details] [diff] [review] Tentative patch Patch does not compile on windows: c:\hg\try\core\Traits.h(485) : error C2062: type 'void' unexpected c:\hg\try\core\Traits.h(485) : error C2238: unexpected token(s) preceding ';'
Attached patch Tentative patch, v2 (obsolete) — Splinter Review
This has been cleaned up. There's a large comment block on functionality and optimization opportunities, the most important of which is the use of type information (but still does not require using the JIT). This patch, and a generator script in a different patch, are in my redux-exact patch queue: users/lhansen_adobe.com/redux-exact. According to Brent, this does not compile with Visual Studio because of the combination of "static" and "FASTCALL", I have not had time to investigate. ISTR there's some restriction on the ordering of those in Visual Studio.
Attachment #526095 - Attachment is obsolete: true
This compiles properly on Windows with MSVC++.
Attachment #526297 - Attachment is obsolete: true
I like the concept a lot. What's the performance improvement that results? FYI: the original motivation for using the bitfield was the addition of the cached TraitsBindings; we didn't want to have to possibly re-gen a TraitsBindings in order to destroy an object. The bitfield was conceived as a compact way to store this info. I took a stab at attempted to JIT these functions a while back, but abandoned the effort; I don't recall whether this was due to technical issues or merely an apparent lack of perf improvement. I'm sure it's still in bugzilla somewhere...
The performance results appear to vary with microarchitecture. We're seeing significant speedups on Core 2 Duo and Xeon, but nothing much on i7. We're still investigating, nothing's conclusive yet. The generated x86 code is not all that great, there seems to be a fair amount of boilerplate that's not well motivated (and it's worse without FASTCALL - at least for GCC). Jitting would allow us to to significantly better, probably, and would allow us to avoid two levels of calls, which could be important. But whether the code would be significantly /faster/ as a result, as opposed to merely smaller, is not known.
Careful measurements by Brent across a number of platforms (ARM, x86, x86-64, with several microarchitectures) shows that the optimization as it stands is not clearly a win over an optimized bit-scanning loop (part of this patch). There are many reasons why this could be. For example, there are multiple levels of calls, with several branch instructions in each leaf, in the call tree while the bit-scanning loop has much better locality and a small number of branches. The code generated for the call tree is not great. There could be stalls due to the indirect calls. On 64-bit we get only 2.5 bits/call, so the overhead per bit is relatively high. Jitting would still be an interesting experiment, but so far as we can tell the call tree optimization is not worthwhile. The optimized bit-scanning loop will be broken out from attachment 528019 [details] [diff] [review] and offered as a separate patch on a new bug.
Target Milestone: Q3 11 - Serrano → Q4 11 - Anza
Depends on: 654086
Target Milestone: Q4 11 - Anza → Q1 12 - Brannan
Assignee: lhansen → nobody
Status: ASSIGNED → NEW
Flags: flashplayer-qrb+
Tamarin isn't maintained anymore. WONTFIX remaining bugs.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: