Closed
Bug 650102
Opened 14 years ago
Closed 7 years ago
Optimize slot tracers
Categories
(Tamarin Graveyard :: Garbage Collection (mmGC), defect, P3)
Tamarin Graveyard
Garbage Collection (mmGC)
Tracking
(Not tracked)
RESOLVED
WONTFIX
Q1 12 - Brannan
People
(Reporter: lhansen, Unassigned)
References
Details
Attachments
(1 file, 2 obsolete files)
|
43.90 KB,
patch
|
Details | Diff | Splinter Review |
Experiments (bug 619913) suggest that the slot tracers are slow. This would be because there is significant overhead to walking the bit table. Though it's possible to do that walking faster, it's still a lot of overhead. A better scheme is to precompute the tracer per type. Ideally we'd do that by jitting code, but a lot can be done with just C++.
| Reporter | ||
Comment 1•14 years ago
|
||
Optimizes the slot tracers by generating trees of C++ functions that perform the tracing. There is no bit table interpretation overhead. Some overhead remains in decoding atoms however, and jitting this code would reduce the call depth and would allow the GC*, m_sizeofInstance, and maybe other values to be inlined, so there's likely value in taking this further.
Shows a large speedup on splay.as for 64-bit (on Mac), otherwise not much happening.
| Reporter | ||
Comment 2•14 years ago
|
||
The patch is not quite ready for landing; Traits::computeTracers only handles up to 200 32-bit slot fields, ie, objects of size up to about 800 bytes. It's easy enough to increase the static limit, but it would probably be better to clean up the code to handle the general case.
Comment 3•14 years ago
|
||
Comment on attachment 526095 [details] [diff] [review]
Tentative patch
Patch does not compile on windows:
c:\hg\try\core\Traits.h(485) : error C2062: type 'void' unexpected
c:\hg\try\core\Traits.h(485) : error C2238: unexpected token(s) preceding ';'
| Reporter | ||
Comment 4•14 years ago
|
||
This has been cleaned up. There's a large comment block on functionality and optimization opportunities, the most important of which is the use of type information (but still does not require using the JIT).
This patch, and a generator script in a different patch, are in my redux-exact patch queue: users/lhansen_adobe.com/redux-exact.
According to Brent, this does not compile with Visual Studio because of the combination of "static" and "FASTCALL", I have not had time to investigate. ISTR there's some restriction on the ordering of those in Visual Studio.
Attachment #526095 -
Attachment is obsolete: true
| Reporter | ||
Comment 5•14 years ago
|
||
This compiles properly on Windows with MSVC++.
Attachment #526297 -
Attachment is obsolete: true
Comment 6•14 years ago
|
||
I like the concept a lot. What's the performance improvement that results?
FYI: the original motivation for using the bitfield was the addition of the cached TraitsBindings; we didn't want to have to possibly re-gen a TraitsBindings in order to destroy an object. The bitfield was conceived as a compact way to store this info.
I took a stab at attempted to JIT these functions a while back, but abandoned the effort; I don't recall whether this was due to technical issues or merely an apparent lack of perf improvement. I'm sure it's still in bugzilla somewhere...
| Reporter | ||
Comment 7•14 years ago
|
||
The performance results appear to vary with microarchitecture. We're seeing significant speedups on Core 2 Duo and Xeon, but nothing much on i7. We're still investigating, nothing's conclusive yet.
The generated x86 code is not all that great, there seems to be a fair amount of boilerplate that's not well motivated (and it's worse without FASTCALL - at least for GCC). Jitting would allow us to to significantly better, probably, and would allow us to avoid two levels of calls, which could be important. But whether the code would be significantly /faster/ as a result, as opposed to merely smaller, is not known.
| Reporter | ||
Comment 8•14 years ago
|
||
Careful measurements by Brent across a number of platforms (ARM, x86, x86-64, with several microarchitectures) shows that the optimization as it stands is not clearly a win over an optimized bit-scanning loop (part of this patch).
There are many reasons why this could be. For example, there are multiple levels of calls, with several branch instructions in each leaf, in the call tree while the bit-scanning loop has much better locality and a small number of branches. The code generated for the call tree is not great. There could be stalls due to the indirect calls. On 64-bit we get only 2.5 bits/call, so the overhead per bit is relatively high. Jitting would still be an interesting experiment, but so far as we can tell the call tree optimization is not worthwhile.
The optimized bit-scanning loop will be broken out from attachment 528019 [details] [diff] [review] and offered as a separate patch on a new bug.
Target Milestone: Q3 11 - Serrano → Q4 11 - Anza
| Reporter | ||
Updated•14 years ago
|
Target Milestone: Q4 11 - Anza → Q1 12 - Brannan
| Reporter | ||
Updated•14 years ago
|
Assignee: lhansen → nobody
Status: ASSIGNED → NEW
Comment 9•7 years ago
|
||
Tamarin isn't maintained anymore. WONTFIX remaining bugs.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•