Closed
Bug 465322
Opened 16 years ago
Closed 16 years ago
[redux] Shrink the interpreter stack frame
Categories
(Tamarin Graveyard :: Virtual Machine, defect)
Tamarin Graveyard
Virtual Machine
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: lhansen, Assigned: lhansen)
Details
Attachments
(1 file, 2 obsolete files)
116.06 KB,
patch
|
stejohns
:
review+
edwsmith
:
review+
|
Details | Diff | Splinter Review |
Analysis has shown that the interpreter stack frame can be huge, ranging from 550 bytes on armcc/ARM to 2900 bytes on vc++/ARM. This is an issue for small-stack systems and can even be a performance liability as the stack will grow deep for even moderately recursive functions. The reasons are a combination of compilers' inability to reuse block-scoped variables, lots of local variables, and poor code generation (inability to reuse temps; many spurious temps not removed, probably because the function is so large). This is a tracking bug for a fix for the problem.
Assignee | ||
Comment 1•16 years ago
|
||
Compiles on many platforms with the following frame sizes (obtained from the generated code): Mac 32 668 Mac 64 376 Win 32 600 Win 64 1552 (!) ARM VC++ 752 The atrocious frame size on Win64 is the result of many more spurious temps than even the Win32 build, and of course they're all 8 bytes. Groan. The main open issue is whether the design around the CallStackNode - now relegated to alloca'd memory, which may in turn be shunted onto the heap - is correct, since the object actually has a destructor that won't be called.
(In reply to comment #1) > Mac 32 668 > Mac 64 376 I'm surprised that Mac 64 has a smaller stack frame size than Mac 32 -- does that gcc backend just do more aggressive temp reuse? (Bodes well for moving to 64-bit userland on the Mac, though!)
Assignee | ||
Comment 3•16 years ago
|
||
(In reply to comment #2) > (In reply to comment #1) > > Mac 32 668 > > Mac 64 376 > > I'm surprised that Mac 64 has a smaller stack frame size than Mac 32 -- does > that gcc backend just do more aggressive temp reuse? (Bodes well for moving to > 64-bit userland on the Mac, though!) I'm surprised too. In fact, 376/8 = 47, which is just two more than the number of local register-allocable variables in that function. The obvious guess is that the extra registers available are put to good use. I wouldn't extrapolate too much from this, the interpreter function is hard to compile and far from typical, but yes, it's encouraging. My analysis is that gcc for ia32 is not doing a good job. I've looked at the code emitted by gcc 4.0.1, and it systematically fails to reuse compiler-introduced temporaries. Any improvement in that area in the 64-bit compiler should have a significant impact. (gcc 4.2 creates a larger frame yet but I've no analysis as to why.) The problem with VC++ 9.0 for ia32 is that it introduces a large number of spill locations for EAX in the interpreter. It spills to these but never loads from them, and the net result is a large stack frame and a substantial amount of redundant computation. I don't yet know why this is happening; looks like a compiler bug / misfeature. The code emitted by VC9 for ARM and x64 have exactly the same problem, but from the numbers above it could look like there are even more spill locations introduced for x64 (I'm just guessing).
Assignee | ||
Comment 4•16 years ago
|
||
Attachment #348816 -
Attachment is obsolete: true
Attachment #348990 -
Flags: review?(stejohns)
Assignee | ||
Updated•16 years ago
|
Attachment #348990 -
Flags: review?(edwsmith)
Updated•16 years ago
|
Attachment #348990 -
Flags: review?(stejohns) → review+
Comment 5•16 years ago
|
||
Comment on attachment 348990 [details] [diff] [review] Patch in WordcodeEmitter.cpp you could use skipU30() rather than multiple readU30() calls.
Updated•16 years ago
|
Attachment #348990 -
Flags: review?(edwsmith) → review+
Assignee | ||
Comment 6•16 years ago
|
||
Pushed to tamarin-redux as changeset 1140:4d56af395653 Compared to the same code without this fix, there are some performance changes (interpreter-only code). These are the deltas above 4% for GCC 4.0 with -O3 on MacOS 10.5.5, lightly loaded system: jsbench/FFT.as 18748 17767 5.5 jsbench/HeapSort.as 8879 9503 -6.6 jsbench/LUFact.as 11974 10995 8.9 jsbench/SOR.as 71817 65771 9.2 jsbench/typed/HeapSort.as 9580 10606 -9.7 jsbench/typed/RayTracer.as 54300 52110 4.2 I wish I could say there was a trend here but the main difference between HeapSort and the others is that HeapSort conses up a lot fewer doubles, probably. I'm not going to worry about it at this time.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 7•15 years ago
|
||
Resolved fixed engineering / work item that has been pushed. Setting status to verified.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•