Closed Bug 465322 Opened 16 years ago Closed 16 years ago

[redux] Shrink the interpreter stack frame

Categories

(Tamarin Graveyard :: Virtual Machine, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: lhansen, Assigned: lhansen)

Details

Attachments

(1 file, 2 obsolete files)

Analysis has shown that the interpreter stack frame can be huge, ranging from 550 bytes on armcc/ARM to 2900 bytes on vc++/ARM.  This is an issue for small-stack systems and can even be a performance liability as the stack will grow deep for even moderately recursive functions.  The reasons are a combination of compilers' inability to reuse block-scoped variables, lots of local variables, and poor code generation (inability to reuse temps; many spurious temps not removed, probably because the function is so large).

This is a tracking bug for a fix for the problem.
Compiles on many platforms with the following frame sizes (obtained from the generated code):

  Mac 32    668
  Mac 64    376
  Win 32    600
  Win 64    1552 (!)
  ARM VC++  752

The atrocious frame size on Win64 is the result of many more spurious temps than even the Win32 build, and of course they're all 8 bytes.  Groan.

The main open issue is whether the design around the CallStackNode - now relegated to alloca'd memory, which may in turn be shunted onto the heap - is correct, since the object actually has a destructor that won't be called.
Assignee: nobody → lhansen
Attachment #348538 - Attachment is obsolete: true
Status: NEW → ASSIGNED
(In reply to comment #1)
>   Mac 32    668
>   Mac 64    376

I'm surprised that Mac 64 has a smaller stack frame size than Mac 32 -- does that gcc backend just do more aggressive temp reuse?  (Bodes well for moving to 64-bit userland on the Mac, though!)
(In reply to comment #2)
> (In reply to comment #1)
> >   Mac 32    668
> >   Mac 64    376
> 
> I'm surprised that Mac 64 has a smaller stack frame size than Mac 32 -- does
> that gcc backend just do more aggressive temp reuse?  (Bodes well for moving to
> 64-bit userland on the Mac, though!)

I'm surprised too.  In fact, 376/8 = 47, which is just two more than the number of local register-allocable variables in that function.  The obvious guess is that the extra registers available are put to good use.

I wouldn't extrapolate too much from this, the interpreter function is hard to compile and far from typical, but yes, it's encouraging.

My analysis is that gcc for ia32 is not doing a good job.  I've looked at the code emitted by gcc 4.0.1, and it systematically fails to reuse compiler-introduced temporaries.  Any improvement in that area in the 64-bit compiler should have a significant impact.  (gcc 4.2 creates a larger frame yet but I've no analysis as to why.)  The problem with VC++ 9.0 for ia32 is that it introduces a large number of spill locations for EAX in the interpreter.  It spills to these but never loads from them, and the net result is a large stack frame and a substantial amount of redundant computation.  I don't yet know why this is happening; looks like a compiler bug / misfeature.  The code emitted by VC9 for ARM and x64 have exactly the same problem, but from the numbers above it could look like there are even more spill locations introduced for x64 (I'm just guessing).
Attached patch PatchSplinter Review
Attachment #348816 - Attachment is obsolete: true
Attachment #348990 - Flags: review?(stejohns)
Attachment #348990 - Flags: review?(edwsmith)
Attachment #348990 - Flags: review?(stejohns) → review+
Comment on attachment 348990 [details] [diff] [review]
Patch

in WordcodeEmitter.cpp you could use skipU30() rather than multiple readU30() calls.
Attachment #348990 - Flags: review?(edwsmith) → review+
Pushed to tamarin-redux as changeset 1140:4d56af395653

Compared to the same code without this fix, there are some performance changes (interpreter-only code).  These are the deltas above 4% for GCC 4.0 with -O3 on MacOS 10.5.5, lightly loaded system:

jsbench/FFT.as                   18748   17767     5.5
jsbench/HeapSort.as               8879    9503    -6.6
jsbench/LUFact.as                11974   10995     8.9
jsbench/SOR.as                   71817   65771     9.2
jsbench/typed/HeapSort.as         9580   10606    -9.7
jsbench/typed/RayTracer.as       54300   52110     4.2

I wish I could say there was a trend here but the main difference between HeapSort and the others is that HeapSort conses up a lot fewer doubles, probably.  I'm not going to worry about it at this time.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Resolved fixed engineering / work item that has been pushed.  Setting status to verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: