Reducing memory operation form prologue & epilogue of JITed code

VERIFIED WONTFIX

Status

Tamarin
Tracing Virtual Machine
VERIFIED WONTFIX
10 years ago
8 years ago

People

(Reporter: Jungwoo Ha, Unassigned)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

10 years ago
User-Agent:       Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.1; MS-RTC LM 8; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)
Build Identifier: 

To align the stack, there is a place in prologue that pushes EBP twice, and in epilogue it POPs twice. 
It is possible to switch first PUSH EBP to SUB ESP, 4 and POP to ADD ESP, 4.
This will increase the prologue instruction size by 3 bytes in x86, but it saves one memory operation. I ran it on Mac OS X with Core 2 Duo 2.8GHz, and got slight improvement on the performance. Here is the patch and performance result.

---
diff -r b7fa522d969b nanojit/Nativei386.cpp
--- a/nanojit/Nativei386.cpp    Tue Jun 03 08:54:09 2008 -0700
+++ b/nanojit/Nativei386.cpp    Tue Jun 03 15:38:32 2008 -0700
@@ -90,7 +90,8 @@
         NIns *patchEntry = _nIns;
                MR(FP, SP);
                PUSHr(FP); // push ebp twice to align frame on 8bytes
-               PUSHr(FP);
+               //PUSHr(FP);
+               SUBi(SP, 4);

                for(Register i=FirstReg; i <= LastReg; i = nextreg(i))
                        if (needSaving&rmask(i))
@@ -175,7 +176,8 @@
                for (Register i=UnknownReg; i >= FirstReg; i = prevreg(i))
                        if (restore&rmask(i)) { POP(i); }

-               POP(FP);
+               //POP(FP);
+               ADDi(SP,4);
                POP(FP);
         return  _nIns;
     }
---

./runtests.py -i 50 sunspider
Executing tests at 2008-05-31 12:36:50.245578
avm: /Users/habals/tamarin-tracing-unmod/dist/shell/avmshell
avm2: /Users/habals/tamarin-tracing/dist/shell/avmshell


test                                                   avm    avm2     %sp

sunspider/access-binary-trees.as                      84.0    84.0     0.0
sunspider/access-fannkuch.as                         138.0   136.0     1.4
sunspider/access-nbody.as                            160.0   160.0     0.0
sunspider/access-nsieve.as                            60.0    60.0     0.0
sunspider/bitops-3bit-bits-in-byte.as                 14.0    14.0     0.0
sunspider/bitops-bits-in-byte.as                      40.0    40.0     0.0
sunspider/bitops-bitwise-and.as                      206.0   201.0     2.4
sunspider/bitops-nsieve-bits.as                       52.0    52.0     0.0
sunspider/controlflow-recursive.as                    30.0    29.0     3.3
sunspider/crypto-aes.as                              169.0   169.0     0.0
sunspider/crypto-sha1.as                              39.0    39.0     0.0
sunspider/math-cordic.as                              52.0    52.0     0.0
sunspider/math-partial-sums.as                       196.0   194.0     1.0
sunspider/math-spectral-norm.as                       33.0    33.0     0.0
sunspider/s3d-cube.as                                155.0   154.0     0.6
sunspider/s3d-morph.as                                77.0    75.0     2.6
sunspider/string-fasta.as                            159.0   155.0     2.5
---

I found that after these pushes in prologue, SUBi SP,40 is executed.
I think by removing these pushes and combine SUB instruction into one, you'd get a better performance improvement. 
However, one of the EBP value in stack is used, so I'm not sure how to get rid of it. 
Any comment if this is possible or a right way to go?


Reproducible: Always

Steps to Reproduce:
1.
2.
3.
(Reporter)

Comment 1

10 years ago
Created attachment 323628 [details] [diff] [review]
replacing push & pop to sub & add in prologue

replacing push & pop to sub & add in prologue

Comment 2

10 years ago
the prologue on windows is aligning esp with 8bytes, and making sure ebp is also aligned.  because of how the code is organized the prologue is messy and larger than it should be.  

further directions that expand the scope but may have more benefit:
- how about an optimized prolog for windows that does the 8-aligning of esp integrated with everything else, rather than two mini-prologs?

- should the pushes of esi, etc occur after saving ebp, to make the prologue "standard" (aids in debugging)

- the prolog is only executed when transitioning between interpreter and traces but not when jumping from one trace to another.  its exactly the same prolog for every trace.  we could handcode the prolog once and jump directly to a no-prologue trace.   this would mean having 1 prologue, period, vs 1 per trace like now.  code size then would not matter.

- related:  when calling a helper function that takes a floating point value (eg fmod) we do PUSH(ECX) twice, then a store to store the fp value.  should we intead do sub esp,8?  whats the size/speed tradeoff.  are there any issues with esp folding?

Ed

Updated

9 years ago
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → WONTFIX

Updated

8 years ago
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.