Closed Bug 522314 Opened 12 years ago Closed 12 years ago

TM merge: Make x86 virtual stack pointer code optional - mirror of bug 514827

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rreitmai, Assigned: rreitmai)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 2 obsolete files)

avoid redundant stack pointer updates when calling functions - bug 514827
Summary: TM merge: Tamarin mirror of bug 514827 : redundant stack ptr updates → TM merge: Redundant stack ptr updates x86 - mirror of bug 514827
Depends on: 514827
Attached patch stk x86 (obsolete) — Splinter Review
Works against latest tamarin, but further performance testing needed as on windows showing a slowdown on deltablue and gameoflife ; might just be noise.
Mac looks fine.
Attachment #406769 - Flags: review?(edwsmith)
From gal:

Weird! I'd hazard a guess that it's our old stinky enemy "stack alignment" come back to bite again, but that's pretty random (just looking at the patch). 

I did the original patch. Its a clear win in most cases. There is one case where it can be slower. PUSH m32 translates to a MOV FP+x, reg and a findRegFor. There is a comment in the original bug that we could probably avoid extra spills there by pre-scanning the arguments. Are we sure what really the slowdown is from here? I went to great length to avoid misalignment. However, note that on win32 we never aligned the stack right (not before, not after the patch).
Digging further into this I'm noticing more register pressure with the patch.  Without the patch we are using 'push -n(ebp)' forms that don't chew up a register; post patch we are using 'mov m(esp), reg'.

As an example here's a hunk of code:

             push -60(ebp)          ebx(ldc1) esi(env) edi(or1)
             push edi               ebx(ldc1) esi(env) edi(or1)
             push 13279828          ebx(ldc1) esi(env)
             push -84(ebp)          ebx(ldc1) esi(env)
             push esi               ebx(ldc1) esi(env)
             call setprop_index     ebx(ldc1) esi(env)
             add esp,20             ebx(ldc1) esi(env)
             mov eax,-60(ebp)       ebx(ldc1) esi(env)  <= restore or2

which post-patch becomes :
             mov 16(esp),edi        eax(or3) ecx(or1) ebx(ldc1) esi(env) edi(or2)
             mov 12(esp),ecx        eax(or3) ecx(or1) ebx(ldc1) esi(env) edi(or2)
             mov 8(esp),13279828    eax(or3) ebx(ldc1) esi(env) edi(or2)
             mov 4(esp),eax         eax(or3) ebx(ldc1) esi(env) edi(or2)
             mov 0(esp),esi         ebx(ldc1) esi(env) edi(or2)
             call setprop_index     ebx(ldc1) esi(env) edi(or2)
             mov ecx,edi            ebx(ldc1) esi(env) edi(or2)

And I imagine as register pressure goes up, the post-patch version gets even worse since we constantly have to shuffle into and out of registers.  With the push() version we aren't required to restore the value to a register prior to use.
strictly speaking, the fact that we have to load then store hasnt changed, what changed is that its now explicit.  the pusharg code could use a scratch register (eax?) thats known to be clobbered by the call, or be carefully crafted to only steal registers that its using for other args, instead of creating new register pressure upstream of the call.

is this ready for review or is more testing being done?  i would hazard a guess that using push will work better on thiscall and fastarg abi's (callee pops args), and using esp-relative stores works better for cdecl (caller pops args).  at least one stack adustment per call is still necessary for callee-pops abi's, right?
note: on windows, method calls are "thiscall" and on gcc (mac/linux) method calls are cdecl.  this could account for the slowdown.
Ed, we think the issue here is explicit ESP-relative addressing. That might be slow on Pentium D. We don't see a slowdown on core2.
On core2 you are not supposed to fiddle with ESP if possible and what we do matches closely what icc generates for function calls. The final re-adjustment in case of fastcall is annoying (callee pops arguments, we have to move ESP back up to our old watermark).
Attached file performance results (obsolete) —
Here is some more data on various CPUs.  All I can say is I'm more confused now than before.
Its becoming clear (to me at least) that we can't accept this change from tracemonkey directly.

So instead, I'm proposing a merged version which supports both modes.  

A runtime option config.fixed_esp determines which form to use.  There is a small amount of extra code needed to support both modes but it seems like the best option at this point.

Plus we gain the flexibility of dynamically switching techniques which could be keyed into processor type and/or other requirements.
obsoletes the previous patch and blends existing tamarin and tracemonkey code.
Attachment #406769 - Attachment is obsolete: true
Attachment #407658 - Flags: superreview?(edwsmith)
Attachment #406769 - Flags: review?(edwsmith)
Attachment #407658 - Flags: review?(gal)
(In reply to comment #9)
> Its becoming clear (to me at least) that we can't accept this change from
> tracemonkey directly.

I'm puzzled / curious as to why the change is bad for Tamarin but good for TraceMonkey.
The speedup for TM is not substantial. This patch is mandatory if we want to free up EBP as a general purpose register. If we are ok with dropping that goal, we can back away from this patch. We should probably at least keep the virtual stack offset piece (stkd) and use it when writing doubles to the stack instead of repeated SUBi ESP, 8. So my vote is to drop the patch, except for that, due to unclear performance impact.
Some suggestions to make #s more relevant:

1. calculate %stdev instead of stdev, that is, %stdev = stdev/avg.  this will normalize all the stdev results, currently they are in the measurement units and its easier to understand if they are unitless %.

2. calculates "significance" as %diff / (%stdev1 + %stdev2).  Test with large standard devitions (lots of noise) thus require a larger %diff before the change can be considered significant.

Not to touch on religious issues here... some small changes are still desireable (or not desireable) even if they are "insignificant" calculated this way.  but more importantly this will highlight the truly significant changes.  (with signifiance outside the noise band, say > 1 or < -1).
(In reply to comment #12)
> The speedup for TM is not substantial. This patch is mandatory if we want to
> free up EBP as a general purpose register. If we are ok with dropping that
> goal, we can back away from this patch. We should probably at least keep the
> virtual stack offset piece (stkd) and use it when writing doubles to the stack
> instead of repeated SUBi ESP, 8. So my vote is to drop the patch, except for
> that, due to unclear performance impact.

on #nanojit the possibility of mov [esp+off], R requiring a SIB byte being a factor.  we could recode this so stores are relative to EBP, which doesn't require a SIB byte.  (ignore the goal of not using EBP, for now).
one more thought:

the benefit from using EBP to reduce register pressure might actually outweigh any potential slowdowns from storing arguments relative to ESP and explicitly adjusting ESP after callee-pops-args calls (thiscall, fastcall).  in that case we should consider the patches together instead of separately.

lastly, lets not look at the performance in isolation from code size.

* not adjusting ESP saves code size
* mov [ebp+d], R is bigger than push
* mov [esp+d], R is bigger than mov [ebp+d]

bigger+slower  => bad
smaller+slower => maybe good or bad
bigger+faster => maybe good or bad
smaller+faster => good

a big size savings for a small slowdown => probably good
big speedup for small growth => probably good.

you can see where i'm going.  i'm not ready to give up on the whole thing yet.  also, lastly, if the whole thing is stinky on msvc and okay on gcc, then its likely because of a much higher use of "thiscall" convention with msvc, which could be rethought.  (more cdecl calls? more fastcall calls?)
Attachment #407658 - Flags: review?(gal) → review+
Attachment #407658 - Flags: superreview?(edwsmith)
Attachment #407658 - Flags: superreview+
Attachment #407658 - Flags: review?(gal)
Attachment #407658 - Flags: review+
Comment on attachment 407658 [details] [diff] [review]
support both x86 stk options

> 
> 		bool sse2;
>+        bool fixed_esp;
> 		bool use_cmov;
> 

indentation

>         /**
>@@ -151,6 +152,7 @@
> 		static const Runmode runmode_default;
> 		static const bool cseopt_default;
> 		static const bool sse2_default;
>+        static const bool fixed_esp_default;

indentation

>-            asm_arg(ARGSIZE_P, ins->arg(argc), EAX);
>+            asm_arg(ARGSIZE_P, ins->arg(argc), EAX, stkd);
>+            if (!config.fixed_esp) stkd = 0;

We usually put the stkd = 0 in the next line here.

>         }
> 
>         for(uint32_t i=0; i < argc; i++)
>@@ -240,11 +254,16 @@
>             if (n < max_regs && sz != ARGSIZE_F) {
>                 r = argRegs[n++]; // tell asm_arg what reg to use
>             }
>-            asm_arg(sz, ins->arg(j), r);
>+            asm_arg(sz, ins->arg(j), r, stkd);
>+            if (!config.fixed_esp) stkd = 0;

Here too.

> #ifdef TM_MERGE
>     //
>     // 22Jul09 rickr - Enabling the evict causes a 10% slowdown on primes
>@@ -1388,7 +1430,10 @@
>             evictIfActive(FST0);
> #endif

Not enabling this is a bug. If you have an argument that is in FST0 and the argument is supplied to a function twice, we generate bad code. We should bisect for the bug #. Ed suggested a better fix than evicting FST0 every time. dvander might remember the bug # and details.
Attachment #407658 - Flags: review?(gal) → review+
Ok, preparing to land this patch as it gives us the most flexibility at this point and leads us closer to a merged codebase, without getting bogged down in proving which approach is *better*.

Suggest we open a separate bug in order to duel it out; we can then migrate post-merge to whatever scheme we agree upon.

As for #17 : 
 (1) all tabs have been removed; relics of cut&paste from surrounding code.
 (2) TM_MERGE not strictly part of this codebase please comment on bug 506146
pushed http://hg.mozilla.org/tamarin-redux/rev/e726f5ba7e6d

Note the optimization suggestion of using mov for quads (rather than pushes), should be placed in any follow-up patches.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Summary: TM merge: Redundant stack ptr updates x86 - mirror of bug 514827 → TM merge: Make x86 virtual stack pointer code optional - mirror of bug 514827
Engineering work item.  Marking as verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.