Closed
Bug 522314
Opened 16 years ago
Closed 16 years ago
TM merge: Make x86 virtual stack pointer code optional - mirror of bug 514827
Categories
(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect)
Tamarin Graveyard
Baseline JIT (CodegenLIR)
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rreitmai, Assigned: rreitmai)
References
Details
(Whiteboard: fixed-in-tracemonkey)
Attachments
(2 files, 2 obsolete files)
9.00 KB,
patch
|
gal
:
review+
edwsmith
:
superreview+
|
Details | Diff | Splinter Review |
12.15 KB,
text/plain
|
Details |
avoid redundant stack pointer updates when calling functions - bug 514827
Assignee | ||
Updated•16 years ago
|
Summary: TM merge: Tamarin mirror of bug 514827 : redundant stack ptr updates → TM merge: Redundant stack ptr updates x86 - mirror of bug 514827
Assignee | ||
Comment 1•16 years ago
|
||
Works against latest tamarin, but further performance testing needed as on windows showing a slowdown on deltablue and gameoflife ; might just be noise.
Mac looks fine.
Assignee | ||
Updated•16 years ago
|
Attachment #406769 -
Flags: review?(edwsmith)
Assignee | ||
Comment 2•16 years ago
|
||
From gal:
Weird! I'd hazard a guess that it's our old stinky enemy "stack alignment" come back to bite again, but that's pretty random (just looking at the patch).
I did the original patch. Its a clear win in most cases. There is one case where it can be slower. PUSH m32 translates to a MOV FP+x, reg and a findRegFor. There is a comment in the original bug that we could probably avoid extra spills there by pre-scanning the arguments. Are we sure what really the slowdown is from here? I went to great length to avoid misalignment. However, note that on win32 we never aligned the stack right (not before, not after the patch).
Assignee | ||
Comment 3•16 years ago
|
||
Digging further into this I'm noticing more register pressure with the patch. Without the patch we are using 'push -n(ebp)' forms that don't chew up a register; post patch we are using 'mov m(esp), reg'.
As an example here's a hunk of code:
push -60(ebp) ebx(ldc1) esi(env) edi(or1)
push edi ebx(ldc1) esi(env) edi(or1)
push 13279828 ebx(ldc1) esi(env)
push -84(ebp) ebx(ldc1) esi(env)
push esi ebx(ldc1) esi(env)
call setprop_index ebx(ldc1) esi(env)
add esp,20 ebx(ldc1) esi(env)
mov eax,-60(ebp) ebx(ldc1) esi(env) <= restore or2
which post-patch becomes :
mov 16(esp),edi eax(or3) ecx(or1) ebx(ldc1) esi(env) edi(or2)
mov 12(esp),ecx eax(or3) ecx(or1) ebx(ldc1) esi(env) edi(or2)
mov 8(esp),13279828 eax(or3) ebx(ldc1) esi(env) edi(or2)
mov 4(esp),eax eax(or3) ebx(ldc1) esi(env) edi(or2)
mov 0(esp),esi ebx(ldc1) esi(env) edi(or2)
call setprop_index ebx(ldc1) esi(env) edi(or2)
mov ecx,edi ebx(ldc1) esi(env) edi(or2)
And I imagine as register pressure goes up, the post-patch version gets even worse since we constantly have to shuffle into and out of registers. With the push() version we aren't required to restore the value to a register prior to use.
Comment 4•16 years ago
|
||
strictly speaking, the fact that we have to load then store hasnt changed, what changed is that its now explicit. the pusharg code could use a scratch register (eax?) thats known to be clobbered by the call, or be carefully crafted to only steal registers that its using for other args, instead of creating new register pressure upstream of the call.
is this ready for review or is more testing being done? i would hazard a guess that using push will work better on thiscall and fastarg abi's (callee pops args), and using esp-relative stores works better for cdecl (caller pops args). at least one stack adustment per call is still necessary for callee-pops abi's, right?
Comment 5•16 years ago
|
||
note: on windows, method calls are "thiscall" and on gcc (mac/linux) method calls are cdecl. this could account for the slowdown.
Comment 6•16 years ago
|
||
Ed, we think the issue here is explicit ESP-relative addressing. That might be slow on Pentium D. We don't see a slowdown on core2.
Comment 7•16 years ago
|
||
On core2 you are not supposed to fiddle with ESP if possible and what we do matches closely what icc generates for function calls. The final re-adjustment in case of fastcall is annoying (callee pops arguments, we have to move ESP back up to our old watermark).
Assignee | ||
Comment 8•16 years ago
|
||
Here is some more data on various CPUs. All I can say is I'm more confused now than before.
Assignee | ||
Comment 9•16 years ago
|
||
Its becoming clear (to me at least) that we can't accept this change from tracemonkey directly.
So instead, I'm proposing a merged version which supports both modes.
A runtime option config.fixed_esp determines which form to use. There is a small amount of extra code needed to support both modes but it seems like the best option at this point.
Plus we gain the flexibility of dynamically switching techniques which could be keyed into processor type and/or other requirements.
Assignee | ||
Comment 10•16 years ago
|
||
obsoletes the previous patch and blends existing tamarin and tracemonkey code.
Attachment #406769 -
Attachment is obsolete: true
Attachment #407658 -
Flags: superreview?(edwsmith)
Attachment #406769 -
Flags: review?(edwsmith)
Assignee | ||
Updated•16 years ago
|
Attachment #407658 -
Flags: review?(gal)
Comment 11•16 years ago
|
||
(In reply to comment #9)
> Its becoming clear (to me at least) that we can't accept this change from
> tracemonkey directly.
I'm puzzled / curious as to why the change is bad for Tamarin but good for TraceMonkey.
Comment 12•16 years ago
|
||
The speedup for TM is not substantial. This patch is mandatory if we want to free up EBP as a general purpose register. If we are ok with dropping that goal, we can back away from this patch. We should probably at least keep the virtual stack offset piece (stkd) and use it when writing doubles to the stack instead of repeated SUBi ESP, 8. So my vote is to drop the patch, except for that, due to unclear performance impact.
Comment 13•16 years ago
|
||
Attachment #407629 -
Attachment is obsolete: true
Comment 14•16 years ago
|
||
Some suggestions to make #s more relevant:
1. calculate %stdev instead of stdev, that is, %stdev = stdev/avg. this will normalize all the stdev results, currently they are in the measurement units and its easier to understand if they are unitless %.
2. calculates "significance" as %diff / (%stdev1 + %stdev2). Test with large standard devitions (lots of noise) thus require a larger %diff before the change can be considered significant.
Not to touch on religious issues here... some small changes are still desireable (or not desireable) even if they are "insignificant" calculated this way. but more importantly this will highlight the truly significant changes. (with signifiance outside the noise band, say > 1 or < -1).
Comment 15•16 years ago
|
||
(In reply to comment #12)
> The speedup for TM is not substantial. This patch is mandatory if we want to
> free up EBP as a general purpose register. If we are ok with dropping that
> goal, we can back away from this patch. We should probably at least keep the
> virtual stack offset piece (stkd) and use it when writing doubles to the stack
> instead of repeated SUBi ESP, 8. So my vote is to drop the patch, except for
> that, due to unclear performance impact.
on #nanojit the possibility of mov [esp+off], R requiring a SIB byte being a factor. we could recode this so stores are relative to EBP, which doesn't require a SIB byte. (ignore the goal of not using EBP, for now).
Comment 16•16 years ago
|
||
one more thought:
the benefit from using EBP to reduce register pressure might actually outweigh any potential slowdowns from storing arguments relative to ESP and explicitly adjusting ESP after callee-pops-args calls (thiscall, fastcall). in that case we should consider the patches together instead of separately.
lastly, lets not look at the performance in isolation from code size.
* not adjusting ESP saves code size
* mov [ebp+d], R is bigger than push
* mov [esp+d], R is bigger than mov [ebp+d]
bigger+slower => bad
smaller+slower => maybe good or bad
bigger+faster => maybe good or bad
smaller+faster => good
a big size savings for a small slowdown => probably good
big speedup for small growth => probably good.
you can see where i'm going. i'm not ready to give up on the whole thing yet. also, lastly, if the whole thing is stinky on msvc and okay on gcc, then its likely because of a much higher use of "thiscall" convention with msvc, which could be rethought. (more cdecl calls? more fastcall calls?)
Updated•16 years ago
|
Attachment #407658 -
Flags: review?(gal) → review+
Updated•16 years ago
|
Attachment #407658 -
Flags: superreview?(edwsmith)
Attachment #407658 -
Flags: superreview+
Attachment #407658 -
Flags: review?(gal)
Attachment #407658 -
Flags: review+
Comment 17•16 years ago
|
||
Comment on attachment 407658 [details] [diff] [review]
support both x86 stk options
>
> bool sse2;
>+ bool fixed_esp;
> bool use_cmov;
>
indentation
> /**
>@@ -151,6 +152,7 @@
> static const Runmode runmode_default;
> static const bool cseopt_default;
> static const bool sse2_default;
>+ static const bool fixed_esp_default;
indentation
>- asm_arg(ARGSIZE_P, ins->arg(argc), EAX);
>+ asm_arg(ARGSIZE_P, ins->arg(argc), EAX, stkd);
>+ if (!config.fixed_esp) stkd = 0;
We usually put the stkd = 0 in the next line here.
> }
>
> for(uint32_t i=0; i < argc; i++)
>@@ -240,11 +254,16 @@
> if (n < max_regs && sz != ARGSIZE_F) {
> r = argRegs[n++]; // tell asm_arg what reg to use
> }
>- asm_arg(sz, ins->arg(j), r);
>+ asm_arg(sz, ins->arg(j), r, stkd);
>+ if (!config.fixed_esp) stkd = 0;
Here too.
> #ifdef TM_MERGE
> //
> // 22Jul09 rickr - Enabling the evict causes a 10% slowdown on primes
>@@ -1388,7 +1430,10 @@
> evictIfActive(FST0);
> #endif
Not enabling this is a bug. If you have an argument that is in FST0 and the argument is supplied to a function twice, we generate bad code. We should bisect for the bug #. Ed suggested a better fix than evicting FST0 every time. dvander might remember the bug # and details.
Attachment #407658 -
Flags: review?(gal) → review+
Assignee | ||
Comment 18•16 years ago
|
||
Ok, preparing to land this patch as it gives us the most flexibility at this point and leads us closer to a merged codebase, without getting bogged down in proving which approach is *better*.
Suggest we open a separate bug in order to duel it out; we can then migrate post-merge to whatever scheme we agree upon.
As for #17 :
(1) all tabs have been removed; relics of cut&paste from surrounding code.
(2) TM_MERGE not strictly part of this codebase please comment on bug 506146
Assignee | ||
Comment 19•16 years ago
|
||
pushed http://hg.mozilla.org/tamarin-redux/rev/e726f5ba7e6d
Note the optimization suggestion of using mov for quads (rather than pushes), should be placed in any follow-up patches.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Updated•16 years ago
|
Summary: TM merge: Redundant stack ptr updates x86 - mirror of bug 514827 → TM merge: Make x86 virtual stack pointer code optional - mirror of bug 514827
Comment 20•16 years ago
|
||
Whiteboard: fixed-in-tracemonkey
You need to log in
before you can comment on or make changes to this bug.
Description
•