Closed Bug 140412 Opened 22 years ago Closed 22 years ago

Modest performance enhancement for x86 unix xptcall using gcc

Categories

(Core :: XPCOM, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: tenthumbs, Assigned: tenthumbs)

References

Details

(Keywords: perf)

Attachments

(3 files, 12 obsolete files)

17.18 KB, patch
shaver
: review+
blizzard
: superreview+
Details | Diff | Splinter Review
17.07 KB, patch
Details | Diff | Splinter Review
16.93 KB, patch
Details | Diff | Splinter Review
Note that XPTC_InvokeByIndex calculates the address of 
invoke_copy_to_stack every time it's called. Since the address doesn't 
change that's inefficient. On Linux at least, the compiler generates 
simple calls for static functions. There's no reason why mozilla 
shouldn't either.

Here's a very experimental patch. In a simple test rig, it's a solid 8%
faster than the current version but I don't expect that much from
mozilla in the long term. It compiles cleanly on my Linux with egcs
1.1.2, gcc 2.95.3, and gcc 3.0.4 but that's not saying much. It may also 
be necessary to add -fkeep-inline-functions to the command line for very 
high optimization levels.

Looking for testers to see if this really helps.
Attached patch patch (obsolete) — Splinter Review
john, can you review this patch?
The patch is correct but the problem is that the function definition might be
omitted.  For gcc 3 and up there is the function attribute

  __attribute__ ((used))

which should be added to the definition to accomplish that.  For other compilers
it remains to be seem.  It could very well be that at least gcc 2.96 removed the
definition.
Yep, compiling with -O3 gives link errors but adding
-fkeep-inline-functions fixes it.

The big problem here is that gcc doesn't know we're calling functions
from the asm so we have to jump through hoops to keep gcc from screwing
up. Only the variable n needs to be preserved. I have a version that 
does this
  register void *esp __asm__ ("esp");//bind name to hardware reg
  void * volatile saved_esp = esp;
  __asm__ (...);
  esp = saved_esp;
which works but I'm not sure some version of gcc won't mess with this.
Adding the asm is kind of silly.  The asm cannot be optimized away and add code
and cost while the whole patch is about performance tuning.

I'd suggest making the improvement conditional on gcc 3.1 and up (gcc 3.1
introduce the 'used' attribute).  So you}d change the beginning of the
invoke_copy_to_stack definition to

static void
#if __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 1)
__attribute__ ((used))
#endif
invoke_copy_to_stack(PRUint32 paramCount, nsXPTCVariant* s, PRUint32* d) 
{

The patch (the asm stuff) then also has to use the same #if expression.
I was unclear. I meant in XPTC_InvokeByIndex not in 
invoke_copy_to_stack.
Attached patch assembler-only patch (obsolete) — Splinter Review
Not surprisingly, rewriting XPTC_InvokeByIndex entirely in assembler
produces even more improvement. In this patch the current assemble code
is reused with just some minor editing. Only the function prologue and
epilogue are new. It's untested in a real mozilla but it compiles and
works correctly in my test rig.
The patch looks OK.  In fact: it fixes one major problem.  Don't know why I
haven't seen it before.  The current code is not signal-safe.  It uses memory
below the stack pointer which is an absolute no-no.  Especially with the current
libpthread implementation on Linux which creates a awfull lot of signals.

What the patch still doesn't handle is in inlining problem.  The #if for the
ATTRIBUTE_USED definition should use the expressions I mentioned last time:

  __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 1)

But this still leaves gcc 3.0 and gcc 2.xx.  If using special parameter is an
option do thiat.  Otherwise create and non-static function which is never used
and which calls this function.  Such as in:

#if __GNUC__ && (__GNUC__ < 3 || (__GNUC__ == 3 && __GNUC_MINOR__ = 0))
void
XPTC_NeverCalledPleaseIgnore (void)
{
  (void) invoke_copy_to_stack (0, 0, 0);
}
#endif

Depending on how the Makefiles are sructured this might be preferrable.


Back to the code.  While it's correct there are still some improvements possible:

- as correctly noted, the PIC register loading isn't necessary.  The function
  doesn't use global variable nor does it call non-static functions.  This,
  on x86, removes the requirement for a correctly set up %ebx

  I would strongly recommend to add a comment explaining this, though.

- since %ebx is not modified it is not necessary to preserve its value.
  Therefore the "pushl %ebx" in the prolog and the matching "popl %ebx" in
  the epilogue are not needed.

- as for the size computation, there is no need to define a symbol.  Replace
  the lines of the .Lmozilla2 definition and the .size line with

  .size XPTC_InvokeByIndex, . - XPTC_InvokeByIndex

  You've almost used the '.' address operator already.


The rest looks good and because of the major bug this patch fixes it should go
in before 1.0.
Attached patch assembler-only patch 2 (obsolete) — Splinter Review
This version incorporates the latest suggestions )assuming I haven't
made any types). I also added an option to use __attribute__ ((stdcall))
which may be slightly faster.

This patch is a major change. Maybe it would be better to put this 
version in a new file, say xptcinvoke_gcc_x86_unix.cpp, and switch over 
each platform in the makefile after proper tested
Attachment #81373 - Attachment is obsolete: true
The stdcall change in the latest patch is broken.  You only changed the
correction of the stack pointer and not the way the parameters are passed to
invoke_copy_to_stack.  The pushls must be replaced by simple register loads.

While this can be a benefit I still wouldn't suggest using it.  gcc is very bad
at  handling stdcall.  Normally the first thing it does in a stdcall function is
to spill the registers.

So, I'd suggest removing the MOZ_USE_STDCALL stuff.  At the same time you can
remove the addl  $0x0c, %esp after the invoke_copy_to_stack.  Since a stack
frame is created the stack pointer need not be corrected after every function call.


And one more thing: you've left the PIC computation code in.  That's fine.  But
the code itself isn't.  This

+
"pushl %ebx\n\t"
+
"call  .Lmozilla1\n"
+
".Lmozilla1:\n\t"
+
"popl  %ebx\n\t"
+
"addl  $_GLOBAL_OFFSET_TABLE_+[. - .Lmozilla1], %ebx\n\t"

code is very unfriendly to the branch prediction unit of i686 and up.  These
processors depend on a call being undone by a ret.  So you should write it like
this:

     pushl %ebx
     call 0f
     .subsection 1
0:   movl (%esp), %ebx
     ret
     .previous
     addl $_GLOBAL_OFFSET_TABLE_, %ebx

This is the code gcc would use for -mtune=i686.


Finally, would somebody please add "dataloss" as a keyword?  The patch fixes a
possibly big problem even though the bug itself is not about this.
Are you confusing stdcall with regparm? The docs say

 "stdcall
    On the Intel 386, the stdcall attribute causes the compiler to
assume that the called function will pop off the stack space used to
pass arguments, unless it takes a variable number of arguments."

so it's just a question of who flushes the arguments from the stack. I
can't say I see any of my compilers behaving badly with stdcall but I'll
remove it anyway.

We must pop the invoke_copy_to_stack arguments because we're creating an
argument list for another function in two pieces and they had better be
contiguous.

I'm leery of the .subsection stuff. Do all the binutils packages out
there support it?

New patch upcoming shortly.

I must be missing something but I just can't see where the current code
references negative offsets from esp.
> Are you confusing stdcall with regparm? The docs say

Oops, yes.  I'm never enabling stdcall without regparms myself.  OK, scratch
that.  But the comment about removing the stack pointer adjustment is still
valid.  Plus, not doing the adjustment will also align the stack for the
indirect call later on.

> I'm leery of the .subsection stuff. 

For a time long enough to depend on it here.  Versions which don't support it
cannot be used to build moz anyway.

> I must be missing something but I just can't see where the current code
> references negative offsets from esp.

Tell me to never read sources in moz.  I misread subl  %8, %%esp in the
originial code as subl  $8, %%esp.  The font wasn't usable.  Oh well, this is
void then, too.
Attached patch assembler-only patch 3 (obsolete) — Splinter Review
I've changed the PIC code. If it becomes an issue I can fix it. I left 
the stdcall stuff because it can't hurt and might help. I'm going to 
leave the rest alone for the moment because I want to fix up xptcstubs.
Attachment #81487 - Attachment is obsolete: true
reassigning to tenthumbs@cybernex.net since he is doing the work.  
Assignee: dougt → tenthumbs
Attached patch xptcinvoke + xptcstubs patch 1 (obsolete) — Splinter Review
Here's a first try at a unified approach for xptcinvoke and xptcstubs.	
To avoid contaminating other platforms, it creates new files and patches
Makefile.in. If something goes wrong, just revert Makefile.in.

I hope to make an optimized build this evening or tomorrow but I don't
have a lot of time. If anyone wants to experiment in their copious spare
time, feel free to do so. :-)
Attachment #81207 - Attachment is obsolete: true
Attachment #81675 - Attachment is obsolete: true
Attached patch xptcinvoke + xotcstubs patch 2 (obsolete) — Splinter Review
Fixed some dumb typos in the Makefile.in patch.

This patch compiles and runs on Linux. It appears to be faster but no numbers
yet.
Attachment #81851 - Attachment is obsolete: true
Attached patch xptcinvoke + xptcstubs patch 3 (obsolete) — Splinter Review
Just minor changes here. The stdcall trick helps performance slightly so 
it's turned on in the Makefile.in patch. The keeper functions are more 
elegant but they are exposed in the library so they're turned off and 
-fkeep-inline-functions is on.
Attachment #81916 - Attachment is obsolete: true
The new XPTC_InvokeByIndex is 18-22% faster than the old one depending 
on the compiler and options used. The nsXPTCStubBase::Stub* functions 
are also slightly faster, maybe 1% but not much more.

In mozilla this translates to about 1% speed improvement but the noise 
is high so it may be less. I'm not sure if 1% is disappointing or 
surprising. There is presumably improvement in other mozilla operations 
but I don't know how to test for that.

Adding the perf keyword. Changing the summary from "Modest performance 
enhancement for xptcinvoke_unixish_x86.cpp" to "Modest performance 
enhancement for x86 unix xptcall using gcc."

The latest patch is solid so I guess it's time to start looking for a
review. 
Keywords: perf
Summary: Modest performance enhancement for xptcinvoke_unixish_x86.cpp → Modest performance enhancement for x86 unix xptcall using gcc
drepper and jband should do the reviews.  

tenthumbs how much testing have you done?
I can only test on Linux which is one reason the patch creates new
files. It can be easily undone. No other platform is affected.

I checked that all my compilers correctly handle the assembler code at
various optimizations. I also get a functional mozilla with the new
code.

Of course, more testing can't hurt.
In xptcinvoke_gcc_x86_unix.cpp: the function invoke_copy_to_stack contains some
hack with a comment which mentions the Solaris compiler.  This is not relevant
for this file; it is only for gcc so the hack and the comment can go.  I.e., the
code should move into a default clause.  Similarly, the XPTC_InvokeByIndex code
contains support for cfront-style code generation.  This isn't interesting here
either.

The xptc_invoke_copy_to_stack_keeper should get a comment.  There is one in the
header but if you're only reading the .cpp file itself it is not obvious why
this code is needed.

In XPTC_InvokeByIndex: Is think I mentioned it before, the addl after the call
to invoke_copy_to_stack should never be necessary.  The destruction of the stack
frame will take care of this.


Beside this the patch look good.  I'll test it a bit very soon myself.
Attached patch xptcinvoke + xptcstubs patch 4 (obsolete) — Splinter Review
Removed solaris hack, added some more comments.
Attachment #82665 - Attachment is obsolete: true
You're absolutely right about the Solaris hack and the need for more
comments. Done.

My idea is that users of unixish_x86  who also use gcc can switch to 
this code after testing. If none of them use Cfront then it can go. I'm 
not sure so I'll keep it around for a while.

XPTC_InvokeByIndex actually calls functions like foo(that, ...) so it's 
necessary to remove invoke_copy_to_stack's arguments from the stack so 
the real stack frame is properly constructed.
My testing showed no problems with the latest patch.  Thanks for the work.
r=drepper
shaver, jband, can I get either of you to review this patch?
Attached patch final patch (obsolete) — Splinter Review
I've been thinking (always a bad sign) and I realized that using
-fkeep-inline-functions has some drawwbacks so this patch uses the
keeper functions which now have no side effects and return something
harmless. See the comment in the new xptc_gcc_x86_unix.h for more
details. Only the keeper functions are changed.

I've tested with egcs-1.1.2, gcc-2.95.3, and gcc-3.0.4 using, -O3, 
-finline-functions, etc. and they all produce the desired code. I've 
tested with gcc-3.1 and it also does as I want.

This is the final patch unless something drastic appears.
Attachment #83536 - Attachment is obsolete: true
Still OK with me.  I'm using the old patch for quite some time now.
Attached patch Stable Stub function interfaces (obsolete) — Splinter Review
This is an updated patch which replaces the Stub functions with all-assembler
code.  More recent versions of gcc (correctly) fail to work with the old code.
A few more words on attachment 88224 [details] [diff] [review].

The stub functions are declared to not take parameters and relied on the frame
pointer being set up to find the parameters.  This is  very fragile even and
failed with more recent compilers (gcc 3).  The old code, replaced by the code
in this bug, also has the problem but since gcc would now use these files it's
not a problem anymore once the patch is applied.

The solution used here is based on a patch by Gwenole Beauchesne
<gbeauchesne@mandrakesoft.com>, adopted by me for the new code.  It uses
assembler throughout, eliminated most dependencies on the compiler.  Only the
name mangling is an issue which is, I think, adequately dealt with.

I've tested the patch with gcc 2.96.
I'm getting problems with a gcc 3.1 build.  I see link errors like this:

../../../dist/bin/libxpcom.so: undefined reference to `_ZN14nsXPTCStubBase5Stub2
37Ev'
../../../dist/bin/libxpcom.so: undefined reference to `_ZN14nsXPTCStubBase5Stub1
34Ev'


[root@stripples unix]# nm libxptcmd.a  | grep Stub104
         U _ZN14nsXPTCStubBase5Stub104Ev
000005d0 T _ZN14nsXPTCStubBase7Stub104Ev

Looks like there's still a name manging problem of some kind.
Oh, here's some of the gcc -E output:

asm(".section   \".text\"\n\t" ".align  2\n\t" ".if     " "104" " < 10\n\t"
".globl     _ZN14nsXPTCStubBase5Stub" "104" "Ev\n\t" ".type 
_ZN14nsXPTCStubBase5Stub" "104" "Ev,@function\n" "_ZN14nsXPTCStubBase5Stub"
"104" "Ev:\n\t" ".elseif    " "104" " < 100\n\t" ".globl    
_ZN14nsXPTCStubBase6Stub" "104" "Ev\n\t" ".type _ZN14nsXPTCStubBase6Stub" "104"
"Ev,@function\n" "_ZN14nsXPTCStubBase6Stub" "104" "Ev:\n
\t" ".elseif    " "104" " < 1000\n\t" ".globl     _ZN14nsXPTCStubBase7Stub"
"104" "Ev\n\t" ".type      _ZN14nsXPTCStubBase7Stub" "104" "
Ev,@function\n" "_ZN14nsXPTCStubBase7Stub" "104" "Ev:\n\t" ".else\n\t" ".err   
\"stub number " "104" " >= 1000 not yet supported\"\n\t"
 ".endif\n\t" "movl     $" "104" ", %eax\n\t" "jmp      SharedStub\n\t" ".size 
_ZN14nsXPTCStubBase5Stub" "104" "Ev,.-_ZN14nsXPTCStubBas
e5Stub" "104" "Ev");
Attached patch Correct .size statements (obsolete) — Splinter Review
Oops, little problems with the gcc 3 code.  Fixed now and at least compiles and
links.
Attachment #88224 - Attachment is obsolete: true
Attached patch merged patch (obsolete) — Splinter Review
I think you want to merge my last patch in. Using -fkeep-inline-functions
doesn't upgrade gracefully with the compiler.

How does this look?
I understand you're problem in principal but do not really know (even after
reading the comment) why the code in the dummy functions must be more complex. 
What exactly was the effect of the old code?  The function was called?  Were
PrepareAndDispatch and invoke_copy_to_stack inlined into the keeper functions?

Anyway, the patch is OK but you should be able to lose th "movl" instructions in
the asms you added to the keeper functions.  gcc has no way to look into asms
and since the functions are never called it doesn't matter that we swindle.

If you make this change you would get a review from me if that would mean anything.
I've just made the change myself.  Hope you don't mind.  The result seems to
work fine here.
Attachment #84621 - Attachment is obsolete: true
Attachment #88261 - Attachment is obsolete: true
Attachment #88286 - Attachment is obsolete: true
In general the keeper functions would be externally visible. I don't
want random people to actually call since that could crash mozilla so I
reference the static functions and return something harmless. Yes, the
movl's probably aren't necessary but I haven't tried all possible
compiler versions with all possible options so I decided to be safe.
It's not all that important.
Comment on attachment 88311 [details] [diff] [review]
merged patch + my proposed changes

automated tests all pass with gcc 2.96 and gcc 3.1.  Also produces builds that
work.  Tested some JS components and in general things look OK.  The code seems
OK as well.

sr=blizzard
Attachment #88311 - Flags: superreview+
I've done some testing and the last patch seems to work.

And one general comment: I hope people here realize that this patch isn't
anymore about 'perf' only.  This is a bug fix now.  So, either lose perf or add
whatever other keyword is needed.
Automated tests on Red Hat 6.2 all pass as well.  It also produces a working
build on 6.2 with egcs.
tenthumbs, does this look good enough for you to stamp?
The last patch compiles correct code with my egcs, 2.95.3, and 3.0.4 so 
it should do. If you're looking for a r= from me you've got it, for what 
little it's worth. Note that I don't have checkin privileges.
Comment on attachment 88311 [details] [diff] [review]
merged patch + my proposed changes

Looks pretty good.  Between ulrich and tenthumbs and me, I think this more than
deserves an r=.
Attachment #88311 - Flags: review+
With the latest changes there's not too much C++ left. It would be 
possible to put all the assembler into one file and the C++ into 
another. This would allow us to use the assembler directly which means 
we could add debugging symbols so one could step through this stuff with 
gdb. The disadvantage is that invoke_copy_to_stack and 
PrepareAndDispatch would be externally visible unless we do some linker 
magic. Any thoughts?
I wouldn't bother making that change.  The amount of assembler code used is very
small [1].  If you wanted, you could add the debug info in the asm() statements
as well.

[1] I just realized that there is no reason to keep the SharedStub function. 
Just rearrange the first two parameters of PrepareAndDispatch, create a "C"
name, and add attribute((regparm(1))).  Then there really is only the
XPTC_InvokeByIndex function left as a bigger asm piece.
A little optimization in SharedStub.  Instead of getting rid of SharedStub
(which would add about 2000 bytes in code size) we can use the optimization in
this patch which uses gcc's abilities to modify the calling convention of
PrepareAndDispatch.  The resulting file is a few bytes shorter but each call
should be quite a bit faster.
*** Bug 155971 has been marked as a duplicate of this bug. ***
So, can we get this checked in? My gcc 3.1 builds are DOA without this patch.
I'm concerned that some future gcc version will change the regparm 
conventions and we'll have to do this all over again. If it were me, I 
would prefer to make all the __attribute__ stuff macros in 
xptc_gcc_x86_unix.h so it can be changed readily if necessary. However I 
think it's time to actually do this and see what happens.

BTW, has anyone ever investigated whether the stub functions can have C 
linkage? It would make life _so_ much easier.
It occurred to me that the SharedStub function is just 2 instructions and a
jump, so why not just include it directly into the stubs. That makes each stub
8 bytes larger but that's still smaller than the current version plus it
eliminates one jump. Also made the regparm thing a macro as I suggested before.
I don't think I broke anything.
No, don't use the "inline the stub".  The size increase is too significant for
the almost unmeasurable benefits.  I've tried it.  The processor is very well
able to predict the unconditional jump correctly.

I do think my last patch should be used and yes, it should be applied.  Can the
r/sr be moved to it?
Well, today's processor is tomorrow's legacy crap. I'm not a big fan of 
doing something just because it works now. Also, when did a few bytes 
matter to mozilla?

Be that as it may, it is definitely time to do something.
Unconditional jumps will be handled rather better than worse in new processors.

And we are not talking about a few bytes.  With 4096 stub and 8 bytes more for
each this are 32kb more code.

In other places I've faught hard to reduce the size by hundreds of bytes.
But xpcom/reflect/xptcall/public/*.{h,inc} only have 256 stubs. Where 
are the others? Also any patch here reduces the size of each stub so 
there is a net win no matter what we do.
OK, in the code there are only 256 stubs.  I used at some point code which had more.

Still, 256 stubs with 8 bytes more each, still adds 2kb with almost no benefit
at all.  Do some measurements.  The unconditional jump almost completely disappears.

In other bugs people argue against a few kb of code for a new feature (even thse
required by standards).  Here you argue about adding 2kb to save a few
nanoseconds.  I'm the first to appreciate optimizations, believe me, but this
isn't one.  I initially wrote the code exactly as you did but reverted since
it's not worth it.

Anyway, could some r/sr make a call and move the reviewer marks to one of the
patches?
I started this bug about performance, i.e. speed, so I will always 
prefer speed over size. The current stubs are at least 42 bytes which is 
twice the size or more of the new versions so I see no reason not to 
prefer speed. We get a size reduction as a side effect.

I agree with you, though, that this should be dealt with now. I won't 
object to any choice someone makes.
I thought we _had_ reviewed one of the patches (88311).  Why wasn't it checked
in then?

The follow-on optimizations are great, and I'd love to get the regparm
adjustment for 1.1final, but we don't need to keep everything in its current
broken state until then.  Can someone please mail drivers@mozilla.org asking for
trunk approval (perhaps branch as well -- blizzard, you should help us make that
call), and get it checked in ASAP?
If Ulrich is right and there's no benifit from inlining, then I say we take his
and get the space savings.  These stubs are used in lots of places so that
little bit will add up.
Comment on attachment 88311 [details] [diff] [review]
merged patch + my proposed changes

a=asa (on behalf of drivers) for checkin to 1.1
Attachment #88311 - Flags: approval+
I checked in attachment 88311 [details] [diff] [review] to the trunk. Please file separate bugs for any
additional changes.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Blocks: 417605
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: