Closed
Bug 140412
Opened 23 years ago
Closed 23 years ago
Modest performance enhancement for x86 unix xptcall using gcc
Categories
(Core :: XPCOM, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: tenthumbs, Assigned: tenthumbs)
References
Details
(Keywords: perf)
Attachments
(3 files, 12 obsolete files)
|
17.18 KB,
patch
|
shaver
:
review+
blizzard
:
superreview+
asa
:
approval+
|
Details | Diff | Splinter Review |
|
17.07 KB,
patch
|
Details | Diff | Splinter Review | |
|
16.93 KB,
patch
|
Details | Diff | Splinter Review |
Note that XPTC_InvokeByIndex calculates the address of
invoke_copy_to_stack every time it's called. Since the address doesn't
change that's inefficient. On Linux at least, the compiler generates
simple calls for static functions. There's no reason why mozilla
shouldn't either.
Here's a very experimental patch. In a simple test rig, it's a solid 8%
faster than the current version but I don't expect that much from
mozilla in the long term. It compiles cleanly on my Linux with egcs
1.1.2, gcc 2.95.3, and gcc 3.0.4 but that's not saying much. It may also
be necessary to add -fkeep-inline-functions to the command line for very
high optimization levels.
Looking for testers to see if this really helps.
Comment 2•23 years ago
|
||
john, can you review this patch?
Comment 3•23 years ago
|
||
The patch is correct but the problem is that the function definition might be
omitted. For gcc 3 and up there is the function attribute
__attribute__ ((used))
which should be added to the definition to accomplish that. For other compilers
it remains to be seem. It could very well be that at least gcc 2.96 removed the
definition.
Yep, compiling with -O3 gives link errors but adding
-fkeep-inline-functions fixes it.
The big problem here is that gcc doesn't know we're calling functions
from the asm so we have to jump through hoops to keep gcc from screwing
up. Only the variable n needs to be preserved. I have a version that
does this
register void *esp __asm__ ("esp");//bind name to hardware reg
void * volatile saved_esp = esp;
__asm__ (...);
esp = saved_esp;
which works but I'm not sure some version of gcc won't mess with this.
Comment 5•23 years ago
|
||
Adding the asm is kind of silly. The asm cannot be optimized away and add code
and cost while the whole patch is about performance tuning.
I'd suggest making the improvement conditional on gcc 3.1 and up (gcc 3.1
introduce the 'used' attribute). So you}d change the beginning of the
invoke_copy_to_stack definition to
static void
#if __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 1)
__attribute__ ((used))
#endif
invoke_copy_to_stack(PRUint32 paramCount, nsXPTCVariant* s, PRUint32* d)
{
The patch (the asm stuff) then also has to use the same #if expression.
I was unclear. I meant in XPTC_InvokeByIndex not in
invoke_copy_to_stack.
Not surprisingly, rewriting XPTC_InvokeByIndex entirely in assembler
produces even more improvement. In this patch the current assemble code
is reused with just some minor editing. Only the function prologue and
epilogue are new. It's untested in a real mozilla but it compiles and
works correctly in my test rig.
Comment 8•23 years ago
|
||
The patch looks OK. In fact: it fixes one major problem. Don't know why I
haven't seen it before. The current code is not signal-safe. It uses memory
below the stack pointer which is an absolute no-no. Especially with the current
libpthread implementation on Linux which creates a awfull lot of signals.
What the patch still doesn't handle is in inlining problem. The #if for the
ATTRIBUTE_USED definition should use the expressions I mentioned last time:
__GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 1)
But this still leaves gcc 3.0 and gcc 2.xx. If using special parameter is an
option do thiat. Otherwise create and non-static function which is never used
and which calls this function. Such as in:
#if __GNUC__ && (__GNUC__ < 3 || (__GNUC__ == 3 && __GNUC_MINOR__ = 0))
void
XPTC_NeverCalledPleaseIgnore (void)
{
(void) invoke_copy_to_stack (0, 0, 0);
}
#endif
Depending on how the Makefiles are sructured this might be preferrable.
Back to the code. While it's correct there are still some improvements possible:
- as correctly noted, the PIC register loading isn't necessary. The function
doesn't use global variable nor does it call non-static functions. This,
on x86, removes the requirement for a correctly set up %ebx
I would strongly recommend to add a comment explaining this, though.
- since %ebx is not modified it is not necessary to preserve its value.
Therefore the "pushl %ebx" in the prolog and the matching "popl %ebx" in
the epilogue are not needed.
- as for the size computation, there is no need to define a symbol. Replace
the lines of the .Lmozilla2 definition and the .size line with
.size XPTC_InvokeByIndex, . - XPTC_InvokeByIndex
You've almost used the '.' address operator already.
The rest looks good and because of the major bug this patch fixes it should go
in before 1.0.
This version incorporates the latest suggestions )assuming I haven't
made any types). I also added an option to use __attribute__ ((stdcall))
which may be slightly faster.
This patch is a major change. Maybe it would be better to put this
version in a new file, say xptcinvoke_gcc_x86_unix.cpp, and switch over
each platform in the makefile after proper tested
Attachment #81373 -
Attachment is obsolete: true
Comment 10•23 years ago
|
||
The stdcall change in the latest patch is broken. You only changed the
correction of the stack pointer and not the way the parameters are passed to
invoke_copy_to_stack. The pushls must be replaced by simple register loads.
While this can be a benefit I still wouldn't suggest using it. gcc is very bad
at handling stdcall. Normally the first thing it does in a stdcall function is
to spill the registers.
So, I'd suggest removing the MOZ_USE_STDCALL stuff. At the same time you can
remove the addl $0x0c, %esp after the invoke_copy_to_stack. Since a stack
frame is created the stack pointer need not be corrected after every function call.
And one more thing: you've left the PIC computation code in. That's fine. But
the code itself isn't. This
+
"pushl %ebx\n\t"
+
"call .Lmozilla1\n"
+
".Lmozilla1:\n\t"
+
"popl %ebx\n\t"
+
"addl $_GLOBAL_OFFSET_TABLE_+[. - .Lmozilla1], %ebx\n\t"
code is very unfriendly to the branch prediction unit of i686 and up. These
processors depend on a call being undone by a ret. So you should write it like
this:
pushl %ebx
call 0f
.subsection 1
0: movl (%esp), %ebx
ret
.previous
addl $_GLOBAL_OFFSET_TABLE_, %ebx
This is the code gcc would use for -mtune=i686.
Finally, would somebody please add "dataloss" as a keyword? The patch fixes a
possibly big problem even though the bug itself is not about this.
| Assignee | ||
Comment 11•23 years ago
|
||
Are you confusing stdcall with regparm? The docs say
"stdcall
On the Intel 386, the stdcall attribute causes the compiler to
assume that the called function will pop off the stack space used to
pass arguments, unless it takes a variable number of arguments."
so it's just a question of who flushes the arguments from the stack. I
can't say I see any of my compilers behaving badly with stdcall but I'll
remove it anyway.
We must pop the invoke_copy_to_stack arguments because we're creating an
argument list for another function in two pieces and they had better be
contiguous.
I'm leery of the .subsection stuff. Do all the binutils packages out
there support it?
New patch upcoming shortly.
I must be missing something but I just can't see where the current code
references negative offsets from esp.
Comment 12•23 years ago
|
||
> Are you confusing stdcall with regparm? The docs say
Oops, yes. I'm never enabling stdcall without regparms myself. OK, scratch
that. But the comment about removing the stack pointer adjustment is still
valid. Plus, not doing the adjustment will also align the stack for the
indirect call later on.
> I'm leery of the .subsection stuff.
For a time long enough to depend on it here. Versions which don't support it
cannot be used to build moz anyway.
> I must be missing something but I just can't see where the current code
> references negative offsets from esp.
Tell me to never read sources in moz. I misread subl %8, %%esp in the
originial code as subl $8, %%esp. The font wasn't usable. Oh well, this is
void then, too.
| Assignee | ||
Comment 13•23 years ago
|
||
I've changed the PIC code. If it becomes an issue I can fix it. I left
the stdcall stuff because it can't hurt and might help. I'm going to
leave the rest alone for the moment because I want to fix up xptcstubs.
Attachment #81487 -
Attachment is obsolete: true
Comment 14•23 years ago
|
||
reassigning to tenthumbs@cybernex.net since he is doing the work.
Assignee: dougt → tenthumbs
| Assignee | ||
Comment 15•23 years ago
|
||
Here's a first try at a unified approach for xptcinvoke and xptcstubs.
To avoid contaminating other platforms, it creates new files and patches
Makefile.in. If something goes wrong, just revert Makefile.in.
I hope to make an optimized build this evening or tomorrow but I don't
have a lot of time. If anyone wants to experiment in their copious spare
time, feel free to do so. :-)
Attachment #81207 -
Attachment is obsolete: true
Attachment #81675 -
Attachment is obsolete: true
| Assignee | ||
Comment 16•23 years ago
|
||
Fixed some dumb typos in the Makefile.in patch.
This patch compiles and runs on Linux. It appears to be faster but no numbers
yet.
Attachment #81851 -
Attachment is obsolete: true
| Assignee | ||
Comment 17•23 years ago
|
||
Just minor changes here. The stdcall trick helps performance slightly so
it's turned on in the Makefile.in patch. The keeper functions are more
elegant but they are exposed in the library so they're turned off and
-fkeep-inline-functions is on.
Attachment #81916 -
Attachment is obsolete: true
| Assignee | ||
Comment 18•23 years ago
|
||
The new XPTC_InvokeByIndex is 18-22% faster than the old one depending
on the compiler and options used. The nsXPTCStubBase::Stub* functions
are also slightly faster, maybe 1% but not much more.
In mozilla this translates to about 1% speed improvement but the noise
is high so it may be less. I'm not sure if 1% is disappointing or
surprising. There is presumably improvement in other mozilla operations
but I don't know how to test for that.
Adding the perf keyword. Changing the summary from "Modest performance
enhancement for xptcinvoke_unixish_x86.cpp" to "Modest performance
enhancement for x86 unix xptcall using gcc."
The latest patch is solid so I guess it's time to start looking for a
review.
Keywords: perf
Summary: Modest performance enhancement for xptcinvoke_unixish_x86.cpp → Modest performance enhancement for x86 unix xptcall using gcc
Comment 19•23 years ago
|
||
drepper and jband should do the reviews.
tenthumbs how much testing have you done?
| Assignee | ||
Comment 20•23 years ago
|
||
I can only test on Linux which is one reason the patch creates new
files. It can be easily undone. No other platform is affected.
I checked that all my compilers correctly handle the assembler code at
various optimizations. I also get a functional mozilla with the new
code.
Of course, more testing can't hurt.
Comment 21•23 years ago
|
||
In xptcinvoke_gcc_x86_unix.cpp: the function invoke_copy_to_stack contains some
hack with a comment which mentions the Solaris compiler. This is not relevant
for this file; it is only for gcc so the hack and the comment can go. I.e., the
code should move into a default clause. Similarly, the XPTC_InvokeByIndex code
contains support for cfront-style code generation. This isn't interesting here
either.
The xptc_invoke_copy_to_stack_keeper should get a comment. There is one in the
header but if you're only reading the .cpp file itself it is not obvious why
this code is needed.
In XPTC_InvokeByIndex: Is think I mentioned it before, the addl after the call
to invoke_copy_to_stack should never be necessary. The destruction of the stack
frame will take care of this.
Beside this the patch look good. I'll test it a bit very soon myself.
| Assignee | ||
Comment 22•23 years ago
|
||
Removed solaris hack, added some more comments.
Attachment #82665 -
Attachment is obsolete: true
| Assignee | ||
Comment 23•23 years ago
|
||
You're absolutely right about the Solaris hack and the need for more
comments. Done.
My idea is that users of unixish_x86 who also use gcc can switch to
this code after testing. If none of them use Cfront then it can go. I'm
not sure so I'll keep it around for a while.
XPTC_InvokeByIndex actually calls functions like foo(that, ...) so it's
necessary to remove invoke_copy_to_stack's arguments from the stack so
the real stack frame is properly constructed.
Comment 24•23 years ago
|
||
My testing showed no problems with the latest patch. Thanks for the work.
r=drepper
Comment 25•23 years ago
|
||
shaver, jband, can I get either of you to review this patch?
| Assignee | ||
Comment 26•23 years ago
|
||
I've been thinking (always a bad sign) and I realized that using
-fkeep-inline-functions has some drawwbacks so this patch uses the
keeper functions which now have no side effects and return something
harmless. See the comment in the new xptc_gcc_x86_unix.h for more
details. Only the keeper functions are changed.
I've tested with egcs-1.1.2, gcc-2.95.3, and gcc-3.0.4 using, -O3,
-finline-functions, etc. and they all produce the desired code. I've
tested with gcc-3.1 and it also does as I want.
This is the final patch unless something drastic appears.
Attachment #83536 -
Attachment is obsolete: true
Comment 27•23 years ago
|
||
Still OK with me. I'm using the old patch for quite some time now.
Comment 28•23 years ago
|
||
This is an updated patch which replaces the Stub functions with all-assembler
code. More recent versions of gcc (correctly) fail to work with the old code.
Comment 29•23 years ago
|
||
A few more words on attachment 88224 [details] [diff] [review].
The stub functions are declared to not take parameters and relied on the frame
pointer being set up to find the parameters. This is very fragile even and
failed with more recent compilers (gcc 3). The old code, replaced by the code
in this bug, also has the problem but since gcc would now use these files it's
not a problem anymore once the patch is applied.
The solution used here is based on a patch by Gwenole Beauchesne
<gbeauchesne@mandrakesoft.com>, adopted by me for the new code. It uses
assembler throughout, eliminated most dependencies on the compiler. Only the
name mangling is an issue which is, I think, adequately dealt with.
I've tested the patch with gcc 2.96.
Comment 30•23 years ago
|
||
I'm getting problems with a gcc 3.1 build. I see link errors like this:
../../../dist/bin/libxpcom.so: undefined reference to `_ZN14nsXPTCStubBase5Stub2
37Ev'
../../../dist/bin/libxpcom.so: undefined reference to `_ZN14nsXPTCStubBase5Stub1
34Ev'
[root@stripples unix]# nm libxptcmd.a | grep Stub104
U _ZN14nsXPTCStubBase5Stub104Ev
000005d0 T _ZN14nsXPTCStubBase7Stub104Ev
Looks like there's still a name manging problem of some kind.
Comment 31•23 years ago
|
||
Oh, here's some of the gcc -E output:
asm(".section \".text\"\n\t" ".align 2\n\t" ".if " "104" " < 10\n\t"
".globl _ZN14nsXPTCStubBase5Stub" "104" "Ev\n\t" ".type
_ZN14nsXPTCStubBase5Stub" "104" "Ev,@function\n" "_ZN14nsXPTCStubBase5Stub"
"104" "Ev:\n\t" ".elseif " "104" " < 100\n\t" ".globl
_ZN14nsXPTCStubBase6Stub" "104" "Ev\n\t" ".type _ZN14nsXPTCStubBase6Stub" "104"
"Ev,@function\n" "_ZN14nsXPTCStubBase6Stub" "104" "Ev:\n
\t" ".elseif " "104" " < 1000\n\t" ".globl _ZN14nsXPTCStubBase7Stub"
"104" "Ev\n\t" ".type _ZN14nsXPTCStubBase7Stub" "104" "
Ev,@function\n" "_ZN14nsXPTCStubBase7Stub" "104" "Ev:\n\t" ".else\n\t" ".err
\"stub number " "104" " >= 1000 not yet supported\"\n\t"
".endif\n\t" "movl $" "104" ", %eax\n\t" "jmp SharedStub\n\t" ".size
_ZN14nsXPTCStubBase5Stub" "104" "Ev,.-_ZN14nsXPTCStubBas
e5Stub" "104" "Ev");
Comment 32•23 years ago
|
||
Oops, little problems with the gcc 3 code. Fixed now and at least compiles and
links.
Attachment #88224 -
Attachment is obsolete: true
| Assignee | ||
Comment 33•23 years ago
|
||
I think you want to merge my last patch in. Using -fkeep-inline-functions
doesn't upgrade gracefully with the compiler.
How does this look?
Comment 34•23 years ago
|
||
I understand you're problem in principal but do not really know (even after
reading the comment) why the code in the dummy functions must be more complex.
What exactly was the effect of the old code? The function was called? Were
PrepareAndDispatch and invoke_copy_to_stack inlined into the keeper functions?
Anyway, the patch is OK but you should be able to lose th "movl" instructions in
the asms you added to the keeper functions. gcc has no way to look into asms
and since the functions are never called it doesn't matter that we swindle.
If you make this change you would get a review from me if that would mean anything.
Comment 35•23 years ago
|
||
I've just made the change myself. Hope you don't mind. The result seems to
work fine here.
Attachment #84621 -
Attachment is obsolete: true
Attachment #88261 -
Attachment is obsolete: true
Attachment #88286 -
Attachment is obsolete: true
| Assignee | ||
Comment 36•23 years ago
|
||
In general the keeper functions would be externally visible. I don't
want random people to actually call since that could crash mozilla so I
reference the static functions and return something harmless. Yes, the
movl's probably aren't necessary but I haven't tried all possible
compiler versions with all possible options so I decided to be safe.
It's not all that important.
Comment 37•23 years ago
|
||
Comment on attachment 88311 [details] [diff] [review]
merged patch + my proposed changes
automated tests all pass with gcc 2.96 and gcc 3.1. Also produces builds that
work. Tested some JS components and in general things look OK. The code seems
OK as well.
sr=blizzard
Attachment #88311 -
Flags: superreview+
Comment 38•23 years ago
|
||
I've done some testing and the last patch seems to work.
And one general comment: I hope people here realize that this patch isn't
anymore about 'perf' only. This is a bug fix now. So, either lose perf or add
whatever other keyword is needed.
Comment 39•23 years ago
|
||
Automated tests on Red Hat 6.2 all pass as well. It also produces a working
build on 6.2 with egcs.
Comment 40•23 years ago
|
||
tenthumbs, does this look good enough for you to stamp?
| Assignee | ||
Comment 41•23 years ago
|
||
The last patch compiles correct code with my egcs, 2.95.3, and 3.0.4 so
it should do. If you're looking for a r= from me you've got it, for what
little it's worth. Note that I don't have checkin privileges.
Comment 42•23 years ago
|
||
Comment on attachment 88311 [details] [diff] [review]
merged patch + my proposed changes
Looks pretty good. Between ulrich and tenthumbs and me, I think this more than
deserves an r=.
Attachment #88311 -
Flags: review+
| Assignee | ||
Comment 43•23 years ago
|
||
With the latest changes there's not too much C++ left. It would be
possible to put all the assembler into one file and the C++ into
another. This would allow us to use the assembler directly which means
we could add debugging symbols so one could step through this stuff with
gdb. The disadvantage is that invoke_copy_to_stack and
PrepareAndDispatch would be externally visible unless we do some linker
magic. Any thoughts?
Comment 44•23 years ago
|
||
I wouldn't bother making that change. The amount of assembler code used is very
small [1]. If you wanted, you could add the debug info in the asm() statements
as well.
[1] I just realized that there is no reason to keep the SharedStub function.
Just rearrange the first two parameters of PrepareAndDispatch, create a "C"
name, and add attribute((regparm(1))). Then there really is only the
XPTC_InvokeByIndex function left as a bigger asm piece.
Comment 45•23 years ago
|
||
A little optimization in SharedStub. Instead of getting rid of SharedStub
(which would add about 2000 bytes in code size) we can use the optimization in
this patch which uses gcc's abilities to modify the calling convention of
PrepareAndDispatch. The resulting file is a few bytes shorter but each call
should be quite a bit faster.
Comment 46•23 years ago
|
||
*** Bug 155971 has been marked as a duplicate of this bug. ***
Comment 47•23 years ago
|
||
So, can we get this checked in? My gcc 3.1 builds are DOA without this patch.
| Assignee | ||
Comment 48•23 years ago
|
||
I'm concerned that some future gcc version will change the regparm
conventions and we'll have to do this all over again. If it were me, I
would prefer to make all the __attribute__ stuff macros in
xptc_gcc_x86_unix.h so it can be changed readily if necessary. However I
think it's time to actually do this and see what happens.
BTW, has anyone ever investigated whether the stub functions can have C
linkage? It would make life _so_ much easier.
| Assignee | ||
Comment 49•23 years ago
|
||
It occurred to me that the SharedStub function is just 2 instructions and a
jump, so why not just include it directly into the stubs. That makes each stub
8 bytes larger but that's still smaller than the current version plus it
eliminates one jump. Also made the regparm thing a macro as I suggested before.
I don't think I broke anything.
Comment 50•23 years ago
|
||
No, don't use the "inline the stub". The size increase is too significant for
the almost unmeasurable benefits. I've tried it. The processor is very well
able to predict the unconditional jump correctly.
I do think my last patch should be used and yes, it should be applied. Can the
r/sr be moved to it?
| Assignee | ||
Comment 51•23 years ago
|
||
Well, today's processor is tomorrow's legacy crap. I'm not a big fan of
doing something just because it works now. Also, when did a few bytes
matter to mozilla?
Be that as it may, it is definitely time to do something.
Comment 52•23 years ago
|
||
Unconditional jumps will be handled rather better than worse in new processors.
And we are not talking about a few bytes. With 4096 stub and 8 bytes more for
each this are 32kb more code.
In other places I've faught hard to reduce the size by hundreds of bytes.
| Assignee | ||
Comment 53•23 years ago
|
||
But xpcom/reflect/xptcall/public/*.{h,inc} only have 256 stubs. Where
are the others? Also any patch here reduces the size of each stub so
there is a net win no matter what we do.
Comment 54•23 years ago
|
||
OK, in the code there are only 256 stubs. I used at some point code which had more.
Still, 256 stubs with 8 bytes more each, still adds 2kb with almost no benefit
at all. Do some measurements. The unconditional jump almost completely disappears.
In other bugs people argue against a few kb of code for a new feature (even thse
required by standards). Here you argue about adding 2kb to save a few
nanoseconds. I'm the first to appreciate optimizations, believe me, but this
isn't one. I initially wrote the code exactly as you did but reverted since
it's not worth it.
Anyway, could some r/sr make a call and move the reviewer marks to one of the
patches?
| Assignee | ||
Comment 55•23 years ago
|
||
I started this bug about performance, i.e. speed, so I will always
prefer speed over size. The current stubs are at least 42 bytes which is
twice the size or more of the new versions so I see no reason not to
prefer speed. We get a size reduction as a side effect.
I agree with you, though, that this should be dealt with now. I won't
object to any choice someone makes.
Comment 56•23 years ago
|
||
I thought we _had_ reviewed one of the patches (88311). Why wasn't it checked
in then?
The follow-on optimizations are great, and I'd love to get the regparm
adjustment for 1.1final, but we don't need to keep everything in its current
broken state until then. Can someone please mail drivers@mozilla.org asking for
trunk approval (perhaps branch as well -- blizzard, you should help us make that
call), and get it checked in ASAP?
Comment 57•23 years ago
|
||
If Ulrich is right and there's no benifit from inlining, then I say we take his
and get the space savings. These stubs are used in lots of places so that
little bit will add up.
Comment 58•23 years ago
|
||
Comment on attachment 88311 [details] [diff] [review]
merged patch + my proposed changes
a=asa (on behalf of drivers) for checkin to 1.1
Attachment #88311 -
Flags: approval+
Comment 59•23 years ago
|
||
I checked in attachment 88311 [details] [diff] [review] to the trunk. Please file separate bugs for any
additional changes.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•