Closed
Bug 140412
Opened 22 years ago
Closed 22 years ago
Modest performance enhancement for x86 unix xptcall using gcc
Categories
(Core :: XPCOM, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: tenthumbs, Assigned: tenthumbs)
References
Details
(Keywords: perf)
Attachments
(3 files, 12 obsolete files)
17.18 KB,
patch
|
shaver
:
review+
blizzard
:
superreview+
asa
:
approval+
|
Details | Diff | Splinter Review |
17.07 KB,
patch
|
Details | Diff | Splinter Review | |
16.93 KB,
patch
|
Details | Diff | Splinter Review |
Note that XPTC_InvokeByIndex calculates the address of invoke_copy_to_stack every time it's called. Since the address doesn't change that's inefficient. On Linux at least, the compiler generates simple calls for static functions. There's no reason why mozilla shouldn't either. Here's a very experimental patch. In a simple test rig, it's a solid 8% faster than the current version but I don't expect that much from mozilla in the long term. It compiles cleanly on my Linux with egcs 1.1.2, gcc 2.95.3, and gcc 3.0.4 but that's not saying much. It may also be necessary to add -fkeep-inline-functions to the command line for very high optimization levels. Looking for testers to see if this really helps.
Comment 2•22 years ago
|
||
john, can you review this patch?
Comment 3•22 years ago
|
||
The patch is correct but the problem is that the function definition might be omitted. For gcc 3 and up there is the function attribute __attribute__ ((used)) which should be added to the definition to accomplish that. For other compilers it remains to be seem. It could very well be that at least gcc 2.96 removed the definition.
Yep, compiling with -O3 gives link errors but adding -fkeep-inline-functions fixes it. The big problem here is that gcc doesn't know we're calling functions from the asm so we have to jump through hoops to keep gcc from screwing up. Only the variable n needs to be preserved. I have a version that does this register void *esp __asm__ ("esp");//bind name to hardware reg void * volatile saved_esp = esp; __asm__ (...); esp = saved_esp; which works but I'm not sure some version of gcc won't mess with this.
Comment 5•22 years ago
|
||
Adding the asm is kind of silly. The asm cannot be optimized away and add code and cost while the whole patch is about performance tuning. I'd suggest making the improvement conditional on gcc 3.1 and up (gcc 3.1 introduce the 'used' attribute). So you}d change the beginning of the invoke_copy_to_stack definition to static void #if __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 1) __attribute__ ((used)) #endif invoke_copy_to_stack(PRUint32 paramCount, nsXPTCVariant* s, PRUint32* d) { The patch (the asm stuff) then also has to use the same #if expression.
I was unclear. I meant in XPTC_InvokeByIndex not in invoke_copy_to_stack.
Not surprisingly, rewriting XPTC_InvokeByIndex entirely in assembler produces even more improvement. In this patch the current assemble code is reused with just some minor editing. Only the function prologue and epilogue are new. It's untested in a real mozilla but it compiles and works correctly in my test rig.
Comment 8•22 years ago
|
||
The patch looks OK. In fact: it fixes one major problem. Don't know why I haven't seen it before. The current code is not signal-safe. It uses memory below the stack pointer which is an absolute no-no. Especially with the current libpthread implementation on Linux which creates a awfull lot of signals. What the patch still doesn't handle is in inlining problem. The #if for the ATTRIBUTE_USED definition should use the expressions I mentioned last time: __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 1) But this still leaves gcc 3.0 and gcc 2.xx. If using special parameter is an option do thiat. Otherwise create and non-static function which is never used and which calls this function. Such as in: #if __GNUC__ && (__GNUC__ < 3 || (__GNUC__ == 3 && __GNUC_MINOR__ = 0)) void XPTC_NeverCalledPleaseIgnore (void) { (void) invoke_copy_to_stack (0, 0, 0); } #endif Depending on how the Makefiles are sructured this might be preferrable. Back to the code. While it's correct there are still some improvements possible: - as correctly noted, the PIC register loading isn't necessary. The function doesn't use global variable nor does it call non-static functions. This, on x86, removes the requirement for a correctly set up %ebx I would strongly recommend to add a comment explaining this, though. - since %ebx is not modified it is not necessary to preserve its value. Therefore the "pushl %ebx" in the prolog and the matching "popl %ebx" in the epilogue are not needed. - as for the size computation, there is no need to define a symbol. Replace the lines of the .Lmozilla2 definition and the .size line with .size XPTC_InvokeByIndex, . - XPTC_InvokeByIndex You've almost used the '.' address operator already. The rest looks good and because of the major bug this patch fixes it should go in before 1.0.
This version incorporates the latest suggestions )assuming I haven't made any types). I also added an option to use __attribute__ ((stdcall)) which may be slightly faster. This patch is a major change. Maybe it would be better to put this version in a new file, say xptcinvoke_gcc_x86_unix.cpp, and switch over each platform in the makefile after proper tested
Attachment #81373 -
Attachment is obsolete: true
Comment 10•22 years ago
|
||
The stdcall change in the latest patch is broken. You only changed the correction of the stack pointer and not the way the parameters are passed to invoke_copy_to_stack. The pushls must be replaced by simple register loads. While this can be a benefit I still wouldn't suggest using it. gcc is very bad at handling stdcall. Normally the first thing it does in a stdcall function is to spill the registers. So, I'd suggest removing the MOZ_USE_STDCALL stuff. At the same time you can remove the addl $0x0c, %esp after the invoke_copy_to_stack. Since a stack frame is created the stack pointer need not be corrected after every function call. And one more thing: you've left the PIC computation code in. That's fine. But the code itself isn't. This + "pushl %ebx\n\t" + "call .Lmozilla1\n" + ".Lmozilla1:\n\t" + "popl %ebx\n\t" + "addl $_GLOBAL_OFFSET_TABLE_+[. - .Lmozilla1], %ebx\n\t" code is very unfriendly to the branch prediction unit of i686 and up. These processors depend on a call being undone by a ret. So you should write it like this: pushl %ebx call 0f .subsection 1 0: movl (%esp), %ebx ret .previous addl $_GLOBAL_OFFSET_TABLE_, %ebx This is the code gcc would use for -mtune=i686. Finally, would somebody please add "dataloss" as a keyword? The patch fixes a possibly big problem even though the bug itself is not about this.
Assignee | ||
Comment 11•22 years ago
|
||
Are you confusing stdcall with regparm? The docs say "stdcall On the Intel 386, the stdcall attribute causes the compiler to assume that the called function will pop off the stack space used to pass arguments, unless it takes a variable number of arguments." so it's just a question of who flushes the arguments from the stack. I can't say I see any of my compilers behaving badly with stdcall but I'll remove it anyway. We must pop the invoke_copy_to_stack arguments because we're creating an argument list for another function in two pieces and they had better be contiguous. I'm leery of the .subsection stuff. Do all the binutils packages out there support it? New patch upcoming shortly. I must be missing something but I just can't see where the current code references negative offsets from esp.
Comment 12•22 years ago
|
||
> Are you confusing stdcall with regparm? The docs say Oops, yes. I'm never enabling stdcall without regparms myself. OK, scratch that. But the comment about removing the stack pointer adjustment is still valid. Plus, not doing the adjustment will also align the stack for the indirect call later on. > I'm leery of the .subsection stuff. For a time long enough to depend on it here. Versions which don't support it cannot be used to build moz anyway. > I must be missing something but I just can't see where the current code > references negative offsets from esp. Tell me to never read sources in moz. I misread subl %8, %%esp in the originial code as subl $8, %%esp. The font wasn't usable. Oh well, this is void then, too.
Assignee | ||
Comment 13•22 years ago
|
||
I've changed the PIC code. If it becomes an issue I can fix it. I left the stdcall stuff because it can't hurt and might help. I'm going to leave the rest alone for the moment because I want to fix up xptcstubs.
Attachment #81487 -
Attachment is obsolete: true
Comment 14•22 years ago
|
||
reassigning to tenthumbs@cybernex.net since he is doing the work.
Assignee: dougt → tenthumbs
Assignee | ||
Comment 15•22 years ago
|
||
Here's a first try at a unified approach for xptcinvoke and xptcstubs. To avoid contaminating other platforms, it creates new files and patches Makefile.in. If something goes wrong, just revert Makefile.in. I hope to make an optimized build this evening or tomorrow but I don't have a lot of time. If anyone wants to experiment in their copious spare time, feel free to do so. :-)
Attachment #81207 -
Attachment is obsolete: true
Attachment #81675 -
Attachment is obsolete: true
Assignee | ||
Comment 16•22 years ago
|
||
Fixed some dumb typos in the Makefile.in patch. This patch compiles and runs on Linux. It appears to be faster but no numbers yet.
Attachment #81851 -
Attachment is obsolete: true
Assignee | ||
Comment 17•22 years ago
|
||
Just minor changes here. The stdcall trick helps performance slightly so it's turned on in the Makefile.in patch. The keeper functions are more elegant but they are exposed in the library so they're turned off and -fkeep-inline-functions is on.
Attachment #81916 -
Attachment is obsolete: true
Assignee | ||
Comment 18•22 years ago
|
||
The new XPTC_InvokeByIndex is 18-22% faster than the old one depending on the compiler and options used. The nsXPTCStubBase::Stub* functions are also slightly faster, maybe 1% but not much more. In mozilla this translates to about 1% speed improvement but the noise is high so it may be less. I'm not sure if 1% is disappointing or surprising. There is presumably improvement in other mozilla operations but I don't know how to test for that. Adding the perf keyword. Changing the summary from "Modest performance enhancement for xptcinvoke_unixish_x86.cpp" to "Modest performance enhancement for x86 unix xptcall using gcc." The latest patch is solid so I guess it's time to start looking for a review.
Keywords: perf
Summary: Modest performance enhancement for xptcinvoke_unixish_x86.cpp → Modest performance enhancement for x86 unix xptcall using gcc
Comment 19•22 years ago
|
||
drepper and jband should do the reviews. tenthumbs how much testing have you done?
Assignee | ||
Comment 20•22 years ago
|
||
I can only test on Linux which is one reason the patch creates new files. It can be easily undone. No other platform is affected. I checked that all my compilers correctly handle the assembler code at various optimizations. I also get a functional mozilla with the new code. Of course, more testing can't hurt.
Comment 21•22 years ago
|
||
In xptcinvoke_gcc_x86_unix.cpp: the function invoke_copy_to_stack contains some hack with a comment which mentions the Solaris compiler. This is not relevant for this file; it is only for gcc so the hack and the comment can go. I.e., the code should move into a default clause. Similarly, the XPTC_InvokeByIndex code contains support for cfront-style code generation. This isn't interesting here either. The xptc_invoke_copy_to_stack_keeper should get a comment. There is one in the header but if you're only reading the .cpp file itself it is not obvious why this code is needed. In XPTC_InvokeByIndex: Is think I mentioned it before, the addl after the call to invoke_copy_to_stack should never be necessary. The destruction of the stack frame will take care of this. Beside this the patch look good. I'll test it a bit very soon myself.
Assignee | ||
Comment 22•22 years ago
|
||
Removed solaris hack, added some more comments.
Attachment #82665 -
Attachment is obsolete: true
Assignee | ||
Comment 23•22 years ago
|
||
You're absolutely right about the Solaris hack and the need for more comments. Done. My idea is that users of unixish_x86 who also use gcc can switch to this code after testing. If none of them use Cfront then it can go. I'm not sure so I'll keep it around for a while. XPTC_InvokeByIndex actually calls functions like foo(that, ...) so it's necessary to remove invoke_copy_to_stack's arguments from the stack so the real stack frame is properly constructed.
Comment 24•22 years ago
|
||
My testing showed no problems with the latest patch. Thanks for the work. r=drepper
Comment 25•22 years ago
|
||
shaver, jband, can I get either of you to review this patch?
Assignee | ||
Comment 26•22 years ago
|
||
I've been thinking (always a bad sign) and I realized that using -fkeep-inline-functions has some drawwbacks so this patch uses the keeper functions which now have no side effects and return something harmless. See the comment in the new xptc_gcc_x86_unix.h for more details. Only the keeper functions are changed. I've tested with egcs-1.1.2, gcc-2.95.3, and gcc-3.0.4 using, -O3, -finline-functions, etc. and they all produce the desired code. I've tested with gcc-3.1 and it also does as I want. This is the final patch unless something drastic appears.
Attachment #83536 -
Attachment is obsolete: true
Comment 27•22 years ago
|
||
Still OK with me. I'm using the old patch for quite some time now.
Comment 28•22 years ago
|
||
This is an updated patch which replaces the Stub functions with all-assembler code. More recent versions of gcc (correctly) fail to work with the old code.
Comment 29•22 years ago
|
||
A few more words on attachment 88224 [details] [diff] [review]. The stub functions are declared to not take parameters and relied on the frame pointer being set up to find the parameters. This is very fragile even and failed with more recent compilers (gcc 3). The old code, replaced by the code in this bug, also has the problem but since gcc would now use these files it's not a problem anymore once the patch is applied. The solution used here is based on a patch by Gwenole Beauchesne <gbeauchesne@mandrakesoft.com>, adopted by me for the new code. It uses assembler throughout, eliminated most dependencies on the compiler. Only the name mangling is an issue which is, I think, adequately dealt with. I've tested the patch with gcc 2.96.
Comment 30•22 years ago
|
||
I'm getting problems with a gcc 3.1 build. I see link errors like this: ../../../dist/bin/libxpcom.so: undefined reference to `_ZN14nsXPTCStubBase5Stub2 37Ev' ../../../dist/bin/libxpcom.so: undefined reference to `_ZN14nsXPTCStubBase5Stub1 34Ev' [root@stripples unix]# nm libxptcmd.a | grep Stub104 U _ZN14nsXPTCStubBase5Stub104Ev 000005d0 T _ZN14nsXPTCStubBase7Stub104Ev Looks like there's still a name manging problem of some kind.
Comment 31•22 years ago
|
||
Oh, here's some of the gcc -E output: asm(".section \".text\"\n\t" ".align 2\n\t" ".if " "104" " < 10\n\t" ".globl _ZN14nsXPTCStubBase5Stub" "104" "Ev\n\t" ".type _ZN14nsXPTCStubBase5Stub" "104" "Ev,@function\n" "_ZN14nsXPTCStubBase5Stub" "104" "Ev:\n\t" ".elseif " "104" " < 100\n\t" ".globl _ZN14nsXPTCStubBase6Stub" "104" "Ev\n\t" ".type _ZN14nsXPTCStubBase6Stub" "104" "Ev,@function\n" "_ZN14nsXPTCStubBase6Stub" "104" "Ev:\n \t" ".elseif " "104" " < 1000\n\t" ".globl _ZN14nsXPTCStubBase7Stub" "104" "Ev\n\t" ".type _ZN14nsXPTCStubBase7Stub" "104" " Ev,@function\n" "_ZN14nsXPTCStubBase7Stub" "104" "Ev:\n\t" ".else\n\t" ".err \"stub number " "104" " >= 1000 not yet supported\"\n\t" ".endif\n\t" "movl $" "104" ", %eax\n\t" "jmp SharedStub\n\t" ".size _ZN14nsXPTCStubBase5Stub" "104" "Ev,.-_ZN14nsXPTCStubBas e5Stub" "104" "Ev");
Comment 32•22 years ago
|
||
Oops, little problems with the gcc 3 code. Fixed now and at least compiles and links.
Attachment #88224 -
Attachment is obsolete: true
Assignee | ||
Comment 33•22 years ago
|
||
I think you want to merge my last patch in. Using -fkeep-inline-functions doesn't upgrade gracefully with the compiler. How does this look?
Comment 34•22 years ago
|
||
I understand you're problem in principal but do not really know (even after reading the comment) why the code in the dummy functions must be more complex. What exactly was the effect of the old code? The function was called? Were PrepareAndDispatch and invoke_copy_to_stack inlined into the keeper functions? Anyway, the patch is OK but you should be able to lose th "movl" instructions in the asms you added to the keeper functions. gcc has no way to look into asms and since the functions are never called it doesn't matter that we swindle. If you make this change you would get a review from me if that would mean anything.
Comment 35•22 years ago
|
||
I've just made the change myself. Hope you don't mind. The result seems to work fine here.
Attachment #84621 -
Attachment is obsolete: true
Attachment #88261 -
Attachment is obsolete: true
Attachment #88286 -
Attachment is obsolete: true
Assignee | ||
Comment 36•22 years ago
|
||
In general the keeper functions would be externally visible. I don't want random people to actually call since that could crash mozilla so I reference the static functions and return something harmless. Yes, the movl's probably aren't necessary but I haven't tried all possible compiler versions with all possible options so I decided to be safe. It's not all that important.
Comment 37•22 years ago
|
||
Comment on attachment 88311 [details] [diff] [review] merged patch + my proposed changes automated tests all pass with gcc 2.96 and gcc 3.1. Also produces builds that work. Tested some JS components and in general things look OK. The code seems OK as well. sr=blizzard
Attachment #88311 -
Flags: superreview+
Comment 38•22 years ago
|
||
I've done some testing and the last patch seems to work. And one general comment: I hope people here realize that this patch isn't anymore about 'perf' only. This is a bug fix now. So, either lose perf or add whatever other keyword is needed.
Comment 39•22 years ago
|
||
Automated tests on Red Hat 6.2 all pass as well. It also produces a working build on 6.2 with egcs.
Comment 40•22 years ago
|
||
tenthumbs, does this look good enough for you to stamp?
Assignee | ||
Comment 41•22 years ago
|
||
The last patch compiles correct code with my egcs, 2.95.3, and 3.0.4 so it should do. If you're looking for a r= from me you've got it, for what little it's worth. Note that I don't have checkin privileges.
Comment on attachment 88311 [details] [diff] [review] merged patch + my proposed changes Looks pretty good. Between ulrich and tenthumbs and me, I think this more than deserves an r=.
Attachment #88311 -
Flags: review+
Assignee | ||
Comment 43•22 years ago
|
||
With the latest changes there's not too much C++ left. It would be possible to put all the assembler into one file and the C++ into another. This would allow us to use the assembler directly which means we could add debugging symbols so one could step through this stuff with gdb. The disadvantage is that invoke_copy_to_stack and PrepareAndDispatch would be externally visible unless we do some linker magic. Any thoughts?
Comment 44•22 years ago
|
||
I wouldn't bother making that change. The amount of assembler code used is very small [1]. If you wanted, you could add the debug info in the asm() statements as well. [1] I just realized that there is no reason to keep the SharedStub function. Just rearrange the first two parameters of PrepareAndDispatch, create a "C" name, and add attribute((regparm(1))). Then there really is only the XPTC_InvokeByIndex function left as a bigger asm piece.
Comment 45•22 years ago
|
||
A little optimization in SharedStub. Instead of getting rid of SharedStub (which would add about 2000 bytes in code size) we can use the optimization in this patch which uses gcc's abilities to modify the calling convention of PrepareAndDispatch. The resulting file is a few bytes shorter but each call should be quite a bit faster.
Comment 46•22 years ago
|
||
*** Bug 155971 has been marked as a duplicate of this bug. ***
Comment 47•22 years ago
|
||
So, can we get this checked in? My gcc 3.1 builds are DOA without this patch.
Assignee | ||
Comment 48•22 years ago
|
||
I'm concerned that some future gcc version will change the regparm conventions and we'll have to do this all over again. If it were me, I would prefer to make all the __attribute__ stuff macros in xptc_gcc_x86_unix.h so it can be changed readily if necessary. However I think it's time to actually do this and see what happens. BTW, has anyone ever investigated whether the stub functions can have C linkage? It would make life _so_ much easier.
Assignee | ||
Comment 49•22 years ago
|
||
It occurred to me that the SharedStub function is just 2 instructions and a jump, so why not just include it directly into the stubs. That makes each stub 8 bytes larger but that's still smaller than the current version plus it eliminates one jump. Also made the regparm thing a macro as I suggested before. I don't think I broke anything.
Comment 50•22 years ago
|
||
No, don't use the "inline the stub". The size increase is too significant for the almost unmeasurable benefits. I've tried it. The processor is very well able to predict the unconditional jump correctly. I do think my last patch should be used and yes, it should be applied. Can the r/sr be moved to it?
Assignee | ||
Comment 51•22 years ago
|
||
Well, today's processor is tomorrow's legacy crap. I'm not a big fan of doing something just because it works now. Also, when did a few bytes matter to mozilla? Be that as it may, it is definitely time to do something.
Comment 52•22 years ago
|
||
Unconditional jumps will be handled rather better than worse in new processors. And we are not talking about a few bytes. With 4096 stub and 8 bytes more for each this are 32kb more code. In other places I've faught hard to reduce the size by hundreds of bytes.
Assignee | ||
Comment 53•22 years ago
|
||
But xpcom/reflect/xptcall/public/*.{h,inc} only have 256 stubs. Where are the others? Also any patch here reduces the size of each stub so there is a net win no matter what we do.
Comment 54•22 years ago
|
||
OK, in the code there are only 256 stubs. I used at some point code which had more. Still, 256 stubs with 8 bytes more each, still adds 2kb with almost no benefit at all. Do some measurements. The unconditional jump almost completely disappears. In other bugs people argue against a few kb of code for a new feature (even thse required by standards). Here you argue about adding 2kb to save a few nanoseconds. I'm the first to appreciate optimizations, believe me, but this isn't one. I initially wrote the code exactly as you did but reverted since it's not worth it. Anyway, could some r/sr make a call and move the reviewer marks to one of the patches?
Assignee | ||
Comment 55•22 years ago
|
||
I started this bug about performance, i.e. speed, so I will always prefer speed over size. The current stubs are at least 42 bytes which is twice the size or more of the new versions so I see no reason not to prefer speed. We get a size reduction as a side effect. I agree with you, though, that this should be dealt with now. I won't object to any choice someone makes.
I thought we _had_ reviewed one of the patches (88311). Why wasn't it checked in then? The follow-on optimizations are great, and I'd love to get the regparm adjustment for 1.1final, but we don't need to keep everything in its current broken state until then. Can someone please mail drivers@mozilla.org asking for trunk approval (perhaps branch as well -- blizzard, you should help us make that call), and get it checked in ASAP?
Comment 57•22 years ago
|
||
If Ulrich is right and there's no benifit from inlining, then I say we take his and get the space savings. These stubs are used in lots of places so that little bit will add up.
Comment 58•22 years ago
|
||
Comment on attachment 88311 [details] [diff] [review] merged patch + my proposed changes a=asa (on behalf of drivers) for checkin to 1.1
Attachment #88311 -
Flags: approval+
Comment 59•22 years ago
|
||
I checked in attachment 88311 [details] [diff] [review] to the trunk. Please file separate bugs for any additional changes.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•