Closed
Bug 532198
Opened 15 years ago
Closed 15 years ago
Reimplement xptcinvoke for arm/linux in C
Categories
(Core :: XPCOM, defect)
Tracking
()
RESOLVED
FIXED
mozilla1.9.3a5
People
(Reporter: reed, Assigned: glandium)
References
Details
Attachments
(2 files, 7 obsolete files)
17.66 KB,
text/plain
|
Details | |
11.18 KB,
patch
|
jbramley
:
review+
vlad
:
review+
benjamin
:
superreview+
|
Details | Diff | Splinter Review |
Upstreaming a patch from Kevin Welton <Kevin.Welton@arm.com> to fix NS_InvokeByIndex() to use BLX for method invocations when compiling for Thumb. This comes from http://linux.onarm.com/bugzilla/show_bug.cgi?id=36.
Attachment #415476 -
Flags: review?(timeless)
Attachment #415476 -
Flags: review?(benjamin)
Reporter | ||
Updated•15 years ago
|
Reporter | ||
Updated•15 years ago
|
See Also: → https://launchpad.net/bugs/488354
Comment 1•15 years ago
|
||
looks like that's patching a patch that was already backed out. http://hg.mozilla.org/mozilla-central/rev/daa5346f19d6
We can use the #ifdef _thumb_ for future work though.
Reporter | ||
Comment 2•15 years ago
|
||
(In reply to comment #1)
> looks like that's patching a patch that was already backed out.
> http://hg.mozilla.org/mozilla-central/rev/daa5346f19d6
That's the backout commit...
Reporter | ||
Comment 3•15 years ago
|
||
This patch applies fine to trunk with a few minor offsets.
Updated•15 years ago
|
Attachment #415476 -
Flags: review?(vladimir)
Attachment #415476 -
Flags: review?(timeless)
Attachment #415476 -
Flags: review?(benjamin)
There's no reason to not use BLX all the time, no? I thought that's what the updated xptcinvoke patch did.
Attachment #415476 -
Flags: review-
Comment on attachment 415476 [details] [diff] [review]
patch - v1
for lazy people, that's bug 530087
>+ "blx ip \n\t" /* call it... */
> "mov lr, pc \n\t" /* call it... */
comments should match the conventions of our file, that means the exact same whitespace, no compression.
>+ "blx ip \n\t" /* copy args to the stack like the
>+ * compiler would */
i think:
/* .... */
/* .... */
>+#ifdef __thumb__
>+ "blx ip \n\t" /* call method */
>+ "mov lr, pc \n\t" /* call method */
Comment 6•15 years ago
|
||
You can (and should) use BLX all the time in this case, but it won't then work on ARMv4. The JIT doesn't work on ARMv4 anyway, and I don't think we care about such old machines.
Having said that, some distributions might care about ARMv4, and they might want support, even if it's slow. (Don't ask me why!) A solution here would be to use the old-style "MOV pc, ip" on ARMv4 (using a compile-time switch of some kind) and BLX everywhere else. We would also have to disable the JIT for ARMv4, but ARMv4 distributions must do that anyway because the JIT hasn't had ARMv4 support for a long time.
NOTE: Using "MOV pc, ip" will break the return stack prediction on newer cores, so even if you don't want to interwork, use BLX <reg>. The performance difference should be significant.
Updated•15 years ago
|
Assignee: nobody → Jacob.Bramley
Comment 7•15 years ago
|
||
The new patch uses BLX all the time, and thus imposes an ARMv5T baseline requirement for XPCOM. (We previously had this requirement for the JIT anyway, but that could be trivially disabled for crazy people who wanted to compile for ARMv4.)
I've (barely) tested this with the ARM build, but couldn't get GCC 4.3.3 to produce a Thumb-2 build. (The error looked like a compiler error in an unrelated module.) I could massage it into building just the bits we care about with -mthumb, but it takes a long time to build anyway, let alone with me hacking the build system.
In any case, I've delayed enough for such a trivial patch, so here it is. Whilst it's trivial, it wants pushing to the try server or something like that before it's actually pushed.
Attachment #415476 -
Attachment is obsolete: true
Attachment #428899 -
Flags: review?(vladimir)
Attachment #415476 -
Flags: review?(vladimir)
Attachment #428899 -
Flags: review?(vladimir) → review+
Assignee | ||
Comment 8•15 years ago
|
||
(In reply to comment #6)
> Having said that, some distributions might care about ARMv4, and they might
> want support, even if it's slow. (Don't ask me why!)
Here is why: openmoko is armv4. Debian supports openmoko.
Comment 9•15 years ago
|
||
(In reply to comment #8)
> (In reply to comment #6)
> Here is why: openmoko is armv4. Debian supports openmoko.
Hmm. The Neo FreeRunner uses an ARM920T, which is an ARMv4T device. However, it sits at 400MHz and has just 128MB of RAM. I admit that this is the closest ARMv4(T) device I've seen to being a realistic platform, but is it really worth going out of our way to support? We would have to include #ifdef blocks in our code to support ARMv4T, and these blocks will need testing.
Yeah, it's highly unlikely that we'll support any ARMv4 platforms going forward. (Even ARMv5 is borderline, but there's little cost in supporting it if we want to support ARMv6 anyway.)
Assignee | ||
Comment 11•15 years ago
|
||
How about "simply" replacing the assembly with (almost) C code only ? The compiler will take care of using the proper opcode for the target architecture. It will also create more efficient code and also inline the invoke_* functions.
I get a 25% speed improvement on the TestXPTCInvoke speed test with an ARMv4T target. (note it ends up using bl on armv4t, and blx when using e.g. an armv5t target)
Original assembly:
direct took 3.78 seconds
invoke took 50.84 seconds
So, invoke overhead was ~ 47.06 seconds (~ 93%)
With C, without inlining:
direct took 3.62 seconds
invoke took 38.90 seconds
So, invoke overhead was ~ 35.28 seconds (~ 91%)
With inlined invoke functions:
direct took 3.62 seconds
invoke took 37.53 seconds
So, invoke overhead was ~ 33.91 seconds (~ 90%)
Note I'm not entirely sure the attached patch is totally safe for any compiler, but it works properly when built with gcc 4.4. If this is not the case, I'm pretty sure this can be made safer.
Also note invoke_count_words could also be removed, to improve speed even more, as per bug 530087.
Attachment #433119 -
Flags: review?(vladimir)
Assignee | ||
Comment 12•15 years ago
|
||
(In reply to comment #11)
> I get a 25% speed improvement on the TestXPTCInvoke speed test with an ARMv4T
> target. (note it ends up using bl on armv4t, and blx when using e.g. an armv5t
> target)
(Note this was compiled for ARMv4T, but run on an ARMv5T)
Comment 13•15 years ago
|
||
That's a neat solution. The only worry I have is the fiddling of the stack pointer within C code; that is likely to break even in the same compiler with unrelated code changes causing it to prod the stack immediately before calling "func".
Actually, I'd be interested in seeing the assembly for that, as I'd like to see what it's doing differently.
(In reply to comment #11)
> Note I'm not entirely sure the attached patch is totally safe for any compiler,
> but it works properly when built with gcc 4.4. If this is not the case, I'm
> pretty sure this can be made safer.
That doesn't matter; the original assembly was extremely compiler-dependent!
> Also note invoke_count_words could also be removed, to improve speed even more,
> as per bug 530087.
Not without sacrificing memory: invoke_copy_to_stack on WinCE fills the stack downwards and avoids calling invoke_count_words to work out how much space to reserve. On Linux, with EABI, 64-bit arguments must be 64-bit aligned, so it's impossible for invoke_copy_to_stack to know where to put each argument (with one pass) if it's filling them backwards; the alignment requirements on an argument early in the list can shift everything above it.
The best solution I have is to reserve arg_count*2 words on the stack and waste stack space if we have 32-bit arguments. That will allow us to stick to one pass, skipping invoke_copy_to_stack entirely and filling the arguments forwards with proper alignment, at the cost of some stack space.
(In reply to comment #12)
> (In reply to comment #11)
> > I get a 25% speed improvement on the TestXPTCInvoke speed test with an ARMv4T
> > target. (note it ends up using bl on armv4t, and blx when using e.g. an armv5t
> > target)
>
> (Note this was compiled for ARMv4T, but run on an ARMv5T)
"bl" and "blx" will exhibit equivalent performance; the only difference is when interworking is involved. Return stack prediction is used for both (but maybe not on an ARMv5T device).
There are two issues here. I think you know these anyway, but for the sake of documentation & clarity in the bug:
* Correctness on a system with mixed ARM & Thumb code: "blx" _must_ be used here.
* Performance:
- ARMv4T interworking branches aren't return-stack-predicted on newer cores, and this costs a huge amount of performance. (I've seen figures ranging from 5% to 50%, but I don't know what would happen in this particular case.)
- ARMv4 long-range branches involve doing "mov pc, <dst_reg>". That's what the old code used to do. ("bl" can't branch to an address in a register and is restricted to PC±32MB.)
----
I was planning to work on this (and 530087) further over the next few days because it's been in my "to do" list for ages, but you're welcome to take it if you want to.
Assignee | ||
Comment 14•15 years ago
|
||
(In reply to comment #13)
> That's a neat solution. The only worry I have is the fiddling of the stack
> pointer within C code; that is likely to break even in the same compiler with
> unrelated code changes causing it to prod the stack immediately before calling
> "func".
I think it should be safe, actually, because the compiler knows the stack pointer is being fiddled with by the assembly, and avoids using it. Without forcing the stack pointer with the assembly, the compiled code modifies sp to read the first 3 arguments. (iirc, once with add and once with ldmia)
While this would actually do the same thing, it wouldn't have the same guarantees.
> Actually, I'd be interested in seeing the assembly for that, as I'd like to see
> what it's doing differently.
Will attach.
> There are two issues here. I think you know these anyway, but for the sake of
> documentation & clarity in the bug:
> * Correctness on a system with mixed ARM & Thumb code: "blx" _must_ be used
> here.
> * Performance:
> - ARMv4T interworking branches aren't return-stack-predicted on newer cores,
> and this costs a huge amount of performance. (I've seen figures ranging from 5%
> to 50%, but I don't know what would happen in this particular case.)
> - ARMv4 long-range branches involve doing "mov pc, <dst_reg>". That's what
> the old code used to do. ("bl" can't branch to an address in a register and is
> restricted to PC±32MB.)
The undeniable advantage of the C code is that the compiler will just use the right construct depending on the target architecture. No need to #ifdef __thumb__ or whatever.
> I was planning to work on this (and 530087) further over the next few days
> because it's been in my "to do" list for ages, but you're welcome to take it if
> you want to.
Since I've gone this far already, I think I'll just take it ;)
Assignee: Jacob.Bramley → mh+mozilla
Comment 15•15 years ago
|
||
(In reply to comment #14)
> The undeniable advantage of the C code is that the compiler will just use the
> right construct depending on the target architecture. No need to #ifdef
> __thumb__ or whatever.
Oh, certainly. C code is always preferable to assembly for that very reason, unless you need to do something that C code can't express efficiently (though that isn't the case here).
I'm still not entirely convinced about the SP-fiddling though. It seems unsafe to me. For instance, how do you know that the stack_space array is at the bottom of your frame? C doesn't guarantee anything there (even though you declare it last). Yes, it probably works, but it feels unsafe. That's why the original was in assembly, so it was clear what was happening.
In the case of writing to SP in the asm block: Yes, the compiler knows you've clobbered it, but C mandates that you have a stack and I'd be surprised if the compiler copes well with you doing that. If it isn't using a frame pointer, for example, it won't know how to wind back the stack pointer. Perhaps writing to SP in this way forces it to use a frame pointer, but none of that behaviour is documented (as far as I know) so it's likely to break between even minor compiler revisions.
I like the C implementation, but it does make me feel nervous :-)
Assignee | ||
Comment 16•15 years ago
|
||
(In reply to comment #15)
> (In reply to comment #14)
> > The undeniable advantage of the C code is that the compiler will just use the
> > right construct depending on the target architecture. No need to #ifdef
> > __thumb__ or whatever.
>
> Oh, certainly. C code is always preferable to assembly for that very reason,
> unless you need to do something that C code can't express efficiently (though
> that isn't the case here).
>
> I'm still not entirely convinced about the SP-fiddling though. It seems unsafe
> to me. For instance, how do you know that the stack_space array is at the
> bottom of your frame?
Because it is the only thing that is allocated on stack in the function. But we can probably use __builtin_alloca instead if you're uncomfortable with that, it generates the same code.
> C doesn't guarantee anything there (even though you
> declare it last). Yes, it probably works, but it feels unsafe. That's why the
> original was in assembly, so it was clear what was happening.
>
> In the case of writing to SP in the asm block: Yes, the compiler knows you've
> clobbered it, but C mandates that you have a stack and I'd be surprised if the
> compiler copes well with you doing that. If it isn't using a frame pointer, for
> example, it won't know how to wind back the stack pointer. Perhaps writing to
> SP in this way forces it to use a frame pointer, but none of that behaviour is
> documented (as far as I know) so it's likely to break between even minor
> compiler revisions.
I don't know if -fomit-frame-pointer is supposed to work on gcc arm, but i can't get gcc to stop using the frame pointer. The function is always marked with "frame_needed = 1", while other functions from the same source are marked as "frame_needed = 0", and as such don't use a frame pointer (even without -fomit-frame-pointer ; I guess it is included in -O2).
Anyways, I can think of several other approaches using a bit more assembly, but still leaving most of the work to the C code, I'll check what kind of code get generated with that.
Here is the assembly i get with the already attached C code (forcing -fno-inline), with an armv4t target:
stmfd sp!, {r4, r5, r6, r7, r8, r9, fp, lr}
mov r5, r0
add fp, sp, #28
mov r6, r1
mov r0, r2
mov r1, r3
mov r7, r2
mov r8, r3
bl _ZL18invoke_count_wordsjP13nsXPTCVariant(PLT)
mov r3, r0, asl #2
add r3, r3, #18
bic r3, r3, #7
sub sp, sp, r3
mov r1, r7
mov r2, r8
add r0, sp, #4
bl _ZL20invoke_copy_to_stackPjjP13nsXPTCVariant(PLT)
mov r4, sp
add r3, sp, #16
#APP
@ 200 "xptcinvoke_arm.cpp" 1
mov sp, r3
@ 0 "" 2
mov r0, r5
ldr r3, [r4, #12]
ldr ip, [r5, #0]
ldmib r4, {r1, r2} @ phole ldm
ldr ip, [ip, r6, asl #2]
mov lr, pc
bx ip
sub sp, fp, #28
ldmfd sp!, {r4, r5, r6, r7, r8, r9, fp, lr}
bx lr
Here it uses bl for the function calls and bx for the method invoke.
With a armv5t target, the end changes to the following:
ldmib r4, {r1, r2} @ phole ldm
mov lr, pc
ldr pc, [ip, r6, asl #2]
sub sp, fp, #28
ldmfd sp!, {r4, r5, r6, r7, r8, r9, fp, pc}
Without the assembly, this is how the first three 32-bits words are read from the stack:
ldr r3, [sp, #12]
ldr ip, [r5, #0]
ldmib sp, {r1, r2} @ phole ldm
In the end, sp would point on the third 32-bits word, instead of the fourth. Plus, the code is highly dependent on optimization level, while with the sp fiddling, the generated code is safe in all cases I tested.
Comment 17•15 years ago
|
||
> Because it is the only thing that is allocated on stack in the function. But we
> can probably use __builtin_alloca instead if you're uncomfortable with that, it
> generates the same code.
Not necessarily: For one thing, the "register" qualfier is just a hint, but
there may also be compiler temporaries that go on the stack.
Admittedly, for a function of this size and complexity, it's a fairly safe
assumption, but I think perhaps it needs a comment warning others from adding
code without considering the consequences, as it would be dreadfully easy to
break.
Is there any way we can ASSERT this? I don't think there's an easy way, but I
suppose in the assembly snippet we could assert that sp == stack_space before
assigning it:
#ifdef DEBUG
__asm__ __volatile__ (
"cmp sp, %1"
"moveq %0, #1"
"movne %0, #0"
"add sp, #(4*4)"
: "=r" (stack_ok)
: "r" (stack_space)
: "sp"
);
#else
__asm__ __volatile__ (
"add sp, #(4*4)" /* sp is at stack_space[0], so advance it to [4].
* (This saves &stack_space[4] from being allocated
* a register. */
:
:
: "sp"
);
#endif
> I don't know if -fomit-frame-pointer is supposed to work on gcc arm, [...]
It sounds like touching the SP does force frame_needed to 1, so it's probably
safe.
> With a armv5t target, the end changes to the following:
> ldmib r4, {r1, r2} @ phole ldm
> mov lr, pc
> ldr pc, [ip, r6, asl #2]
> sub sp, fp, #28
> ldmfd sp!, {r4, r5, r6, r7, r8, r9, fp, pc}
Ouch. I don't know why the compiler is emiting "ldr pc, [...]". It'd be
preferable to load into a temporary (such as "ip") and then blx to that. Hmm.
What do you get if you specify "-mcpu=cortex-a8"? It's possible that the
compiler is assuming that it doesn't matter for ARMv5T devices (and it's
probably right).
(In any case, this is a compiler-related issue that will occur in all the
compiled code.)
> Without the assembly, this is how the first three 32-bits words are read from
> the stack:
> ldr r3, [sp, #12]
> ldr ip, [r5, #0]
> ldmib sp, {r1, r2} @ phole ldm
>
> In the end, sp would point on the third 32-bits word, instead of the fourth.
Nope, sp isn't updated there. It's a strange way to load the values as a single
ldm would suffice, but I suppose it would work and wouldn't affect performance
much.
----
I just have one final concern, and then I'll leave you alone :-) We need to
guarantee that the stack space is 64-bit aligned (for EABI compliance), but I'm
not sure how you do that. Indeed, you removed some alignment code from
invoke_count_words. Have you confirmed that this is actually correct? Any issues
would only show up with 64-bit arguments (and only then if they don't fall
across an alignment boundary).
Comment 18•15 years ago
|
||
I missed the fairly obvious call to "ASSERT(stack_ok)" in my DEBUG example :-)
Assignee | ||
Comment 19•15 years ago
|
||
(In reply to comment #17)
> I just have one final concern, and then I'll leave you alone :-) We need to
> guarantee that the stack space is 64-bit aligned (for EABI compliance), but I'm
> not sure how you do that. Indeed, you removed some alignment code from
> invoke_count_words. Have you confirmed that this is actually correct?
Stack allocation already guarantees the stack pointer is going to be aligned. That's why the allocation on stack is space + 1, and the pointer passed to invoke_copy_to_stack is &space_stack[1], so that we start filling the stack on an unaligned word, and that the fourth argument is correctly aligned.
Anyways, I am currently testing a rather different approach (combined with the removal of invoke_count_words, because it makes things easier) that doesn't do too much magic with the stack.
I should be done (including full xpcshell-tests run) by tonight or tomorrow.
Assignee | ||
Updated•15 years ago
|
Attachment #433119 -
Flags: review?(vladimir)
Comment 20•15 years ago
|
||
Right-o. This will also resolve bug 530087 :-)
Assignee | ||
Comment 21•15 years ago
|
||
I have a incomplete patch for the moment but the results are already good:
(somehow the machine I'm using got faster since yesterday, so I reiterated all the benchmarks for comparison)
original code:
direct took 2.20 seconds
invoke took 29.81 seconds
So, invoke overhead was ~ 27.61 seconds (~ 93%)
with attachment 433119 [details] [diff] [review]:
direct took 2.20 seconds
invoke took 22.66 seconds
So, invoke overhead was ~ 20.45 seconds (~ 90%)
with the WIP patch:
direct took 2.22 seconds
invoke took 16.94 seconds
So, invoke overhead was ~ 14.72 seconds (~ 87%)
We're looking at almost 2x improvement, though the speed test is actually biased, as it only tries one type of calls, that doesn't even need to use the stack.
The current status of the WIP is that it is clean of any assembly and works properly, except that it lacks proper comments, and more importantly, breaks the non EABI case (which I unfortunately can't test, but I can at least try to make it theoretically work).
Assignee | ||
Comment 22•15 years ago
|
||
Some values for a modified speed test, using 10000000 iterations (10 times less) over AddMixedFloats instead of AddTwoInts:
original code:
direct took 10.72 seconds
invoke took 17.41 seconds
So, invoke overhead was ~ 6.69 seconds (~ 38%)
with attachment 433119 [details] [diff] [review]:
direct took 10.79 seconds
invoke took 16.69 seconds
So, invoke overhead was ~ 5.90 seconds (~ 35%)
with the WIP patch:
direct took 10.78 seconds
invoke took 14.58 seconds
So, invoke overhead was ~ 3.80 seconds (~ 26%)
Comment 23•15 years ago
|
||
I wouldn't worry about supporting non-EABI in Linux. It might be nice to have a conditional #error in case someone tries it. GCC still supports non-EABI for now, at least, though I don't know of anyone using it.
A 2* improvement is fantastic! I'm guessing that's because you only make one pass of the argument list; I saw similar effects on the WinCE version with that modification (which was really simple for WinCE's calling convention).
Assignee | ||
Comment 24•15 years ago
|
||
The patch is unfortunately not easily readable. It's probably better to apply it to see the resulting file. Please tell me if the comments are not clear enough.
It is now free of assembly, and I think it is now perfectly safe.
FYI, the main reason I went with the copy_double_word function is that somehow, gcc would generate 2 almost identical assembly snippets for the double word copies when I just inlined the code like it is in the original code: 1 for nsXPTType::T_I64 and nsXPTType::T_U64 and another one for nsXPTType::T_DOUBLE.
And when I say almost identical, it's almost pathetic that it can't factor:
add r0, r0, #7 add r0, r0, #7
bic r7, r0, #7 bic r7, r0, #7
cmp r7, r5 cmp r5, r7
moveq r7, lr moveq r7, lr
sub r9, r3, #16 sub r9, r3, #16
ldmia r9, {r8-r9} ldmia r9, {r8-r9}
add r0, r7, #4 add r0, r7, #4
stmia r7, {r8-r9} stmia r7, {r8-r9}
b .L5 b .L5
As a nice side effect, it made it cleaner to keep non-eabi support, so I just implemented it.
During the development of this patch, I hit some other pathetic cases of stupid gcc optimizations that I'll blog about.
Attachment #428899 -
Attachment is obsolete: true
Attachment #433119 -
Attachment is obsolete: true
Attachment #433581 -
Flags: review?(Jacob.Bramley)
Comment 25•15 years ago
|
||
Mike - I will apply the patch and test on an N900 and N810. Are there tests apps you have been using to get your measurements?
I was going to test using some talos suites too.
Comment 26•15 years ago
|
||
When building for N900:
c++ -o xptcinvoke_arm.o -c -fvisibility=hidden -DMOZILLA_INTERNAL_API -DOSTYPE=\"Linux2.6\" -DOSARCH=Linux -DEXPORT_XPTC_API -D_IMPL_NS_COM -I/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/../.. -I/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/../../../../xptinfo/src -I/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix -I. -I../../../../../../dist/include -I../../../../../../dist/include/nsprpub -I/home/mfinkle/mozilla-192/mobile/xulrunner/dist/include/nspr -I/home/mfinkle/mozilla-192/mobile/xulrunner/dist/include/nss -fPIC -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-long-long -pedantic -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -g -fno-inline -Os -freorder-blocks -fno-reorder-functions -finline-limit=50 -O2 -DMOZILLA_CLIENT -include ../../../../../../mozilla-config.h -Wp,-MD,.deps/xptcinvoke_arm.pp /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp: In function 'PRUint32* copy_double_word(PRUint32*, PRUint32*, PRUint32*, PRUint64*)':
/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:75: warning: cast from 'PRUint32*' to 'PRUint64*' increases required alignment of target type
/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp: In function 'nsresult NS_InvokeByIndex_P(nsISupports*, PRUint32, PRUint32, nsXPTCVariant*)':
/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:158: error: ISO C++ forbids variable-size array 'stack_space'
make[8]: *** [xptcinvoke_arm.o] Error 1
Assignee | ||
Comment 27•15 years ago
|
||
Mark, can you try squashing this patch ?
Assignee | ||
Comment 28•15 years ago
|
||
(In reply to comment #25)
> Mike - I will apply the patch and test on an N900 and N810. Are there tests
> apps you have been using to get your measurements?
The measurements in e.g. comment 11 come from TestXPTCInvoke in xpcom/reflect/xptcinvoke/tests (you need to uncomment the DoSpeedTest call)
Comment 29•15 years ago
|
||
(In reply to comment #27)
> Created an attachment (id=433641) [details]
> Additional patch
>
> Mark, can you try squashing this patch ?
With this patch applied on the first patch, I get a successful build.
However, when attempting to launch Firefox Mobile on my N900, I get a Segmentation fault.
Assignee | ||
Comment 30•15 years ago
|
||
(In reply to comment #29)
> However, when attempting to launch Firefox Mobile on my N900, I get a
> Segmentation fault.
Can you try building the TestXPTCInvoke program in xpcom/reflect/xptcinvoke/tests (just make -C xpcom/reflect/xptcinvoke/tests) and then run it with:
dist/bin/run-mozilla.sh dist/bin/TestXPTCInvoke ? How far does it go ?
Assignee | ||
Comment 31•15 years ago
|
||
Also, is the N900 toolchain building EABI binaries ?
Comment 32•15 years ago
|
||
(In reply to comment #30)
> Can you try building the TestXPTCInvoke program in
> xpcom/reflect/xptcinvoke/tests (just make -C xpcom/reflect/xptcinvoke/tests)
> and then run it with:
> dist/bin/run-mozilla.sh dist/bin/TestXPTCInvoke ? How far does it go ?
Completes the whole thing. Even with DoSpeedTests uncommented.
> Also, is the N900 toolchain building EABI binaries ?
I think so. The arch in the packaged filenames says "guneabi"
Comment 33•15 years ago
|
||
I have not been able to install gdb on my N900, but I have seen that the segfault is happening in invoke_copy_to_stack. I'll try to narrow it down.
Updated•15 years ago
|
Attachment #433581 -
Flags: review?(Jacob.Bramley) → review+
Comment 34•15 years ago
|
||
Comment on attachment 433581 [details] [diff] [review]
C-only version
>+ current = (PRUint32 *)(((PRUint32)current + 7) & 0xfffffff8);
'~7' is clearer to me than '0xfffffff8'. I suppose that's subjective,
though.
>+ *((PRUint64*) current) = *dw;
I don't think that's valid in the non-EABI case where 'current' isn't
8-byte aligned. It will probably work anyway, but I think C assumes that
64-bit values have 64-bit alignment; LDRD and STRD had this restriction
on older architectures. I think the non-EABI and EABI cases should be
entirely contained in the #ifdef to avoid these issues.
>+ /* Wrap when reaching the end of the stack buffer */
>+ if (d == end) d = stk;
It might be wise to add a debug assertion here to check that 'd' is
within 'stk' and '&stk[end]'. That will catch some logic errors that can
be hard to spot, and easy to introduce in future patches.
> * The routine will issue calls to count the number of words
> * required for argument passing and to copy the arguments to
> * the stack.
We don't do this any more. Well, not explicitly anyway. Can you clobber
this bit of the comment please?
----
I like the look of this patch (with the 'Additional patch' on top), and it appears correct so I've marked it 'r+'. However, the N900 crash obviously needs to diagnosed & fixed before this is pushed.
Thanks,
Jacob
Assignee | ||
Comment 35•15 years ago
|
||
(In reply to comment #34)
> '~7' is clearer to me than '0xfffffff8'. I suppose that's subjective,
> though.
I prefer, too, but I just left what already was there.
> >+ *((PRUint64*) current) = *dw;
>
> I don't think that's valid in the non-EABI case where 'current' isn't
> 8-byte aligned. It will probably work anyway, but I think C assumes that
> 64-bit values have 64-bit alignment; LDRD and STRD had this restriction
> on older architectures. I think the non-EABI and EABI cases should be
> entirely contained in the #ifdef to avoid these issues.
I don't think there is a 64-bit alignment requirement on architectures that matter. FWIW, in Debian we target armv4t, and upstream Mozilla codebase already targets at least armv5.
> >+ /* Wrap when reaching the end of the stack buffer */
> >+ if (d == end) d = stk;
>
> It might be wise to add a debug assertion here to check that 'd' is
> within 'stk' and '&stk[end]'. That will catch some logic errors that can
> be hard to spot, and easy to introduce in future patches.
Fair enough.
> I like the look of this patch (with the 'Additional patch' on top), and it
> appears correct so I've marked it 'r+'. However, the N900 crash obviously needs
> to diagnosed & fixed before this is pushed.
Totally agreed. I'd like some more feedback from Mark if possible.
Assignee | ||
Comment 36•15 years ago
|
||
Any update, Mark ?
Comment 37•15 years ago
|
||
We are looking into this. Brian Crowder was able to get a stack from a build:
#0 0xbe9ff4a8 in ?? ()
#1 0x43162a2c in invoke_copy_to_stack (stk=0x12, paramCount=3198154224, s=0x8) at /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:92
#2 0x42693488 in XPCWrappedNative::CallMethod (ccx=..., mode=<value optimized out>)
at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:2722
#3 0x4269c2c0 in XPCWrappedNative::GetAttribute (ccx=...) at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcprivate.h:2524
#4 0x4269993c in XPC_WN_GetterSetter (cx=0x41c05800, obj=<value optimized out>, argc=0, argv=0x40455608, vp=0xbe9ff708)
at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcwrappednativejsops.cpp:1784
#5 0x406d7c0c in js_Invoke (cx=0x41c05800, argc=0, vp=0x40455600, flags=<value optimized out>)
Comment 38•15 years ago
|
||
I have been testing on mozilla-192. I will start testing on mozilla-central as well and report.
Assignee | ||
Comment 39•15 years ago
|
||
The xptcinvoke code is the same on 1.9.2 and trunk. It would be interesting to have the values for d, stk and paramCount on your stacktrace, but that would either need to break on the invoke_copy_to_stack function entry, or run an unoptimized build.
Comment 40•15 years ago
|
||
I made a non-optimized build from mozilla-central. The build works fine on my N900.
I am now going to make a non-optimized build from mozilla-192 and try it.
Comment 41•15 years ago
|
||
Non optimized build from mozilla-192 segfaults on startup. Here is the stack:
Starting program: /home/opt/fennec/fennec
Program received signal SIGSEGV, Segmentation fault.
0xbe9efb70 in ?? ()
0xbe9efb70: andeq r0, r0, r0
(gdb) bt
#0 0xbe9efb70 in ?? ()
#1 0x43a98b30 in invoke_copy_to_stack (stk=0x421d2200, paramCount=1109287616, s=0x8)
at /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:92
#2 0x426db3f8 in XPCWrappedNative::CallMethod (ccx=..., mode=XPCWrappedNative::CALL_GETTER)
at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:2722
#3 0x426ebdf4 in XPCWrappedNative::GetAttribute (ccx=...)
at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcprivate.h:2524
#4 0x426e5fa0 in XPC_WN_GetterSetter (cx=0x40446200, obj=0x421d2200, argc=0, argv=0x404b05a0, vp=0xbe9eff3c)
at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcwrappednativejsops.cpp:1784
#5 0x4076948c in js_Invoke (cx=0x40446200, argc=0, vp=0x404b0598, flags=2)
at /home/mfinkle/mozilla-192/mozilla/js/src/jsinterp.cpp:1360
Comment 42•15 years ago
|
||
I tried stepping through invoke_copy_to_stack:
(gdb) step 1
80 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
88 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
86 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
87 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
92 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
nsXPTCVariant::IsPtrData (this=<value optimized out>) at ../../../../../../dist/include/xptcall.h:110
110 ../../../../../../dist/include/xptcall.h: No such file or directory.
in ../../../../../../dist/include/xptcall.h
(gdb) step 1
invoke_copy_to_stack (stk=0xbe809940, paramCount=1, s=0xbe809b70)
at /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:91
91 /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp: No such file or directory.
in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
92 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
94 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
92 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
88 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
Program received signal SIGSEGV, Segmentation fault.
0xbe809b70 in ?? ()
0xbe809b70: andeq r0, r0, r0
Looks like the if(s->IsPtrData()) test is true and we attempt to continue. I should have been checking locals after each step. More coming.
Comment 43•15 years ago
|
||
Mike - One thing I overlooked: This appears to be broken only on mozilla-192, but not on mozilla-central. So it should be OK to checkin the patches you have and we can continue to work on getting this to mozilla-192.
Jacob & Brian - Do you guys agree?
Comment 44•15 years ago
|
||
I would personally prefer to understand the 192 broken-ness before landing it anywhere.
Assignee | ||
Comment 45•15 years ago
|
||
(In reply to comment #44)
> I would personally prefer to understand the 192 broken-ness before landing it
> anywhere.
Especially considering that really, there should be no difference between both.
Comment 46•15 years ago
|
||
(In reply to comment #44)
> I would personally prefer to understand the 192 broken-ness before landing it
> anywhere.
I certainly agree! The brokenness is probably still there in mozilla-central, but just not exercised. It might be fine, but until we understand what's happening, I don't trust it.
mfinkle, is this crashing on any/the first xptcall invocation? Or just a specific one? Stepping through it an instruction at a time and printing registers would be helpful; a gdb command that does 'si ; x/i $pc ; info reg' would probably be helpful, instead of 'step 1' above.
Comment 48•15 years ago
|
||
(In reply to comment #47)
> mfinkle, is this crashing on any/the first xptcall invocation? Or just a
> specific one?
Happens on the first invocation
> Stepping through it an instruction at a time and printing
> registers would be helpful; a gdb command that does 'si ; x/i $pc ; info reg'
> would probably be helpful, instead of 'step 1' above.
I am collecting the data now. Will attach a dump of the session.
Comment 49•15 years ago
|
||
Here is the gdb session leading up to the crash - using the "si ; x/i $pc ; info reg" commands
Assignee | ||
Comment 50•15 years ago
|
||
D'oh, the problem was so obvious and so problematic that I wonder how it could work without crashes on my builds, or with Mark's m-c builds O_o
The problem was that the stack buffer was too small in the paramCount == 1 case, and invoke_copy_to_stack would copy outside the buffer boundaries (it would write *before* the buffer, where the saved pc is for function return, in non optimized builds at least). The added assertion would catch this if this ever happens again.
Mark, can you check this patch ? I'll also try it on my end.
Attachment #433581 -
Attachment is obsolete: true
Attachment #433641 -
Attachment is obsolete: true
Assignee | ||
Comment 51•15 years ago
|
||
This version actually builds. I'm still waiting for the build to fully finish to actually test.
Attachment #438688 -
Attachment is obsolete: true
Assignee | ||
Comment 52•15 years ago
|
||
With a brown paper bag fix, this one seems to work properly. More testing undergoing.
Attachment #438697 -
Attachment is obsolete: true
Assignee | ||
Comment 53•15 years ago
|
||
Comment on attachment 438700 [details] [diff] [review]
Patch v3.2
I verified this one works for me. I think it will work for Mark, too.
Attachment #438700 -
Flags: review?(Jacob.Bramley)
Comment 54•15 years ago
|
||
Comment on attachment 438700 [details] [diff] [review]
Patch v3.2
This patch works for me! Nice job Mike.
Running Fennec on my N900. It launches fine and I am browsing pages and running sunspider.
Comment 55•15 years ago
|
||
I tested Ts on my own Fennec build with and without the patch:
(averaged over 10 runs)
Ts without: 5912ms
Ts with: 5787ms
Nice 14% improvement
Comment 56•15 years ago
|
||
Problem in Google Spreadsheet formula. Correct percentage: 2%
Not as good, but still positive!
Comment 57•15 years ago
|
||
Comment on attachment 438700 [details] [diff] [review]
Patch v3.2
Looks good to me. Nice job!
Attachment #438700 -
Flags: review?(Jacob.Bramley) → review+
Assignee | ||
Updated•15 years ago
|
Summary: Fix NS_InvokeByIndex() to use BLX for method invocations when compiling for Thumb → Reimplement xptcinvoke for arm in C
Assignee | ||
Updated•15 years ago
|
Summary: Reimplement xptcinvoke for arm in C → Reimplement xptcinvoke for arm/linux in C
Assignee | ||
Updated•15 years ago
|
Attachment #438700 -
Flags: review?(vladimir)
Comment on attachment 438700 [details] [diff] [review]
Patch v3.2
Looks good to me! Note that I'm not an xpcom peer -- bsmedberg is though, so asking him just to confirm this patch. Code looks fine to me, and is much cleaner. Thanks!
Attachment #438700 -
Flags: superreview?(benjamin)
Attachment #438700 -
Flags: review?(vladimir)
Attachment #438700 -
Flags: review+
Comment 59•15 years ago
|
||
Comment on attachment 438700 [details] [diff] [review]
Patch v3.2
I'll delegate to people who know!
Attachment #438700 -
Flags: superreview?(benjamin) → superreview+
Assignee | ||
Comment 60•15 years ago
|
||
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.3a5
Comment 62•15 years ago
|
||
Comment on attachment 438700 [details] [diff] [review]
Patch v3.2
Code is baking on trunk and would be helpful on 192 for Fennec 1.1
Attachment #438700 -
Flags: approval1.9.2.5?
Comment 63•15 years ago
|
||
Comment on attachment 438700 [details] [diff] [review]
Patch v3.2
dropping for fennec 1.1
Attachment #438700 -
Flags: approval1.9.2.5?
You need to log in
before you can comment on or make changes to this bug.
Description
•