Closed Bug 803601 Opened 12 years ago Closed 12 years ago

SIGILL on armv6 while loading wiki.mozilla.org

Categories

(Core :: JavaScript Engine, defect)

ARM
Android
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: kats, Unassigned)

References

Details

(Whiteboard: [ion:t])

Attachments

(5 files)

Attached file mozconfig
I have a local armv6 build of firefox for android, and when I load wiki.mozilla.org it crashes with SIGILL. I'll attach my mozconfig and GDB backtrace. This is using the latest mozilla-central code unmodified (cset 7fcac3016159) on a Galaxy Q device running Android 2.3.4.
ugh.
today is the day of the un-fun bugs.
One thing that you can do (to confirm that this is the actual problem) is to print the call instruction that got you into PushActiveVMFrame.  (Not knowing how much you know) You can print it out by x/i FOO-4, where FOO is the return address listed for frame 1 in your backtrace (in this case 0x51a36bbc, but in practice, it may be different).  The instruction *should* be

bl 0x51a37f??

(possibly a different address)
but your machine is most likely going to say

blx 0x51a37f??

This is a problem since blx asks the processor to call the given function *AND* transition from executing ARM code to executing THUMB code, and armv6 builds are ARM-only.  Theoretically, we should never emit a blx with a relative offset, only bl with a relative offset and blx with a register.  While the body of JaegerTrampoline is hard-coded with a blx in it, the linker should be able to recognize that it is computing a call from ARM code to ARM code, and emit the correct bl instruction.  My current opinion is that this is a bug in the toolchain, and a rather finicky one at that.
You are correct:

(gdb) x/i 0x51a72bb8
   0x51a72bb8 <JaegerTrampoline+48>:	blx	0x51a73fac <PushActiveVMFrame(js::VMFrame&)>

FWIW I'm using the new mozilla-repackaged NDK toolchain which has GCC 4.6.3 and gold as the linker. That's not what's used on the builder machines (yet) so it may not affect release builds. If it is a toolchain issue specific to the mozilla-repackaged NDK then this bug should block bug 769099.
Marty, Kats - this has bitten us a few times. How should we triage? If it only happens in local builds, is it worth prioritizing?
Whiteboard: [ion:t]
Unlike with linux, if you aren't a developer, and are running firefox on android you *are* running a firefox that we built, so hopefully will be free of this issue.  Unfortunately, having the possibility of scaring off a new developer due to mysterious crashes and linker bugs seems bad.  The easiest thing to do would likely be to document this failure mode inside of our building-on-android documentation, however, it is already an unwieldy document, and making it more complex probably isn't a great idea.  We already have a sigsegv handler.  Would it be reasonable to add in a sigill handler that knows enough to look up the stack, and print out an error about bad linkers?
or actually, much simpler, at build time (this is a build time failure after all), objdump the shell (or the actual browser if there is an elf lying around), and grep for blx.  I used something like that to initially diagnose this bug.
Is there some way to prevent this crash from happening? Even ugly hacks are fine since I just need to do it on my local builds.
The crash with SIGILL also happens when I build for armv6 using NDK r8c. As we would like to switch over all Fennec builds to r8c, I would like to get more info on this bug and see if we can fix it.

(In reply to Marty Rosenberg [:mjrosenb] from comment #2)
> This is a problem since blx asks the processor to call the given function
> *AND* transition from executing ARM code to executing THUMB code, and armv6
> builds are ARM-only.  Theoretically, we should never emit a blx with a
> relative offset, only bl with a relative offset and blx with a register. 
> While the body of JaegerTrampoline is hard-coded with a blx in it, the
> linker should be able to recognize that it is computing a call from ARM code
> to ARM code, and emit the correct bl instruction.  My current opinion is
> that this is a bug in the toolchain, and a rather finicky one at that.

Are there linker flags we can twiddle with to make it do the right thing here? Alternatively, can we ifdef the blx instruction in JaegerTrampoline and use a bl instruction on ARMv6 devices instead?
Blocks: 816993
> Are there linker flags we can twiddle with to make it do the right thing
> here? Alternatively, can we ifdef the blx instruction in JaegerTrampoline
> and use a bl instruction on ARMv6 devices instead?

The assembler should be treating |bl label| and |blx label| identically, since it'll determine if the target is ARM or THUMB at link time.  I don't know of any such flags, but we may want to poke google, since it is likely a linker bug (and I don't know how much they've modified the code and haven't upstreamed yet).
Please try the attached patch. It fixes the problem for me.

This problem can be also reproduced with a small testcase on an ARM linux system:

$ cat test.S
.text
.arch armv7-a
.global main
.arm

#ifdef HAVE_PROPER_TYPES
.type func, %function
#endif
func:
        bx          lr

#ifdef HAVE_PROPER_TYPES
.type main, %function
#endif
main:
        push        {r4, lr}
        blx         func
        pop         {r4, pc}

$ readelf -a a.out | grep func
    72: 00008434     0 NOTYPE  LOCAL  DEFAULT   13 func

$ gcc test.S && ./a.out
Illegal instruction

$ gcc -DHAVE_PROPER_TYPES test.S
test.S: Assembler messages:
test.S:17: Warning: blx to 'func' an ARM ISA state function changed to bl

$ readelf -a a.out | grep func
    72: 00008434     0 FUNC    LOCAL  DEFAULT   13 func

$ ./a.out
(In reply to Siarhei Siamashka from comment #9)
> Please try the attached patch. It fixes the problem for me.

How did you verify this? I tried your patch but it didn't seem to fix the problem for me.
Ifdefing the blx instructions under #ifdef MOZ_THUMB2 and using bl otherwise did seem to work.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10)
> (In reply to Siarhei Siamashka from comment #9)
> > Please try the attached patch. It fixes the problem for me.
> 
> How did you verify this?

The standard packaged firefox-17 in gentoo linux suffers from this SIGILL problem on my ARM system. After applying the patch to firefox and rebuilding it using the standard gentoo emerge tool, the problem disappears.

On a side note, whenever I try to crosscompile firefox on my PC (obviously for faster build times), the problem is even not reproducible with or without any patches.

In all cases I'm using binutils-2.22 if that matters.

> I tried your patch but it didn't seem to fix the problem for me.

Thanks for testing. That's interesting. Have you tried to verify (using readelf) that the symbol type is now correct for JaegerTrampoline (FUNC instead of NOTYPE) in libxul.so? Also could you get a backtrace and confirm that the crash still happens in JaegerTrampoline / PushActiveVMFrame?

I just wonder if there could be other similar problems somewhere else. For example, the patch tries to fix not just JaegerTrampoline alone, but a whole bunch of Jaeger* functions. But it might be not enough.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #11)
> Ifdefing the blx instructions under #ifdef MOZ_THUMB2 and using bl otherwise
> did seem to work.

The use of bl instruction should be fine whenever there is no arm->thumb or thumb->arm state change.

Still, as we have discussed with Marty Rosenberg on irc, there is http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044e/IHI0044E_aaelf.pdf which says "The linker is only required to provide interworking support for symbols of type STT_FUNC (interworking for untyped symbols must be encoded directly in the object file)" in "4.5.2 Symbol Types" section. Hence it is generally a good idea to ensure that all the functions have correct symbol types in libxul.so and in the other binaries/libraries.
(In reply to Siarhei Siamashka from comment #12)
> Thanks for testing. That's interesting. Have you tried to verify (using
> readelf) that the symbol type is now correct for JaegerTrampoline (FUNC
> instead of NOTYPE) in libxul.so?

I just tried this and it does report as type FUNC:

kats@kgupta-pc obj-android$ ~/android/ndk/toolchains/arm-linux-androideabi-4.6/prebuilt/linux-x86/bin/arm-linux-androideabi-readelf -a dist/fennec/libxul.so | grep Jaeger
  2842: 02e47a08     0 FUNC    GLOBAL DEFAULT   11 JaegerThrowpoline
  3070: 02e479f0     0 FUNC    GLOBAL DEFAULT   11 JaegerTrampolineReturn
  3071: 02e479b8     0 FUNC    GLOBAL DEFAULT   11 JaegerTrampoline
  3072: 02e47a2c     0 FUNC    GLOBAL DEFAULT   11 JaegerInterpolineScripted
  3073: 02e47a34     0 FUNC    GLOBAL DEFAULT   11 JaegerInterpoline
  3074: 02e47a70     0 FUNC    GLOBAL DEFAULT   11 JaegerStubVeneer

> Also could you get a backtrace and confirm
> that the crash still happens in JaegerTrampoline / PushActiveVMFrame?

Yup, still the same spot:

(gdb) bt
#0  0x52160e2e in PushActiveVMFrame (f=...) at /home/kats/zspace/mozilla-git/js/src/methodjit/MethodJIT.cpp:116
#1  0x5215f9ec in JaegerTrampoline () from /home/kats/zspace/mozilla-git/obj-android-armv6/dist/bin/libxul.so
#2  0x521611b8 in js::mjit::EnterMethodJIT (cx=0x4a51ef00, fp=0x49500110, code=0x4a610ca0, stackLimit=0x498e0000, partial=true) at /home/kats/zspace/mozilla-git/js/src/methodjit/MethodJIT.cpp:1057
#3  0x521614a0 in CheckStackAndEnterMethodJIT (cx=0x4a51ef00, fp=0x49500110, code=0x4a610ca0, partial=true) at /home/kats/zspace/mozilla-git/js/src/methodjit/MethodJIT.cpp:1115

(gdb) x/i 0x5215f9ec-4
   0x5215f9e8 <JaegerTrampoline+48>:	blx	0x52160e2c <PushActiveVMFrame(js::VMFrame&)>

FWIW I'm building this using the r8c version of the android NDK, which uses gcc-4.6 and gold 1.10 (from binutils 2.21).
> FWIW I'm building this using the r8c version of the android NDK, which uses
> gcc-4.6 and gold 1.10 (from binutils 2.21).

It could very well be a gold bug. Try BFD ld.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #14)
> FWIW I'm building this using the r8c version of the android NDK, which uses
> gcc-4.6 and gold 1.10 (from binutils 2.21).

Thanks for the feedback. I tried to experiment a bit more and constructed another small testcase. Looks like all the binutils releases up to 2.23.x are affected, but the latest binutils git is not. I bisected binutils to find when it was fixed and identified the following commit:

commit 4f194bb6afc5f0ec37720bdff01b6e6fbfce53f0
Author: Julian Brown <julian@codesourcery.com>
Date:   Wed Nov 28 16:52:57 2012 +0000

        gas/
        * config/tc-arm.c (md_apply_fix): Fix conversion of BL to BLX for
        local targets in Thumb mode.
    
        gas/testsuite/
        * gas/arm/bl-local-2.s: New test.
        * gas/arm/bl-local-2.d: New.

http://sourceware.org/git/?p=binutils.git;a=commit;h=4f194bb6afc5f0ec
Thanks for tracking that down! I'm a little confused though - you said in comment #12 you were using binutils-2.22 and it worked with that, but in comment #16 you said that all binutils releases up to 2.23.x are affected, which seems contradictory. I'm curious to know what the difference was between the testcase you created in comment #9 and the one you used in comment #16, since the one in #9 worked for you in binutils 2.22 but clearly the one in #16 did not. It seems to me that there may have been a different bug fixed between 2.21 (which didn't work on JaegerTrampoline for me) and 2.22 (which *did* work on JaegerTrampoline for you).
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)
> Thanks for tracking that down! I'm a little confused though - you said in
> comment #12 you were using binutils-2.22 and it worked with that,

Yes, I have initially reproduced the problem with binutils-2.22, which was the version I had installed at that time. The comment #12 also mentioned that the reproducibility of the problem was not 100% reliable for me (crosscompilation with a slightly different configuration did not trigger it). So I could not be 100% sure that the problem had been really fixed, even though it seemed to be gone. That's why your feedback was extremely valuable.

> but in comment #16 you said that all binutils releases up to 2.23.x are
> affected, which seems contradictory. I'm curious to know what the difference
> was between the testcase you created in comment #9 and the one you used in
> comment #16, since the one in #9 worked for you in binutils 2.22 but clearly
> the one in #16 did not.

Based on your feedback, it was clear to me that fixing only symbol types was not enough. So I downgraded my version of binutils to 2.21 (gold) and tried to construct a testcase, which would break on BL/BLX even with the correct symbol types. That's how I got https://bugzilla.mozilla.org/attachment.cgi?id=695363 test program. Then I tried to see, which versions of binutils are affected, and it looked like all the releases up to the latest 23.x have this bug (both gold and bfd). Then the next natural step would have been to submit a bug at http://sourceware.org/bugzilla/ in the case if the problem was also reproducible with binutils git. But it was not. So I just bisected binutils to find the commit which seems to solve the problem and reported it here.

As for the differences between the testcases in comment #9 and in comment #16. The former shows that having correct symbol types is generally important and not doing so is a fault on the Mozilla side. The latter shows that there is some bug in binutils even when having correct symbol types. Both of these bugs need to be fixed. And in the case if further testing reveals that the SIGILL problem still persists, then we are going to have a little more fun here and another bug to find :)

> It seems to me that there may have been a different bug fixed
> between 2.21 (which didn't work on JaegerTrampoline for me) and
> 2.22 (which *did* work on JaegerTrampoline for you).

More likely it's the same bug with unstable reproducibility, which just can arbitrarily show up or go away even on minor changes.

And as a side note, in the case of having multiple bugs affecting the same use case, the shotgun debugging approach (try something random -> see that it does not fully solve the problem -> go back to square one) does not really work. We need to have a really good understating about what's going on and just fix bugs one at a time.
I built gold locally using the latest version of the code in the binutils CVS repo and used that to build FF on ARMv6 using your .type %function patch, and it still crashes :( I verified using readelf that the symbol types are correct and the crash is still because of a BLX instruction in JaegerTrampoline. gold prints out this as it's version number:
  GNU gold (GNU Binutils 2.23.51.20121227) 1.11

Note that I only replaced the ld binaries in my NDK installation, not any of the other binutils binaries, if that makes any difference.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #19)
> Note that I only replaced the ld binaries in my NDK installation, not any of
> the other binutils binaries, if that makes any difference.

Based on how http://sourceware.org/git/?p=binutils.git;a=commit;h=4f194bb6afc5f0ec looks, I believe the problem might be in gas, and not in ld. Have you tried to also check if the SIGILL problem disappears for https://bugzilla.mozilla.org/attachment.cgi?id=695363 ?
Or it could be indeed unrelated. Are Mozilla builds for ARMv6 using no thumb at all?

In any case, I have difficulties reproducing the issue, because it tends to randomly disappear even with minor unrelated changed here :( Maybe I should try to do an Android build myself.

If BLX->BL change makes the issue disappear and nobody is up to investigate the root cause, maybe it's a usable workaround?
(In reply to Siarhei Siamashka from comment #20)
> Based on how
> http://sourceware.org/git/?p=binutils.git;a=commit;h=4f194bb6afc5f0ec looks,
> I believe the problem might be in gas, and not in ld.

Oh, good point. I will replace that as well and see what happens.

> Have you tried to also
> check if the SIGILL problem disappears for
> https://bugzilla.mozilla.org/attachment.cgi?id=695363 ?

Hmm, I don't seem to be able to reproduce the SIGILL problem on that test case using my stock NDK toolchain. I tried with and without the .type directives and it seems to work either way.

(In reply to Siarhei Siamashka from comment #21)
> Or it could be indeed unrelated. Are Mozilla builds for ARMv6 using no thumb
> at all?

Yeah, currently ARMv6 builds are not using thumb at all.

> In any case, I have difficulties reproducing the issue, because it tends to
> randomly disappear even with minor unrelated changed here :( Maybe I should
> try to do an Android build myself.

Do let me know if you try this, and/or if you need any help getting things set up. There are instructions on how to build at https://wiki.mozilla.org/Mobile/Fennec/Android

> 
> If BLX->BL change makes the issue disappear and nobody is up to investigate
> the root cause, maybe it's a usable workaround?

I have pushed a test build to the tryserver to see if this works. https://tbpl.mozilla.org/?tree=Try&rev=2c0744d1d581
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #22)
> (In reply to Siarhei Siamashka from comment #20)
> > Based on how
> > http://sourceware.org/git/?p=binutils.git;a=commit;h=4f194bb6afc5f0ec looks,
> > I believe the problem might be in gas, and not in ld.
> 
> Oh, good point. I will replace that as well and see what happens.

Same thing with the new as binary.
The try build at https://tbpl.mozilla.org/?tree=Try&rev=2c0744d1d581 works great on the handful of ARMv6 devices I tried it on. I don't get any SIGILL crashes. I think we should land attachment 694766 [details] [diff] [review] because it is the "right" thing to do but I would also like to stick in a workaround for whatever linker/assembler problems exist on the stock NDK-r8c toolchain so that we can switch ARMv6 builds to that toolchain. I will upload the workaround patch that fixed it for me.
Comment on attachment 696369 [details] [diff] [review]
prevent SIGILL crashes

Review of attachment 696369 [details] [diff] [review]:
-----------------------------------------------------------------

Interesting, I thought I'd tried this when the bug first came up, and it didn't fix the issue.  Then again, I was also using ndk r5c at the time, so if it works, go for it!
Attachment #696369 - Flags: review?(mrosenberg) → review+
https://hg.mozilla.org/mozilla-central/rev/f9be5a3d9b3c
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
(In reply to Marty Rosenberg [:mjrosenb] from comment #26)
> Interesting, I thought I'd tried this when the bug first came up, and it
> didn't fix the issue.  Then again, I was also using ndk r5c at the time, so
> if it works, go for it!

Just to be on a safe side, it is better to mimic the assembly code that is normally produced by C compilers:
1. Use .type and .size directives for functions
2. Don't mix arm and thumb code in the same source
3. Use BL for all the function calls (unless they are indirect) and let the linker fixup them to BLX if necessary.

Anything else drives us into the poorly tested territory. If/when this bug shows up again, just try the other things from the list above.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: