Closed Bug 697301 Opened 8 years ago Closed 7 years ago

all Android crashes with mozalloc_abort at the top of stack have garbled stacks

Categories

(Toolkit :: Crash Reporting, defect, major)

ARM
Android
defect
Not set
major

Tracking

()

RESOLVED WORKSFORME
mozilla12
Tracking Status
firefox9 --- affected
firefox10 --- affected
firefox11 --- affected

People

(Reporter: dbaron, Assigned: glandium)

Details

Attachments

(1 file)

After looking at bug 696906, I looked through the Fennec top crashes for Aurora/9.0a2 Fennec here:
https://crash-stats.mozilla.com/topcrasher/byversion/Fennec/9.0a2/7

It looks to me like all of the crashes with signatures of the form "mozalloc_abort | ..." for various values of "..." have garbled stacks, such that it's impossible to tell what's actually going on.

One example for each of the five crashes I looked at:
bp-d12bf4bd-6527-4083-990a-76a1f2111019
bp-4d601c3a-d6e3-4a2d-a217-c953f2111023
bp-2eb14835-7cf7-407d-b365-5060b2111025
bp-9fd15a0a-5540-420d-9f9c-ac5232111023
bp-d86e8664-9241-4c9b-a8e1-1e55e2111020

These stacks all look useless:  the caller of mozalloc_abort isn't something that would call it, and in many cases (e.g., the first) there are other chains of functions that clearly can't call each other.

I looked around at some other crashes, and there clearly are some crashes where we are getting useful crash stacks, such as these:
bp-d45a835a-f524-442c-9512-75ae12111019
bp-342b0ff7-38bc-42d7-b041-5581e2111024
bp-f4c3794f-59a7-4f56-8240-75ea42111023

I'm not sure why the mozalloc_abort ones are different, but it seems like there's something wrong with the stack walking for Android/ARM.
Note: the crashes with : Java_org_mozilla_gecko_GeckoAppShell_reportJavaCrash are java crashes and should have "Java Signature" like bug 679176.  There's a bug to have Socorro report those : ( bug 686973 )

I am unsure if there is an issue with breakpad and virtual methods ( https://crash-stats.mozilla.com/report/index/d12bf4bd-6527-4083-990a-76a1f2111019 ).  I hope to find some sort of STR for this particular case.
I don't think getting steps to reproduce is critical here:  we have raw crash dumps to debug on the crash-stats server, the problem lies in converting those raw crash dumps to stack traces, which is code that it should be possible to debug entirely with data we already have.
The problem here is that we don't have symbols for libc in the crashes you point out. Frame 1 (in libc) is just __libc_android_abort, but the stack walker can't reliably get past that without symbols. This is why I put together my Android Symbol Sender extension:
https://addons.mozilla.org/en-US/mobile/addon/android-symbol-sender/

The only other idea I had to make stacks more reliable was to do the stack walking client-side, since on ARM and other architectures like x86-64 all the stack unwind info for all libraries is present on the client. That's filed as bug 650239.
The problem is actually much worse. As the compiler knows mozalloc_abort doesn't return, it doesn't care about keeping the return address, and as such, lr is just garbage and there is no way to guess it from the stack.
For what it's worth, x86 and x64 are apparently safe on most platforms. I validated on OSX 32-bits, Win32, Linux and Linux64 (we apparently don't run xpcshell tests on win64 try), by trying to allocate ~4GB memory on 32-bits builds and 42GB on 64-bits builds with moz_xmalloc. The stack trace was useful in all the mentioned platforms. Which makes ARM the only one affected.
Interestingly, moz_xmalloc(42GB) didn't trigger mozalloc_abort on OSX 64-bits. I think we should file a bug for that.
I think a fix/workaround would be to force TouchBadMemory not to be inlined. My original idea was to remove MOZ_NORETURN from the mozalloc_abort definition, but that's probably going to affect optimizations in its callers, while well, once we're in mozalloc_abort, we don't care if it itself is optimized to the best.
Attachment #587376 - Flags: review?(jones.chris.g) → review+
Assignee: nobody → mh+mozilla
https://hg.mozilla.org/mozilla-central/rev/9f00bf6379a6
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Whiteboard: [inbound]
Target Milestone: --- → mozilla12
I'm afraid this might not be enough :-/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I think some other bugs made this better. It's probably not worth keeping this one open anymore. If we spot new problems, we'll file new bugs.
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.