Closed Bug 664510 Opened 13 years ago Closed 13 years ago

Get valid crashreporter reports again

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: azakai, Assigned: ted)

References

Details

(Whiteboard: [mobile_dev_needed][android_tier_1])

We have some bugs where we crash on talos. With crashreporter working there, we could get useful data which otherwise is extremely difficult to acquire.
To clarify, the issue is that our crash reports have no useful information for symbols in the OS libs (like libc). If we build our own libc etc. we can do that with debug symbols.
Ignore comments 0 and 1. This is a broader issue than it seemed.

We are getting corrupt stack traces basically all the time apparently.

Stacks only show libc, and sometimes other system libraries, but never our own code. This happens both in crashreporter - almost all the recent top crashes are like this now - and when manually running a debugger on device.

Example: https://crash-stats.mozilla.com/report/index/555d3584-6131-4aec-a9fc-8d4dc2110609

This seems to be a recent regression.
Summary: Get crashreporter reports from talos runs → Get valid crashreporter reports again
any update on this?  I assumed this would be fixed by now since it is a regression and not getting proper crash reports is a pretty serious problem.
I do not believe that crash reports are broken in general. I tested crashing both content and chrome processes on the June 24 nightly, and they showed up with perfect stacks on socorro.
(In reply to comment #1)
> To clarify, the issue is that our crash reports have no useful information
> for symbols in the OS libs (like libc). If we build our own libc etc. we can
> do that with debug symbols.

Ted wanted to grab the symbols from the system libraries with an extension ; unfortunately, our dynamic linker is broken and doesn't permit that (bug 647288)
(In reply to comment #4)
> I do not believe that crash reports are broken in general. I tested crashing
> both content and chrome processes on the June 24 nightly, and they showed up
> with perfect stacks on socorro.

It isn't 100% broken, but we saw this both when debugging recently in bug 662936, and in most of the recent top crashers.
who is working on this bug and is there an ETA for fixing it?  My understanding is that nobody is looking at bug 662936 until this is resolved.
dougt may have already found part of the stack trace issue in general (something with the library loader), but I don't think it can explain problems with crashreporter (which doesn't depend on the library loader AFAIK). I can't make a guess as to ETA, but we are doing our best.

I am also working on bug 662936 in parallel, some ideas that do not depend on this bug.
ted and jdm inform me on irc that the issue is we are 'stuck' inside libc calls, without a way to get a proper stack trace from there. Relevant bugs are bug 668210 and bug 644707. The latter bug has a potential partial solution, will look into that. In general though there might not be a way to fix this for all cases.
Specifically, these are almost certainly calls to libc!abort(). Since we don't have debug symbols for system libraries on crash-stats, and there's no frame pointer on ARM, the stack walker just scans the stack looking for possible return addresses. Clearly it gets lost and wanders off into the weeds. bug 644707 would probably fix the abort() case, bug 668210 is a bit more work but would help fix the general case by crowdsourcing symbol data. It would also get us function names for system library stack frames, which would be nice.
bug 644707 should have fixed the majority of these.
Depends on: 644707
Was that only pushed to Nightly?  This won't resolve Aurora nor Beta crashes would it?

(In reply to comment #12)
> bug 644707 should have fixed the majority of these.
Only Nightly so far.

If we see an improvement in Nightly crash report quality, and no new problems due to this patch, then we should ask for this to be in Aurora and Beta.
Whiteboard: [mobile_dev_needed]
Whiteboard: [mobile_dev_needed] → [mobile_dev_needed][android_tier_1]
I think that's unavoidable for any stack that dies inside of libc.so at this point.
I'll try to get bug 668210 revived, that may be our only hope.
Assignee: nobody → ted.mielczarek
(In reply to Ted Mielczarek [:ted, :luser] from comment #17)
> I'll try to get bug 668210 revived, that may be our only hope.

No update here in over 3 weeks on this android_tier_1 bug. Any progress?
Sorry, I've been working on bug 668210 but I apparently failed to update bugzilla. I have a working extension there, we'll need to get it installed on a variety of devices to get useful symbols into crash-stats.
Preliminary results are encouraging. Looking at the last 4 hours of crash reports:
https://crash-stats.mozilla.com/query/query?product=Fennec&version=ALL%3AALL&range_value=4&range_unit=hours&date=09%2F08%2F2011+06%3A41%3A09&query_search=signature&query_type=contains&query=&reason=&build_id=&process_type=any&hang_type=any&do_query=1

The #1 topcrash is __libc_android_abort, which is probably a rollup of all those distinct libc.so@xxx crashes. We'll probably need to skiplist that signature to get more distinct crash reports out of it, but it looks like it's having a positive effect.
Depends on: 668210
Filed bug 685888 on skiplisting that signature.
I've done all I can do here, I think the situation has improved. It's not perfect, but I don't think it ever will be.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.