Open Bug 1808616 Opened 1 year ago Updated 3 months ago

Investigate to determine why MOZ_CRASH_UNSAFE_PRINTF was not reported in launch crash

Categories

(GeckoView :: Core, task, P2)

All
Android

Tracking

(Not tracked)

People

(Reporter: zmckenney, Assigned: zmckenney)

Details

(Whiteboard: [geckoview:m111][geckoview:m112][geckoview:m113])

In Bug 1807716 it appears MOZ_CRASH_UNSAFE_PRINTF was not caught by the crash reporter and we had no reporting on a launch crash. This ticket is to investigate why this wasn't reported and whether we are currently capturing other variations of MOZ_CRASH_* without issue.

Component: Crash Reporting → Core
Product: Fenix → GeckoView

111

Severity: -- → N/A
Rank: 210
Priority: -- → P2
Whiteboard: [geckoview:m111]
Rank: 210 → 111
Assignee: nobody → zmckenney

Investigation Results

Problem 1.) When a user opened the app they could crash AFTER the CrashReporter was created (at initializeGlean() in FenixApplication) which after logging the native code shows MOZ_CRASH_UNSAFE_PRINTF properly asserts and sends to MOZ_Crash which completes as expected. Higher in the stack this crash was not caught and recorded to file or reported in my testing (more details below). Crash in Glean was here.

Problem 2.) When a user opened the app they could crash BEFORE the CrashReporter was created if there was a crash file that was found. This occurred as soon as GleanCrashReporterService is created. This is because the file is parsed and it is added to CrashMetrics.crashCount in AC via this line. That add() function in turn calls the native code which causes the crash.

I suspect when we merged this PR to move the engine warmup above initializeGlean this "fixed" the initializeGlean crash. If the user did not have a crash file at next launch they would not see the app crashing anymore. If the user DID have a crash file (whether due to problem 1 or not) they would encounter Problem 2 which would not report because it was before CrashReporter has been created. Also note, if at any point the user had a new crash file they would be stuck crashing without reporting (such as going to about:crashparent).

Extra Details

If a user updated after we pushed to nightly the PR fix above (and with no crash file), they would be able to navigate to pages which would break (ex. ign.com) but would still report.

A potential reason Problem 1 did not get reported was because of the process being killed and a DeadObjectException being thrown.

2023-01-30 22:05:51.449  1339-1366  BootReceiver            system_process                       I  Copying /data/tombstones/tombstone_14 to DropBox (SYSTEM_TOMBSTONE)
2023-01-30 22:05:51.456  1339-20581 ActivityManager         system_process                       W  Exception thrown during pause
                                                                                                    android.os.DeadObjectException
                                                                                                    	at android.os.BinderProxy.transactNative(Native Method)
                                                                                                    	at android.os.BinderProxy.transact(Binder.java:764)
                                                                                                    	at android.app.IApplicationThread$Stub$Proxy.schedulePauseActivity(IApplicationThread.java:1079)
                                                                                                    	at com.android.server.am.ActivityStack.startPausingLocked(ActivityStack.java:1347)
                                                                                                    	at com.android.server.am.ActivityStack.finishActivityLocked(ActivityStack.java:3779)
                                                                                                    	at com.android.server.am.ActivityStack.finishActivityLocked(ActivityStack.java:3721)
                                                                                                    	at com.android.server.am.ActivityStack.finishTopRunningActivityLocked(ActivityStack.java:3602)
                                                                                                    	at com.android.server.am.ActivityStackSupervisor.finishTopRunningActivityLocked(ActivityStackSupervisor.java:2124)
                                                                                                    	at com.android.server.am.AppErrors.handleAppCrashLocked(AppErrors.java:668)
                                                                                                    	at com.android.server.am.AppErrors.makeAppCrashingLocked(AppErrors.java:500)
                                                                                                    	at com.android.server.am.AppErrors.crashApplicationInner(AppErrors.java:376)
                                                                                                    	at com.android.server.am.AppErrors.crashApplication(AppErrors.java:321)
                                                                                                    	at com.android.server.am.ActivityManagerService.handleApplicationCrashInner(ActivityManagerService.java:14375)
                                                                                                    	at com.android.server.am.NativeCrashListener$NativeCrashReporter.run(NativeCrashListener.java:85) 
Whiteboard: [geckoview:m111] → [geckoview:m111][geckoview:m112]
See Also: → 1805974
See Also: → 1805973
Priority: P2 → P1

A better and more final answer to "why MOZ_CRASH_UNSAFE_PRINTF was not reported in launch crash" is that LaunchCrashHandlerService had not yet been created when the crash occurred in the libmozglue.

Adding a log in MinidumpCallback() which invokes the launch crash handler service here and also adding a MOZ_CRASH_UNSAFE_PRINTF() in loadGeckoLibs() here validates this.

Whiteboard: [geckoview:m111][geckoview:m112] → [geckoview:m111][geckoview:m112][geckoview:m113]
See Also: 1805974
See Also: 1805973

The Android team has not been keeping our P1 bug list up to date, so we're resetting all our P1 bugs to P2 to avoid signalling that we're actively working on bugs that we're not. The BMO documentation https://wiki.mozilla.org/BMO/UserGuide/BugFields#priority says P1 means "fix in the current release cycle" and P2 means "fix in the next release cycle or the following (nightly + 1 or nightly + 2)".

If you are actively working on this bug and expect to ship it in Fx 122 or 123, then please restore the priority back to P1.

Priority: P1 → P2
You need to log in before you can comment on or make changes to this bug.