Closed Bug 1557096 Opened 8 months ago Closed 6 months ago

GeckoSession.ContentDelegate.onCrash() gets called for crashes and when content process gets killed - making it hard to know when to auto-restore

Categories

(GeckoView :: General, defect, P1)

Unspecified
Android
defect

Tracking

(firefox67 wontfix, firefox68 wontfix, firefox69 fixed, firefox70 fixed)

RESOLVED FIXED
mozilla70
Tracking Status
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- fixed
firefox70 --- fixed

People

(Reporter: sebastian, Assigned: awaseem)

References

()

Details

(Keywords: crash, Whiteboard: [geckoview:fenix:m7])

Attachments

(2 files, 2 obsolete files)

There are two situations where the content process can die:

  • (1) The content process crashes
  • (2) The content process gets killed (by Android or by running kill -9 on it)

GeckoView API:

  • (1) When the content process crashes
    • GeckoView notifies our registered crashHandler in a different process.
    • GeckoView calls GeckoSession.ContentDelegate.onCrash() for affected sessions
  • (2) When the content process gets killed
    • GeckoView calls GeckoSession.ContentDelegate.onCrash() for affected sessions

This makes sense since (2) is not actually a crash and therefore doesn't invoke the crashHandler.

AC:

  • The crashHandler implementation (lib-crash) handles the crash via a service running in a different process and notifies the app about this non-fatal crash via an Intent.
  • The implementation of GeckoSession.ContentDelegate.onCrash() sets a crashed flag on the Session and notifies its observers.

Fenix:

  • For non-fatal crashes (Intent from lib-crash): It shows an in-app UI that let's the user decide whether to report it and whether to restore the crashes tab(s) or not.

  • Session.crashed is not handled by the app right now (like reported in issue 1577).

Problem:

Session.crashed coming from GeckoSession.ContentDelegate.onCrash() can be caused by (1) content process crash or (2) content process getting killed.

The app wants to handle those cases differently:

  • (1) is already handled through the in-app crash UI and the crashed flag will be used to determine which sessions need to be restored.

  • (2) The app (or the code in AC) could automatically restore the crashed sessions since that is expected to happen when Android has killed the process.

When GeckoSession.ContentDelegate.onCrash() we have no way to know whether this is actually a crash or just the process getting killed.

We could try to correlate onCrash() getting called with the Intent coming from the crashHandler (onCrash() without ´Intent` coming in: auto-restore). But both actions are happening independently, asynchronously from different processes. That sounds super fragile and error-prone.

When GeckoView calls onCrash() could it tell us whether that is from an actual crash (or have two different callbacks)?

For example WebView's onRenderProcessGone() receives a RenderProcessGoneDetail object with a didCrash() method.

Summary: GeckoSession.ContentDelegate.onCrash() getting called for crashes and when content process gets killed - making it hard to know when to auto-restore → GeckoSession.ContentDelegate.onCrash() gets called for crashes and when content process gets killed - making it hard to know when to auto-restore

Adding [geckoview:fenix:m6] whiteboard tag because we should decide what we want to do before Fenix MVP.

Keywords: crash
Priority: -- → P1
Whiteboard: [geckoview:fenix:m6]

Sebastian says Fenix would like this for MVP so the app knows whether to reload the page or show the crash reporter. Maybe add separate onCrash and onKill methods? Or a reason flag to onCrash?

Alvina said she will take this bug.

Assignee: nobody → awaseem

FYI I don't think there's a way to know why the content process is gone besides looking at the crash reporter.

This is what WebView does: https://cs.chromium.org/chromium/src/android_webview/browser/aw_browser_terminator.cc?l=116-134&rcl=8506cd4563dc9736e314dc7732499cddafc0af89

Also, looking at the WebView docs it says it will report that didCrash() == true when Android kills the proces because of low memory (ie. case 2 of the OP).

Indicates whether the render process was observed to crash, or whether it was killed by the system. If the render process was killed, this is most likely caused by the system being low on memory.

I'm not sure if having a different behavior between crash and low memory kill is the correct thing to do here. If we lose the content process when the app is in the foreground, that's a serious issue and I would argue we should show the crashed tab UI. We could always refresh the tab on the background (and show the crashreporter prompt if we have a stacktrace).

Chris, we won't have this fix for MVP but let's keep it at high priority so we can ship a fix as soon as it's available.

Whiteboard: [geckoview:fenix:m6] → [geckoview:fenix:m7]

Sebastian says this bug is a high priority and he would like this fix for a Fenix 1.0.1 dot release.

Whiteboard: [geckoview:fenix:m7] → [geckoview:fenix:m6]
See Also: → 1561402
Attachment #9073959 - Attachment description: Bug 1557096 - created GeckoView:ContentKill request in GeckoViewContent.jsm to differentiate between crashes and kills. → Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills
Attachment #9073959 - Attachment description: Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills → Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills.
Attachment #9071068 - Attachment is obsolete: true

We'd like to fix this bug in GV 69, so we'll need to uplift the fix to Beta.

OS: All → Android
Whiteboard: [geckoview:fenix:m6] → [geckoview:fenix:m7]
Pushed by jwillcox@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/81f121f3a7cb
Add ContentDelegate.onKill() to differentiate between content process crashes and kills. r=geckoview-reviewers,agi,snorp

Backed out changeset 81f121f3a7cb (Bug 1557096) for ES lint failure on GeckoViewContent.jsm

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=linting%2Copt%2Csource-test-mozlint-eslint%2C%28es%29&fromchange=ff2f7f88eb4451607c01dc6e3a36f8f2565cac27&tochange=b6f7a0ad2f329e0d237aacea2e677faef3e565f3&selectedJob=256027132

Backout link: https://hg.mozilla.org/integration/autoland/rev/b6f7a0ad2f329e0d237aacea2e677faef3e565f3

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=256027132&repo=autoland&lineNumber=289
[task 2019-07-11T21:03:26.487Z] Installing eslint for mach using "/usr/local/bin/node /usr/local/bin/npm install --loglevel=error"...
[task 2019-07-11T21:03:26.487Z]
[task 2019-07-11T21:03:26.487Z] eslint installed successfully!
[task 2019-07-11T21:03:26.487Z]
[task 2019-07-11T21:03:26.487Z] NOTE: Your local eslint binary is at /builds/worker/checkouts/gecko/node_modules/.bin/eslint
[task 2019-07-11T21:03:26.487Z]
[task 2019-07-11T21:17:32.074Z] TEST-UNEXPECTED-ERROR | /builds/worker/checkouts/gecko/mobile/android/modules/geckoview/GeckoViewContent.jsm:224:15 | Replace browser?&&?aSubject.get("childID")?!=?browser.frameLoader.childID with ?????????????browser?&&?????????????aSubject.get("childID")?!=?browser.frameLoader.childID??????????? (prettier/prettier)
[taskcluster 2019-07-11 21:17:32.571Z] === Task Finished ===
[taskcluster 2019-07-11 21:17:33.418Z] Unsuccessful task run with exit code: 1 completed in 1268.372 seconds

Flags: needinfo?(awaseem)
Pushed by jwillcox@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4629e855ea33
Add ContentDelegate.onKill() to differentiate between content process crashes and kills. r=geckoview-reviewers,agi,snorp
Status: NEW → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Regressions: 1565794
Regressions: 1495113
Regressions: 1565422
Regressions: 1502276

Oh, there have been a bunch of regressions linked. Will this block uplifting?

Backout by rgurzau@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/0fb761cd8fa4
Backed out 2 changesets (bug 1557096, bug 1565410) for multiple regressions linked to Bug 1557096. a=backout CLOSED TREE

:sebastian Hi, the linked regressions resulted in backing out this bug and Bug 1565410 for conflicting with the backout.

Backout link: https://hg.mozilla.org/mozilla-central/rev/0fb761cd8fa448eb511e8b2520ede5a3230853ed

Status: RESOLVED → REOPENED
Flags: needinfo?(s.kaspari)
Resolution: FIXED → ---
Target Milestone: mozilla70 → ---
Flags: needinfo?(s.kaspari) → needinfo?(snorp)
No longer regressions: 1565794

Alvina, if the Robocop test failures in bug 1565794 are not actually caused by your onCrash fix, can we re-land your patches?

We also need to uplift your patches to GV 69 Beta.

Sorry about the trouble caused here. Alvina, Agi could you try to reland this? I tried importing the patch here but it's conflicting on this file https://hg.mozilla.org/mozilla-central/diff/a2180268af1d0d8943eb32535b8d3503756eef75/mobile/android/geckoview/src/main/java/org/mozilla/geckoview/doc-files/CHANGELOG.md#l1.13 from Bug 1565410 which is now fixed. Thank you.

Flags: needinfo?(agi)

I fixed the merge conflict with the changelog, so it should be ready to land now :)

Flags: needinfo?(snorp)
Flags: needinfo?(awaseem)
Flags: needinfo?(agi)
Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6b80e71611e5
Add ContentDelegate.onKill() to differentiate between content process crashes and kills. r=geckoview-reviewers,agi,snorp
Status: REOPENED → RESOLVED
Closed: 6 months ago6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

Please nominate this patch for Beta approval when you get a chance.

Flags: needinfo?(awaseem)

Comment on attachment 9073959 [details]
Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills. a=RyanVM

Beta/Release Uplift Approval Request

  • User impact if declined: This change affects how the app knows whether to reload the page or show the crash reporter.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change is not risky since mostly pre-existing crash event handlers were modified in name rather than anything new being added.
  • String changes made/needed:
Flags: needinfo?(awaseem)
Attachment #9073959 - Flags: approval-mozilla-beta?

Comment on attachment 9073959 [details]
Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills. a=RyanVM

Improves GeckoView's ability to determine whether a page crashed or was force-killed (and therefore which UI to show). Approved for GV69.

Attachment #9073959 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Conflicts when I tried to uplift this patch:
warning: conflicts while merging mobile/android/geckoview/src/androidTest/java/org/mozilla/geckoview/test/ContentDelegateTest.kt!
warning: conflicts while merging mobile/android/geckoview/src/main/java/org/mozilla/geckoview/doc-files/CHANGELOG.md!

Flags: needinfo?(awaseem)

I'll fix it and push again, thanks

Flags: needinfo?(awaseem)
Flags: needinfo?(agi)

Backed out for apilint failure at mobile/android/geckoview/src/main/java/org/mozilla/gecko/GeckoAppShell.java:40:

https://hg.mozilla.org/releases/mozilla-beta/rev/ba09b92b6286a1752031d434151bc5d1aa210be4

This causes also test failures:
https://treeherder.mozilla.org/logviewer.html#?job_id=258333418&repo=mozilla-beta
https://treeherder.mozilla.org/logviewer.html#?job_id=258333414&repo=mozilla-beta

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&resultStatus=usercancel%2Ctestfailed%2Cbusted%2Cexception&revision=ca01f11722729d60639e56c7aeb6ea26a1b2d78c
Apilint log: https://treeherder.mozilla.org/logviewer.html#?job_id=258329486&repo=mozilla-beta

/builds/worker/workspace/build/src/mobile/android/geckoview/src/main/java/org/mozilla/gecko/GeckoAppShell.java:40: error: cannot find symbol
TEST-UNEXPECTED-FAIL | method public static void killContentProcess() | Method missing threading annotation. Needs one of: @MainThread, @UiThread, @WorkerThread, @BinderThread, @AnyThread.

Please fix the issue and resubmit.

Flags: needinfo?(awaseem)

Just pushed. Not sure if it's right still, so please let me know!

Flags: needinfo?(awaseem) → needinfo?(s.kaspari)
Flags: needinfo?(snorp)
Flags: needinfo?(agi)

Ryan, Alvina has a patch up for beta here https://phabricator.services.mozilla.com/D39732

Flags: needinfo?(snorp) → needinfo?(ryanvm)
Flags: needinfo?(s.kaspari)
Flags: needinfo?(ryanvm)
Flags: needinfo?(agi)

Looks like the apilint job at least is still broken with the Beta patch:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258849247&repo=mozilla-beta&lineNumber=2968

Flags: needinfo?(awaseem)

That is just a warning (confusing I know, it's fixed in m-c) the actual problem is here: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258849247&repo=mozilla-beta&lineNumber=2979-2984

[task 2019-07-29T19:02:11.507Z] 19:02:11     INFO -  ERROR: The api changelog file is out of date. Please update the file at
[task 2019-07-29T19:02:11.508Z] 19:02:11     INFO -  /builds/worker/workspace/build/src/mobile/android/geckoview/src/main/java/org/mozilla/geckoview/doc-files/CHANGELOG.md
[task 2019-07-29T19:02:11.508Z] 19:02:11     INFO -  and then modify the [api-version] line as following:
[task 2019-07-29T19:02:11.508Z] 19:02:11     INFO -  >>>>
[task 2019-07-29T19:02:11.508Z] 19:02:11     INFO -  [api-version]: 7b53d1a6703cc7f613de81c422dccdd95b6153fe
[task 2019-07-29T19:02:11.508Z] 19:02:11     INFO -  <<<<

Alvina has addressed the apilint issue. Agi will land Alvina's patch in beta.

Flags: needinfo?(awaseem) → needinfo?(agi)
Attachment #9081343 - Attachment description: Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills. → Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills. , a=RyanVM
Attachment #9073959 - Attachment description: Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills. → Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills. a=RyanVM
Attachment #9081343 - Attachment description: Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills. , a=RyanVM → Bug 1557096 - Add ContentDelegate.onKill() to differentiate between content process crashes and kills. a=RyanVM
Attachment #9082518 - Attachment is obsolete: true
Flags: needinfo?(agi)
You need to log in before you can comment on or make changes to this bug.