Crash in [@ mozilla::layers::PCompositorManager::CreateEndpoints]
Categories
(GeckoView :: General, defect, P5)
Tracking
(firefox67 unaffected, firefox68 wontfix, firefox69 wontfix, firefox70 wontfix)
Tracking | Status | |
---|---|---|
firefox67 | --- | unaffected |
firefox68 | --- | wontfix |
firefox69 | --- | wontfix |
firefox70 | --- | wontfix |
People
(Reporter: cpeterson, Unassigned)
References
Details
(Keywords: crash, regression, topcrash)
Crash Data
Attachments
(1 obsolete file)
This bug is for crash report bp-438d7634-2103-4bcf-a85c-56ac50190416.
Top 10 frames of crashing thread:
0 libxul.so mozilla::layers::PCompositorManager::CreateEndpoints ipc/glue/ProtocolUtils.h:852
1 libxul.so mozilla::gfx::GPUProcessManager::CreateContentBridges gfx/ipc/GPUProcessManager.cpp:797
2 libxul.so mozilla::dom::ContentParent::LaunchSubprocessInternal const dom/ipc/ContentParent.cpp:2539
3 libxul.so mozilla::dom::ContentParent::LaunchSubprocessInternal dom/ipc/ContentParent.cpp:2211
4 libxul.so mozilla::dom::ContentParent::LaunchSubprocessSync dom/ipc/ContentParent.cpp:2234
5 libxul.so mozilla::dom::ContentParent::GetNewOrUsedBrowserProcess dom/ipc/ContentParent.cpp:899
6 libxul.so mozilla::dom::ContentParent::CreateBrowser dom/ipc/ContentParent.cpp:1142
7 libxul.so nsFrameLoader::TryRemoteBrowser dom/base/nsFrameLoader.cpp:2605
8 libxul.so nsFrameLoader::ShowRemoteFrame dom/base/nsFrameLoader.cpp:868
9 libxul.so nsFrameLoader::Show dom/base/nsFrameLoader.cpp:737
This is a Fenix crash.
Reporter | ||
Comment 1•5 years ago
•
|
||
Here is a recent history of Fenix's GV versions (from Gecko.kt and their CreateEndpoints crash counts:
GV Version | Build ID | Date Vendored | Crash Count | Crashes/Day | Changeset |
---|---|---|---|---|---|
68 Beta | 20190604110028 | 2019-06-05 | TBD | TBD | |
68 Beta | 20190527103257 | 2019-05-28 | 3928 | 655 | |
68 Beta | 20190520141152 | 2019-05-22 | 2567 | 183 | |
68 Nightly | 20190508111321 | 2019-05-08 | 288 | 58 | 65a693623cee0837b4ad0d23241c84cd3ea23e3a |
68 Nightly | 20190503041749 | 2019-05-03 | 0 | 0 | 083106d8fc7407c880a3a044c83d4e15e5961063 |
68 Nightly | 20190429095544 | 2019-05-03 | 0 | 0 | |
68 Nightly | 20190422094240 | 2019-04-22 | 0 | 0 |
Looks like the crash first appeared in Build ID 20190508111321, but started getting worse in future versions. I don't know how many Fenix users we had for these different versions.
Here is the pushlog between Build ID 20190503041749 and 20190508111321:
That pushlog include ARM64 PGO bug 1543212 and 89.61% of these CreateEndpoints crashes are from ARM64 devices. Perhaps this is a PGO problem?
Reporter | ||
Comment 2•5 years ago
|
||
That pushlog include ARM64 PGO bug 1543212 and 89.61% of these CreateEndpoints crashes are from ARM64 devices. Perhaps this is a PGO problem?
I no longer think this is an ARM64-specific or PGO bug. ARMv7 has a smaller crash volume than ARM64, but the same trend and regression range. I think Fenix users just have a lot of ARM64 devices.
Comment 3•5 years ago
|
||
:cpeterson, since this bug is a regression, could you fill (if possible) the regressed_by field?
For more information, please visit auto_nag documentation.
Reporter | ||
Comment 4•5 years ago
|
||
(In reply to Release mgmt bot [:sylvestre / :calixte] from comment #3)
:cpeterson, since this bug is a regression, could you fill (if possible) the regressed_by field?
We know this bug is a regression because we have a regression range, but we don't know which bug that caused the regression yet.
Reporter | ||
Comment 5•5 years ago
|
||
Jed or Gian-Carlo, do you have any suggestions for how we can narrow down this GeckoView top crash?
GeckoView's main process is crashing because a content process fails to launch, but we don't know why the content process fails to launch.
Jed cleaned up the GeckoChildProcessHost.cpp's handling of LaunchAndroidService errors (in bug 1548525). This might turn the main process crashes into blank tabs. That's a better user experience but will be harder to diagnose from user bug reports. The GeckoView team added a Java exception (bug 1555447) closer to the error point inside LaunchAndroidService so we can stuff more diagnostic information in an exception report. (Fenix has not updated to that GeckoView version yet, so we have no data.)
I think this crash might be a regression in this (five-day) regression range:
Comment 6•5 years ago
|
||
Jed or Gian-Carlo, do you have any suggestions for how we can narrow down this GeckoView top crash?
GeckoView's main process is crashing because a content process fails to launch, but we don't know why the content process fails to launch.
I suggest asking the Android team, as there's no reason to believe this is a generic IPC problem nor that the problem is in the IPC layer to begin with. I assume we have no specific STR and no usable error info from the field so you're SOL here. Is it device or chipset specific?
I went through that list and if we exclude ARM64 specific changes (comment 2) I see little specific candidates.
Wild swings at a guess:
https://hg.mozilla.org/mozilla-central/rev/86c7622bc218906e28cf73c18c7e3b867d7839e3 (silent failure instead of NPE due to mistaken logic?)
https://hg.mozilla.org/mozilla-central/rev/4abfc5c2a4ee418302a99a2662a78f88b84d44b9 (because the stack claims it's the GPU process, related https://hg.mozilla.org/mozilla-central/rev/c737fc642ddcaa5846383f906ec3f24145d4a7a5 patch is in CompositorBridgeChild...I'm surprised Android is using a GPU process anyway?!)
https://hg.mozilla.org/mozilla-central/rev/a43b68bbe67db228211e414a6e33490ce46a8ba4 (version guard seems fine so doubtful)
https://hg.mozilla.org/mozilla-central/rev/d029091d7fd831b30cbc4a660c03f962b20fa44e (don't see how it can NPE or anything)
Reporter | ||
Comment 7•5 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #6)
I assume we have no specific STR and no usable error info from the field so you're SOL here. Is it device or chipset specific?
Correct. We have no STR and no clear correlation to device or chipset.
Wild swings at a guess:
https://hg.mozilla.org/mozilla-central/rev/86c7622bc218906e28cf73c18c7e3b867d7839e3 (silent failure instead of NPE due to mistaken logic?)
This change to shutdown early instead of crashing might be interesting. That might explain why we don't have a crash reports for a content process crash corresponding to this parent process crash.
https://hg.mozilla.org/mozilla-central/rev/4abfc5c2a4ee418302a99a2662a78f88b84d44b9 (because the stack claims it's the GPU process, related https://hg.mozilla.org/mozilla-central/rev/c737fc642ddcaa5846383f906ec3f24145d4a7a5 patch is in CompositorBridgeChild...I'm surprised Android is using a GPU process anyway?!)
We're not using a GPU process on Android. I am told (bug 1548525 comment 4) these function names are misleading because the same code is used for connecting to a GPU process or GPU code in the parent process.
Here is an alternate regression range (GV Build ID 20190508111321 to 20190520141152) based on Socorro's trend graphs instead of crash counts by Fenix's GV versions. Unfortunately, I don't see any obvious regression candidates.
Comment 8•5 years ago
|
||
Unfortunately, I don't see any obvious regression candidates.
Huge list.
https://hg.mozilla.org/releases/mozilla-beta/rev/74252063fc9ec6f72bd27968444ac5de397c6bc1 (can't tell for sure if this is run at launch, but it's a lot of new code)
https://hg.mozilla.org/releases/mozilla-beta/rev/4fbfc8798cad0ecc7ae867ed943b91b24789a500 (maybe it's stripping too much now? But that would cause a total failure, not an intermittent?)
https://hg.mozilla.org/releases/mozilla-beta/rev/e00d479b6a92650637b9347d9ae18bd8da3b9493 (subtle ARM JIT bug? also seems unlikely)
Reporter | ||
Comment 9•5 years ago
|
||
There is hope that a recent Fenix crash may have triggered this GeckoView crash. The Fenix crash was fixed recently and this crash signature's volume has been decreasing. Hopefully that's not just because Fenix beta testers are less active on the weekend...
Comment 10•5 years ago
|
||
Adding [@ libxul.so@0xaae770 | libxul.so@0xdfc4bc] as Dylan symbolicated those to the endpoints issue.
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Comment 11•5 years ago
•
|
||
Fenix MVP is going to release with or without this fix, so it's not technically a Fenix MVP blocker. I'll move this bug to the next Fenix milestone (M7).
Updated•5 years ago
|
Comment 12•5 years ago
|
||
This is almost surely a GeckoView bug so I'm putting it back in the right component.
We shipped this fix to our nightly users and it seems promising so far. I'll update this bug when we have more data.
Comment 13•5 years ago
|
||
java.lang.RuntimeException: at org.mozilla.gecko.process.GeckoProcessManager.start(GeckoProcessManager.java) signature is quite high in Fenix release, with over 7K crashes: https://bit.ly/2Luq2Li
Comment 14•5 years ago
|
||
Adding another comment here after we shipped Fenix 1.0.1. My device shows the Gecko ID for that release as 20190611143747. The signature in Comment 13 shows 9218 crashes for that build ID. It seems hard to believe we could generate that many crashes after shipping, but it also looks as if people were crashing as far back as 7-4 with that build ID in crash stats.
Comment 15•5 years ago
|
||
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #14)
Adding another comment here after we shipped Fenix 1.0.1. My device shows the Gecko ID for that release as 20190611143747. The signature in Comment 13 shows 9218 crashes for that build ID. It seems hard to believe we could generate that many crashes after shipping, but it also looks as if people were crashing as far back as 7-4 with that build ID in crash stats.
1.0.1 did not update the version of GV being shipped, so the build id is the same as in 1.0.0.
Reporter | ||
Comment 16•5 years ago
|
||
Bobby is rewriting the content process launching code in e10s-multi bug 1530770. That might fix or reveal the unknown root cause of the content process launch failures.
Updated•5 years ago
|
Comment 17•5 years ago
|
||
Added exception handling for the expected failure case.
Reporter | ||
Comment 18•5 years ago
|
||
Unassigning Elliot because this code will change when we land e10s-multi. James says we don't want to catch this exception because it is a diagnostic to help us monitor how frequent this fatal error is in the wild.
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Comment 19•5 years ago
|
||
Dropping to P2 because we expect e10s-multi to rewrite a lot of this process launching code. This crash might go away then.
In the meantime, we'll keep monitoring the crash volume. Over the last month, we've seen about 500 crash reports from GV 68.0b in Fenix, but we haven't seen any from GV 69.0b. Perhaps this crash was fixed by Fenix's process warming code change?
Comment 20•5 years ago
|
||
(In reply to Chris Peterson [:cpeterson] from comment #19)
Dropping to P2 because we expect e10s-multi to rewrite a lot of this process launching code. This crash might go away then.
In the meantime, we'll keep monitoring the crash volume. Over the last month, we've seen about 500 crash reports from GV 68.0b in Fenix, but we haven't seen any from GV 69.0b. Perhaps this crash was fixed by Fenix's process warming code change?
We have seen a lot from 69.0b. The signature of this bug was changed by bug 1555447; it is now here: https://crash-stats.mozilla.com/signature/?product=Fenix&signature=java.lang.RuntimeException%3A%20at%20org.mozilla.gecko.process.GeckoProcessManager.start%28GeckoProcessManager.java%29&version=69.0b0&date=%3C2019-08-29T18%3A34%3A06%2B00%3A00&date=%3E%3D2019-08-22T18%3A34%3A06%2B00%3A00
Reporter | ||
Comment 21•5 years ago
|
||
P2 while we wait for e10s-multi. Tagging as [geckoview:fenix:m9]
so we revisit this bug in August.
Reporter | ||
Comment 22•5 years ago
|
||
The crash volume is still pretty high: about 200 reports per day.
Comment 24•5 years ago
|
||
Adding a related signature that is showing up in FennecAndroid crash stats.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 25•5 years ago
|
||
This no longer appears to be happening.
Updated•5 years ago
|
Description
•