Open Bug 1557449 Opened 6 months ago Updated 20 days ago

Crash in [@ mozilla::layers::PCompositorManager::CreateEndpoints]

Categories

(GeckoView :: General, defect, P2, critical)

Unspecified
Android
defect

Tracking

(firefox67 unaffected, firefox68 wontfix, firefox69 wontfix, firefox70 ?)

Tracking Status
firefox67 --- unaffected
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 --- ?

People

(Reporter: cpeterson, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 obsolete file)

This bug is for crash report bp-438d7634-2103-4bcf-a85c-56ac50190416.

Top 10 frames of crashing thread:

0 libxul.so mozilla::layers::PCompositorManager::CreateEndpoints ipc/glue/ProtocolUtils.h:852
1 libxul.so mozilla::gfx::GPUProcessManager::CreateContentBridges gfx/ipc/GPUProcessManager.cpp:797
2 libxul.so mozilla::dom::ContentParent::LaunchSubprocessInternal const dom/ipc/ContentParent.cpp:2539
3 libxul.so mozilla::dom::ContentParent::LaunchSubprocessInternal dom/ipc/ContentParent.cpp:2211
4 libxul.so mozilla::dom::ContentParent::LaunchSubprocessSync dom/ipc/ContentParent.cpp:2234
5 libxul.so mozilla::dom::ContentParent::GetNewOrUsedBrowserProcess dom/ipc/ContentParent.cpp:899
6 libxul.so mozilla::dom::ContentParent::CreateBrowser dom/ipc/ContentParent.cpp:1142
7 libxul.so nsFrameLoader::TryRemoteBrowser dom/base/nsFrameLoader.cpp:2605
8 libxul.so nsFrameLoader::ShowRemoteFrame dom/base/nsFrameLoader.cpp:868
9 libxul.so nsFrameLoader::Show dom/base/nsFrameLoader.cpp:737

This is a Fenix crash.

Depends on: 1548525
Depends on: 1555447

Here is a recent history of Fenix's GV versions (from Gecko.kt and their CreateEndpoints crash counts:

GV Version Build ID Date Vendored Crash Count Crashes/Day Changeset
68 Beta 20190604110028 2019-06-05 TBD TBD
68 Beta 20190527103257 2019-05-28 3928 655
68 Beta 20190520141152 2019-05-22 2567 183
68 Nightly 20190508111321 2019-05-08 288 58 65a693623cee0837b4ad0d23241c84cd3ea23e3a
68 Nightly 20190503041749 2019-05-03 0 0 083106d8fc7407c880a3a044c83d4e15e5961063
68 Nightly 20190429095544 2019-05-03 0 0
68 Nightly 20190422094240 2019-04-22 0 0

Looks like the crash first appeared in Build ID 20190508111321, but started getting worse in future versions. I don't know how many Fenix users we had for these different versions.

Here is the pushlog between Build ID 20190503041749 and 20190508111321:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=083106d8fc7407c880a3a044c83d4e15e5961063&tochange=65a693623cee0837b4ad0d23241c84cd3ea23e3a

That pushlog include ARM64 PGO bug 1543212 and 89.61% of these CreateEndpoints crashes are from ARM64 devices. Perhaps this is a PGO problem?

That pushlog include ARM64 PGO bug 1543212 and 89.61% of these CreateEndpoints crashes are from ARM64 devices. Perhaps this is a PGO problem?

I no longer think this is an ARM64-specific or PGO bug. ARMv7 has a smaller crash volume than ARM64, but the same trend and regression range. I think Fenix users just have a lot of ARM64 devices.

:cpeterson, since this bug is a regression, could you fill (if possible) the regressed_by field?
For more information, please visit auto_nag documentation.

Flags: needinfo?(cpeterson)

(In reply to Release mgmt bot [:sylvestre / :calixte] from comment #3)

:cpeterson, since this bug is a regression, could you fill (if possible) the regressed_by field?

We know this bug is a regression because we have a regression range, but we don't know which bug that caused the regression yet.

Flags: needinfo?(cpeterson)

Jed or Gian-Carlo, do you have any suggestions for how we can narrow down this GeckoView top crash?

GeckoView's main process is crashing because a content process fails to launch, but we don't know why the content process fails to launch.

Jed cleaned up the GeckoChildProcessHost.cpp's handling of LaunchAndroidService errors (in bug 1548525). This might turn the main process crashes into blank tabs. That's a better user experience but will be harder to diagnose from user bug reports. The GeckoView team added a Java exception (bug 1555447) closer to the error point inside LaunchAndroidService so we can stuff more diagnostic information in an exception report. (Fenix has not updated to that GeckoView version yet, so we have no data.)

I think this crash might be a regression in this (five-day) regression range:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=083106d8fc7407c880a3a044c83d4e15e5961063&tochange=65a693623cee0837b4ad0d23241c84cd3ea23e3a

Flags: needinfo?(jld)
Flags: needinfo?(gpascutto)

Jed or Gian-Carlo, do you have any suggestions for how we can narrow down this GeckoView top crash?
GeckoView's main process is crashing because a content process fails to launch, but we don't know why the content process fails to launch.

I suggest asking the Android team, as there's no reason to believe this is a generic IPC problem nor that the problem is in the IPC layer to begin with. I assume we have no specific STR and no usable error info from the field so you're SOL here. Is it device or chipset specific?

I went through that list and if we exclude ARM64 specific changes (comment 2) I see little specific candidates.

Wild swings at a guess:
https://hg.mozilla.org/mozilla-central/rev/86c7622bc218906e28cf73c18c7e3b867d7839e3 (silent failure instead of NPE due to mistaken logic?)
https://hg.mozilla.org/mozilla-central/rev/4abfc5c2a4ee418302a99a2662a78f88b84d44b9 (because the stack claims it's the GPU process, related https://hg.mozilla.org/mozilla-central/rev/c737fc642ddcaa5846383f906ec3f24145d4a7a5 patch is in CompositorBridgeChild...I'm surprised Android is using a GPU process anyway?!)

https://hg.mozilla.org/mozilla-central/rev/a43b68bbe67db228211e414a6e33490ce46a8ba4 (version guard seems fine so doubtful)
https://hg.mozilla.org/mozilla-central/rev/d029091d7fd831b30cbc4a660c03f962b20fa44e (don't see how it can NPE or anything)

Flags: needinfo?(jld)
Flags: needinfo?(gpascutto)

(In reply to Gian-Carlo Pascutto [:gcp] from comment #6)

I assume we have no specific STR and no usable error info from the field so you're SOL here. Is it device or chipset specific?

Correct. We have no STR and no clear correlation to device or chipset.

https://crash-stats.mozilla.com/signature/?product=Fenix&signature=mozilla%3A%3Alayers%3A%3APCompositorManager%3A%3ACreateEndpoints&date=%3E%3D2018-12-07T19%3A15%3A00.000Z&date=%3C2019-06-07T19%3A15%3A00.000Z#summary

Wild swings at a guess:
https://hg.mozilla.org/mozilla-central/rev/86c7622bc218906e28cf73c18c7e3b867d7839e3 (silent failure instead of NPE due to mistaken logic?)

This change to shutdown early instead of crashing might be interesting. That might explain why we don't have a crash reports for a content process crash corresponding to this parent process crash.

https://hg.mozilla.org/mozilla-central/rev/4abfc5c2a4ee418302a99a2662a78f88b84d44b9 (because the stack claims it's the GPU process, related https://hg.mozilla.org/mozilla-central/rev/c737fc642ddcaa5846383f906ec3f24145d4a7a5 patch is in CompositorBridgeChild...I'm surprised Android is using a GPU process anyway?!)

We're not using a GPU process on Android. I am told (bug 1548525 comment 4) these function names are misleading because the same code is used for connecting to a GPU process or GPU code in the parent process.

Here is an alternate regression range (GV Build ID 20190508111321 to 20190520141152) based on Socorro's trend graphs instead of crash counts by Fenix's GV versions. Unfortunately, I don't see any obvious regression candidates.

https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=65a693623cee0837b4ad0d23241c84cd3ea23e3a&tochange=2f46b19ec841c939b84fa40be8b326a11141e6da

Unfortunately, I don't see any obvious regression candidates.

Huge list.

https://hg.mozilla.org/releases/mozilla-beta/rev/74252063fc9ec6f72bd27968444ac5de397c6bc1 (can't tell for sure if this is run at launch, but it's a lot of new code)
https://hg.mozilla.org/releases/mozilla-beta/rev/4fbfc8798cad0ecc7ae867ed943b91b24789a500 (maybe it's stripping too much now? But that would cause a total failure, not an intermittent?)
https://hg.mozilla.org/releases/mozilla-beta/rev/e00d479b6a92650637b9347d9ae18bd8da3b9493 (subtle ARM JIT bug? also seems unlikely)

There is hope that a recent Fenix crash may have triggered this GeckoView crash. The Fenix crash was fixed recently and this crash signature's volume has been decreasing. Hopefully that's not just because Fenix beta testers are less active on the weekend...

https://crash-stats.mozilla.com/signature/?signature=mozilla%3A%3Alayers%3A%3APCompositorManager%3A%3ACreateEndpoints&date=%3E%3D2019-05-10T15%3A34%3A00.000Z&date=%3C2019-06-10T15%3A34%3A00.000Z#graphs

Adding [@ libxul.so@0xaae770 | libxul.so@0xdfc4bc] as Dylan symbolicated those to the endpoints issue.

Crash Signature: [@ mozilla::ipc::CreateEndpoints<T>] [@ mozilla::layers::PCompositorManager::CreateEndpoints] → [@ mozilla::ipc::CreateEndpoints<T>] [@ mozilla::layers::PCompositorManager::CreateEndpoints] [@ libxul.so@0xaae770 | libxul.so@0xdfc4bc]
Crash Signature: [@ mozilla::ipc::CreateEndpoints<T>] [@ mozilla::layers::PCompositorManager::CreateEndpoints] [@ libxul.so@0xaae770 | libxul.so@0xdfc4bc] → [@ mozilla::ipc::CreateEndpoints<T>] [@ mozilla::layers::PCompositorManager::CreateEndpoints] [@ libxul.so@0xaae770] [libxul.so@0xdfc4bc]
Crash Signature: [@ mozilla::ipc::CreateEndpoints<T>] [@ mozilla::layers::PCompositorManager::CreateEndpoints] [@ libxul.so@0xaae770] [libxul.so@0xdfc4bc] → [@ mozilla::ipc::CreateEndpoints<T> ] [@ mozilla::layers::PCompositorManager::CreateEndpoints ] [@ libxul.so@0xaae770 | libxul.so@0xdfc4bc ]
Crash Signature: [@ mozilla::ipc::CreateEndpoints<T> ] [@ mozilla::layers::PCompositorManager::CreateEndpoints ] [@ libxul.so@0xaae770 | libxul.so@0xdfc4bc ] → [@ mozilla::ipc::CreateEndpoints<T> ] [@ mozilla::layers::PCompositorManager::CreateEndpoints ] [@ libxul.so@0xaae770 | libxul.so@0xdfc4bc ] [@ java.lang.RuntimeException: at org.mozilla.gecko.process.GeckoProcessManager.start(GeckoProcessManager.java)]

Fenix MVP is going to release with or without this fix, so it's not technically a Fenix MVP blocker. I'll move this bug to the next Fenix milestone (M7).

Whiteboard: [geckoview:fenix:m6] → [geckoview:fenix:m7]

This is almost surely a GeckoView bug so I'm putting it back in the right component.

We shipped this fix to our nightly users and it seems promising so far. I'll update this bug when we have more data.

Component: IPC → General
Product: Core → GeckoView

java.lang.RuntimeException: at org.mozilla.gecko.process.GeckoProcessManager.start(GeckoProcessManager.java) signature is quite high in Fenix release, with over 7K crashes: https://bit.ly/2Luq2Li

Adding another comment here after we shipped Fenix 1.0.1. My device shows the Gecko ID for that release as 20190611143747. The signature in Comment 13 shows 9218 crashes for that build ID. It seems hard to believe we could generate that many crashes after shipping, but it also looks as if people were crashing as far back as 7-4 with that build ID in crash stats.

(In reply to Marcia Knous [:marcia - needinfo? me] from comment #14)

Adding another comment here after we shipped Fenix 1.0.1. My device shows the Gecko ID for that release as 20190611143747. The signature in Comment 13 shows 9218 crashes for that build ID. It seems hard to believe we could generate that many crashes after shipping, but it also looks as if people were crashing as far back as 7-4 with that build ID in crash stats.

1.0.1 did not update the version of GV being shipped, so the build id is the same as in 1.0.0.

Bobby is rewriting the content process launching code in e10s-multi bug 1530770. That might fix or reveal the unknown root cause of the content process launch failures.

Depends on: android-e10s-multi
Assignee: nobody → estirling

Unassigning Elliot because this code will change when we land e10s-multi. James says we don't want to catch this exception because it is a diagnostic to help us monitor how frequent this fatal error is in the wild.

Assignee: estirling → nobody
Attachment #9085736 - Attachment is obsolete: true

Dropping to P2 because we expect e10s-multi to rewrite a lot of this process launching code. This crash might go away then.

In the meantime, we'll keep monitoring the crash volume. Over the last month, we've seen about 500 crash reports from GV 68.0b in Fenix, but we haven't seen any from GV 69.0b. Perhaps this crash was fixed by Fenix's process warming code change?

Priority: P1 → P2

(In reply to Chris Peterson [:cpeterson] from comment #19)

Dropping to P2 because we expect e10s-multi to rewrite a lot of this process launching code. This crash might go away then.

In the meantime, we'll keep monitoring the crash volume. Over the last month, we've seen about 500 crash reports from GV 68.0b in Fenix, but we haven't seen any from GV 69.0b. Perhaps this crash was fixed by Fenix's process warming code change?

We have seen a lot from 69.0b. The signature of this bug was changed by bug 1555447; it is now here: https://crash-stats.mozilla.com/signature/?product=Fenix&signature=java.lang.RuntimeException%3A%20at%20org.mozilla.gecko.process.GeckoProcessManager.start%28GeckoProcessManager.java%29&version=69.0b0&date=%3C2019-08-29T18%3A34%3A06%2B00%3A00&date=%3E%3D2019-08-22T18%3A34%3A06%2B00%3A00

P2 while we wait for e10s-multi. Tagging as [geckoview:fenix:m9] so we revisit this bug in August.

Whiteboard: [geckoview:fenix:m7] → [geckoview:fenix:m9]

The crash volume is still pretty high: about 200 reports per day.

Rank: 22
Whiteboard: [geckoview:fenix:m9]
Blocks: 1587994
Duplicate of this bug: 1587994

Adding a related signature that is showing up in FennecAndroid crash stats.

Crash Signature: org.mozilla.gecko.process.GeckoProcessManager.start(GeckoProcessManager.java)] → org.mozilla.gecko.process.GeckoProcessManager.start(GeckoProcessManager.java)] [@ java.lang.RuntimeException ]
You need to log in before you can comment on or make changes to this bug.