1915788 - Crash in [@ mozilla::dom::ContentChild::~ContentChild]

Gabriele Svelto [:gsvelto]

Reporter

Description

•

3 months ago

Crash report: https://crash-stats.mozilla.org/report/index/6c95ec01-b53b-4a69-b815-dbed50240829

MOZ_CRASH Reason: MOZ_CRASH(Content Child shouldn't be destroyed.)

Top 10 frames:

0  libxul.so  mozilla::dom::ContentChild::~ContentChild()  dom/ipc/ContentChild.cpp:653
1  libxul.so  mozilla::dom::ContentProcess::~ContentProcess()  dom/ipc/ContentProcess.cpp:67
1  libxul.so  mozilla::dom::ContentProcess::~ContentProcess()  dom/ipc/ContentProcess.cpp:67
2  libxul.so  mozilla::DefaultDelete<mozilla::ipc::ProcessChild>::operator()(mozilla::ipc::...  mfbt/UniquePtr.h:460
2  libxul.so  mozilla::UniquePtr<mozilla::ipc::ProcessChild, mozilla::DefaultDelete<mozilla...  mfbt/UniquePtr.h:302
2  libxul.so  mozilla::UniquePtr<mozilla::ipc::ProcessChild, mozilla::DefaultDelete<mozilla...  mfbt/UniquePtr.h:250
2  libxul.so  XRE_InitChildProcess(int, char**, XREChildData const*)  toolkit/xre/nsEmbedFunctions.cpp:655
3  libmozglue.so  Java_org_mozilla_gecko_mozglue_GeckoLoader_nativeRun  mozglue/android/APKOpen.cpp:400
4  libart.so  libart.so@0x351e30
5  ?  @0x0000000112d9a1ac

We're tripping over an existing assertion, this seems to be hitting Android extremely hard. Some volume was already visible on beta and nightly, and is also affecting desktop Firefox but with a much smaller volume, and only on nightly and beta. The volume here is high enough that we should act on this quickly.

Pascal Chevrel:pascalc

Updated

•

3 months ago

status-firefox129: --- → affected

status-firefox130: --- → affected

status-firefox131: --- → affected

tracking-firefox130: --- → +

tracking-firefox131: --- → +

Keywords: regression

Pascal Chevrel:pascalc

Comment 1

•

3 months ago

Jari, could we get this bug triaged by your team asap? This may become a release blocker. Thanks!

Severity: -- → S2

Flags: needinfo?(jjalkanen)

Priority: -- → P1

Jens Stutte [:jstutte]

Comment 2

•

3 months ago

The assertion has been introduced by bug 1281223 apparently for debugging a problem back then. I wonder if our invarients changed and it is just not true anymore, but I am less sure who I can ask. I was not able to find any obvious bug that could have caused this, neither. Nika, are you aware of something that could have had an impact here? In general I would assume/hope that nothing so bad can happen without that assertion that we should bother our users with this.

Jens Stutte [:jstutte]

Updated

•

3 months ago

Flags: needinfo?(nika)

Jens Stutte [:jstutte]

Comment 3

•

3 months ago

•

Edited

Just adding a few more datapoints:

these are all content process crashes (of course, because of ~ContenChild).
The first build ID I see this starting with seems to be 20240723211328 for Firefox and 20240724215903 for Fenix (both nightly).
Bug 1834864 landed just a few days earlier but was immediately uplifted to beta 129, so that seems to not fit with beta crashes
Bug 1728331 ~~looks pretty relevant for what it does~~ but has been fixed already in 129, too.
The first Fenix beta crashes seem to arrive with beta 5. I cannot spot anything outstanding in this push range, though. And also the earlier 130 beta uplifts seem unrelated to me at a first glance.
on the incident channel, :aryx said that these are all startup crashes (wrt to the single process).

The stack trace indicates that we see XRE_InitChildProcess arriving at its very end, which normally should never happen as IIRC we run our shutdown and exit(0) on top of the message loop. The only case where this can happen seems to be a failure during process->Init(...) which would immediately return and thus destroy UniquePtr<ProcessChild> process;.

So it appears to me that we encounter an error during process initialization and that will make the dtor angry (for probably no good reason in that case). Bug 1471720 landed July 23 and introduced a new failure case during startup if the omnijar is corrupted.

I would expect to see for Fenix the corresponding parent process crashes on bug 1909700, but I do not see them. I would assume that these crashes are just a symptom of a corrupted installation.

BugBot [:suhaib / :marco/ :calixte]

Comment 4

•

3 months ago

The bug is linked to a topcrash signature, which matches the following criteria:

Top 10 AArch64 and ARM crashes on nightly
Top 10 AArch64 and ARM crashes on beta
Top 10 AArch64 and ARM crashes on release

For more information, please visit BugBot documentation.

Keywords: topcrash

BugBot [:suhaib / :marco/ :calixte]

Comment 5

•

3 months ago

The bug is marked as tracked for firefox130 (beta) and tracked for firefox131 (nightly). However, the bug still isn't assigned.

:jstutte, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Comment 6

•

3 months ago

I can take this for investigation, but I might only be able to provide a patch that hides the crashes for content processes. But first I'd want to understand what happens to the parent processes there.

Assignee: nobody → jstutte

Flags: needinfo?(jstutte)

Regressed by: 1471720

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

3 months ago

Set release status flags based on info from the regressing bug 1471720

status-firefox-esr115: --- → unaffected

status-firefox-esr128: --- → unaffected

Jens Stutte [:jstutte]

Comment 8

•

3 months ago

•

Edited

It looks like the changes from bug 1471720 that affect the parent process do not have any effect on Fenix. Indeed there are still ServiceWorkerRegistrar crashes coming in for newer builds from Fenix. There must be something different about how geckoview uses XPCOM, apparently.
But in any case I wonder how the Fenix parent process can try to start child processes before the parent process crashes, though. Unless we are in a situation where the parent somehow started normally, but then the files got corrupted and from then on each child cannot start. Which might be an indication for an update running under the hoods while the browser is still running? ~~Is that even a possible scenario for Android apps?~~

Edit: Talking with other more Android savvy people we think this is not possible. At least it really should.

Gabriele Svelto [:gsvelto]

Reporter

Comment 9

•

3 months ago

I've dredged through a few minidumps and all the aggregations but I can't find a pattern. It's particularly odd that we get most crashes from a single vendor, but it's not just that one, and there's nothing in common between the crashes on their devices, and the crashes on other vendors' devices. If there wouldn't be this much Skew for Vivo I'd say that the distribution of the crashes was random, but it apparently isn't.

Gabriele Svelto [:gsvelto]

Reporter

Comment 10

•

3 months ago

I've noticed that among the nightly/beta crashes we have Pixel 9 Pro XL, Pixel 8 Pro, Pixel 8a and Fairphone 5 phones being affected. Does anyone have one of those devices and see if the crash is reproducible there?

Andrew Sutherland [:asuth] (he/him)

Comment 11

•

3 months ago

•

Edited

(In reply to Gabriele Svelto [:gsvelto] from comment #10)

I've noticed that among the nightly/beta crashes we have Pixel 9 Pro XL, Pixel 8 Pro, Pixel 8a and Fairphone 5 phones being affected. Does anyone have one of those devices and see if the crash is reproducible there?

I have a Pixel 8a, effectively on default settings (new phone), it seems fine currently (no crashes). That said, when I first started Firefox nightly on the device and was trying to activate Firefox Sync via QR code, about:crashes says I got 6 crashes in a row. But then it got better? Those crashes were not submitted and I'm submitting them now and will take a look.

Andrew Sutherland [:asuth] (he/him)

Comment 12

•

3 months ago

Hm, the crashes all fail to submit.

Jens Stutte [:jstutte]

Comment 13

•

3 months ago

•

Edited

I prepared https://treeherder.mozilla.org/jobs?repo=try&revision=876d372f44cd5ed7fdf31b831c92065c5f60b281 which would moot the dtor crash in case of init failures, I think. I assume this would result either in other crashes or send back some error to the parent, but did not try to follow through, yet.

Arrgh, I have some moz-phab update issues on the machine I made this with...

Jens Stutte [:jstutte]

Comment 14

•

3 months ago

Attached file bug1915788.txt — Details

Jens Stutte [:jstutte]

Comment 15

•

3 months ago

Attached file Bug 1915788 - Do not crash on ~ContentChild if we have a process init failure. r?nika (obsolete) — Details

This crash prevents us from having a better error propagation on child
process startup failures. This patch will only moot those crashes but
not improve that error reporting.

Nika Layzell [:nika] (ni? for response)

Comment 16

•

3 months ago

(In reply to Jens Stutte [:jstutte] from comment #8)

It looks like the changes from bug 1471720 that affect the parent process do not have any effect on Fenix. Indeed there are still ServiceWorkerRegistrar crashes coming in for newer builds from Fenix. There must be something different about how geckoview uses XPCOM, apparently.

GeckoView does have a different entry point for the parent process, namely I believe it calls GeckoStart (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/toolkit/xre/nsAndroidStartup.cpp#25). That being said, that function should be calling XRE_main after which point it's very similar to the parent process.

The big notable difference around omnijar startup is probably that in the parent process, Omnijar::Init is called in a different place on android (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/toolkit/xre/nsAppRunner.cpp#6161-6172).

Because the omnijar will already be initialized at that point, this in turn means we don't try to initialize the omnijar in the normal place in NS_InitXPCOM, so we won't get the NS_ERROR_OMNIJAR_CORRUPT error message which you are expecting (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#381-395).

I wonder if this could actually be related to the parent process validation not happening as expected in some way, though I am currently still unsure how this behaviour could have happened.

Flags: needinfo?(nika)

Nika Layzell [:nika] (ni? for response)

Comment 17

•

3 months ago

A new note: I'm currently not seeing how the content process could be failing to start in new ways after the changes, as we actually initialize the content process omnijar early (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/dom/ipc/ContentProcess.cpp#155), so we wouldn't be hitting the xpcom init codepath anyway.

The only way that could happen is if neither the greomni or appomni arguments were provided on the command line to the child process (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/Omnijar.cpp#213-243), which I believe should never happen on android (as that indicates an unpacked build, which I believe is impossible on Android).

So perhaps there is some situation where the parent process sees a corrupted omnijar and doesn't fall over with the expected crash (as I explain in comment 16), and somehow we are able to start a content process without an omnijar, and are seeing crashes early in startup? I find it hard to believe that is possible though, without seeing crashes for the parent process.

Gabriele Svelto [:gsvelto]

Reporter

Comment 18

•

3 months ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #11)

I have a Pixel 8a, effectively on default settings (new phone), it seems fine currently (no crashes). That said, when I first started Firefox nightly on the device and was trying to activate Firefox Sync via QR code, about:crashes says I got 6 crashes in a row. But then it got better? Those crashes were not submitted and I'm submitting them now and will take a look.

This is interesting, looking at the distribution of crashes on Socorro shows something similar: users seem to be experiencing a burst of several crashes in a relatively short amount of time (usually 10s of seconds at most). I wonder if Firefox recovers and restarts afterwards - like it happened to you - or they never get a chance to load a page. If only we had comments on Fenix crash reports... they'll be coming later this year AFAIK but it's already too late.

Gabriele Svelto [:gsvelto]

Reporter

Comment 19

•

3 months ago

This is a long shot but since we don't have many leads: I put up a call for help on Mastodon to see if we can find an affected users. I also scoured Reddit and there doesn't seem to be comments about Firefox for Android being unstable. I'll re-check the crashes tonight to see if I can find users that have been affected repeatedly, or if the pattern of a burst of crashes and then nothing repeats itself.

Nika Layzell [:nika] (ni? for response)

Comment 20

•

3 months ago

Going through every error return path in NS_InitXPCOM in case that's the failing codepath:

https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#257-260 - Should never be hit as we definitely haven't called NS_InitXPCOM before.
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#318-320 - Can't be hit in a content process due to the XRE_IsParentProcess() condition.
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#327-329 - The nsThreadManager was already initialized earlier, so this second call does nothing.
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#336-338 - Implementation is actually infallible
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#363-365 - Implementation can only fail if aAppFileLocationProvider is null, which is checked in the caller.
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#392-394 - The failure we are suspicious about - should not be hit on Android, as we'll initialize the omnijar earlier, unless the -greOmni argument is missing or an invalid path.
- Potentially we're in some weird buggy state where this has happened somehow, but I would expect the parent process to be pretty broken in that case, as we shouldn't have a functional omnijar (where it's set on commandline)
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#401-418 - CommandLine::IsInitialized() is true, as it was initialized earlier.
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#432-435 - This can only fail if pthread_key_create fails, which seems quite unlikely (and also extremely difficult for us to recover from).
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/XPCOMInit.cpp#458-461 - Surprisingly, nsComponentManagerImpl::Init is completely infallible.

This makes me think that unless we're having some issue where the omnijar isn't being initialized, it's unlikely it's this codepath. Looking at the failure paths in the caller (ContentProcess::Init):

https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/dom/ipc/ContentProcess.cpp#123-126 - Mandatory flags - would be surprising if they were missing.
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/dom/ipc/ContentProcess.cpp#128-130 - Could potentially fail if we failed to map the preferences shared memory at process startup (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/ipc/glue/ProcessUtils_common.cpp#169-177). Don't think this has changed lately though so seems like it'd be surprising if this changed. This could spuriously fail though if there is some fairly extreme resource exhaustion I suppose?
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/dom/ipc/ContentProcess.cpp#132-135 - Shouldn't ever be hit, as we'll never pass the flags on android (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/ipc/glue/ProcessUtils_common.cpp#199-200), so we should immediately early-return with success (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/ipc/glue/ProcessUtils_common.cpp#235-239).
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/dom/ipc/ContentProcess.cpp#141-143 - Should only fail if MOZ_ANDROID_LIBDIR isn't set (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/xpcom/build/BinaryPath.h#138-143). I don't immediately see how https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/mobile/android/geckoview/src/main/java/org/mozilla/gecko/mozglue/GeckoLoader.java#220 could be skipped, so I think that's unlikely.
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/dom/ipc/ContentProcess.cpp#148-150 - Appears to have the same failure path as GetGREDir above.
https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/dom/ipc/ContentProcess.cpp#159-161 - Discussed above.

If all of those steps succeed, that means that we should be making it to MessageLoop::Run (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/toolkit/xre/nsEmbedFunctions.cpp#644-645), which should never return in a release build, due to the ProcessChild::QuickExit() call in ContentChild::ActorDestroy (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/dom/ipc/ContentChild.cpp#2207). The actor should have successfully been bound to IPC.

The only other way I can think of which could cause failures is if we failed to run the appshell in the content process, which could cause the function to exit, but it appears that nsBaseAppShell::Run should behave as expected (https://searchfox.org/mozilla-central/rev/cb5faf5dd5176494302068c553da97b4d08aa339/widget/nsBaseAppShell.cpp#141-154), and it doesn't seem that is overridden for Android on first look.

TL;DR I'm a bit stumped as to what could be causing this failure right now.

Nika Layzell [:nika] (ni? for response)

Comment 21

•

3 months ago

:mccr8, is there any chance you could look over the comments above, and see if you can find errors in my reasoning? We're trying to figure out how the specific android crash is happening, and it seems most likely to be due to a failure during content process startup.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Updated

•

3 months ago

Depends on: 1915988

Andrew McCreight [:mccr8]

Assignee

Comment 22

•

3 months ago

I don't really know this code well enough to have a better analysis, unfortunately. I wrote some patches so we can try to confirm that
a) It is actually an init failure.
b) Where it is failing.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Updated

•

3 months ago

Depends on: 1915998

Gabriele Svelto [:gsvelto]

Reporter

Comment 23

•

3 months ago

After poring over the crashes again I've got a theory of what might be happening, but I need to test it. Content processes on Android aren't children of the main process, they're service instances launched via Android's activity manager. While working on the out-of-process crash reporter I've noticed that service instances can survive the death of the main process. So I'm wondering if an update could lead to the main process being killed, but some launches of child processes which had been queued up still go ahead. These would try to access the omnijar and fail, because the updater removed it already.

Phabricator Automation

Updated

•

3 months ago

Attachment #9421836 - Attachment is obsolete: true

Jens Stutte [:jstutte]

Comment 24

•

3 months ago

(removing Jari's initial needinfo, things went on a bit in the meantime)

Flags: needinfo?(jjalkanen)

Gabriele Svelto [:gsvelto]

Reporter

Comment 25

•

3 months ago

I've been looking at crashes grouped by installation time, version and device - to aggregate around a single user more or less - and it seems that the problem isn't happening just once. There's groups of crashes happening a few 10s of seconds to a few minutes away. Which would indicate the user tried to launch Firefox again and encountered the same problem.

Jens Stutte [:jstutte]

Comment 26

•

3 months ago

•

Edited

These are the first crashes with the diagnostic patches.

Summarizing what we know:

All (above) crashes so far fall over mozilla::Omnijar::Init();, which fits well with bug 1471720 being the regressing bug.
We still see no sign of similar crashes from the parent process on Fenix.
We do not know what the user perceivable impact of these crashes is (if any).
We do not know exactly if and how we end up with apparently freshly started content processes ~~without the parent running~~, see comment 23. Edit: gsvelto noted on slack that without parent we should not see the crash reports.

This makes it pretty likely that whatever happened before bug 1471720 surfaced this problem was not really better for the user (probably other opaque crashes).

Pascal Chevrel:pascalc

Updated

•

3 months ago

Severity: S2 → S3

Donal Meehan [:dmeehan]

Updated

•

3 months ago

status-firefox129: affected → unaffected

Jens Stutte [:jstutte]

Comment 27

•

3 months ago

This bug's root cause is probably Android specific and I will not be able to investigate it.

Assignee: jstutte → nobody

BugBot [:suhaib / :marco/ :calixte]

Comment 28

•

3 months ago

Set release status flags based on info from the regressing bug 1471720

status-firefox132: --- → affected

Andrew McCreight [:mccr8]

Assignee

Comment 29

•

3 months ago

Adding the new signature after bug 1915988, although after bug 1915998 it will likely change again.

Crash Signature: [@ mozilla::dom::ContentChild::~ContentChild] → [@ mozilla::dom::ContentChild::~ContentChild] [@ NS_InitXPCOM ]

Jens Stutte [:jstutte]

Updated

•

3 months ago

Priority: P1 → P2

Andrew Sutherland [:asuth] (he/him)

Comment 30

•

3 months ago

•

Edited

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #11)

I have a Pixel 8a, effectively on default settings (new phone), it seems fine currently (no crashes). That said, when I first started Firefox nightly on the device and was trying to activate Firefox Sync via QR code, about:crashes says I got 6 crashes in a row. But then it got better? Those crashes were not submitted and I'm submitting them now and will take a look.

In a 1:1 I went back to look at this and 1 crash of 8 managed to submit: https://crash-stats.mozilla.org/report/index/5f24fd71-f2b1-4f52-903b-dbecb0240827

Confusingly, the about:crashes UI is now different than it was when I posted comment 11. There were submit buttons then, although they failed to submit, now there's just "Share" and "Socorro". (Do the buttons disappear after some amount of time?)

The general steps I took here were:

I had a brand new Pixel 8a that during setup I hooked it up to my Pixel 6 Pro which had nightly installed so that it migrated the contents of the 6 Pro to the 8a. In terms of app installation, when the process completed, the Play Store was asynchronously downloading all the apps I'd previously installed in the background. I use Firefox Nightly so it eventually got installed.
I think I left the device alone for a while at this point to download apps. In particular, it's conceivable that enough time passed (hours) that an update for Nightly became available.
I went to use Firefox nightly and the onboarding flow suggested I sign in to my Firefox Account. I did this, running Firefox Nightly on my linux desktop and going to the URL that results in a QR code being produced.
On the Android device when I took a picture of the QR code, the tab crashed, or the tab it navigated to crashed. My memory is not 100% on this, but I think basically I kept taking pictures of the QR code figuring that the tab crash should be transient. But it kept crashing. I believe I then maybe used the back button and tried the flow again, but it still crashed. I am certain I then used the process manager to close Firefox (I hit the rightmost square icon thing to the right of the home button circle on the bottom android chrome UI, then swiped up on Firefox). I feel like it still was crashing after closing and reopening so I put it down for a few minutes as I task switched and that when I came back to it and tried one last time, it worked (without visibly crashing).
As noted in comment 12, when I found out about this bug I went to about:crashes and none of the crashes had been submitted and when I clicked "submit all" they all failed, and when I clicked on "submit" for each one, they individually failed.

:kaya, does the extra context of setting up the Firefox Account / Firefox Sync change anything about the potential scenario?

Flags: needinfo?(kkaya)

Jens Stutte [:jstutte]

Updated

•

3 months ago

Crash Signature: [@ mozilla::dom::ContentChild::~ContentChild] [@ NS_InitXPCOM ] → [@ mozilla::dom::ContentChild::~ContentChild] [@ NS_InitXPCOM ] [@ mozilla::Omnijar::ChildProcessInit ]

Gabriele Svelto [:gsvelto]

Reporter

Comment 31

•

3 months ago

We might have to re-evaluate this. :willdurand reached out in the #crashreporting channel and mentioned he saw a significant spike in crash pings coming from extension processes in Fenix:

https://sql.telemetry.mozilla.org/queries/96371#237925

This is the only bug we know of that spiked significantly so there's a very significant probability that this issue and the telemetry spike are correlated.

Jens Stutte [:jstutte]

Comment 32

•

3 months ago

(In reply to Gabriele Svelto [:gsvelto] from comment #31)

We might have to re-evaluate this. :willdurand reached out in the #crashreporting channel and mentioned he saw a significant spike in crash pings coming from extension processes in Fenix:

https://sql.telemetry.mozilla.org/queries/96371#237925

This is the only bug we know of that spiked significantly so there's a very significant probability that this issue and the telemetry spike are correlated.

If I remove the filter:

AND metrics.string.crash_remote_Type = "extension"

I cannot see any real uptick, as if crashes moved from one content process type to another.

BugBot [:suhaib / :marco/ :calixte]

Comment 33

•

2 months ago

This is a reminder regarding comment #5!

The bug is marked as tracked for firefox130 (release) and tracked for firefox131 (beta). We have limited time to fix this, the soft freeze is in 14 days. However, the bug still isn't assigned.

BugBot [:suhaib / :marco/ :calixte]

Comment 34

•

2 months ago

This is a reminder regarding comment #5!

The bug is marked as tracked for firefox130 (release) and tracked for firefox131 (beta). We have limited time to fix this, the soft freeze is in 8 days. However, the bug still isn't assigned.

Pascal Chevrel:pascalc

Updated

•

2 months ago

status-firefox130: affected → wontfix

BugBot [:suhaib / :marco/ :calixte]

Comment 35

•

2 months ago

This is a reminder regarding comment #5!

The bug is marked as tracked for firefox131 (beta). We have limited time to fix this, the soft freeze is in 2 days. However, the bug still isn't assigned.

Jens Stutte [:jstutte]

Comment 36

•

2 months ago

IIUC the latest stats from crash-stats, we now see only Android builds are always packaged.

That can mean either that the argument was missing from the command line (which sounds unlikely) or that the file was not there or had inaccessible rights.

I do not see any omnijar related crashes in the parent process, still. As if the content process launches come out of nowhere.

Jens Stutte [:jstutte]

Comment 37

•

2 months ago

When we first encountered "Android builds are always packaged", nika: wrote on slack some time ago:

So FWIW we do actually do XER_GetFileFromPath in the parent process as well, before getting the path which would be passed down to the child (https://searchfox.org/mozilla-central/rev/cc01f11adfacca9cd44a75fd140d2fdd8f9a48d4/toolkit/xre/nsAppRunner.cpp#5829-5834)
And that has to return a non-null value on android, otherwise the parent process XRE_main will exit with status 2. https://searchfox.org/mozilla-central/rev/cc01f11adfacca9cd44a75fd140d2fdd8f9a48d4/toolkit/xre/nsAppRunner.cpp#5917
Though, uhh, apparently Android ignores XRE_main returning? https://searchfox.org/mozilla-central/rev/cc01f11adfacca9cd44a75fd140d2fdd8f9a48d4/toolkit/xre/nsAndroidStartup.cpp#54-55

Now most of the crashes are of that type.

Could it be that GeckoThread.run can be called afterwards, calling then Java_org_mozilla_gecko_mozglue_GeckoLoader_nativeRun which in turn wants to create some content process even if the parent process creation failed but the failure was ignored ?
If so, I assume we would only see parent process failures if we would fatally handle XRE_main failures on the Java side (which we probably should do, anyways) ? Unfortunately this would just move (and reduce) the number of crashes to become parent process crashes and not really give us a hint, why the file is missing or not accessible.

Chris Peterson [:cpeterson]

Updated

•

2 months ago

Crash Signature: [@ mozilla::dom::ContentChild::~ContentChild] [@ NS_InitXPCOM ] [@ mozilla::Omnijar::ChildProcessInit ] → [@ mozilla::dom::ContentChild::~ContentChild] [@ mozilla::Omnijar::ChildProcessInit] [@ mozilla::Omnijar::Init] [@ NS_InitXPCOM]

Dianna Smith [:diannaS]

Comment 38

•

2 months ago

adding NI just for a reminder (feel free to divert to someone else)
It was discussed in slack to tone down the crash noise by making this crash non-fatal.

Updating flags since this bug is lower severity now.
Marking 131 as fix-optional in case the patch is ready/trivial for the 131 planned dot release on october 15th.

status-firefox131: affected → fix-optional

status-firefox133: --- → affected

tracking-firefox131: + → ---

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Updated

•

2 months ago

Assignee: nobody → continuation

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Comment 39

•

2 months ago

We have 4 different signatures here.

mozilla::Omnijar::ChildProcessInit (bp-ef596ce5-467e-47fe-873c-4f3620241003). These all have the crash reason "Android builds are always packaged", which was added in bug 1915998. This is currently the high volume crash. I think we can work around this by making ChildProcessInit and ContentProcess::Init fallible, and bail out of those methods in the !greOmni case, changing the assertion to be DIAGNOSTIC.
mozilla::Omnijar::Init (bp-dd7d9b31-76a1-469b-ba15-93a0a0241002). These all have the crash reason "Omnijar::Init failed: NS_ERROR_FILE_CORRUPTED". This was also added in bug 1915998. This is happening on desktop and Fenix, in both parent and child processes. This is a different error than the previous one: we found the OmniJAR file, but the contents are broken. The volume here isn't too high (only about 90 reports in the last month) so I think we should leave it alone and spin it off into a new bug.
NS_InitXPCOM (bp-6897fa16-9dd9-4b59-bf2a-819360240914). These all have the crash reason Omnijar::Init(). This was added in bug 1915988. If we're bailing out of ContentProcess::Init after Omnijar::ChildProcessInit fails, then we won't re-try initializing the omnijar, so we shouldn't hit this.
mozilla::dom::ContentChild::~ContentChild (bp-5ff2e126-c4d3-447f-a146-1de6a0241003). This is the original crash that the bug was filed for. The actual crash is old, but the code change that caused this to start happening was in bug 1471720. I need to dig through that some more to figure out how to prevent this crash.

Andrew McCreight [:mccr8]

Assignee

Updated

•

2 months ago

Depends on: 1922707

Jens Stutte [:jstutte]

Comment 40

•

2 months ago

(In reply to Andrew McCreight [:mccr8] from comment #39)

mozilla::Omnijar::ChildProcessInit (bp-ef596ce5-467e-47fe-873c-4f3620241003). These all have the crash reason "Android builds are always packaged", which was added in bug 1915998. This is currently the high volume crash. I think we can work around this by making ChildProcessInit and ContentProcess::Init fallible, and bail out of those methods in the !greOmni case, changing the assertion to be DIAGNOSTIC.

I think we should also add something to be able to distinguish the possible reasons as of comment 36 which I think still applies to this case:

That can mean either that the argument was missing from the command line (which sounds unlikely) or that the file was not there or had inaccessible rights.

Andrew McCreight [:mccr8]

Assignee

Comment 41

•

2 months ago

(In reply to Jens Stutte [:jstutte] from comment #40)

I think we should also add something to be able to distinguish the possible reasons as of comment 36 which I think still applies to this case:

Yes, Nika suggested that, too, so it is part of my patch in bug 1922707.

Andrew McCreight [:mccr8]

Assignee

Updated

•

2 months ago

Comment 42

•

2 months ago

I've split [@ mozilla::Omnijar::Init] into a separate bug, bug 1923198.

Crash Signature: [@ mozilla::dom::ContentChild::~ContentChild] [@ mozilla::Omnijar::ChildProcessInit] [@ mozilla::Omnijar::Init] [@ NS_InitXPCOM] → [@ mozilla::dom::ContentChild::~ContentChild] [@ mozilla::Omnijar::ChildProcessInit] [@ NS_InitXPCOM]

Ryan VanderMeulen [:RyanVM]

Updated

•

2 months ago

status-firefox131: fix-optional → wontfix

Andrew McCreight [:mccr8]

Assignee

Comment 43

•

1 month ago

My patch in bug 1922707 split one failure state into 2, per Nika's suggestion. We're starting to get results back from Nightly, and the failure reason for all of them is MOZ_CRASH(XRE_GetFileFromPath failed). Not very surprising.

There's a lot going on in this bug, so we should probably file new bugs for the handful of different failure conditions we're hitting. I listed these in comment 39. I've filed the first one as bug 1923198, so we need a bug for the Android mozilla::Omnijar::ChildProcessInit failures and for the desktop NS_InitXPCOM failures.

Andrew McCreight [:mccr8]

Assignee

Updated

•

1 month ago

Comment 44

•

1 month ago

I'm moving the [@ mozilla::Omnijar::ChildProcessInit] failure (the common Android one) to bug 1924182.

Crash Signature: [@ mozilla::dom::ContentChild::~ContentChild] [@ mozilla::Omnijar::ChildProcessInit] [@ NS_InitXPCOM] → [@ mozilla::dom::ContentChild::~ContentChild] [@ NS_InitXPCOM]

Andrew McCreight [:mccr8]

Assignee

Comment 45

•

1 month ago

Fenix crashes where the signature is NS_InitXPCOM, and the major version is > 129: There are 27 of these. All but 4 have the build id 20240830222238, which is the Nightly build after bug 1915988 landed and before bug 1915998 landed.

Of the remaining 4, 1 of them (bp-4c35fdad-4730-4204-937c-7c3e00241001) is on release and looks like a null deref on xpcomLib. I think we can ignore that because there's only one.

The final three of these crashes (eg bp-3a36d9d5-33af-4cf9-94f0-584ff0240914) look like they could be from the same device. They all supposedly have the build id 20240912092307, but the actual hg revision the crash stack links to is the same one as the 20240830222238 build crashes, so I think these could be from the period where the Nightly Android crash reporter was incorrectly reporting build ids or something. So I think all of the Fenix NS_InitXPCOM crashes were "fixed" by turning them into mozilla::Omnijar::Init crashes.

There are still some residual NS_InitXPCOM crashes on desktop, but that's not what this bug was about, and the volume isn't super high so I'm inclined to close this and somebody can file a new bug if it is an issue.

Finally, as I think I said before, the ~ContentChild crash was "fixed" by turning it into a different crash.

Status: NEW → RESOLVED

Closed: 1 month ago

Resolution: --- → WORKSFORME

Andrew McCreight [:mccr8]

Assignee

Updated

•

1 month ago

status-firefox132: affected → disabled

status-firefox133: affected → disabled

bug1915788.txt 3 months ago Jens Stutte [:jstutte] 2.04 KB, text/plain		Details
Bug 1915788 - Do not crash on ~ContentChild if we have a process init failure. r?nika 3 months ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review