Open Bug 1394788 Opened 2 years ago Updated 11 months ago

Crash in MessageLoop::PostTask_Helper

Categories

(Firefox for Android :: General, defect, P5, critical)

55 Branch
Unspecified
Android
defect

Tracking

()

Tracking Status
fennec + ---
firefox55 --- wontfix
firefox56 + wontfix
firefox57 --- fix-optional

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, regression)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-fccc50c8-4076-469b-805e-06bd60170829.
=============================================================

Crash that is spiking on Android in 55.0.2: 3766 crashes/  : http://bit.ly/2gnN7l5

Also occurs on desktop on both Windows and Mac, although in much smaller numbers. There are a number of intermittent test failure bugs on file, such as Bug 1394665 with the same signature.

ni on snorp and nevin to see if they can figure out what might be causing this spike on Android.
Flags: needinfo?(snorp)
Flags: needinfo?(cnevinchen)
A lot of these seem to be called from mozilla::layers::UiCompositorControllerChild::Destroy()
Other android ones seem to be from mozilla::layers::AndroidDynamicToolbarAnimator::UpdateFrameMetrics()
but there are very few android crashes other than the UICompositorControllerChild
Crashes in PostTask_Helper are also a big problem for Android tests currently - bug 1394428.
See Also: → 1394428
Joe, is there someone on your team who can take a look?
Flags: needinfo?(jcheng)
Thanks for ni.
This doesn't look like a front-end bug since it's crash at native code. Sorry I have no idea how to fix it.

Hi Jing Wei
Do you have any idea about this crash?
Flags: needinfo?(cnevinchen) → needinfo?(topwu.tw)
Whiteboard: [FNC][SPT57.3][INT]
Hi Jingwei, as discussed, please check it first to clarify if this will need other team's help, tks!
Flags: needinfo?(jcheng)
Since the crash happens in c++ and comment 0 said it also occurs on desktop, let's wait for platform team's help with their wisdom.
Flags: needinfo?(topwu.tw)
Joe, Wesly, ~90% of these crashes are on fennec, with about 4k crashes a week on release. "wait until some other team gets to it" doesn't seem like a winning strategy?
Flags: needinfo?(wehuang)
Flags: needinfo?(jcheng)
rbarker, can you take a look? Seems like you may be familiar with this code.
Flags: needinfo?(rbarker)
likely need snorp's team on it. Ni Snorp
Flags: needinfo?(jcheng)
I believe this may be a dup of Bug 1394428 which I am currently looking into.
Flags: needinfo?(rbarker)
Flags: needinfo?(wehuang)
Hopefully patches in 1392705 can help here.
See Also: → 1392705
Yeah hopefully this goes away with the other patches.
Flags: needinfo?(snorp)
[Tracking Requested - why for this release]: Very high Android crash rate (and not insignificant Desktop crash rate)
tracking-fennec: --- → ?
Track 56+ as the volume of crashes in 56 is high.
See Also: → 1359148
Whiteboard: [FNC][SPT57.3][INT]
This is Android nightly #1 top crash for 20170918100058 build
See bug 1392705, which may fix the Android portion of this.  This merged to m-c on 9/18.

For non-Android crashes, such as Windows:
https://crash-stats.mozilla.com/signature/?platform=Windows&signature=MessageLoop%3A%3APostTask_Helper&date=%3E%3D2017-09-14T07%3A44%3A00.000Z&date=%3C2017-09-21T07%3A44%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports

this will still be relevant.  These seem to come from all over the place, and don't (all) appear to be shutdown crashes, though that's worth checking.

There's some from MediaManager shutdown, which is called from GetProfileBeforeChange (non-e10s) or XpcomWillShutdown (e10s Content).  Basically, we're posting a Task to the MediaManager thread telling it to cleanup and shutdown.

Others I see crashing in calls from vsync, and a number from the GeckoIO Thread from IPC reception or OnChannelError.

Perhaps some non-atomic/locking oddness since pump_ can be accessed from multiple threads?  Though pump_ seems to be set-once, never touch again (unless the entire object is being destroyed).  The one access that uses pump_ after object destruction (potentially) is in PostTask(), where it grabs a stack-based ref to pump_ to call pump->ScheduleWork().  I don't see how that could make where this crashes always (pump_->GetXPCOMThread()) to fail, but all of this is tricky.

Bill -- This is an ongoing crash source, with apparently some underlying issue causing it to get hit in many places.  Also see what Android was doing that made this a topcrash there until bug 1392705 landed; perhaps that's a clue about what's failing on other platforms.
Flags: needinfo?(wmccloskey)
tracking-fennec: ? → +
Priority: -- → P1
Assignee: nobody → rbarker
For the Windows crashes, I suspect that bug 1395330 caused most of these to stop happening. My hypothesis was that we were crashing because we were shutting down a thread while there was still an IPC channel alive that could post messages to that thread. The assertion catches that situation and crashes us before it can happen. (And we are seeing quite a lot of crashes from bug 1395330.)

Looking at Windows crashes for 57, it looks like they mostly stopped around 9/6, which is around when bug 1395330 landed. I see one crash from a 9/18 build that seems unrelated:
https://crash-stats.mozilla.com/report/index/1e0f4bfb-0066-40a7-938a-cba850170919
Perhaps the message loop for mMediaThread has already been shut down there? In any case, it's different from the IPC crashes.

We still need to fix the crashes arising from bug 1395330, but at least now we have a better sense of what is going on: people are forgetting to Close() their channels before they shut down their threads.
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #19)

> We still need to fix the crashes arising from bug 1395330, but at least now
> we have a better sense of what is going on: people are forgetting to Close()
> their channels before they shut down their threads.

Curious about next steps... does this need someone to hunt for these cases -- is there an easy way to find them?
Flags: needinfo?(wmccloskey)
Bug 1398070 mostly fixed this issue. We're still seeing a few PostTask_Helper crashes, but they're not IPC related. A lot of them seem to be graphics related. Hopefully that team can fix them.
Flags: needinfo?(wmccloskey)
200 crashes in the last week on 56, so I'm not considering this a dot release issue for 56. Crash volume in 57 beta looks pretty low.
[triage] 42 crashes on 58 in the past 7 days and that includes fennec. This is also no longer a top crasher. Given fennec engineering resources, this is non-critical so removing P1.

Randall, please unassign if you're not working on this.
Flags: needinfo?(rbarker)
Priority: P1 → P3
Assignee: rbarker → nobody
Flags: needinfo?(rbarker)
Re-triaging per https://bugzilla.mozilla.org/show_bug.cgi?id=1473195

Needinfo :susheel if you think this bug should be re-triaged.
Priority: P3 → P5
You need to log in before you can comment on or make changes to this bug.