crash in OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | std::priv::_Deque_base<T>::_M_initialize_map

RESOLVED FIXED in Firefox 47

Status

()

defect
--
critical
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: snorp, Assigned: esawin)

Tracking

({crash})

45 Branch
Firefox 48
Unspecified
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox47 fixed, firefox48 fixed, fennec46+)

Details

(crash signature)

Attachments

(2 attachments)

This bug was filed from the Socorro interface and is 
report bp-c1fe5a7f-fc67-4fd0-ad5c-9b4ea2160330.
=============================================================

Most of the reports seem to be OOMing in media decoder, but unknown if that is the cause.
Looks like fallout from bug 1248792
Assignee: nobody → esawin
Blocks: 1248792
tracking-fennec: --- → 46+
The early return in a double-booked shutdown situation is problematic.

A forced shutdown through the destructor would return early if the shutdown procedure had already been initiated, in which case the object would be destructed without a proper shutdown of the decoder thread.

We should also prevent any illegal state transitions (flushing could potentially override a shutdown) when trying to set the state to give the call site feedback on the success of the transition.
Attachment #8739187 - Flags: review?(snorp)
Comment on attachment 8739187 [details] [diff] [review]
0001-Bug-1262456-1.1-Prevent-interruption-of-the-decoder-.patch

Review of attachment 8739187 [details] [diff] [review]:
-----------------------------------------------------------------

Nice catch!
Attachment #8739187 - Flags: review?(snorp) → review+
https://hg.mozilla.org/mozilla-central/rev/8bce23713e8f
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 48
The crash is still occurring. The device distribution looks suspicious, with over 30% of all crashes on the GT-I9060I (Samsung Galaxy Grand Neo), which afaik is not a mainstream device like the S-series.
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Let's replace the queue container adapter with the underlying container deque to gain access to its clear implementation (std::queue is not perfectly suited for our case because of the flushing mechanics).

This will help diagnosing the actual issue here, since deque::clear does not require a temporary container and therefore avoids the allocation which should give us a different crash stack.
Attachment #8742409 - Flags: review?(snorp)
Attachment #8742409 - Flags: review?(snorp) → review+
Closing this and will post a new bug with the new signature. So far nothing related has popped up on Nightly, we should uplift this to get clearer results.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Keywords: leave-open
Blocks: 1273523
Comment on attachment 8739187 [details] [diff] [review]
0001-Bug-1262456-1.1-Prevent-interruption-of-the-decoder-.patch

Approval Request Comment
[Feature/regressing bug #]: Media playback, blocks uplift of bug 1273523
[User impact if declined]: Top media crash
[Describe test coverage new/current, TreeHerder]: On Nightly and Aurora for weeks
[Risks and why]: Low, only affects media decoder shutdown
[String/UUID change made/needed]: none
Attachment #8739187 - Flags: approval-mozilla-beta?
Attachment #8742409 - Flags: approval-mozilla-beta?
Comment on attachment 8739187 [details] [diff] [review]
0001-Bug-1262456-1.1-Prevent-interruption-of-the-decoder-.patch

Need this patch for medium/high volume top Fennec 47 crash, this patch has been in 48 for over a month, seems safe. Beta47+
Attachment #8739187 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Hi Margaret, Snorp, FYI. We had conflicts uplifting patch from bug 1273523 to Beta and Eugen would like to uplift this patch instead of adding a new rebased patch to Beta47 in that bug. I have approved it for uplift to Beta47. If you have any concerns, please let me know.

At this point in Beta47 cycle, I am only taking fixes for critical recent regressions and severe stability/security issues. I'd appreciate all the help I can get in evaluating risk associated with this uplift. Thanks!
Flags: needinfo?(snorp)
Flags: needinfo?(margaret.leibovic)
Attachment #8742409 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I defer to Snorp and Eugen here, this is outside my area of expertise.
Flags: needinfo?(margaret.leibovic)
Snorp commented on IRC that he thinks we should take these two patches.
No crashes have been encountered on the 47.0b10 build in today's testing session. 

The testing took place on the following devices: Samsung Galaxy Note 5(Android 5.1.1), Xiaomi mi i4 (Android 5.0.2), Samsung Galaxy S5 (Android 5.0.2), Nexus 5 (Android 6.0.1), Nexus 6P (Android N Beta), 	Alcatel OneTouch, and on tablets: Xiaomi MI pad2 (Android 5.1.1), Nexus 9 (Android 5.1), Nexus 9 (Android 6.1).

I will leave this issue open in order to gather more information in the Crash-stats reports.
As per bug 1273523 comment 10, I suspect this is an stlport bug, not a Mozilla bug.

Look at the results of this search:

https://crash-stats.mozilla.com/search/?oom_allocation_size=2147483656&_facets=oom_allocation_size&_facets=signature&_facets=platform&_facets=proto_signature&_facets=version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=process_type&_columns=platform_pretty_version#facet-oom_allocation_size

Look at the different facet tabs and you can see all these crashes are coming from std::priv::_Deque_base<T>::_M_initialize_map, which is from the std:deque usage in mozilla::MediaCodecDataDecoder::PeekNextSample, and they're all Android. There are also crashes in version 46.0b13.

Looking at the stack traces, the decoder is just calling clear() on the std::deque, which seems totally reasonable. Smells like an stlport bug to me. That would explain why the steps taken so far don't seem to help.

If I'm right, the good news is that Nathan Froyd recently did the hard work to get us off stlport, onto a more modern C++ stdlib. The bad news is that we backporting stlport removal is likely hard or impossible...
(In reply to Nicholas Nethercote [:njn] from comment #18)
> Looking at the stack traces, the decoder is just calling clear() on the
> std::deque, which seems totally reasonable. Smells like an stlport bug to
> me. That would explain why the steps taken so far don't seem to help.

In bug 1273523 we fix the UB (possibly heap corruption) caused by calling deque::pop_front() on an empty deque, which would explain the subsequent deque::clear() call to behave abnormally. Previously, we've only asserted this constraint in debug builds.
> In bug 1273523 we fix the UB (possibly heap corruption) caused by calling
> deque::pop_front() on an empty deque, which would explain the subsequent
> deque::clear() call to behave abnormally. Previously, we've only asserted
> this constraint in debug builds.

Ah. In that case my analysis is mistaken and the problem is in our code. Thank you for the explanation.
(In reply to Ritu Kothari (:ritu) from comment #15)
> Snorp commented on IRC that he thinks we should take these two patches.

Clearing needinfo since the question was answered and the issue was resolved.
Flags: needinfo?(snorp)
You need to log in before you can comment on or make changes to this bug.