Closed Bug 1341360 Opened 3 years ago Closed 2 years ago

[Fennec] Crash in libstagefright.so

Categories

(Firefox for Android :: Audio/Video, defect, P2, blocker)

50 Branch
Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox51 --- wontfix
firefox52 + wontfix
firefox53 --- wontfix
firefox54 --- fixed

People

(Reporter: philipp, Assigned: jhlin)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-a8f63bf6-9355-4801-8e51-337dc2170221.
=============================================================

crashes in libstagefright.so are increasing in the 52.0b cycle - they are now accounting for 17% of crashes in 52.0b6 (but were just ~8% in 51.0b).
This bug 1292144 as well?  Note that volume is dropping, perhaps it's moved to one of the signatures tracked here?
See Also: → 1292144
Marking as blocking 52, we should probably find out what causes this spike.
the numbers regarding crash volume in comment #0 were based on all crash reports with a signature starting with libstagefright.so*.

when looking into correlations for the top signatures in 52.0b they all seem to be related to a particular device/chipset:
* libstagefright.so@0xcd7f1 (100.0% in signature vs 00.84% overall) android_model = Lenovo TAB 2 A10-70F
* libstagefright.so@0x11d8e6 (100.0% in signature vs 00.52% overall) android_model = SM-G355M
* libstagefright.so@0x869ed (100.0% in signature vs 01.72% overall) android_hardware = mt6755 (OPPO)
* libstagefright.so@0xdd51d (100.0% in signature vs 02.66% overall) android_hardware = mt6752 (Sony)
* libstagefright.so@0x1ea722 (100.0% in signature vs 05.29% overall) android_hardware = mt6582
Summary: Crash in libstagefright.so@0xcd7f1 → Crash in libstagefright.so
It would make sense for the addresses in the module to be version/device specific; they may very well all be custom :)
Blake, John, this is another media-related crash that has apparently spiked on Beta.
Flags: needinfo?(jolin)
Flags: needinfo?(bwu)
Component: General → Audio/Video
Summary: Crash in libstagefright.so → [Fennec] Crash in libstagefright.so
Most, if not all, of collected logcat dumps show the visited page creates and uses many codec simultaneously. Looks like a video gallery page with more than a few video element could exhaust mediaserver and bring it down. This in turn will mess IPC/binder up and confuse sanity check code in ACodec, then abort browser process.

James, does Fennec have some way to constrain the number of codecs in use?
Flags: needinfo?(jolin)
Flags: needinfo?(bwu)
See Also: → 1339496
Flags: needinfo?(snorp)
Ok, we had similar problems before OOP decoding too. Eugen was working on some patches to limit decoders, but I don't know the status there.
Flags: needinfo?(snorp) → needinfo?(esawin)
We can limit the number of supported decoder pairs via the "media.decoder.limit" pref since version 50.
Currently, we set the limit to 1 by default on devices running Android 4.2 and earlier [1].

[1] https://searchfox.org/mozilla-central/source/dom/media/MediaPrefs.h#185
Flags: needinfo?(esawin)
For Android API 18 and later, hard-coding a limit on devices that support multiple decoder instances seems too extreme to me. How about rejecting software decoder by comparing the return value of MediaCodec.getName() [1] with "OMX.google*" and "OMX.*" like what stagefright does [2]?

[1] https://developer.android.com/reference/android/media/MediaCodec.html#getName()
[2] http://androidxref.com/5.0.0_r2/xref/frameworks/av/media/libstagefright/OMXCodec.cpp#147
Flags: needinfo?(snorp)
Flags: needinfo?(esawin)
I like that idea. It seems like we already talked about something like that? I think we wanted to reject software vp9 in some cases.
Flags: needinfo?(snorp)
Either way, we should only impose those limits on affected devices. Each of those signatures seems to be specific to 1-2 SoCs.
Flags: needinfo?(esawin)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #10)
> I like that idea. It seems like we already talked about something like that?
> I think we wanted to reject software vp9 in some cases.

 I found
  - bug 1232911, which rejects VPX for specific devices, and
  - bug 1286738, which rejects software VP9 to make YouTube serve H.264 to Fennec
 
 Both use predefined conditions rather than running status. It seems another mechanism that rejects during initialization is needed for this bug. Will upload the patch later.
(In reply to Eugen Sawin [:esawin] from comment #11)
> Either way, we should only impose those limits on affected devices. Each of
> those signatures seems to be specific to 1-2 SoCs.

 The majority of crashes are mt*(MTK), few are Exynos(Samsung) and qcom (Qualcomm), and a handful of some that I don't recognize. Imposing the limits on MTK devices only should significantly reduce the number of crashes, but applying this universally seems safer to me.
We don't have the plan to have dot release for 51. Mark 51 won't fix.
52 RC build is around the corner, it would be great if we can have a new build for beta users as trial to see whether crash rate is trending down or not.
Severity: critical → blocker
Some crashes result from ACodec  : frameworks/av/media/libstagefright/ACodec.cpp:1762 CHECK_EQ( metaData->eType,kMetadataBufferTypeGrallocSource) failed: 0 vs. 1
After looking at this in crash stats, it seems the spike started around January 23/24 on fennec 50.1.0 (which was released over a month before that), and then followed on 51.0.2 and 51.0.3 when that was rolled out.  So it seems like this was caused by other factors, and I'm not convinced 52 is actually any worse off than previous versions.
I'm removing the blocker flag as I'm not convinced this is actually much worse in 52 than 50 or 51.  It *does* look like something we should try to figure out and fix with high priority, but at this point I don't think holding the release is necessary or useful.
Turns out merely rejecting software decoder is not sufficient. After applying the patch I can still crash mediaserver process when visiting the page.

Also browsed with Chrome and it looks just fine. However when examining the log, I noticed that it doesn't use software decoder either. Will dig deeper to see if it has other resource management tricks.
Assignee: nobody → jolin
Priority: -- → P1
I have the patch using only hardware video decoder that seems reduce the chances of crash on my Z3C/Android 5.1.1. But it cannot stop mediaserver process crash from triggering assert/abort code. That looks like a known issue to Google which was fixed in Mashmallow [1].

Since OOP decoding is going to be on in 53, it seems to me that spending more time on this one is not a good investment of efforts.

James and Blake, what do you think? Should we uplift the patch, or just use the existing pref?

FWIW, I vote for the later. :P

 [1] https://android.googlesource.com/platform/frameworks/av/+/48a31bf3f1c1ed5953a4e64f71cdf528f3a38ee5
Flags: needinfo?(snorp)
Flags: needinfo?(bwu)
SGTM. Let's uplift to 53 first. From the crash reports, it looks most cashes comes from low end phones. I am going to get some of those phones and we can verify your patch as well. 
And agree that we also need to make OOP more stable to mitigate this kind of crashes.
Flags: needinfo?(bwu)
Yup, sounds fine. Thanks!
Flags: needinfo?(snorp)
(In reply to John Lin [:jolin][:jhlin] from comment #20)
> ...
> 
> Since OOP decoding is going to be on in 53, it seems to me that spending
> more time on this one is not a good investment of efforts.

This is a surprise - I thought out of process decoding is only enabled by default where we have the compositor process, which is Windows.
(In reply to Milan Sreckovic [:milan] from comment #23)
> (In reply to John Lin [:jolin][:jhlin] from comment #20)
> > ...
> > 
> > Since OOP decoding is going to be on in 53, it seems to me that spending
> > more time on this one is not a good investment of efforts.
> 
> This is a surprise - I thought out of process decoding is only enabled by
> default where we have the compositor process, which is Windows.

 Fennec has its own OOP decoding implementation. Please see bug 1257777.
Duplicate of this bug: 1339496
Duplicate of this bug: 1345912
What patch are you aiming to uplift, here?
Flags: needinfo?(jolin)
Will send the patch for reviewing later. Thanks for the reminder.
Flags: needinfo?(jolin)
MediaCodec.createDecoderByType() allocates SW decoder automatically when it fails to get a HW one. To avoid that, use MediaCodecList to query decoder names ourselves and call MediaCodec.createByCodecName() instead.
Attachment #8846927 - Flags: review?(esawin)
Comment on attachment 8846927 [details] [diff] [review]
Query the component names and reject SW decoders.

Review of attachment 8846927 [details] [diff] [review]:
-----------------------------------------------------------------

Just some nits/suggestions.

::: mobile/android/geckoview/src/main/java/org/mozilla/gecko/util/HardwareCodecCapabilityUtils.java
@@ +45,5 @@
> +    return !findDecoderNamesForMimeType(aMimeType, false /* HW only */, true /* 1st only */).isEmpty();
> +  }
> +
> +  private static List<String> findDecoderNamesForMimeType(String aMimeType,
> +          boolean aHardwareOnly,

Alternatively, we could add an enum (e.g., HARDWARE_ONLY, HARDWARE_SOFTWARE) instead to improve readability without inline comments.

@@ +46,5 @@
> +  }
> +
> +  private static List<String> findDecoderNamesForMimeType(String aMimeType,
> +          boolean aHardwareOnly,
> +          boolean aFirstOnly) {

Do we expect retrieving the whole list to cause a performance regression? I'm not sure whether adding the aFirstOnly flag is warranted.

@@ +63,5 @@
>        for (String mimeType : info.getSupportedTypes()) {
>          if (mimeType.equals(aMimeType)) {
> +          names.add(name);
> +          if (aFirstOnly) {
> +            break iterate;

Please move the break into the outer loop (cond: aFirstOnly && names.size()) to avoid the label.
Attachment #8846927 - Flags: review?(esawin) → review+
Duplicate of this bug: 1348785
Mass wontfix for bugs affecting firefox 52.
We put our resource to ship OOP in 54 to mitigate libstagefright crashes.
These signatures dropped significantly recently.
Looking at the URLs, before April 20 we were crashing really often on some specific porn websites (xnxx.com, xvideos.com), with ~20000 reports in one week; we are now crashing way more rarely on those websites (~800 reports over the past week).
From 11 to 18, these two websites accounted for ~75% of crash reports where the URL field is not empty (~10% of all crash reports).
This is an assigned P1 bug without activity in two weeks. 

If you intend to continue working on this bug for the current release/iteration/sprint, remove the 'stale-bug' keyword.

Otherwise we'll reset the priority of the bug back to '--' on Monday, August 28th.
Keywords: stale-bug
Mass change P1->P2 to align with new Mozilla triage process.
Priority: P1 → P2
With out-of-process decoding(Bug 1333323), it is not a problem anymore.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
Resolution: INVALID → FIXED
You need to log in before you can comment on or make changes to this bug.