Closed Bug 925645 Opened 6 years ago Closed 3 years ago

[Galaxy Nexus] A WebGL page makes the phone crash - crash in mozilla::gl::GLContext::MakeCurrent(bool)

Categories

(Core :: Canvas: WebGL, defect, critical)

ARM
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox45 - wontfix
firefox46 --- wontfix
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- affected
fennec 47+ ---
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: julienw, Assigned: eflores)

References

()

Details

(Keywords: crash, crashreportid, topcrash-android-armv7, Whiteboard: [native-crash][in-the-wild][gfx-noted])

Crash Data

Attachments

(2 files, 2 obsolete files)

STR:
* open Firefox for Android
* navigate to http://www.catuhe.com/msdn/ballGame/
* wait

Expected:
* the page loads and a beautiful WebGL based game works

Actual:
* something displays and the phone reboots after some time

Tested on Galaxy Nexus with Firefox for Android v25 (which is the current Beta).
Severity: normal → critical
Version: unspecified → Firefox 25
Installed latest Nightly and it doesn't reboot the phone but instead it crashes the process.

Which is better since I know have a crash report :)

https://crash-stats.mozilla.com/report/index/c9bd9f1b-a908-434f-9e60-8332c2131011

Could it be related to bug 884047 ?
Keywords: crash, crashreportid
Crash Signature: [@ mozilla::gl::GLContext::MakeCurrent(bool)]
Component: Graphics, Panning and Zooming → Canvas: WebGL
Product: Firefox for Android → Core
Whiteboard: [native-crash]
Version: Firefox 25 → Trunk
Summary: [Galaxy Nexus] A WebGL page makes the phone reboot → [Galaxy Nexus] A WebGL page makes the phone crash - crash in mozilla::gl::GLContext::MakeCurrent(bool)
Whiteboard: [native-crash] → [native-crash][in-the-wild]
On latest nightly, it still reboots my device. Tested on Samsung Galaxy Ace 2 (GT-I8160).

So bug 925608 does not resolve this, sadly. And sadly, this reboots the device instead of crashing the process...
Crash Signature: [@ mozilla::gl::GLContext::MakeCurrent(bool)] → [@ mozilla::gl::GLContext::MakeCurrent(bool)] [@ mozilla::gl::GLContext::MakeCurrent]
Assignee: nobody → edwin
[Tracking Requested - why for this release]:
This is a topcrash in Fennec, currently #8 in Release, #12 in Beta, and #24 in Aurora. There's also a significant volume in Firefox on Windows, currently at #41 in Release and #49 in Beta, insignificant on Aurora and Nightly (not sure if that should be tracked separately).

Specific to Android, here are the top-5 devices:
3.314% are using the Samsung Galaxy S4 (GT-I9500) on Android 5.0 (API-21)
3.290% are using the Samsung Galaxy S5 (SM-G900V) on Android 5.0 (API-21)
1.294% are using the Motorola Droid Turbo (XT1254) on Android 5.1 (API-22)
1.286% are using the Google Nexus 7 on Android 6.0 (API-23)
1.166% are using the Samsung Galaxy S4 (GT-I9500) on Android 4.4 (API-19)
tracking-fennec: --- → ?
Whiteboard: [native-crash][in-the-wild] → [native-crash][in-the-wild][gfx-noted]
This is likely fixed by bug 1224199. Will have to let it bake for a bit and see.
tracking-fennec: ? → 47+
(In reply to Edwin Flores [:eflores] [:edwin] from comment #4)
> This is likely fixed by bug 1224199. Will have to let it bake for a bit and
> see.

It looks like bug 1224199 probably didn't fix this - 

https://crash-stats.mozilla.com/signature/?product=FennecAndroid&version=47.0b&signature=mozilla%3A%3Agl%3A%3AGLContext%3A%3AMakeCurrent&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#reports

Builds on 5/9 and 5/2 include those patches, but we're still seeing crashes with the same callstack.
Volume has dropped significantly on Fennec 48:

https://crash-stats.mozilla.com/signature/?date=%3E2016-01-01&product=FennecAndroid&signature=mozilla%3A%3Agl%3A%3AGLContext%3A%3AMakeCurrent#graph

The charts show no crashes from 48 nightly or aurora. Still a small number in 48b1, but nowhere near what we saw in 47b1.

Not sure exactly what caused this.
@mozilla::gl::GLContext::MakeCurrent(Bool) is gone but @mozilla::gl::GLContext::MakeCurrent still exists in high volume. At present volume this ranks at #3 in Firefox @ 4.52% and #4 in Fennec @ 2.53%. 

Fennec 49: 6 crashes
Fennec 48: 232 crashes
Fennec 47: 13,403 crashes
Fennec 46: 695 crashes

Firefox 50: 17 crashes
Firefox 49: 0 crashes
Firefox 48: 85 crashes
Firefox 47: 2,885 crashes
Firefox 46: 52 crashes

I'm not sure if the Firefox crashes are a different issue than what we're seeing on Fennec but it's definitely topcrash volume in both products. Please needinfo me if you need a separate bug for the Firefox crashes.
Finally managed to reproduce this, if only intermittently.

The cause is largely the same as outlined in bug 1224199 comment 1.

What I didn't realise when writing that patch, was that the GLScreenBuffer's SurfaceFactory can change. When it does, GLScreenBuffer can still be holding onto TextureClientSharedSurface objects, extending their lifetime beyond that of their SurfaceFactory. We can then go on to use those surfaces, adding them to the CompositorBridgeChild's |mTexturesWaitingRecycled| list.

In the meantime, the GLContext can die. The SurfaceFactory it indirectly owns will die with it, and call |CancelWaitForRecycle| on the surfaces created by that factory.

The surfaces created by the old factory, however, will still be waiting for the compositor recycle notification. When that notification arrives, we destroy the surface, and this crash happens again.
Possible fixes:

1. Turn SharedSurface::mGL into a strong ref and move ownership of the GLScreenBuffer to prevent a cycle (as suggested by nical).
2. Make GLContext derive from SupportsWeakPtr and turn SharedSurface::mGL into a WeakPtr<>. Then we can check in ~SharedSurface_* whether the GLContext still exists.
3. Clear mFront and mBack in GLScreenBuffer::Morph.

Pursuing the last option currently, but it's causing more horribleness than it's solving right now.
This patch makes us replace mFront and mBack in GLScreenBuffer::Morph() with surfaces created by the new factory. That ensures that the old factory isn't outlived by its surfaces.

Most of the meat is in GLScreenBuffer.cpp; the rest is plumbing to let callers of Morph() handle its failure.
Attachment #8790302 - Flags: review?(jgilbert)
And for good measure, let's make mGL a weak pointer and check that it still exists before calling into MakeCurrent().
Attachment #8790316 - Flags: review?(jgilbert)
Comment on attachment 8790302 [details] [diff] [review]
Replace mFront and mBack in GLScreenBuffer::Morph

Review of attachment 8790302 [details] [diff] [review]:
-----------------------------------------------------------------

Some small changes after a try run. Have fixed locally.

::: dom/canvas/OffscreenCanvas.cpp
@@ +160,5 @@
>  
>          UniquePtr<gl::SurfaceFactory> factory =
>            gl::GLScreenBuffer::CreateFactory(gl, caps, forwarder, flags);
>  
> +        if (!factory || !screen->Morph(Move(factory)))

From the context, this doesn't look like it should be fatal. Should be |if (factory && !screen->Morph(...))|.

::: gfx/layers/client/ClientCanvasLayer.cpp
@@ +44,5 @@
>  
>    mCanvasClient = nullptr;
>  
> +  if (!mGLContext) {
> +    return false;

This should be |true|.

@@ +78,5 @@
>        // Absolutely must have a factory here, so create a basic one
>        mFactory = MakeUnique<SurfaceFactory_Basic>(mGLContext, caps, mFlags);
>      }
>    } else {
> +    if (!factory || !screen->Morph(Move(factory))) {

|if (factory && ...)| as well.
Latest patch. Still has some issues on Windows that were uncovered by a try run, so not re-requesting review yet.
Attachment #8790302 - Attachment is obsolete: true
Attachment #8790302 - Flags: review?(jgilbert)
Latest patch. See comment 10 for description.

Still fails |webgl-color-test.html?native-gl| reftest on try, but that appears to be a driver thing. The nVidia driver on test infra is old enough that we blacklist it usually, but force-enable it for reftests. I managed to reproduce it with a loaner box, and confirmed that newer drivers fix it.
Attachment #8791611 - Attachment is obsolete: true
Attachment #8793832 - Flags: review?(jgilbert)
Crash volume for signature 'mozilla::gl::GLContext::MakeCurrent':
 - nightly (version 52): 0 crashes from 2016-09-19.
 - aurora  (version 51): 0 crashes from 2016-09-19.
 - beta    (version 50): 0 crashes from 2016-09-20.
 - release (version 49): 34 crashes from 2016-09-05.
 - esr     (version 45): 1395 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly       0       0
 - aurora        0       0
 - beta          0       0
 - release      27       7
 - esr          81     103

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly
 - aurora
 - beta
 - release #1510     #2564
 - esr     #137
Crash volume for signature 'mozilla::gl::GLContext::MakeCurrent':
 - nightly (version 52): 0 crashes from 2016-09-19.
 - aurora  (version 51): 0 crashes from 2016-09-19.
 - beta    (version 50): 3 crashes from 2016-09-20.
 - release (version 49): 137 crashes from 2016-09-05.
 - esr     (version 45): 887 crashes from 2016-07-25.

Crash volume on the last weeks (Week N is from 10-17 to 10-23):
            W. N-1  W. N-2  W. N-3  W. N-4
 - nightly       0       0       0       0
 - aurora        0       0       0       0
 - beta          0       3       0       0
 - release      46      42      27       7
 - esr          85      87      72      84

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly
 - aurora
 - beta
 - release #1184     #1077
 - esr     #116
Comment on attachment 8790316 [details] [diff] [review]
Turn GLContext::mGL into a WeakPtr

Review of attachment 8790316 [details] [diff] [review]:
-----------------------------------------------------------------

I don't love this solution, but we should take it to fix the crash.

::: gfx/gl/SharedSurface.h
@@ +59,5 @@
>                           SurfaceFactory* factory);
>  
>      const SharedSurfaceType mType;
>      const AttachmentType mAttachType;
> +    WeakPtr<GLContext> const mGL;

const WeakPtr<>
Attachment #8790316 - Flags: review?(jgilbert) → review+
Comment on attachment 8793832 [details] [diff] [review]
Replace mFront and mBack in GLScreenBuffer::Morph

Review of attachment 8793832 [details] [diff] [review]:
-----------------------------------------------------------------

I'd really prefer not to take this. Not r-, but f- for now.
Attachment #8793832 - Flags: review?(jgilbert) → feedback-
Pushed by eflores@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1d1dd53f792c
Turn SharedSurface::mGL into a WeakPtr<> - r=jgilbert
Comment on attachment 8790316 [details] [diff] [review]
Turn GLContext::mGL into a WeakPtr

Approval Request Comment
[Feature/regressing bug #]: WebGL.
[User impact if declined]: Crashes on WebGL destruction.
[Describe test coverage new/current, TreeHerder]: None.
[Risks and why]: Very low. Just adding a lifetime check.
[String/UUID change made/needed]: None.
Attachment #8790316 - Flags: approval-mozilla-beta?
Attachment #8790316 - Flags: approval-mozilla-aurora?
Comment on attachment 8790316 [details] [diff] [review]
Turn GLContext::mGL into a WeakPtr

Crash fix, wasn't able to verify on 52 (no reports on that version), let's hope this fixes the crash, Aurora51+, Beta50+
Attachment #8790316 - Flags: approval-mozilla-beta?
Attachment #8790316 - Flags: approval-mozilla-beta+
Attachment #8790316 - Flags: approval-mozilla-aurora?
Attachment #8790316 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/1d1dd53f792c
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Depends on: 1312940
You need to log in before you can comment on or make changes to this bug.