Closed Bug 1720841 Opened 2 months ago Closed 2 months ago

Webrender hard crashes Samsung Galaxy Tab 4 [Adreno 305] / Galaxy S4 [Adreno 330]

Categories

(Core :: Graphics: WebRender, defect)

Firefox 90
Unspecified
Android
defect

Tracking

()

VERIFIED FIXED
92 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox90 --- verified
firefox91 --- verified
firefox92 --- verified

People

(Reporter: ke5trel, Assigned: jnicol)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, regression)

Attachments

(3 files)

Samsung Galaxy Tab 4 (SM-T530)
Android 5.0.2
Qualcomm Adreno (TM) 305
OpenGL ES 3.0 V@84.0

Opening almost any site (eg https://mozilla.org) causes the content to turn black, followed by the screen turning black and the device rebooting. about:config similarly crashes. about:support works, though buttons are drawn with a diagonal artifact, rounded corners are missing and scrolling is choppy.

Disabling WebRender by installing Nightly 89 which uses OpenGL, changing gfx.webrender.force-disabled = true then updating to the latest Nightly 92 works.

logcat does not provide any information about the crash.

Severity: -- → S2

Same thing happens with:
Samsung Galaxy S4 (GT-I9506)
Android 5.0.1
Qualcomm Adreno (TM) 330
OpenGL ES 3.0 V@84.0

Summary: Webrender hard crashes Samsung Galaxy Tab 4 [Adreno 305] → Webrender hard crashes Samsung Galaxy Tab 4 [Adreno 305] / Galaxy S4 [Adreno 330]

:jnicol, can you comment to the bug?

Flags: needinfo?(jnicol)

Looks like a pretty severe bug in Android 5 or old Adreno driver versions.

On my Nexus 5 running Android 5.0, V@95.0 I can also render about:support but it fails if I try to render a more complex page. However, it doesn't cause the device to reboot, it just causes Firefox to stop responding.

logcat does not provide any information about the crash.

Could you please attach a copy of the logcat anyway? There may be something in there that is helpful. Thanks

Flags: needinfo?(jnicol) → needinfo?(ke5trel)
Attached file galaxy-tab4-log.zip

Sure, I also included OpenGL traces. The last libEGL entry is always glViewport(0, 0, 1024, 512);.

Flags: needinfo?(ke5trel)

Thanks. You're right, I can't see anything relating to the crash in there. Whereas on my Nexus 5 I see a lot of errors from the Adreno driver.

I'm hoping that we are indeed seeing the same issue, despite the different logcats. I have a fix for my issue, could you please test it and see if it works for yours too? https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/QjY0FENQQa-buEB6Wu29bg/runs/0/artifacts/public/build/geckoview_example.apk

Flags: needinfo?(ke5trel)

The test build works for me without crashing. 👍

Flags: needinfo?(ke5trel)

That's brilliant, thanks for testing.

Here's what I've discovered about this bug:

  • I did a mozregression --good 2020-07-20 --app gve --pref gfx.webrender.all:true to see if this was regressed by something prior to webrender being enabled on this device. Older builds rendered completely incorrectly due to bug 1630356, but importantly did not crash. This gave me this range.
  • By hand, I established that Bug 1682365 - Pt 2 - Remove texture array usage from render task graph was responsible, specifically just the changes to cs_blur.glsl
  • It doesn't make much sense why those changes to cs_blur (switching from sampler2DArray to sampler2D) would cause this bug, but there is clearly something in the shader the old adreno drivers does not like, and we were just getting lucky before.
  • I reduced the current cs_blur shader by hand to see what would stop the crash, and it appears to be due to the unbounded for loop: https://searchfox.org/mozilla-central/rev/a14ecd829fdb1e9780b7c100016c6a3d951cf7da/gfx/wr/webrender/res/cs_blur.glsl#158
  • If I limit the number of iterations, eg by looping to min(vSupport, SOME_CONST) then we avoid the crash.

Glenn, I don't really understand the blur code, but am I correct in thinking that we can safely use 300 as a limit here.

  • We have a max blur radius of 100 here
  • Then we calculate vSupport here as 3 times the blur radius. And 3*100 = 300.

Or am I misunderstanding?

Flags: needinfo?(gwatson)

Yes, it looks like we should be able to clamp that value there. Odd that it would avoid a hard driver crash, but it looks like a valid workaround to me!

Flags: needinfo?(gwatson)

Cool. Yeah, bit strange. I think old drivers can have issues with shaders that they cannot verify will complete statically.

In the cs_blur fragment shader there is a for loop with a number of
iterations determined by a flat varying. On some old Adreno drivers
this causes severe issues including hangs and crashes. These can be
avoided by clamping the number of iterations to a statically known
value.

Assignee: nobody → jnicol
Status: NEW → ASSIGNED
Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/16308825081d
Clamp loop condition variable to static value in cs_blur. r=gw
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch

Should we uplift this to 91? (maybe even 90?)

Flags: needinfo?(jnicol)

Yes definitely. Probably worth doing for 90 too: it only affects a small number of users, but makes the browser practically unusable.

I'll make the request.

Flags: needinfo?(jnicol)

Comment on attachment 9232031 [details]
Bug 1720841 - Clamp loop condition variable to static value in cs_blur. r?gw

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes, app not responding, phone rebooting for users with Adreno 3xx GPUs running Android 5.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Adds a limit to the number of loop iterations in a shader, which shouldn't be exceeded anyway so won't affect rendering. It should make life easier for drivers to have a statically known max number of loop iterations, however, there can always be unintended consequences of modifying shader code.
  • String changes made/needed:
Attachment #9232031 - Flags: approval-mozilla-release?
Attachment #9232031 - Flags: approval-mozilla-beta?

I think this should be low risk, however we've been bitten in the past by changes to shader code which "should" be fine. So maybe best to let this bake on nightly/beta for a few days before it hits release?

Comment on attachment 9232031 [details]
Bug 1720841 - Clamp loop condition variable to static value in cs_blur. r?gw

Crash fix, approved for 91 beta 5, thanks.

Attachment #9232031 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

I was not able to reproduce the crash on the 92.0a1 Nightly from 7/21, nor on the 92.0.0-beta.2 build.
Tested with a Samsung Galaxy Tab A6 (Android 5.1.1), SM-T285.

Status: RESOLVED → VERIFIED

Jftr, the latest nightly works fine for me as well with
Samsung Galaxy S4 (GT-I9506)
Android 5.0.1
Qualcomm Adreno (TM) 330

Do you want a bug for the button rendering in about:support mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1720841#c0?

Flags: needinfo?(jnicol)

(In reply to Stefan [:stefanh] from comment #20)

Jftr, the latest nightly works fine for me as well with
Samsung Galaxy S4 (GT-I9506)
Android 5.0.1
Qualcomm Adreno (TM) 330

Do you want a bug for the button rendering in about:support mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1720841#c0?

That's great!

Yes please on the new bug. Thanks!

Flags: needinfo?(jnicol)

(In reply to Jamie Nicol [:jnicol] from comment #21)

Yes please on the new bug. Thanks!
Filed bug 1721795.

Duplicate of this bug: 1722068

Comment on attachment 9232031 [details]
Bug 1720841 - Clamp loop condition variable to static value in cs_blur. r?gw

Approved for Fenix 90.1.3.

Attachment #9232031 - Flags: approval-mozilla-release? → approval-mozilla-release+

I was not able to reproduce the crash/ANR on the 90.1.3 RC build.
Tested with a Samsung Galaxy Tab A6 (Android 5.1.1), SM-T285.

You need to log in before you can comment on or make changes to this bug.