Closed Bug 1485441 Opened 6 years ago Closed 6 years ago

Crash in mozilla::layers::ShaderProgramOGL::GetProgram

Categories

(Core :: Graphics: Layers, defect, P1)

62 Branch
Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
mozilla64
Tracking Status
geckoview62 + fixed
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox61 --- wontfix
firefox62 blocking fixed
firefox63 --- fixed
firefox64 --- fixed

People

(Reporter: marcia, Assigned: jnicol)

References

Details

(Keywords: crash, regression, Whiteboard: gfx-noted)

Crash Data

Attachments

(4 files)

This bug was filed from the Socorro interface and is
report bp-bcd8bc7d-54d3-4c6a-bc93-b6dd90180822.
=============================================================

Seen while looking at b19 data: https://bit.ly/2w6bDfT. Present in 61 as well. No crashes seen in 63 nightly. 48 crashes/19 installs so far on b19.

URLs are varied, and there are no useful comments. APIs range from 28 all the way down to 18.

Top 10 frames of crashing thread:

0 libxul.so mozilla::layers::ShaderProgramOGL::GetProgram gfx/layers/opengl/OGLShaderProgram.cpp:953
1 libxul.so mozilla::layers::CompositorOGL::ActivateProgram gfx/layers/opengl/CompositorOGL.cpp:1052
2 libxul.so void mozilla::layers::CompositorOGL::DrawGeometry<mozilla::gfx::RectTyped<mozilla::gfx::UnknownUnits, float> > gfx/layers/opengl/CompositorOGL.cpp:1293
3 libGLESv2_adreno.so libGLESv2_adreno.so@0xc3e27 
4 dalvik-main space (region space) (deleted) dalvik-main space @0x31c6fffe 
5 dalvik-main space (region space) (deleted) dalvik-main space @0x32419ffe 
6 libGLESv2_adreno.so libGLESv2_adreno.so@0xc352f 
7 libxul.so mozilla::gl::GLContext::fUniform4fv gfx/gl/GLContext.h:1754
8 libxul.so mozilla::gl::GLContext::fBindBuffer gfx/gl/GLContext.h:838
9 libxul.so mozilla::gl::GLContext::fBindBuffer gfx/gl/GLContext.h:838

=============================================================
Looks like we fail to compile a shader. As a result GetShaderProgramFor returns a null pointer which we end up passing to CompositorOGL::ActivateProgram. We can add a simple null check although it means some content would silently not render. We don't have a software fallback on android so I suppose we can't do much better than that unless we can get a phone that reproduces the bug to investigate and work around what is probably a driver bug.

Since this is affecting beta it's a bit late for exploratory investigations so I propose that we add a nightly-only assertion in the hope that we can eventually catch the bug there and early out of rendering layers that we couldn't compile the shader for in release.
We should also log the failing shader config in gfxCriticalNote.
Assignee: nobody → nical.bugzilla
Priority: -- → P3
Whiteboard: gfx-noted
Comment on attachment 9003432 [details]
Bug 1485441 - Add diagnostics to GL compositor shader failures and avoid crashing release builds. r=jnicol

Jamie Nicol [:jnicol] has approved the revision.
Attachment #9003432 - Flags: review+
[Tracking Requested - why for this release]: 224 crashes in b19 so far. Previous betas only had a handful of crashes.
Comment on attachment 9003466 [details]
Bug 1485441 - Add diagnostics to GL compositor shader failures and avoid crashing release builds. r=jnicol

Jamie Nicol [:jnicol] has approved the revision.
Attachment #9003466 - Flags: review+
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/83488f147740
Add diagnostics to GL compositor shader failures and avoid crashing release builds. r=jnicol
https://hg.mozilla.org/mozilla-central/rev/83488f147740
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla63 → ---
Attachment #9003432 - Attachment is obsolete: true
This is the #1 topcrash on Fennec 62 by a pretty large margin. But I'm still not sure it makes sense to consider this patch for uplift given that we're trading off crashes for rendering errors?
There is a crash with the added gfx critical notes: cfg: features: 0x2001 multiplier: 1 op: CompositionOp::OP_OVER
The features are  ENABLE_RENDER_COLOR and ENABLE_DEAA

So the layer we are trying to render is a solid color with a perspective, rotation, or non-integer transformation which makes us request some anti-aliasing. I'm guessing the DEAA part is what's causing the issue here, because the non-antialiased solid color shader should have showed up before we run into this combination.

We could probably try to fall back to not using DEAA when building the shader fails.
the signature took off in 62.0b19 - these were the bugs going into that build: https://mzl.la/2wb8RoG
In the channel meeting today, this was deemed as a release blocking issue. We have decided to keep updates at 5% for Fennec 62.0 until there is a resolution on this issue.
The top crashing device so far is ONEPLUS A6003, which is running Qualcomm Snapdragon 845, followed by Samsung Galaxy Note 9 which also has the same processor. Ioana does QA have one of these devices? There are also other devices in the list that you may have - is your device inventory list up to date? Thanks.
Flags: needinfo?(ioana.chiorean)
Jamie is taking this bug from Nical.
Assignee: nical.bugzilla → jnicol
Comment on attachment 9003466 [details]
Bug 1485441 - Add diagnostics to GL compositor shader failures and avoid crashing release builds. r=jnicol

As discussed on irc, let's uplift the patch as-is to release. This will stop the crashes, but could mean some pages are rendered incorrectly

Approval Request Comment
[Feature/Bug causing the regression]: Not sure yet, seems to have been introduced in 62b19
[User impact if declined]: Crashes for users with OnePlus 6 or galaxy note 9 phones
[Is this code covered by automated tests?]: N/A
[Has the fix been verified in Nightly?]: Has been on nightly for a while and hasn't caused any issues, but we didn't have any crashes on nightly to begin with so can't check it's stopped them.
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No
[Why is the change risky/not risky?]: Rather than crashing we will not render some content. This could mean pages render incorrectly but that will be better than crashing
[String changes made/needed]: None
Attachment #9003466 - Flags: approval-mozilla-release?
Attached patch patch for betaSplinter Review
For beta we can temporarily change the diagnostic assert to a release assert so that we can collect some crash reports with the added information.

Approval Request Comment
[Feature/Bug causing the regression]: Not sure
[User impact if declined]: Silent failures, meaning we cannot identify the cause of the bug
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: N/A
[Needs manual test from QE? If yes, steps to reproduce]: N/A
[List of other uplifts needed for the feature/fix]:None
[Is the change risky?]: A little bit
[Why is the change risky/not risky?]: This will cause more crashes on beta, but will enable us to identify the cause.
[String changes made/needed]: None
Attachment #9006567 - Flags: review?(nical.bugzilla)
Attachment #9006567 - Flags: approval-mozilla-beta?
status-geckoview62=affected because we will probably want to uplift the final fix to the GECKOVIEW_62_RELBRANCH.
Even though we are throttled at 5%, it looks as if we have already amassed over 1600K crashes. From some of the comments it sounds as if it they are crashing pretty easily.
That's pretty bad. Jeff could you (or someone) steal the release patch review?
Flags: needinfo?(jmuizelaar)
Actually, the Beta patch would be more helpful so we could get it into a Fennec 63b4 build tomorrow.
Attachment #9006567 - Flags: review?(nical.bugzilla) → review+
Comment on attachment 9006567 [details] [diff] [review]
patch for beta

Will hopefully give us better diagnostic info for these crashes. Approved for Fennec 63.0b4.
Attachment #9006567 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(jmuizelaar)
We should just buy the device and debug it locally.
I can now reproduce this reliably on the OnePlus 6, on seemingly any webgl content. This one works for example: weblgsamples.org/aquarium/aquarium.html. It reproduces on the current beta from the play store (62.0b21), but not stable (61.0.2, hasn't got the 62 update yet), and interestingly not nightly (2018-09-05)

The reason I couldn't reproduce earlier today was because I was either testing webgl content (on central), or the beta branch (but the wrong content). Luckily happened to try the right combination this evening.

I am now kicking off a beta build on my laptop, which hopefully will have finished by the time I wake up tomorrow morning! Should be a lot easier to debug now I'm able to reproduce.
(In reply to [:philipp] from comment #11)
> the signature took off in 62.0b19 - these were the bugs going into that
> build: https://mzl.la/2wb8RoG

Bug 1477817 seems like a possible candidate if it's crashing with WebGL content.
Yes I can confirm that bug 1477817 is the cause of the regression.

It makes us set some additional robustness flags when creating a gl context for webgl. Later when we try to link the shader program which will composite the webgl surface, it fails. I'm not sure why that is yet.

I can write a patch which removes the additional robustness flags on the affected gpu.
Given that we know it also caused regressions on Desktop, I think I'd prefer to backout from 62 at this point. Would be interesting to know why 63 doesn't seem affected by this or bug 1489099, though.
Blocks: 1477817
I've landed the backout of bug 1477817 on m-r for 62.0.1. Per discussion with Jamie, investigation for 63+ will continue on.

https://hg.mozilla.org/releases/mozilla-release/rev/4e6207ecd553
Attachment #9003466 - Flags: approval-mozilla-release? → approval-mozilla-release-
With the backout are we no longer blocking 62?
Flags: needinfo?(ryanvm)
I'm not planning to unthrottle Fennec 62 until we ship 62.0.1 at this point. But no, there's no further engineering action needed for 62 at this point. Per my IRC discussion with Jamie today, though, it sounds like Beta63 still reproduces easily.
Flags: needinfo?(ryanvm)
Jamie, do we still need to ship the release assert on beta, or should we back that out now that you can repro?
Flags: needinfo?(jnicol)
We should back that out now, have all the information I need.
Flags: needinfo?(jnicol)
The problem seems to be that after a context with the EGL_CONTEXT_OPENGL_ROBUST_ACCESS_EXT flag set is created (which we set for webgl contexts, since bug 1477817), future attempts to link programs (even if from a different context) will fail. Programs linked before the robust context is created continue to work fine. It seems highly likely to me that this is a driver bug rather than a mistake in our code.

However, it doesn't seem to be quite any shader that will fail, so that perhaps warrants further investigation.

For beta, I think it might make sense to avoid setting the troublesome flag on Adreno 630s. I will write a patch for that.

For central, I am curious enough to continue investigating which shaders are causing issues and why.

Jeff, do you have any thoughts on how important the flag is, and how much of a loss it would be to disable it?
Flags: needinfo?(jgilbert)
RBAB means we don't need to validate index buffer contents before calls to DrawElements, which is a perf win, particularly if the user updates their index buffer often. (This resets our cache right now)

It's also a general improved security assurance that each context can only access its own data.

We want to be on it as much as possible, but it's not critical.
Flags: needinfo?(jgilbert)
For the record, that failure case is /insane/. Do we have any approach for making minimal testcases to give to vendors? We should definitely report this to the IHV.
(In reply to Jeff Gilbert [:jgilbert] from comment #36)
> RBAB means we don't need to validate index buffer contents before calls to
> DrawElements, which is a perf win, particularly if the user updates their
> index buffer often. (This resets our cache right now)
> 
> It's also a general improved security assurance that each context can only
> access its own data.
> 
> We want to be on it as much as possible, but it's not critical.

Okay, thanks for the explanation. Sounds like disabling it for beta is sensible.

> For the record, that failure case is /insane/.

100% agree! I'd like to spend some more time to figure out what exactly about some of our shaders it doesn't like, then we can give them a testcase.
Attachment #9006567 - Flags: approval-mozilla-beta+ → approval-mozilla-beta-
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #13)
> The top crashing device so far is ONEPLUS A6003, which is running Qualcomm
> Snapdragon 845, followed by Samsung Galaxy Note 9 which also has the same
> processor. Ioana does QA have one of these devices? There are also other
> devices in the list that you may have - is your device inventory list up to
> date? Thanks.

Replied on the mail, forgot to take the NI out.
Flags: needinfo?(ioana.chiorean)
^ We do't have those devices - was the reply - sorry for the spam
Ah, it's going to be harder than I'd assumed to disable this specifically on Adreno 630s... We usually do this by checking glGetString(GL_RENDERER). But that won't work until after creating a context, by which point it will be too late. So we'll need to create a dummy context, load the symbols, call that function, then create our real context with the flag set depending on the renderer string. Yuck. Alternatively, the android SDK can give us the phone model, but that wouldn't protect us against future devices with this GPU.

Playing around with my reduced testcase a bit more, I've established it doesn't like that we index the uniform array uLayerRects with the variable aCoord.w. If we only use constant indices then it links the program fine. So there might be a cleaner workaround by modifying the shader code instead.

Let's see if speaking to quallcom can give us a nice solution.
It was causing the linking of some shaders to fail.

MozReview-Commit-ID: 6KGGZX01pM2
So I'm going through another bug, and I came across this comment:
    // see bug 929506 comment 29. wglGetProcAddress requires a current context.

At least WGL does not guarantee that GetProcAddress gives the same pfn back for different contexts.
Can we check if that's maybe the case with fLinkProgram here?
In InitWithPrefixImpl, we only call MakeCurrent *after* we load all the basic pfns, like fLinkProgram. If the driver's giving us different symbols with RBAB than without, that would explain the failures here.

Any chance you can test this?
Flags: needinfo?(jnicol)
Sure I can test that. However, the failure occurs in the compositor's context, which does not have rbab enabled. Even if we got a different symbol for the webgl context with rbab, that would presumably still be a problem.
Flags: needinfo?(jnicol)
Attachment #9015854 - Attachment description: Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices. → Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices
There is no change to the loaded symbol address for glLinkProgram. Regardless of before or after MakeCurrent, or with rbab enabled or disabled.
Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b2ddfd679997
Don't enable robust buffer access on Adreno 630 devices r=jgilbert
https://hg.mozilla.org/mozilla-central/rev/b2ddfd679997
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Comment on attachment 9015854 [details]
Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices

I know it's late, but I think we should uplift this.

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1477817

User impact if declined: Some shaders will fail to compile after loading webgl content on adreno 630. This will probably mean some content wont be rendered.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): Disables a recently introduced feature for certain GPUs only.

String changes made/needed:
Attachment #9015854 - Flags: approval-mozilla-beta?
Comment on attachment 9015854 [details]
Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices

Looks safe, code coverage is correct too, approved for our last fennec beta. Thanks
Attachment #9015854 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 9015854 [details]
Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices

I am resetting the approval request until we have finished the Beta to Release merge and I will approve again for both beta & release branches after the merge so as that we have the patch on the final release.

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: None

User impact if declined: 

Is this code covered by automated tests?: Yes

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: Yes

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): 

String changes made/needed:
Attachment #9015854 - Flags: approval-mozilla-beta+ → approval-mozilla-beta?
Comment on attachment 9015854 [details]
Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices

The Beta to Release merge is done. Uplift approved for 63.0b15 and 63rc1.
Attachment #9015854 - Flags: approval-mozilla-release+
Attachment #9015854 - Flags: approval-mozilla-beta?
Attachment #9015854 - Flags: approval-mozilla-beta+
Attachment #9003432 - Attachment is obsolete: false
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: