Closed
Bug 1485441
Opened 7 years ago
Closed 7 years ago
Crash in mozilla::layers::ShaderProgramOGL::GetProgram
Categories
(Core :: Graphics: Layers, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla64
| Tracking | Status | |
|---|---|---|
| geckoview62 | + | fixed |
| firefox-esr52 | --- | unaffected |
| firefox-esr60 | --- | unaffected |
| firefox61 | --- | wontfix |
| firefox62 | blocking | fixed |
| firefox63 | --- | fixed |
| firefox64 | --- | fixed |
People
(Reporter: marcia, Assigned: jnicol)
References
Details
(Keywords: crash, regression, Whiteboard: gfx-noted)
Crash Data
Attachments
(4 files)
|
46 bytes,
text/x-phabricator-request
|
jnicol
:
review+
|
Details | Review |
|
46 bytes,
text/x-phabricator-request
|
jnicol
:
review+
RyanVM
:
approval-mozilla-release-
|
Details | Review |
|
861 bytes,
patch
|
jrmuizel
:
review+
RyanVM
:
approval-mozilla-beta-
|
Details | Diff | Splinter Review |
|
46 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
pascalc
:
approval-mozilla-release+
|
Details | Review |
This bug was filed from the Socorro interface and is
report bp-bcd8bc7d-54d3-4c6a-bc93-b6dd90180822.
=============================================================
Seen while looking at b19 data: https://bit.ly/2w6bDfT. Present in 61 as well. No crashes seen in 63 nightly. 48 crashes/19 installs so far on b19.
URLs are varied, and there are no useful comments. APIs range from 28 all the way down to 18.
Top 10 frames of crashing thread:
0 libxul.so mozilla::layers::ShaderProgramOGL::GetProgram gfx/layers/opengl/OGLShaderProgram.cpp:953
1 libxul.so mozilla::layers::CompositorOGL::ActivateProgram gfx/layers/opengl/CompositorOGL.cpp:1052
2 libxul.so void mozilla::layers::CompositorOGL::DrawGeometry<mozilla::gfx::RectTyped<mozilla::gfx::UnknownUnits, float> > gfx/layers/opengl/CompositorOGL.cpp:1293
3 libGLESv2_adreno.so libGLESv2_adreno.so@0xc3e27
4 dalvik-main space (region space) (deleted) dalvik-main space @0x31c6fffe
5 dalvik-main space (region space) (deleted) dalvik-main space @0x32419ffe
6 libGLESv2_adreno.so libGLESv2_adreno.so@0xc352f
7 libxul.so mozilla::gl::GLContext::fUniform4fv gfx/gl/GLContext.h:1754
8 libxul.so mozilla::gl::GLContext::fBindBuffer gfx/gl/GLContext.h:838
9 libxul.so mozilla::gl::GLContext::fBindBuffer gfx/gl/GLContext.h:838
=============================================================
Comment 1•7 years ago
|
||
Looks like we fail to compile a shader. As a result GetShaderProgramFor returns a null pointer which we end up passing to CompositorOGL::ActivateProgram. We can add a simple null check although it means some content would silently not render. We don't have a software fallback on android so I suppose we can't do much better than that unless we can get a phone that reproduces the bug to investigate and work around what is probably a driver bug.
Since this is affecting beta it's a bit late for exploratory investigations so I propose that we add a nightly-only assertion in the hope that we can eventually catch the bug there and early out of rendering layers that we couldn't compile the shader for in release.
We should also log the failing shader config in gfxCriticalNote.
Assignee: nobody → nical.bugzilla
Priority: -- → P3
Whiteboard: gfx-noted
Comment 2•7 years ago
|
||
| Assignee | ||
Comment 3•7 years ago
|
||
Comment on attachment 9003432 [details]
Bug 1485441 - Add diagnostics to GL compositor shader failures and avoid crashing release builds. r=jnicol
Jamie Nicol [:jnicol] has approved the revision.
Attachment #9003432 -
Flags: review+
Comment 4•7 years ago
|
||
| Reporter | ||
Comment 5•7 years ago
|
||
[Tracking Requested - why for this release]: 224 crashes in b19 so far. Previous betas only had a handful of crashes.
tracking-firefox62:
--- → ?
| Assignee | ||
Comment 6•7 years ago
|
||
Comment on attachment 9003466 [details]
Bug 1485441 - Add diagnostics to GL compositor shader failures and avoid crashing release builds. r=jnicol
Jamie Nicol [:jnicol] has approved the revision.
Attachment #9003466 -
Flags: review+
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/83488f147740
Add diagnostics to GL compositor shader failures and avoid crashing release builds. r=jnicol
Comment 8•7 years ago
|
||
| bugherder | ||
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox63:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Updated•7 years ago
|
Status: RESOLVED → REOPENED
status-firefox-esr52:
--- → unaffected
status-firefox-esr60:
--- → unaffected
Resolution: FIXED → ---
Target Milestone: mozilla63 → ---
Updated•7 years ago
|
Attachment #9003432 -
Attachment is obsolete: true
Updated•7 years ago
|
Comment 9•7 years ago
|
||
This is the #1 topcrash on Fennec 62 by a pretty large margin. But I'm still not sure it makes sense to consider this patch for uplift given that we're trading off crashes for rendering errors?
Comment 10•7 years ago
|
||
There is a crash with the added gfx critical notes: cfg: features: 0x2001 multiplier: 1 op: CompositionOp::OP_OVER
The features are ENABLE_RENDER_COLOR and ENABLE_DEAA
So the layer we are trying to render is a solid color with a perspective, rotation, or non-integer transformation which makes us request some anti-aliasing. I'm guessing the DEAA part is what's causing the issue here, because the non-antialiased solid color shader should have showed up before we run into this combination.
We could probably try to fall back to not using DEAA when building the shader fails.
Comment 11•7 years ago
|
||
the signature took off in 62.0b19 - these were the bugs going into that build: https://mzl.la/2wb8RoG
In the channel meeting today, this was deemed as a release blocking issue. We have decided to keep updates at 5% for Fennec 62.0 until there is a resolution on this issue.
| Reporter | ||
Comment 13•7 years ago
|
||
The top crashing device so far is ONEPLUS A6003, which is running Qualcomm Snapdragon 845, followed by Samsung Galaxy Note 9 which also has the same processor. Ioana does QA have one of these devices? There are also other devices in the list that you may have - is your device inventory list up to date? Thanks.
Flags: needinfo?(ioana.chiorean)
| Reporter | ||
Comment 14•7 years ago
|
||
Here are some sample URLs from the beta reports:
* http://wordfinder.yourdictionary.com/unscramble/remember_tiles/bdeinty
* https://www.oreillyauto.com/shop/b/cooling---heating-16773/radiator-20386/12defc6c370f
* https://www.sothebysrealty.com/eng/sales/detail/180-l-5119-7m2tt7/oceanfront-castle-melbourne-beach-fl-32951
* https://www.space.com/39905-jupiter-cyclones-geometric-clusters.html
* https://www.guidingtech.com/nova-microsoft-android-launcher-comparison/
* https://www.travelpulse.com/news/destinations/7-reasons-to-visit-manchester-new-hampshire.html
* https://comicbook.com/marvel/2018/08/30/guardians-of-the-galaxy-vol-3-james-gunn-soundtrack-questions/
Updated•7 years ago
|
Priority: P3 → P1
| Assignee | ||
Comment 16•7 years ago
|
||
Comment on attachment 9003466 [details]
Bug 1485441 - Add diagnostics to GL compositor shader failures and avoid crashing release builds. r=jnicol
As discussed on irc, let's uplift the patch as-is to release. This will stop the crashes, but could mean some pages are rendered incorrectly
Approval Request Comment
[Feature/Bug causing the regression]: Not sure yet, seems to have been introduced in 62b19
[User impact if declined]: Crashes for users with OnePlus 6 or galaxy note 9 phones
[Is this code covered by automated tests?]: N/A
[Has the fix been verified in Nightly?]: Has been on nightly for a while and hasn't caused any issues, but we didn't have any crashes on nightly to begin with so can't check it's stopped them.
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No
[Why is the change risky/not risky?]: Rather than crashing we will not render some content. This could mean pages render incorrectly but that will be better than crashing
[String changes made/needed]: None
Attachment #9003466 -
Flags: approval-mozilla-release?
| Assignee | ||
Comment 17•7 years ago
|
||
For beta we can temporarily change the diagnostic assert to a release assert so that we can collect some crash reports with the added information.
Approval Request Comment
[Feature/Bug causing the regression]: Not sure
[User impact if declined]: Silent failures, meaning we cannot identify the cause of the bug
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: N/A
[Needs manual test from QE? If yes, steps to reproduce]: N/A
[List of other uplifts needed for the feature/fix]:None
[Is the change risky?]: A little bit
[Why is the change risky/not risky?]: This will cause more crashes on beta, but will enable us to identify the cause.
[String changes made/needed]: None
Attachment #9006567 -
Flags: review?(nical.bugzilla)
Attachment #9006567 -
Flags: approval-mozilla-beta?
Comment 18•7 years ago
|
||
status-geckoview62=affected because we will probably want to uplift the final fix to the GECKOVIEW_62_RELBRANCH.
status-firefox64:
--- → affected
status-geckoview62:
--- → affected
| Reporter | ||
Comment 19•7 years ago
|
||
Even though we are throttled at 5%, it looks as if we have already amassed over 1600K crashes. From some of the comments it sounds as if it they are crashing pretty easily.
Comment 20•7 years ago
|
||
That's pretty bad. Jeff could you (or someone) steal the release patch review?
Flags: needinfo?(jmuizelaar)
Comment 21•7 years ago
|
||
Actually, the Beta patch would be more helpful so we could get it into a Fennec 63b4 build tomorrow.
Updated•7 years ago
|
Attachment #9006567 -
Flags: review?(nical.bugzilla) → review+
Comment 22•7 years ago
|
||
Comment on attachment 9006567 [details] [diff] [review]
patch for beta
Will hopefully give us better diagnostic info for these crashes. Approved for Fennec 63.0b4.
Attachment #9006567 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Updated•7 years ago
|
Flags: needinfo?(jmuizelaar)
Comment 23•7 years ago
|
||
We should just buy the device and debug it locally.
| Assignee | ||
Comment 24•7 years ago
|
||
I can now reproduce this reliably on the OnePlus 6, on seemingly any webgl content. This one works for example: weblgsamples.org/aquarium/aquarium.html. It reproduces on the current beta from the play store (62.0b21), but not stable (61.0.2, hasn't got the 62 update yet), and interestingly not nightly (2018-09-05)
The reason I couldn't reproduce earlier today was because I was either testing webgl content (on central), or the beta branch (but the wrong content). Luckily happened to try the right combination this evening.
I am now kicking off a beta build on my laptop, which hopefully will have finished by the time I wake up tomorrow morning! Should be a lot easier to debug now I'm able to reproduce.
Comment 25•7 years ago
|
||
(In reply to [:philipp] from comment #11)
> the signature took off in 62.0b19 - these were the bugs going into that
> build: https://mzl.la/2wb8RoG
Bug 1477817 seems like a possible candidate if it's crashing with WebGL content.
| Assignee | ||
Comment 26•7 years ago
|
||
Yes I can confirm that bug 1477817 is the cause of the regression.
It makes us set some additional robustness flags when creating a gl context for webgl. Later when we try to link the shader program which will composite the webgl surface, it fails. I'm not sure why that is yet.
I can write a patch which removes the additional robustness flags on the affected gpu.
Comment 27•7 years ago
|
||
Given that we know it also caused regressions on Desktop, I think I'd prefer to backout from 62 at this point. Would be interesting to know why 63 doesn't seem affected by this or bug 1489099, though.
Blocks: 1477817
Comment 28•7 years ago
|
||
| bugherder uplift | ||
Updated•7 years ago
|
Comment 29•7 years ago
|
||
I've landed the backout of bug 1477817 on m-r for 62.0.1. Per discussion with Jamie, investigation for 63+ will continue on.
https://hg.mozilla.org/releases/mozilla-release/rev/4e6207ecd553
Updated•7 years ago
|
Attachment #9003466 -
Flags: approval-mozilla-release? → approval-mozilla-release-
Comment 31•7 years ago
|
||
I'm not planning to unthrottle Fennec 62 until we ship 62.0.1 at this point. But no, there's no further engineering action needed for 62 at this point. Per my IRC discussion with Jamie today, though, it sounds like Beta63 still reproduces easily.
Flags: needinfo?(ryanvm)
Comment 32•7 years ago
|
||
Jamie, do we still need to ship the release assert on beta, or should we back that out now that you can repro?
Flags: needinfo?(jnicol)
| Assignee | ||
Comment 33•7 years ago
|
||
We should back that out now, have all the information I need.
Flags: needinfo?(jnicol)
Comment 34•7 years ago
|
||
Release assert backed out from beta:
https://hg.mozilla.org/releases/mozilla-beta/rev/f66a7bf84380
| Assignee | ||
Comment 35•7 years ago
|
||
The problem seems to be that after a context with the EGL_CONTEXT_OPENGL_ROBUST_ACCESS_EXT flag set is created (which we set for webgl contexts, since bug 1477817), future attempts to link programs (even if from a different context) will fail. Programs linked before the robust context is created continue to work fine. It seems highly likely to me that this is a driver bug rather than a mistake in our code.
However, it doesn't seem to be quite any shader that will fail, so that perhaps warrants further investigation.
For beta, I think it might make sense to avoid setting the troublesome flag on Adreno 630s. I will write a patch for that.
For central, I am curious enough to continue investigating which shaders are causing issues and why.
Jeff, do you have any thoughts on how important the flag is, and how much of a loss it would be to disable it?
Flags: needinfo?(jgilbert)
Comment 36•7 years ago
|
||
RBAB means we don't need to validate index buffer contents before calls to DrawElements, which is a perf win, particularly if the user updates their index buffer often. (This resets our cache right now)
It's also a general improved security assurance that each context can only access its own data.
We want to be on it as much as possible, but it's not critical.
Flags: needinfo?(jgilbert)
Comment 37•7 years ago
|
||
For the record, that failure case is /insane/. Do we have any approach for making minimal testcases to give to vendors? We should definitely report this to the IHV.
| Assignee | ||
Comment 38•7 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #36)
> RBAB means we don't need to validate index buffer contents before calls to
> DrawElements, which is a perf win, particularly if the user updates their
> index buffer often. (This resets our cache right now)
>
> It's also a general improved security assurance that each context can only
> access its own data.
>
> We want to be on it as much as possible, but it's not critical.
Okay, thanks for the explanation. Sounds like disabling it for beta is sensible.
> For the record, that failure case is /insane/.
100% agree! I'd like to spend some more time to figure out what exactly about some of our shaders it doesn't like, then we can give them a testcase.
Comment 39•7 years ago
|
||
| uplift | ||
Backout merged to GECKOVIEW_62_RELBRANCH:
https://hg.mozilla.org/releases/mozilla-release/rev/494b2c115bc1
tracking-geckoview62:
--- → +
Updated•7 years ago
|
Attachment #9006567 -
Flags: approval-mozilla-beta+ → approval-mozilla-beta-
Comment 40•7 years ago
|
||
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #13)
> The top crashing device so far is ONEPLUS A6003, which is running Qualcomm
> Snapdragon 845, followed by Samsung Galaxy Note 9 which also has the same
> processor. Ioana does QA have one of these devices? There are also other
> devices in the list that you may have - is your device inventory list up to
> date? Thanks.
Replied on the mail, forgot to take the NI out.
Flags: needinfo?(ioana.chiorean)
Comment 41•7 years ago
|
||
^ We do't have those devices - was the reply - sorry for the spam
| Assignee | ||
Comment 42•7 years ago
|
||
Ah, it's going to be harder than I'd assumed to disable this specifically on Adreno 630s... We usually do this by checking glGetString(GL_RENDERER). But that won't work until after creating a context, by which point it will be too late. So we'll need to create a dummy context, load the symbols, call that function, then create our real context with the flag set depending on the renderer string. Yuck. Alternatively, the android SDK can give us the phone model, but that wouldn't protect us against future devices with this GPU.
Playing around with my reduced testcase a bit more, I've established it doesn't like that we index the uniform array uLayerRects with the variable aCoord.w. If we only use constant indices then it links the program fine. So there might be a cleaner workaround by modifying the shader code instead.
Let's see if speaking to quallcom can give us a nice solution.
| Assignee | ||
Comment 43•7 years ago
|
||
It was causing the linking of some shaders to fail.
MozReview-Commit-ID: 6KGGZX01pM2
Comment 44•7 years ago
|
||
So I'm going through another bug, and I came across this comment:
// see bug 929506 comment 29. wglGetProcAddress requires a current context.
At least WGL does not guarantee that GetProcAddress gives the same pfn back for different contexts.
Can we check if that's maybe the case with fLinkProgram here?
In InitWithPrefixImpl, we only call MakeCurrent *after* we load all the basic pfns, like fLinkProgram. If the driver's giving us different symbols with RBAB than without, that would explain the failures here.
Any chance you can test this?
Flags: needinfo?(jnicol)
| Assignee | ||
Comment 45•7 years ago
|
||
Sure I can test that. However, the failure occurs in the compositor's context, which does not have rbab enabled. Even if we got a different symbol for the webgl context with rbab, that would presumably still be a problem.
Flags: needinfo?(jnicol)
Updated•7 years ago
|
Attachment #9015854 -
Attachment description: Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices. → Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices
| Assignee | ||
Comment 46•7 years ago
|
||
There is no change to the loaded symbol address for glLinkProgram. Regardless of before or after MakeCurrent, or with rbab enabled or disabled.
Comment 47•7 years ago
|
||
Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b2ddfd679997
Don't enable robust buffer access on Adreno 630 devices r=jgilbert
Comment 48•7 years ago
|
||
| bugherder | ||
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
| Assignee | ||
Comment 49•7 years ago
|
||
Comment on attachment 9015854 [details]
Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices
I know it's late, but I think we should uplift this.
[Beta/Release Uplift Approval Request]
Feature/Bug causing the regression: Bug 1477817
User impact if declined: Some shaders will fail to compile after loading webgl content on adreno 630. This will probably mean some content wont be rendered.
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Disables a recently introduced feature for certain GPUs only.
String changes made/needed:
Attachment #9015854 -
Flags: approval-mozilla-beta?
Comment 50•7 years ago
|
||
Comment on attachment 9015854 [details]
Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices
Looks safe, code coverage is correct too, approved for our last fennec beta. Thanks
Attachment #9015854 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 51•7 years ago
|
||
Comment on attachment 9015854 [details]
Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices
I am resetting the approval request until we have finished the Beta to Release merge and I will approve again for both beta & release branches after the merge so as that we have the patch on the final release.
[Beta/Release Uplift Approval Request]
Feature/Bug causing the regression: None
User impact if declined:
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: Yes
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky):
String changes made/needed:
Attachment #9015854 -
Flags: approval-mozilla-beta+ → approval-mozilla-beta?
Comment 52•7 years ago
|
||
Comment on attachment 9015854 [details]
Bug 1485441 - Don't enable robust buffer access on Adreno 630 devices
The Beta to Release merge is done. Uplift approved for 63.0b15 and 63rc1.
Attachment #9015854 -
Flags: approval-mozilla-release+
Attachment #9015854 -
Flags: approval-mozilla-beta?
Attachment #9015854 -
Flags: approval-mozilla-beta+
Comment 53•7 years ago
|
||
| bugherder uplift | ||
Updated•7 years ago
|
Attachment #9003432 -
Attachment is obsolete: false
You need to log in
before you can comment on or make changes to this bug.
Description
•