Open Bug 1609191 Opened 9 months ago Updated 2 months ago

Some Adreno 5xx devices crash during shader compilation

Categories

(Core :: Graphics: WebRender, defect, P2)

72 Branch
All
Android
defect

Tracking

()

ASSIGNED

People

(Reporter: ktaeleman, Assigned: jnicol, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: crash, leave-open, Whiteboard: wr-android)

Crash Data

Attachments

(1 file)

Moto G7 play (Adreno 506):
https://crash-stats.mozilla.org/report/index/47cd05e9-e6c5-4055-951e-87f970200114

Xiami Redmi 7A (Adreno 505):
https://crash-stats.mozilla.org/report/index/20c9ded7-0daf-4263-b96b-2b2190200113

On both devices the application had only been running for under 30s, pointing to on demand shader compilation.

Fixing the crash signature so it shows up properly in Socorro.

Crash Signature: libllvm-glnext.so → [@ libllvm-glnext.so@0x732bb0 ]
No longer blocks: wr-74-android

This seems to be happening both on Fenix with WR as on Fennec. Both occuring on Adreno 505 and 506.

Whiteboard: wr-android
Blocks: wr-adreno5xx6xx
No longer blocks: wr-75-android
Severity: normal → S3

Sotaro mentioned he tried to reproduce this crash with an Adreno 506 but did not see the crash in the example app + WebRender. Are there clear STR? Does it happen for you in the example app?

Flags: needinfo?(ktaeleman)

No, we haven't been able to reproduce this crash locally, but are seeing ~10 crashes per day on nightly.
Maybe it's a specific shader causing the issue.

@sotaro: Would it be possible to force compile all shaders to see if that could be the issue? I don't know if we have all the combinations predefined, so maybe this is not possible and how long this would take.

Flags: needinfo?(ktaeleman) → needinfo?(sotaro.ikeda.g)

If you want to find out which shader it is you could try something like this:
https://searchfox.org/mozilla-central/rev/61fceb7c0729773f544a9656f474e36cd636e5ea/js/src/jit/x86-shared/Assembler-x86-shared.cpp#119-126
and store the name of the shader on the stack and we could read it out of the minidumps

All crashes in Socorro have the following GraphicsCriticalError message. But the crashes in comment 0 did not have the message.

|[0][GFX1-]: Failed to create EGLContext!: 0x300c

0x300c error means EGL_BAD_PARAMETER.
https://searchfox.org/mozilla-central/rev/61fceb7c0729773f544a9656f474e36cd636e5ea/gfx/gl/GLContextProviderEGL.cpp#301

(In reply to Kris Taeleman (:ktaeleman) from comment #4)

@sotaro: Would it be possible to force compile all shaders to see if that could be the issue? I don't know if we have all the combinations predefined, so maybe this is not possible and how long this would take.

ShaderPrecacheFlags::FULL_COMPILE flag seems to make WebRender to compile majoriy of shaders at startup, though it might not be all shader combinations.

https://searchfox.org/mozilla-central/rev/61fceb7c0729773f544a9656f474e36cd636e5ea/gfx/webrender_bindings/src/bindings.rs#3747
https://searchfox.org/mozilla-central/rev/61fceb7c0729773f544a9656f474e36cd636e5ea/gfx/webrender_bindings/src/bindings.rs#3747

:ktaeleman, how do you know the crashes happened by on demand shader compilation?

Flags: needinfo?(sotaro.ikeda.g)

:ktaeleman, how do you know the crashes happened by on demand shader compilation?

I think we're just guessing, based on libllvm-glnext.so in the crash signature.

No longer blocks: wr-android

On a Moto g7 play, I can reproduce fairly reliably by enabling gfx.webrender.debug.show-overdraw. (I was just randomly toggling prefs to see if anything caused a crash!)

I'm sceptical that enough users are flipping this pref in the wild to give us these crash numbers, so maybe there are multiple ways to trigger it.

(In reply to Sotaro Ikeda [:sotaro] from comment #6)

All crashes in Socorro have the following GraphicsCriticalError message. But the crashes in comment 0 did not have the message.

|[0][GFX1-]: Failed to create EGLContext!: 0x300c

0x300c error means EGL_BAD_PARAMETER.
https://searchfox.org/mozilla-central/rev/61fceb7c0729773f544a9656f474e36cd636e5ea/gfx/gl/GLContextProviderEGL.cpp#301

This is a red herring: since bug 1474281 we attempt to create an OpenGL context first, then fall back to GLES. The error message is from failing to create the GL context, but the GLES context is created successfully immediately afterwards.

I cannot reproduce this crash ever in GVE, but can reproduce in Fenix. Setting ShaderPrecacheFlags::FULL_COMPILE makes it crash at startup. Sometimes in a debug overdraw shader, but not always, so I don't think that is important. The specific shader which crashes seems to vary: sometimes it is the first one, sometimes a few compile successfully before the crash.

Figured a bit more of this out:

  • I can in fact reproduce from GVE fairly easily. But it is even easier in Sample Browser / Fenix. I think because the SkiaGL (used for the android UI) can either trigger the crash itself or help set up the required state for the crash to occur.
  • The crash seems to occur when calling glLinkProgram when one of the shader sources is identical to a shader source used for a previously linked program. Perhaps a bug in some driver-internal code which is attempting to cache shaders?
  • This is a very common scenario when gfx.webrender.debug.show-overdraw is enabled, as long as gfx.webrender.use-optimized-shaders is also enabled. This is because the shader optimization pass makes it so that:
    a) The vertex source for a debug-overdraw variant is identical to the non-debug-overdraw variant (as debug overdraw only affects the fragment shader).
    b) Different shaders' debug overdraw variants have the exact same fragment source as each other (because it just outputs a fixed colour).
  • The reason why the specific shader which crashed kept changing was because of webrender's shader cache. Say we have programs A, B, and C which all have identical fragment shader source. On the first run A will be successfully compiled and cached, then B will cause the crash. On the second run A will be loaded from the cache, B will be successfully compiled, then C will cause the crash. And so on.
  • Even with this knowledge, I have been unable to reproduce in wrench or a custom-written test app. And like I said it does occur less frequently in GVE than Fenix, so there must be some required state that I haven't figured out yet.

Now, debug overdraw will not be the reason users are hitting this crash: it's unlikely many users have enabled a debug option, it only causes crashes in conjunction with optimized shaders (which only landed a month ago), and some (most?) crashes appear to be on fenix stable rather than nightly, which means webrender probably isn't even enabled. My theory is therefore that webgl pages are causing this. A website could attempt to compile 2 different programs which share a common shader. Perhaps even visiting a webgl app multiple times could trigger this. I have, however, been unable to trigger this myself by writing a webgl app.

Fixing webrender's optimized debug-overdraw shaders is simple (we can add a unique comment to each shader's source). While I'm not certain webgl is the cause of these crashes, perhaps we should try to do the same there and see if the numbers go down. Jeff, do you think that'd be reasonable? Speculatively appending a unique comment to the end of each shader string webgl passes to glShaderSource()? (On Adreno 506 only)

Flags: needinfo?(jgilbert)

Wow, a bug in driver-level shader caching sounds awful. It's worth trying a comment, but it might cache with comments stripped. If that doesn't work, we can try adding a random unused variable or something.

Great dissection!

Flags: needinfo?(jgilbert)
Crash Signature: [@ libllvm-glnext.so@0x732bb0 ] → [@ libllvm-glnext.so@0x732bb0 ] [@ libllvm-glnext.so@0x732610 ] [@ libllvm-glnext.so@0x732ba0 ] [@ libllvm-glnext.so@0x732600 ] [@ libllvm-glnext.so@0x7acb6c ]
See Also: → 1595821

On some Adreno 505 and 506 devices we are encountering driver crashes during
glLinkProgram(). The only circumstance in which we have been able to reproduce
locally is when the show-overdraw debug option is enabled. The reason appears to
be that, due to shader optimisation, the debug overdraw variants of many shaders
have identical source code. The crash seems to occur when linking a shader which
has identical source code to a previously linked shader.

This does not, however, explain the non-insignificant numbers of crashes in the
wild because a) it's unlikely many users are enabling overdraw debugging, and b)
some crash reports predate the commit which enabled shader
optimisation. However, it is possible that for a different reason we are
compiling multiple shaders with identical source code.

To attempt to work around this crash this change adds a random comment to the
end of each shader source string, on the affected devices.

Assignee: nobody → jnicol
Status: NEW → ASSIGNED
Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bbe5ed51273b
Ensure shader sources are always unique to workaround adreno crash. r=gw

I'm not very optimistic about this fixing the bug, so let's leave this open for now

Keywords: leave-open
Pushed by ccoroiu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e76ed046c4f9
Backed changeset bbe5ed51273b for webrender failures. CLOSED TREE

Sorry, stupid mistake. forgot to update webrender's Cargo.lock

Flags: needinfo?(jnicol)
Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c08d4fe356e5
Ensure shader sources are always unique to workaround adreno crash. r=gw

@jnicol: Could you add a resolved callstack to this bug?

Flags: needinfo?(jnicol)
You need to log in before you can comment on or make changes to this bug.