Closed Bug 1638593 Opened 5 years ago Closed 4 years ago

Fenix crashes when playing video on Samsung with WebRender enabled

Categories

(Core :: Graphics: WebRender, defect)

78 Branch
ARM64
Android
defect

Tracking

()

RESOLVED FIXED
89 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox78 --- wontfix
firefox87 --- wontfix
firefox88 --- fixed
firefox89 --- fixed

People

(Reporter: ekager, Assigned: jnicol)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Originally reported as https://github.com/mozilla-mobile/fenix/issues/10655

Steps to reproduce

  1. Set gfx.webrender.all to true and restart browser
  2. Open https://dtf.ru/new
  3. Find any embedded youtube video
  4. Start playing video
  5. Crash
    If browser not crashed, scroll for another youtube video or refresh page and try again

about:crashes leads me to this page:
https://crash-stats.mozilla.org/report/index/0ae9329c-13c1-4ae7-8f39-199730200514

Expected behavior

Video played without crashes

Actual behavior

Browser crashed

Device information

  • Android device: Samsung S6 Edge / Android 7.0 (Exynos 7420 / Mali-T760)
  • Fenix version: All Fenix versions affected
Blocks: wr-mali
Crash Signature: [@ libGLES_mali.so@0x80db3c ]
Keywords: crash
OS: Unspecified → Android
Hardware: Unspecified → ARM64

This sounds quite serious! We should fix this.
Jamie, could you repro?

Severity: -- → S2
Flags: needinfo?(jnicol)

Lowering severity to S3 as we aren't yet shipping WR on that platform by default (but we want to!).

Severity: S2 → S3

As you say, we're not shipping to Mali yet. Mali-Gxx should be soon, but Txxx will take a bit longer.

FWIW I cannot reproduce on a Galaxy S9 (Mali-G72).

Flags: needinfo?(jnicol)
Blocks: wr-malit-release
No longer blocks: wr-mali

I cannot reproduce on a Galaxy A5 2017 (Mali-T830). All crashes matching that signature are on a T-760, so maybe it's very GPU-specific. I'd say proceed with enabling on nightly for all Mali-Ts, and keep an eye on this. If the numbers spike we can disable for affected devices, and try to get our hands on one so we can fix it.

From comment 0, I wonder if GL_OES_EGL_image_external_essl3 does not work well on Mali-T760.

I have got Samsung S6 Edge with Android 7.0 (Mali-T760). I could not reproduce the crash of comment 0. But I could reproduce the crash of the following STR

On latest m-c, crash happened with the STR of comment 6.


glDrawElementsInstanced 0x00000000bf3594e0
gleam::ffi_gles::Gles2::DrawElementsInstanced::h125c6ce4e56035a1 gles_bindings.rs:2261
_$LT$gleam..gl..GlesFns$u20$as$u20$gleam..gl..Gl$GT$::draw_elements_instanced::hab5c351f21ff4bb7 gles_fns.rs:1215
webrender::device::gl::Device::draw_indexed_triangles_instanced_u16::h495291381f2d1352 gl.rs:3500
webrender::renderer::Renderer::draw_instanced_batch::h706a197f30bde517 mod.rs:2705
webrender::renderer::Renderer::draw_blurs::hb4bb8972829eb3c8 mod.rs:3825
webrender::renderer::Renderer::draw_alpha_target::hf0e51b15df50cd87 mod.rs:3965
webrender::renderer::Renderer::draw_frame::h869ab55439a4a366 mod.rs:4674
webrender::renderer::Renderer::render_impl::he0ce2eec139c9826 mod.rs:2131
webrender::renderer::Renderer::update::hc0f12041d169ecc4 mod.rs:1447
wr_renderer_update bindings.rs:627
mozilla::wr::RendererOGL::UpdateAndRender(mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*, mozilla::wr::RendererStats*) RendererOGL.cpp:173
mozilla::wr::RenderThread::UpdateAndRender(mozilla::wr::WrWindowId, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> const&, mozilla::TimeStamp const&, bool, mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*)

During the STR of comment 7, RenderAndroidSurfaceTextureHost was not used. Then video seemed not related to the crash. Instead, draw_instanced_batch() call within Renderer::draw_blurs() seemed to related to the crash.

Bug 1640960 comment 37 added non-instancing rendering. But enabling it also caused the crash like the following :(


glDrawElements 0x00000000b65e9f04
gleam::ffi_gles::Gles2::DrawElements::hdfbcc41d8584cb83 gles_bindings.rs:2257
_$LT$gleam..gl..GlesFns$u20$as$u20$gleam..gl..Gl$GT$::draw_elements::hcc601f4a57c4077e gles_fns.rs:1197
webrender::device::gl::Device::draw_indexed_triangles::h389c952e9bbe322f gl.rs:3487
webrender::renderer::Renderer::draw_instanced_batch::h706a197f30bde517 mod.rs:2710
webrender::renderer::Renderer::draw_blurs::hb4bb8972829eb3c8 mod.rs:3825
webrender::renderer::Renderer::draw_alpha_target::hf0e51b15df50cd87 mod.rs:3965
webrender::renderer::Renderer::draw_frame::h869ab55439a4a366 mod.rs:4674
webrender::renderer::Renderer::render_impl::he0ce2eec139c9826 mod.rs:2131
webrender::renderer::Renderer::update::hc0f12041d169ecc4 mod.rs:1447
wr_renderer_update bindings.rs:627
mozilla::wr::RendererOGL::UpdateAndRender(mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*, mozilla::wr::RendererStats*) RendererOGL.cpp:173
mozilla::wr::RenderThread::UpdateAndRender(mozilla::wr::WrWindowId, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> const&, mozilla::TimeStamp const&, bool, mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*) RenderThread.cpp:480

Sotaro, did you manage to find out anything else about this? Are you able to make a testcase which reproduces the crash?

Flags: needinfo?(sotaro.ikeda.g)

Closed bug 1696989 as a duplicate of this, and added its signature to this bug.

As I wrote in that bug, since bug 1701633 landed some crash reports are trickling in with the GraphicsDrawShader annotation. We appear to be crashing either during cs_clip_box_shadow_TEXTURE_2D or cs_blur_ALPHA_TARGET draw calls. The blur one matches Sotaro's findings in comment 8.

Also, if we search crash stats for libGLES_mali.so, these 2 signatures are the highest. But there are quite a few other signatures on various Mali-T8xx devices too which may be related. Some of these have not had recent enough crashes to contain the GraphicsDrawShader annotation, but some have and they contain the same 2 shaders. So it's likely that this bug applies to more Mali-T devices than just the T760.

I am able to reproduce the cs_blur case on the site f95zone.to/latest (NSFW), and I can reproduce the cs_clip_box_shadow crash on https://www.w3schools.com/cssref/tryit.asp?filename=trycss3_image_gallery

Crash Signature: [@ libGLES_mali.so@0x80db3c ] → [@ libGLES_mali.so@0x80db3c ] [@ libGLES_mali.so@0x761254 ]
Flags: needinfo?(sotaro.ikeda.g)

When investigating this I noticed that after adding enough logging to draw_alpha_target() the crashes no longer reproduced. Specifically after adding a lot of logging between the glClear()s for ones and zeroes, and the draw call which triggers the crash on the respective pages (either blurs or box shadows). Likewise, adding a glFlush() after the clears avoids the crash, as does removing the clears altogether (though of course that breaks rendering).

This explains why cs_blur and cs_clip_box_shadow are affected - what those two shaders have in common is that we always clear the regions of the alpha target (with either ones or zeroes) prior to rendering with them.

So my theory is that there's some sort of timing issue between calling glClear on an alpha target and then doing a draw call. However, on an alternative line of investigation I was playing around with the shaders themselves. I tried removing the textureSize() calls and hard coding the values instead, just in case this bug was in any way similar to bug 1692848. It seemed a long shot given that was a Mali driver regression in Android 11 on Mali-G devices, and was specific to YUV/texture external. However, to my surprise it did actually reliably prevent the crash. Furthermore, applying the same workaround as in bug 1692848 (ie continue to call textureSize, but also include a sample call in the vertex shader) did reliably prevent the cs_blur crash, but not the cs_clip_box_shadow crash.

The textureSize line of investigation doesn't seem nearly as satisfying as the glClear one to me. My hunch would be that the glClear is the real underlying issue, and that some shader changes can affect the timings in such a way that the crash may or may not occur.

So my proposed fix is to call glFlush after the clears in draw_alpha_target() on Mali-T

On several Mali-T devices we have observed crashes during draw calls
when rendering to an alpha target immediately after clearing regions
of the target with glClear. This appears to be timing-related as it
seems more likely to occur the greater the number of regions that have
been cleared, and it no longer occurs after adding enough
logging. Explicitly calling glFlush after the clears prevents the
crash.

Assignee: nobody → jnicol
Status: NEW → ASSIGNED

ARM have confirmed with me that this is a suitable fix.

Attachment #9214421 - Attachment description: Bug 1638593 - Call glFlush to avoid driver crash during draw call on Mali-T. r?gw → Bug 1638593 - Use shader to clear alpha targets on Mali-Txxx devices. r?gw,kvark
Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/76ecb50a6999 Use shader to clear alpha targets on Mali-Txxx devices. r=gw

Dzmitry pointed out in the review that we could use the ps_clear shader instead to perform the clears, like we currently do on some platforms for picture cache tiles due to bug. That way we don't need to flush. I tested that this still avoids the crash, and changed the patch to do that.

Comment on attachment 9214421 [details]
Bug 1638593 - Use shader to clear alpha targets on Mali-Txxx devices. r?gw,kvark

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes on some websites for users with Mali-T devices
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Very localized change which only affects devices which are affected by the crash. Replaces usage the glClear function that causes the crash with an alternative clearing method. This alternative clear is already used elsewhere in the code on other platforms, so is unlikely to cause issues
  • String changes made/needed:
Attachment #9214421 - Flags: approval-mozilla-release?
Attachment #9214421 - Flags: approval-mozilla-beta?

We should certainly uplift this to beta, but possibly release too given the crash volume and that it's a relatively low-risk fix?

Comment on attachment 9214421 [details]
Bug 1638593 - Use shader to clear alpha targets on Mali-Txxx devices. r?gw,kvark

Extremely unlikely that we're going to dot release 87 at this point, but approved for Fenix 88.0.0-beta.6.

Attachment #9214421 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch

Comment on attachment 9214421 [details]
Bug 1638593 - Use shader to clear alpha targets on Mali-Txxx devices. r?gw,kvark

88 is in RC now and there are no plans for an 87 dot release.

Attachment #9214421 - Flags: approval-mozilla-release? → approval-mozilla-release-
Crash Signature: [@ libGLES_mali.so@0x80db3c ] [@ libGLES_mali.so@0x761254 ] → [@ libGLES_mali.so@0x80db3c ] [@ libGLES_mali.so@0x761254 ] [@ libGLES_mali.so@0x685c0c]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: