Closed Bug 1524280 Opened 5 years ago Closed 5 years ago

Crash in mozilla::wr::Moz2DRenderCallback (ExternalSourceSurfaceCreation PLAY failure) - Still on Google Slides?

Categories

(Core :: Graphics: WebRender, defect, P3)

x86_64
All
defect

Tracking

()

VERIFIED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox67 --- wontfix
firefox68 --- verified
firefox69 --- verified

People

(Reporter: jan, Assigned: aosmond)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, nightly-community, regression)

Crash Data

Attachments

(2 files, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #1516011 +++

(Andrew Osmond [:aosmond] from bug 1516011 comment 20)

I see https://crash-stats.mozilla.com/report/index/9f210bca-5925-4eb8-8e08-8fdd50190115#tab-details which includes my fix. I am now puzzled as to the root cause.

Still happening on Nvidia/Win10. From yesterday:

This bug is for crash report bp-40087b07-eaf5-48b7-8d80-8c9890190131.

MOZ_RELEASE_ASSERT(false)
GraphicsCriticalError |[G0][GFX1-]: Updating unknown shared surface: 979252543489 (t=548804) |[G1][GFX1-]: Replay failure: ExternalSourceSurfaceCreation PLAY (t=619060) |[G2][GFX1-]: Replay failure: ExternalSourceSurfaceCreation PLAY (t=619060)

Top 10 frames of crashing thread:

0 xul.dll static bool mozilla::wr::Moz2DRenderCallback gfx/webrender_bindings/Moz2DImageRenderer.cpp:435
1 xul.dll wr_moz2d_render_cb gfx/webrender_bindings/Moz2DImageRenderer.cpp:474
2 xul.dll static struct  gfx/webrender_bindings/src/moz2d_renderer.rs:526
3 xul.dll static void rayon::iter::plumbing::bridge_producer_consumer::helper<rayon::vec::VecProducer<webrender_bindings::moz2d_renderer::Job>, rayon::iter::map::MapConsumer<rayon::iter::collect::consumer::CollectConsumer< third_party/rust/rayon/src/iter/plumbing/mod.rs:418
4 xul.dll static void rayon_core::job::{{impl}}::execute<rayon_core::latch::SpinLatch, closure,  third_party/rust/rayon-core/src/job.rs:113
5 xul.dll static void rayon_core::registry::WorkerThread::wait_until_cold<rayon_core::latch::CountLatch> third_party/rust/rayon-core/src/registry.rs:567
6 xul.dll static void std::sys_common::backtrace::__rust_begin_short_backtrace<closure,  z:/libstd/sys_common/backtrace.rs:137
7 xul.dll static void alloc::boxed::{{impl}}::call_box< z:/liballoc/boxed.rs:672
8 xul.dll static void std::sys::windows::thread::{{impl}}::new::thread_start src/libstd/sys/windows/thread.rs:56
9 kernel32.dll BaseThreadInitThunk 

See Also: → 1516011
Crash Signature: [@ mozilla::wr::Moz2DRenderCallback] → [@ mozilla::wr::Moz2DRenderCallback] [@ wr_moz2d_render_cb ]
OS: Windows 10 → All
Priority: -- → P4
Assignee: nobody → aosmond
Blocks: wr-67
No longer blocks: stage-wr-trains
Priority: P4 → P3

I just had bp-77f1a7fd-b465-4b47-aba4-467cf0190319 on Google Slides (macOS), which seems related.

Blocks: wr-68
No longer blocks: wr-67

Per bug 1540853 comment 4 and bug 1540853 comment 6 this is the #2 WR-related crash on the 66 release experiment, and #3 in 67 beta. Would be good to track it down if we can.

Just from looking at the codepaths it seems that if the parent bails out here then that could result in this crash. Can we add some logging or such to catch this scenario? Or maybe insert a dummy entry into the sInstances table so that the replay doesn't just fail?

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #5)

Just from looking at the codepaths it seems that if the parent bails out here then that could result in this crash. Can we add some logging or such to catch this scenario? Or maybe insert a dummy entry into the sInstances table so that the replay doesn't just fail?

Good point. If the memory map fails, we will crash. This should not be considered an impossible/unlikely error, especially on . The general image use case is here:

https://searchfox.org/mozilla-central/rev/b59a99943de4dd314bae4e44ab43ce7687ccbbec/gfx/webrender_bindings/RendererOGL.cpp#30

Which will fail gracefully and we do see those warnings in the critical logs from time to time (either for the same failure or different).

Keywords: leave-open
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5d9d5394671f
Add logging for crash reports to help diagnose why the buffer is missing. r=kats

So https://crash-stats.mozilla.com/report/index/ff27f77c-0f16-41f8-a1f3-a71510190507 and https://crash-stats.mozilla.com/report/index/42a16c3a-5917-4041-ac42-fa9590190508 both have this patch but not relevant messages in the GraphicsCriticalError field. So maybe the crash is coming from something else?

https://crash-stats.mozilla.com/report/index/b028aea6-06fa-451e-9236-123bc0190506#tab-metadata has this: DataSourceSurface of SharedSurfaces does not exist for extId:21474837290, maybe that's relevant?

Google Slides are entirely blob images, so I suppose it makes sense it is the most likely place to observe the problem. These do not appear to be animated images, so it appears the following is insufficient / missing some corner case:

https://searchfox.org/mozilla-central/rev/11cfa0462a6b5d8c5e2111b8cfddcf78098f0141/gfx/layers/wr/WebRenderCommandBuilder.cpp#181

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10)

https://crash-stats.mozilla.com/report/index/b028aea6-06fa-451e-9236-123bc0190506#tab-metadata has this: DataSourceSurface of SharedSurfaces does not exist for extId:21474837290, maybe that's relevant?

Also seen in bug 1553522.

See Also: → 1553522

I have had some modest, if tedious, success in reproducing.

With some tweaks, in particular setting image.mem.surfacecache.min_expiration_ms to something very small, I have been able to reproduce this consistently on nightly and local builds.

STR:

  1. Go to about:config and set image.mem.surfacecache.min_expiration_ms to 1 and gfx.webrender.blob.paint-flashing to true. Restart browser.
  2. Browse to https://www.yr.no/kart/?spr=eng#lat=57.75946&lon=11.89664&zoom=6&laga=nedb%C3%B8rskyer&proj=3575.
  3. Once the page loads, move the mouse between blob tiles with a weather icon.
  4. Crashes very quickly.
Has STR: --- → yes

I did a mozregression in case it was illuminating but it mostly just points to bug 1456555 which is an artifact of my STR. I was able to make it crash without bug 1456555 but not consistently.

What the logs suggest is that AsyncImagePipelineManager::HoldExternalImage is not keeping the external surfaces alive long enough. This was a key assumption in the lifetime design. I'm not sure if it is HoldExternalImage's fault exactly -- it is doing the job that was asked, but I suspect the blob lives longer than our hold on it, rather than HoldExternalImage did not hold onto the surface long enough. I think we should try to move to an implementation where we are more explicit with the blob dependencies. This would probably prevent similar problems in the future. That's what I'm looking into now.

When a blob image invalidates, it doesn't always repaint the entire
blob. When we stored the shared surface references in the DIGroup, it
would incorrectly forgot about images referenced by items that were not
invalidated when it repainted. As such, it could free them too early and
cause a crash when rasterizing the blob in the compositor process.

This did not crash most of the time because the image cache would bail
us out. It takes a full minute for the image cache to expire, but the
issue was more readily reproducible when that timeout was shortened.

We now store the references in BlobItemData, on a per display item
basis. This ensures that when any given item is invalidated, it will
continue referencing any resources that it needs.

Additionally we now also post the releasing of the shared surface image
keys and external image ID to the main thread. This allows the current
transaction to complete before freeing the surface, which guards against
cases where the surface is referenced and released somehow in the space
of the same transaction.

We now also post the releasing of the shared surface image keys and
external image ID to the main thread. This allows the current
transaction to complete before freeing the surface, which guards against
cases where the surface is referenced and released somehow in the space
of the same transaction.

Attachment #9067998 - Attachment description: Bug 1524280 - Store the blob image reference to each shared surface in BlobItemData. → Bug 1524280 - Part 2. Store the blob image reference to each shared surface in BlobItemData.
Attachment #9062483 - Attachment is obsolete: true
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8b66f48247b6
Part 1. Ensure we always post when freeing SharedUserData. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/4d3bc03ee7a5
Part 2. Store the blob image reference to each shared surface in BlobItemData. r=jrmuizel
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69

I verified this with the STR in nightly builds before and after the patch landed. It has been fixed.

Comment on attachment 9068095 [details]
Bug 1524280 - Part 1. Ensure we always post when freeing SharedUserData.

Beta/Release Uplift Approval Request

  • User impact if declined: May experience crashes when browsing sites with a lot of SVG images when using WebRender. In particular Google slides is impacted.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: 1) Go to about:config and set image.mem.surfacecache.min_expiration_ms to 1 and gfx.webrender.blob.paint-flashing to true. Restart browser.
  1. Browse to https://www.yr.no/kart/?spr=eng#lat=57.75946&lon=11.89664&zoom=6&laga=nedb%C3%B8rskyer&proj=3575.
  2. Once the page loads, move the mouse between blob tiles with a weather icon.
  3. Should no longer crash.
  • List of other uplifts needed: Needs both parts 1 and 2 from bug 1524280
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): We are making the lifetimes of everything longer. It is unlikely to increase the crash rate, even if done wrong, because the lifetimes are already too short. The image cache has been hiding the issue by retaining the surfaces for an extra 60 seconds (in case the cache is queried again) and will continue to do so.
  • String changes made/needed: None
Attachment #9068095 - Flags: approval-mozilla-beta?
Attachment #9067998 - Flags: approval-mozilla-beta?

Comment on attachment 9068095 [details]
Bug 1524280 - Part 1. Ensure we always post when freeing SharedUserData.

webrender crash fix, approved for 68.0b7

Attachment #9068095 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9067998 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: qe-verify+
QA Whiteboard: [qa-triaged]

Hello,

Reproduced the issue with Firefox 69.0a1 (20190522152821) on Windows 10 x64, macOS 10.14 and Ubuntu 18.04.

On Ubuntu 18.04 and macOS 10.14 Firefox crashed and the “Crash Reporter” was displayed after doing the steps from comment 27.
On Windows 10x64 no “Crash Reporter” window appeared and the browser window turned white. In the about:crashes page I observed there are like 4-5 unsubmitted crash reports (depends on how many times browser turned white) having the same Signature (“mozilla::wr::Moz2DRenderCallback” - e.g. link).

The issue is verified using Firefox 69.0a1 (20190604214415) and Firefox 68.0b7(20190603181408) on Windows 10x64, Ubuntu 18.04 and macOS 10.14 (no crashes or the window turning white was encountered).

Status: RESOLVED → VERIFIED
Flags: qe-verify+
See Also: → 1509870
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: