Crash in mozalloc_abort | abort | core::option::expect_failed | webrender::resource_cache::ResourceCache::get_cached_image

RESOLVED FIXED in mozilla57

Status

()

defect
P1
normal
RESOLVED FIXED
2 years ago
Last year

People

(Reporter: kah0922, Assigned: nical)

Tracking

(Depends on 1 bug)

57 Branch
mozilla57
x86_64
All
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox56 unaffected, firefox57 unaffected)

Details

(Whiteboard: [wr-mvp] [gfx-noted], crash signature)

Attachments

(1 attachment, 1 obsolete attachment)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0
Build ID: 20170906100107

Steps to reproduce:

It seems to happen randomly, especially from switching tabs.
I've encountered this on both of my systems running Intel and AMD graphics.
Crash reports: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozalloc_abort%20%7C%20abort%20%7C%20core%3A%3Aoption%3A%3Aexpect_failed%20%7C%20webrender%3A%3Aresource_cache%3A%3AResourceCache%3A%3Aget_cached_image


Actual results:

The browser crashed.


Expected results:

It should not crash.
Component: Untriaged → Graphics: WebRender
OS: Unspecified → Linux
Product: Firefox → Core
Hardware: Unspecified → x86_64
Duplicate of this bug: 1397459
I'm assuming you're running with webrender enabled.
Crash Signature: [@ core::option::expect_failed | webrender::resource_cache::ResourceCache::get_cached_image] [@ mozalloc_abort | abort | core::option::expect_failed | webrender::resource_cache::ResourceCache::get_cached_image]
Yeah, webrender is enanbled, as is webrenderest, layers-free, and blob-images.
Priority: P3 → P2
Whiteboard: [gfx-noted] → [wr-mvp] [gfx-noted]
Yeah you can get this pretty reliably in a ~minute just opening tabs to websites and toggling youtube videos in and out of fullscreen.

This is probably(?) the last major blocker for making webrender dogfoodable.
Depends on: 1398533
Duplicate of this bug: 1397779
This is the #9 Windows topcrash in Nightly 20170908220146.
Did some investigation tonight into why this is happening:

* Always a resource with this format `ImageRequest { key: ImageKey(IdNamespace(1), 309), rendering: Auto, tile: None }`

* Always removed as a result of running the ~WebRenderImageData destructor.

* Problem destructor generally shows up in a big batch of destructors, suggesting some kind of bulk event.

* Image seems to always be removed very close to the crash, suggesting it's used on the same/next frame.
Assignee: nobody → a.beingessner
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Priority: P2 → P1
Target Milestone: --- → mozilla57
Crash Signature: [@ core::option::expect_failed | webrender::resource_cache::ResourceCache::get_cached_image] [@ mozalloc_abort | abort | core::option::expect_failed | webrender::resource_cache::ResourceCache::get_cached_image] → [@ core::option::expect_failed | webrender::resource_cache::ResourceCache::get_cached_image] [@ mozalloc_abort | abort | core::option::expect_failed | webrender::resource_cache::ResourceCache::get_cached_image] [@ mozalloc_abort | abort | core::option::…
OS: Linux → All
Some more notes:

* It's a WrImageMask that's going missing; looks like the mask that fades out text that's too long to fit on a tab.

* It's possible this is related to a graphical glitch where the whole chrome flickers if your tabs are full enough to fill the title bar.
Shot in the dark, markus any ideas on this?
Flags: needinfo?(mstange)
Oh, we were suspicious that maybe it was an EmptyTransaction thing, but changing WebRenderLayerManager::EndEmptyTransaction to just `return false` doesn't fix it.
(In reply to Alexis Beingessner [:Gankro] from comment #8)
> * It's possible this is related to a graphical glitch where the whole chrome
> flickers if your tabs are full enough to fill the title bar.

If this bug occurs with webrendest, then it has nothing to do with the tab text glitch. The tab text glitch (bug 1389476) is due to changing layerization, but with webrendest there is no layerization.
Flags: needinfo?(mstange)
Ok so we've figured this out. The issue is that when we go to EndTransaction, we don't atomically batch up deleting images  from the previous display list (DiscardImages -> update_resources) with the sending of the new display list. So we end up deleting the images, then sending the display list (set_display_list), creating a race condition where a render can happen between those two events.

The solution is to get the ResourceUpdates from DiscardImages and stuff them in set_display_list, so that this is properly atomic.

Nical has volunteered to fix this, as he's in the middle of refactoring all of this code anyway.
Assignee: a.beingessner → nical.bugzilla
This should do.
I think that fonts have the same issue (although we churn them a lot less so it's less likely to happen).
Attachment #8907172 - Flags: review?(a.beingessner)
Comment on attachment 8907172 [details] [diff] [review]
Apply deferred image key deletions to the next transaction

Review of attachment 8907172 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/wr/WebRenderLayerManager.cpp
@@ +822,5 @@
> +  for (const auto& key : mImageKeysToDelete) {
> +    resourceUpdates.DeleteImage(key);
> +  }
> +  mImageKeysToDelete.Clear();
> +

1:28.60 In file included from /Users/ABeingessner/dev/gecko/obj-x86_64-apple-darwin16.7.0/gfx/layers/Unified_cpp_gfx_layers11.cpp:74:
 1:28.60 /Users/ABeingessner/dev/gecko/gfx/layers/wr/WebRenderLayerManager.cpp:809:24: error: invalid range expression of type 'mozilla::wr::ResourceUpdateQueue'; no viable 'begin' function available
 1:28.60   for (const auto& key : mImageKeysToDelete) {
 1:28.60                        ^ ~~~~~~~~~~~~~~~~~~
 1:28.60 /Users/ABeingessner/dev/gecko/gfx/layers/wr/WebRenderLayerManager.cpp:810:5: error: use of undeclared identifier 'resourceUpdates'
 1:28.60     resourceUpdates.DeleteImage(key);
 1:28.60     ^
 1:28.60 2 errors generated.
Yeah the patch is rebased on top of bug 1393031. In the current state of mozilla-central mImageKeysToDelete is not an nsTArray (yet).
Comment on attachment 8907172 [details] [diff] [review]
Apply deferred image key deletions to the next transaction

Tried to crash the browser for 10 minutes with this, and nothing! \o/
Attachment #8907172 - Flags: review?(a.beingessner) → review+
Note to checkin people, this patch is based on top of https://bugzilla.mozilla.org/show_bug.cgi?id=1393031, which has been pushed to inbound but hasn't hit unified yet. Dunno what the process is!
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5dff64340068
Apply deferred image key deletions to the next transaction. r=Gankro
Keywords: checkin-needed
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/70f5f23a429f
Apply deferred image key deletions to the next transaction. r=Gankro
Although not all of bug 1393031 has made it to m-c yet, the part needed by this patch is in now.
Flags: needinfo?(nical.bugzilla)
https://hg.mozilla.org/mozilla-central/rev/70f5f23a429f
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Backed out for letting Talos g1 timeout and spam its log on Linux x64 QuantumRender opt:

https://hg.mozilla.org/mozilla-central/rev/c99a1520c7a5af8f3769d3a4fce38697ce47cbb2

Merge to mozilla-central with failing job: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=7aceaf8bcb9f582db0f93488b48ef7019e348dba&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Log of failing job (>50MB due to the log spam): https://treeherder.mozilla.org/logviewer.html#?job_id=131349043&repo=mozilla-central
It contains many lines like these:
> Delete the non-exist key:ImageKey(IdNamespace(1), 1)
Status: RESOLVED → REOPENED
Flags: needinfo?(nical.bugzilla)
Resolution: FIXED → ---
Target Milestone: mozilla57 → ---
Target Milestone: --- → mozilla57
This is the #7 Windows topcrash in Nightly 20170915100121.
D'oh! I think that this is caused by our bad pattern of creating an image and immediately queueing it's destruction in various places. That'll take a bit longer to fix so I'll start by delaying the deletion of the images by one transaction which sucks for memory usage but should fix the issue while we work through the nastiness of the image key management.
Flags: needinfo?(nical.bugzilla)
Nical, this is still WebRender only, right?
Flags: needinfo?(nical.bugzilla)
Same patch with an additional array of image keys that buffers the deletion and to delay them to the transaction after next.

try push looks good so far: try push with this patch looks good so far: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1c27c1f61b9bae2e9d2e645d599b916b4c13c3a9
Attachment #8907172 - Attachment is obsolete: true
Flags: needinfo?(nical.bugzilla)
Attachment #8909345 - Flags: review?(a.beingessner)
(In reply to Milan Sreckovic [:milan] from comment #26)
> Nical, this is still WebRender only, right?

Absolutely.
Comment on attachment 8909345 [details] [diff] [review]
Apply deferred image key deletions to the next transaction (delayed by one transaction)

Review of attachment 8909345 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with typo fix

::: gfx/layers/wr/WebRenderLayerManager.h
@@ +249,5 @@
>  
>  private:
>    nsIWidget* MOZ_NON_OWNING_REF mWidget;
>    nsTArray<wr::ImageKey> mImageKeysToDelete;
> +  // TODO - This is needed because we have some code that creates a image keys

typo: "a images" -> "images"
Attachment #8909345 - Flags: review?(a.beingessner) → review+
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9f8c6be7d2f4
Apply deferred image key deletions to the next transaction. r=Gankro
https://hg.mozilla.org/mozilla-central/rev/9f8c6be7d2f4
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.