Crash in [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind]
Categories
(Core :: Graphics: WebRender, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox68 | --- | disabled |
People
(Reporter: jan, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: crash, nightly-community, regression, Whiteboard: [fxperf:p1])
Crash Data
Attachments
(4 files)
Happens on Twitter if gfx.webrender.split-render-roots is true.
This bug is for crash report bp-b5ec86b8-cc89-46c9-b65e-ee4ca0190324.
Top 10 frames of crashing thread:
0 libxul.so MOZ_Crash mfbt/Assertions.h:314
1 libxul.so GeckoCrash toolkit/xre/nsAppRunner.cpp:5093
2 libxul.so gkrust_shared::panic_hook toolkit/library/rust/shared/lib.rs:237
3 libxul.so core::ops::function::Fn::call src/libcore/ops/function.rs:78
4 libxul.so std::panicking::rust_panic_with_hook src/libstd/panicking.rs:495
5 libxul.so std::panicking::continue_panic_fmt src/libstd/panicking.rs:398
6 libxul.so rust_begin_unwind src/libstd/panicking.rs:325
7 libxul.so core::panicking::panic_fmt src/libcore/panicking.rs:95
8 libxul.so core::option::expect_failed src/libcore/option.rs:1008
9 libxul.so webrender::renderer::TextureResolver::bind src/libcore/option.rs:322
Comment 1•5 years ago
|
||
Hello Doug - is this related to document splitting?
Comment 2•5 years ago
|
||
Yes, this presents itself with document splitting on and the gfx.webrender.debug.texture-cache.disable-shrink pref off. I've created Bug 1538710 to track the work of no longer depending on this pref.
Comment 4•5 years ago
|
||
The TextureResolver::bind signature is currently the #1 GPU process crash for WR-enabled nightly builds. crash-stats search results. There's ~127 crashes in the last 7 days. It seems unlikely that this is only triggered when document-splitting is enabled.
Comment 5•5 years ago
|
||
The crash signature was very low-volume until March 23, when it suddenly spiked. That's the same time the document-splitting stuff landed. So I'm fairly sure the spike in crashes is a result of those patches. Doug, do you have cycles to prioritize this, since it seems to affect users even in the default configuration of document-splitting disabled?
Comment 6•5 years ago
|
||
Yeah. I'm going to prioritize landing the work for bug 1538710 and see how that affects these numbers. Then if they're still present (likely based on comment 5), I'll dig into what else might be causing this.
Comment 7•5 years ago
|
||
Looks like this is still happening. Here's a crash from today's nightly: https://crash-stats.mozilla.com/report/index/147b0aab-757b-43b8-acaa-aef9d0190417
Updated•5 years ago
|
Comment 8•5 years ago
•
|
||
I haven't found any smoking gun yet. However, here is a summary of my spelunking so far:
-
Something is missing from the TextureCache's
texture_cache_map
-
This means one of two possibilities:
- The item has not been added yet, and yet we render with a frame that holds the key. This should not be possible, as in order to be added to the frame the item must have been queued to be added to
texture_cache_map
viaTextureCacheAllocationKind::Alloc
- The item has been removed, and we render with a frame that has not been updated. This is likely what's going on, and it's what was going on in bug 1538710. However, that was specific to multiple documents, where one document's frame build would trigger a clearing of cached resources, and the other document's frame would not be built, leading to a stale frame when it came time to render
- The item has not been added yet, and yet we render with a frame that holds the key. This should not be possible, as in order to be added to the frame the item must have been queued to be added to
-
Progressing from the assumption that we are removing the relevant
texture_cache_map
item, triggering a render, and yet not building a frame, we need to examine how we remove items fromtexture_cache_map
. -
The item is actually removed here
-
To hit this, we must go through
TextureUpdateList::push_free
-
This is called from
TextureCache::free
for a standalone entry, and fromTextureCache::clear_shared
-
TextureCache::free
is called from three places:TextureCache::expire_old_entries
- this is ruled out as it is only called during a frame buildTextureCache::upsert_entry
- this is ruled out for the same reasonTextureCache::clear_all
- this can be called independently outside of a frame build
-
TextureCache::clear_shared
is called from two places:TextureCache::maybe_reclaim_shared_memory
- this was the cause of bug 1538710, in which it was moved toTextureCache::before_frames
, where it forces us to build a frame by settingrequire_frame_build
totrue
TextureCache::clear_all
- (as stated above) this can be called independently outside of a frame build
-
-
-
This leaves us with the conclusion that either we've missed something, or we have to call
TextureCache::clear_all
without doing a frame build, but still trigger a render. -
So, going down that rabbit hole, there's two places where we can trigger
TextureCache::clear_all
- Through
NotifyWebRenderContextPurge
, which occurs immediately after a render call, meaning we would have to render again in order to hit it. - Through
NotifyMemoryPressure
, which seems to be triggerable any time(?)
- Through
-
So, assuming
NotifyMemoryPressure
is the culprit, because it's a possible way of clearing items fromtexture_cache_map
which isn't directly tied with a frame build, the question becomes whether we can then trigger a render while continuing to avoid the frame build. -
Before the fixes for bug 1538710, this was relatively easy - if we called
TextureCache::clear_all
from a memory pressure event, we could have the main document withframe_is_valid
equal totrue
, causingrender_frame
to be true andbuild_frame
to be false insideRenderBackend::update_document
, which would lead us to callwr_notifier_new_frame_ready
, without having built an actual frame (?). After bug 1538710, it seems trickier, though there is this strange comment referencing what seems to be a similar problem in the handling ofResultMsg::UpdateResources
, though that code seems to be dead. EDIT: ResultMsg::UpdateResources is indeed used, and seems to be used correctly. I just missed it because it only shows up in searchfox for text searches. I think my next step is to do some archaeology on that code and see if anything pops out. However, I still don't have any hunches as to how the document splitting patches would have affected any of this.
Comment 9•5 years ago
|
||
This is a bit of a stab in the dark, but I wanted to rule it out
as a potential cause. See comment in the code for some justification.
Comment 10•5 years ago
|
||
Marking as leave-open while we see if the above fix actually fixes the issue. (Even if it does we'll probably want to land a follow-up patch to clean it up)
Comment 11•5 years ago
|
||
Pushed by dothayer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6cbc4908bc4b Speculatively revert SetDocumentView usage r=kats
Comment 12•5 years ago
|
||
bugherder |
Comment 13•5 years ago
|
||
Hmm, not resolved, it seems. Back to the search...
Comment 14•5 years ago
|
||
Kats, I'm trying to make sense of why this has leveled off at 0 with a little blip on the 042209 build. I would chalk it up to my patch if it weren't for that blip. Do you think I should keep scratching my head at this or wait and see if it stays at 0?
Comment 15•5 years ago
|
||
There are likely multiple root causes for this crash; as I mentioned in comment 5 it was pre-existing, but spiked with the document splitting stuff. So if it's gone now except for a minor blip here and there I would say that the spike caused by document splitting has been resolved, and it's back to the baseline pre-existing crashes.
That being said, according to the little crash data table on this bug it looks like the 20190418221600 build was the first one that fixed the problem (if 20190419094745 was another "blip") which points to something before this patch landed. If that's true, the fix window would be https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=02b89c29412b6c1444fe32a4847e5261e2bb3d00&tochange=2ccc6648064315964dd23039ad28ebf7d9f82999 which contains bug 1536487 but I'm not sure how that might have fixed this.
It might be worth backing out the fix you landed to see if the crashes come back, to verify that it was your patch that fixed it. If so, then we can work on a more proper fix. If the crashes don't come back, then we can probably assume it was fixed by something else.
Comment 16•5 years ago
|
||
FYI, the crashes I reported in Bug 1538944 are no longer occurring.
Comment 17•5 years ago
|
||
Are you able to bisect using mozregression to find out exactly what fixed it?
Comment 18•5 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)
Are you able to bisect using mozregression to find out exactly what fixed it?
(Since bug 1538944 is specifically with document splitting enabled I'm nearly certain that's bug 1538710. I'm not sure that's worth bisecting given what we're trying to diagnose is non-document-splitting occurrences of this crash.)
Comment 19•5 years ago
|
||
I can bisect if you like, but my original bug was duped to this one -- should it have been duped to bug 1538710 instead?
Reporter | ||
Comment 20•5 years ago
|
||
The dupe is correct. Comment 0 was fixed by bug 1538710.
Comment 21•5 years ago
|
||
Yeah I guess we made a bit of a mess in repurposing this bug for the other crashes (which share the same signature).
Comment 22•5 years ago
|
||
Bisected. Bug 1538944 was fixed by Bug 1538710.
Comment 23•5 years ago
|
||
Thanks for the confirmation!
Comment 24•5 years ago
|
||
In trying to diagnose bug 1538540, I'm hitting my limits as far as
simply staring at the code and trying to work out possible ways to
hit the crash goes. This assertion will split the search space into
clear-related causes and non-clear-related causes to narrow things
down.
Comment 25•5 years ago
|
||
Pushed by dothayer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/91859883a54d Sanity check frames after TextureCache clears r=bholley
Comment 26•5 years ago
|
||
Backed out changeset 91859883a54d (bug 1538540) for causing tests crashes with [@ GeckoCrash]
push that caused the backout: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&revision=91859883a54dbdd7cc5376f379466ceb5e8b4e9b
backout: https://hg.mozilla.org/integration/autoland/rev/ad490f39a317715e59875f6a630b60f345dd0bf6
Comment 27•5 years ago
|
||
Well, this actually looks like what we were hoping to catch, I think, and without having to make it to Nightly. So tentatively, hooray!
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 28•5 years ago
|
||
Eh, unfortunately it was a red herring - I needed to move a line down a few lines to not split the render_impl call. So this will in fact have to make it to Nightly to give useful information.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f8dbda0ea7c74466ef00188e03e8bfa0c208de69
Comment 29•5 years ago
|
||
Pushed by dothayer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6462a738b3c5 Sanity check frames after TextureCache clears r=bholley
Comment 30•5 years ago
|
||
bugherder |
Comment 31•5 years ago
|
||
So, we are seeing crashes at the assertion I added. The document splitting pref is off in the crash, so I can see no possible way that the document being checked at the crash site is not the main, default document.
I can only see one way for this to happen, and it sounds really far-fetched, but it's all I've got right now.
- Somehow
TextureCache::clear_shared
is called. Let's say it's throughDebugCommand::ClearCaches
. I don't know if this is feasible or not, but we could also call it throughTextureCache::before_frames
, though we'd need some explanation for how we didn't then build a frame. - A
ResultMsg::UpdateResources
is sent withoutmemory_pressure
set totrue
. It seems this can only happen by the user initiating a capture viaCtrl+Shift+3
. - We call
Renderer::render_impl
without having updatedactive_documents
. This could occur ifdoc.has_pixels()
isfalse
(we would end up callingself.notifier.new_frame_ready
, triggering a render), or if there's an entry inactive_documents
withframe.must_be_drawn() == true
.
I think it's unlikely that this is the cause of the crashes we see, but since it's definitely possible, we should at least address it. So I think we need to fix the UpdateResource
message in save_capture
. Dzmitry, assuming my reasoning checks out to you, do you have any thoughts on how you'd like to see this done? We could set memory_pressure
to true, effectively invalidating the active_documents
. Or we could send over the frame built earlier in the method. Or something else?
Comment 32•5 years ago
|
||
I also doubt we can see any reasonable amount of crashes from Ctrl+Shift+3, given that it's only used by us and :darkspirit.
Looking at the fresh documents_seen
code, it appears to me that we'd hit the assertion if we had any outstanding texture cache tasks for an old document, and then we cleared the texture cache and sent a new document. The "PublishDocument" match arm would still call render_impl
hoping to get those old document updates to complete before replacing the document. Perhaps, we could relax the assertion to not fire if we reach the render_impl
only to get those texture cache tasks resolved (as opposed to actually be drawn on screen)?
Reporter | ||
Comment 33•5 years ago
|
||
Seen on Socorro. Added by comment 30:
bp-b63e0b8f-e4ca-4f9f-bbc5-99bb70190502
Cleared texture cache without sending new document frame.
Reporter | ||
Comment 34•5 years ago
•
|
||
Could it be related to bug 1547911?
-
[GeForce GTX 1050 Ti Mobile] (0x1c8c) only has crashes of this bug and bug 1547911.
Comment 35•5 years ago
|
||
(In reply to Dzmitry Malyshau [:kvark] from comment #32)
I also doubt we can see any reasonable amount of crashes from Ctrl+Shift+3, given that it's only used by us and :darkspirit.
Looking at the fresh
documents_seen
code, it appears to me that we'd hit the assertion if we had any outstanding texture cache tasks for an old document, and then we cleared the texture cache and sent a new document. The "PublishDocument" match arm would still callrender_impl
hoping to get those old document updates to complete before replacing the document. Perhaps, we could relax the assertion to not fire if we reach therender_impl
only to get those texture cache tasks resolved (as opposed to actually be drawn on screen)?
Hmm, I'm not sure we're on the same page. Is this what you're referring to:
- In
Renderer::update()
:- Process a
ResultMsg::PublishDocument
with some texture cache tasks
- This inserts into
documents_seen
- Process a
ResultMsg::PublishDocument
with texture cache updates containing theclears_shared_cache
flag
- This triggers
render_impl
for the document from thePublishDocument
call above, which callsupdate_texture_cache
with the updates from i., and clearsdocuments_seen
- After the
render_impl
call, we add ourselves back todocuments_seen
and set thetexture_cache_cleared
flag
- Process a
I don't see any problem with the above sequence of events - the assertion should not fail there or in a subsequent render
call. Could you clarify what sequence of events you mean if it's not that?
Reporter | ||
Comment 36•5 years ago
•
|
||
Win10, GTX1060
I can reproduce, but -so far- only with gfx.webrender.split-render-roots;true:
Play the video on https://bugzilla.mozilla.org/show_bug.cgi?id=1548247 and go fullscreen.
bp-7fa21db6-ba9f-4199-bc8b-247600190502
Edit:
Debian Testing, KDE, X11, Macbook Pro: bp-3fe21d28-8156-4de4-8209-0cc180190502
Comment 37•5 years ago
|
||
Interesting - I'm on Win10 GTX960 and can't reproduce with that video. Are you able to reproduce 100% of the time?
It's strange to me that this would be specific to a particular graphics card. The code paths related to this don't seem like they would change much on different hardware - the only exception to that is this one, but I know the GTX960 supports ARB_robustness and I imagine the GTX1060 does as well?
Reporter | ||
Comment 38•5 years ago
•
|
||
(In reply to Doug Thayer [:dthayer] from comment #37)
Are you able to reproduce 100% of the time?
No, but almost. Could it be related to fade-in/out of video controls? Sometimes I move my mouse to unhide the progress bar.
Further observation: "mozilla.org is now full screen" doesn't show up at the top when going fullscreen with document splitting.
Reporter | ||
Updated•5 years ago
|
Comment 40•5 years ago
|
||
Ah - I hit it! Great! I'll debug this tomorrow and see if it could shed any light on the non-doc-splitting case.
Comment 41•5 years ago
|
||
Sounds like you are in process of solving this. Please reach back if getting stuck.
Comment 42•5 years ago
|
||
If a document has an empty rect for whatever reason (in the case observed, this was
due to running a full screen video in the content document, which occludes the
chrome document), then we still want to ensure that we build a frame if the
texture cache has performed a clear, as otherwise we may try to access stale
items from the texture resolver in render_impl.
Comment 43•5 years ago
|
||
Pushed by dothayer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ee580278f872 Respect texture_cache rebuild reqs even if doc has no pixels r=kvark
Comment 44•5 years ago
|
||
bugherder |
Reporter | ||
Comment 45•5 years ago
|
||
One person got a new crash of the original signature:
bp-77e3ec60-5cc8-48c7-bc0e-048bf0190508 [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind ]
no entry found for key
I might have run into the same yesterday (I think by switching tabs?):
bp-5730ecfc-069d-4df7-942d-d0ffd0190507 [@ GeckoCrash ]
no entry found for key
Comment 46•5 years ago
|
||
I've hit that on two machines on builds since the most recent patch, but in both cases I was running with document splitting enabled. Jan, were you also running with document splitting? It's a bit easier to see how we might hit this with document splitting enabled, as it could just be a resource being attributed to the incorrect document. Still needs to be fixed, but it's certainly less urgent.
Reporter | ||
Comment 47•5 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #46)
Jan, were you also running with document splitting?
Yes. :)
Reporter | ||
Comment 48•5 years ago
•
|
||
bp-ca665ce1-566a-4c6e-9c65-4fe4f0190527 (using llvmpipe on Debian Testing)
Just had this picture enlarged: https://twitter.com/einstueckkunst/status/1133074885754986502
Then, suddenly, website content became empty/white. It came back after switching to another tab and back.
Updated•5 years ago
|
Comment 49•5 years ago
|
||
Unassigning from me. I think (correct me if wrong) we talked about this in Whistler; this seems like a bug that will be less efficient for me to tackle at this point. However see comment 31 for some (apparently incomplete) analysis on where this could be coming from.
Reporter | ||
Comment 50•5 years ago
•
|
||
bp-188b3580-1588-4dc9-8bb3-89b0e0190706
Cleared texture cache without sending new document frame.
(Assert was added by comment 30.)
Document splitting is enabled. 13 app tabs. Opened 7 YouTube tabs by Ctrl+Click. Started to play one video. GPU process crash, window became transparent. (I thought I accidentally minimized Nightly, so I clicked on its icon, but nothing happend (because I minimized a transparent window), and when I clicked again, Nightly reopened, but fortunately with window content.)
Updated•5 years ago
|
Comment 52•5 years ago
|
||
Assigning to myself to monitor this, hopefully document splitting should be stable enough to start dogfooding it again.
Updated•5 years ago
|
Comment 53•4 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:jbonisteel, maybe it's time to close this bug?
Reporter | ||
Comment 54•4 years ago
|
||
Obsolete by bug 1622360.
Description
•