Closed Bug 1538540 Opened 1 year ago Closed 3 months ago

Crash in [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind]

Categories

(Core :: Graphics: WebRender, defect, P2)

Unspecified
All
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox68 --- disabled

People

(Reporter: darkspirit, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, nightly-community, regression, Whiteboard: [fxperf:p1])

Crash Data

Attachments

(4 files)

Happens on Twitter if gfx.webrender.split-render-roots is true.

This bug is for crash report bp-b5ec86b8-cc89-46c9-b65e-ee4ca0190324.

Top 10 frames of crashing thread:

0 libxul.so MOZ_Crash mfbt/Assertions.h:314
1 libxul.so GeckoCrash toolkit/xre/nsAppRunner.cpp:5093
2 libxul.so gkrust_shared::panic_hook toolkit/library/rust/shared/lib.rs:237
3 libxul.so core::ops::function::Fn::call src/libcore/ops/function.rs:78
4 libxul.so std::panicking::rust_panic_with_hook src/libstd/panicking.rs:495
5 libxul.so std::panicking::continue_panic_fmt src/libstd/panicking.rs:398
6 libxul.so rust_begin_unwind src/libstd/panicking.rs:325
7 libxul.so core::panicking::panic_fmt src/libcore/panicking.rs:95
8 libxul.so core::option::expect_failed src/libcore/option.rs:1008
9 libxul.so webrender::renderer::TextureResolver::bind src/libcore/option.rs:322

Hello Doug - is this related to document splitting?

Flags: needinfo?(dothayer)
Priority: -- → P2

Yes, this presents itself with document splitting on and the gfx.webrender.debug.texture-cache.disable-shrink pref off. I've created Bug 1538710 to track the work of no longer depending on this pref.

Flags: needinfo?(dothayer)
Depends on: 1538710
Duplicate of this bug: 1538944

The TextureResolver::bind signature is currently the #1 GPU process crash for WR-enabled nightly builds. crash-stats search results. There's ~127 crashes in the last 7 days. It seems unlikely that this is only triggered when document-splitting is enabled.

The crash signature was very low-volume until March 23, when it suddenly spiked. That's the same time the document-splitting stuff landed. So I'm fairly sure the spike in crashes is a result of those patches. Doug, do you have cycles to prioritize this, since it seems to affect users even in the default configuration of document-splitting disabled?

Flags: needinfo?(dothayer)

Yeah. I'm going to prioritize landing the work for bug 1538710 and see how that affects these numbers. Then if they're still present (likely based on comment 5), I'll dig into what else might be causing this.

Flags: needinfo?(dothayer)

Looks like this is still happening. Here's a crash from today's nightly: https://crash-stats.mozilla.com/report/index/147b0aab-757b-43b8-acaa-aef9d0190417

Assignee: nobody → dothayer
Status: NEW → ASSIGNED
Whiteboard: [fxperf:p1]

I haven't found any smoking gun yet. However, here is a summary of my spelunking so far:

  • Something is missing from the TextureCache's texture_cache_map

  • This means one of two possibilities:

    • The item has not been added yet, and yet we render with a frame that holds the key. This should not be possible, as in order to be added to the frame the item must have been queued to be added to texture_cache_map via TextureCacheAllocationKind::Alloc
    • The item has been removed, and we render with a frame that has not been updated. This is likely what's going on, and it's what was going on in bug 1538710. However, that was specific to multiple documents, where one document's frame build would trigger a clearing of cached resources, and the other document's frame would not be built, leading to a stale frame when it came time to render
  • Progressing from the assumption that we are removing the relevant texture_cache_map item, triggering a render, and yet not building a frame, we need to examine how we remove items from texture_cache_map.

  • The item is actually removed here

  • To hit this, we must go through TextureUpdateList::push_free

    • This is called from TextureCache::free for a standalone entry, and from TextureCache::clear_shared

      • TextureCache::free is called from three places:

        • TextureCache::expire_old_entries - this is ruled out as it is only called during a frame build
        • TextureCache::upsert_entry - this is ruled out for the same reason
        • TextureCache::clear_all - this can be called independently outside of a frame build
      • TextureCache::clear_shared is called from two places:

        • TextureCache::maybe_reclaim_shared_memory - this was the cause of bug 1538710, in which it was moved to TextureCache::before_frames, where it forces us to build a frame by setting require_frame_build to true
        • TextureCache::clear_all - (as stated above) this can be called independently outside of a frame build
  • This leaves us with the conclusion that either we've missed something, or we have to call TextureCache::clear_all without doing a frame build, but still trigger a render.

  • So, going down that rabbit hole, there's two places where we can trigger TextureCache::clear_all

  • So, assuming NotifyMemoryPressure is the culprit, because it's a possible way of clearing items from texture_cache_map which isn't directly tied with a frame build, the question becomes whether we can then trigger a render while continuing to avoid the frame build.

  • Before the fixes for bug 1538710, this was relatively easy - if we called TextureCache::clear_all from a memory pressure event, we could have the main document with frame_is_valid equal to true, causing render_frame to be true and build_frame to be false inside RenderBackend::update_document, which would lead us to call wr_notifier_new_frame_ready, without having built an actual frame (?). After bug 1538710, it seems trickier, though there is this strange comment referencing what seems to be a similar problem in the handling of ResultMsg::UpdateResources, though that code seems to be dead. EDIT: ResultMsg::UpdateResources is indeed used, and seems to be used correctly. I just missed it because it only shows up in searchfox for text searches. I think my next step is to do some archaeology on that code and see if anything pops out. However, I still don't have any hunches as to how the document splitting patches would have affected any of this.

This is a bit of a stab in the dark, but I wanted to rule it out
as a potential cause. See comment in the code for some justification.

Marking as leave-open while we see if the above fix actually fixes the issue. (Even if it does we'll probably want to land a follow-up patch to clean it up)

Keywords: leave-open
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6cbc4908bc4b
Speculatively revert SetDocumentView usage r=kats

Hmm, not resolved, it seems. Back to the search...

Kats, I'm trying to make sense of why this has leveled off at 0 with a little blip on the 042209 build. I would chalk it up to my patch if it weren't for that blip. Do you think I should keep scratching my head at this or wait and see if it stays at 0?

Flags: needinfo?(kats)

There are likely multiple root causes for this crash; as I mentioned in comment 5 it was pre-existing, but spiked with the document splitting stuff. So if it's gone now except for a minor blip here and there I would say that the spike caused by document splitting has been resolved, and it's back to the baseline pre-existing crashes.

That being said, according to the little crash data table on this bug it looks like the 20190418221600 build was the first one that fixed the problem (if 20190419094745 was another "blip") which points to something before this patch landed. If that's true, the fix window would be https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=02b89c29412b6c1444fe32a4847e5261e2bb3d00&tochange=2ccc6648064315964dd23039ad28ebf7d9f82999 which contains bug 1536487 but I'm not sure how that might have fixed this.

It might be worth backing out the fix you landed to see if the crashes come back, to verify that it was your patch that fixed it. If so, then we can work on a more proper fix. If the crashes don't come back, then we can probably assume it was fixed by something else.

Flags: needinfo?(kats)

FYI, the crashes I reported in Bug 1538944 are no longer occurring.

Are you able to bisect using mozregression to find out exactly what fixed it?

Flags: needinfo?(yoasif)

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)

Are you able to bisect using mozregression to find out exactly what fixed it?

(Since bug 1538944 is specifically with document splitting enabled I'm nearly certain that's bug 1538710. I'm not sure that's worth bisecting given what we're trying to diagnose is non-document-splitting occurrences of this crash.)

I can bisect if you like, but my original bug was duped to this one -- should it have been duped to bug 1538710 instead?

Flags: needinfo?(yoasif)

The dupe is correct. Comment 0 was fixed by bug 1538710.

Yeah I guess we made a bit of a mess in repurposing this bug for the other crashes (which share the same signature).

Bisected. Bug 1538944 was fixed by Bug 1538710.

Thanks for the confirmation!

In trying to diagnose bug 1538540, I'm hitting my limits as far as
simply staring at the code and trying to work out possible ways to
hit the crash goes. This assertion will split the search space into
clear-related causes and non-clear-related causes to narrow things
down.

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/91859883a54d
Sanity check frames after TextureCache clears r=bholley

Well, this actually looks like what we were hoping to catch, I think, and without having to make it to Nightly. So tentatively, hooray!

Flags: needinfo?(dothayer)
Attachment #9061768 - Attachment description: Bug 1538540 - Sanity check frames after TextureCache clears r?bholley → Bug 1538540 - Sanity check frames after TextureCache clears r=bholley
Attachment #9061768 - Attachment description: Bug 1538540 - Sanity check frames after TextureCache clears r=bholley → Bug 1538540 - Sanity check frames after TextureCache clears r?bholley
Attachment #9061768 - Attachment description: Bug 1538540 - Sanity check frames after TextureCache clears r?bholley → Bug 1538540 - Sanity check frames after TextureCache clears r=bholley

Eh, unfortunately it was a red herring - I needed to move a line down a few lines to not split the render_impl call. So this will in fact have to make it to Nightly to give useful information.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f8dbda0ea7c74466ef00188e03e8bfa0c208de69

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6462a738b3c5
Sanity check frames after TextureCache clears r=bholley

So, we are seeing crashes at the assertion I added. The document splitting pref is off in the crash, so I can see no possible way that the document being checked at the crash site is not the main, default document.

I can only see one way for this to happen, and it sounds really far-fetched, but it's all I've got right now.

  1. Somehow TextureCache::clear_shared is called. Let's say it's through DebugCommand::ClearCaches. I don't know if this is feasible or not, but we could also call it through TextureCache::before_frames, though we'd need some explanation for how we didn't then build a frame.
  2. A ResultMsg::UpdateResources is sent without memory_pressure set to true. It seems this can only happen by the user initiating a capture via Ctrl+Shift+3.
  3. We call Renderer::render_impl without having updated active_documents. This could occur if doc.has_pixels() is false (we would end up calling self.notifier.new_frame_ready, triggering a render), or if there's an entry in active_documents with frame.must_be_drawn() == true.

I think it's unlikely that this is the cause of the crashes we see, but since it's definitely possible, we should at least address it. So I think we need to fix the UpdateResource message in save_capture. Dzmitry, assuming my reasoning checks out to you, do you have any thoughts on how you'd like to see this done? We could set memory_pressure to true, effectively invalidating the active_documents. Or we could send over the frame built earlier in the method. Or something else?

Flags: needinfo?(dmalyshau)

I also doubt we can see any reasonable amount of crashes from Ctrl+Shift+3, given that it's only used by us and :darkspirit.

Looking at the fresh documents_seen code, it appears to me that we'd hit the assertion if we had any outstanding texture cache tasks for an old document, and then we cleared the texture cache and sent a new document. The "PublishDocument" match arm would still call render_impl hoping to get those old document updates to complete before replacing the document. Perhaps, we could relax the assertion to not fire if we reach the render_impl only to get those texture cache tasks resolved (as opposed to actually be drawn on screen)?

Flags: needinfo?(dmalyshau)

Seen on Socorro. Added by comment 30:
bp-b63e0b8f-e4ca-4f9f-bbc5-99bb70190502

Cleared texture cache without sending new document frame.

Crash Signature: [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] → [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ]

(In reply to Dzmitry Malyshau [:kvark] from comment #32)

I also doubt we can see any reasonable amount of crashes from Ctrl+Shift+3, given that it's only used by us and :darkspirit.

Looking at the fresh documents_seen code, it appears to me that we'd hit the assertion if we had any outstanding texture cache tasks for an old document, and then we cleared the texture cache and sent a new document. The "PublishDocument" match arm would still call render_impl hoping to get those old document updates to complete before replacing the document. Perhaps, we could relax the assertion to not fire if we reach the render_impl only to get those texture cache tasks resolved (as opposed to actually be drawn on screen)?

Hmm, I'm not sure we're on the same page. Is this what you're referring to:

  • In Renderer::update():
    1. Process a ResultMsg::PublishDocument with some texture cache tasks
    • This inserts into documents_seen
    1. Process a ResultMsg::PublishDocument with texture cache updates containing the clears_shared_cache flag
    • This triggers render_impl for the document from the PublishDocument call above, which calls update_texture_cache with the updates from i., and clears documents_seen
    • After the render_impl call, we add ourselves back to documents_seen and set the texture_cache_cleared flag

I don't see any problem with the above sequence of events - the assertion should not fail there or in a subsequent render call. Could you clarify what sequence of events you mean if it's not that?

Flags: needinfo?(dmalyshau)
Attached video 2019-05-02 22-22-31.mp4

Win10, GTX1060
I can reproduce, but -so far- only with gfx.webrender.split-render-roots;true:
Play the video on https://bugzilla.mozilla.org/show_bug.cgi?id=1548247 and go fullscreen.
bp-7fa21db6-ba9f-4199-bc8b-247600190502

Edit:
Debian Testing, KDE, X11, Macbook Pro: bp-3fe21d28-8156-4de4-8209-0cc180190502

Interesting - I'm on Win10 GTX960 and can't reproduce with that video. Are you able to reproduce 100% of the time?

It's strange to me that this would be specific to a particular graphics card. The code paths related to this don't seem like they would change much on different hardware - the only exception to that is this one, but I know the GTX960 supports ARB_robustness and I imagine the GTX1060 does as well?

See Also: → 1548708

(In reply to Doug Thayer [:dthayer] from comment #37)

Are you able to reproduce 100% of the time?

No, but almost. Could it be related to fade-in/out of video controls? Sometimes I move my mouse to unhide the progress bar.
Further observation: "mozilla.org is now full screen" doesn't show up at the top when going fullscreen with document splitting.

Duplicate of this bug: 1548711
Crash Signature: [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] → [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e]

Ah - I hit it! Great! I'll debug this tomorrow and see if it could shed any light on the non-doc-splitting case.

Sounds like you are in process of solving this. Please reach back if getting stuck.

Flags: needinfo?(dmalyshau)

If a document has an empty rect for whatever reason (in the case observed, this was
due to running a full screen video in the content document, which occludes the
chrome document), then we still want to ensure that we build a frame if the
texture cache has performed a clear, as otherwise we may try to access stale
items from the texture resolver in render_impl.

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ee580278f872
Respect texture_cache rebuild reqs even if doc has no pixels r=kvark

One person got a new crash of the original signature:
bp-77e3ec60-5cc8-48c7-bc0e-048bf0190508 [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind ]

no entry found for key

I might have run into the same yesterday (I think by switching tabs?):
bp-5730ecfc-069d-4df7-942d-d0ffd0190507 [@ GeckoCrash ]

no entry found for key

I've hit that on two machines on builds since the most recent patch, but in both cases I was running with document splitting enabled. Jan, were you also running with document splitting? It's a bit easier to see how we might hit this with document splitting enabled, as it could just be a resource being attributed to the incorrect document. Still needs to be fixed, but it's certainly less urgent.

(In reply to Doug Thayer [:dthayer] from comment #46)

Jan, were you also running with document splitting?

Yes. :)

bp-ca665ce1-566a-4c6e-9c65-4fe4f0190527 (using llvmpipe on Debian Testing)
Just had this picture enlarged: https://twitter.com/einstueckkunst/status/1133074885754986502
Then, suddenly, website content became empty/white. It came back after switching to another tab and back.

See Also: → 1557491

Unassigning from me. I think (correct me if wrong) we talked about this in Whistler; this seems like a bug that will be less efficient for me to tackle at this point. However see comment 31 for some (apparently incomplete) analysis on where this could be coming from.

Assignee: dothayer → nobody
Status: ASSIGNED → NEW

bp-188b3580-1588-4dc9-8bb3-89b0e0190706

Cleared texture cache without sending new document frame.

(Assert was added by comment 30.)
Document splitting is enabled. 13 app tabs. Opened 7 YouTube tabs by Ctrl+Click. Started to play one video. GPU process crash, window became transparent. (I thought I accidentally minimized Nightly, so I clicked on its icon, but nothing happend (because I minimized a transparent window), and when I clicked again, Nightly reopened, but fortunately with window content.)

Duplicate of this bug: 1565323
Crash Signature: [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e] → [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e] [@ @0x7fcfafc7ed6e]

Assigning to myself to monitor this, hopefully document splitting should be stable enough to start dogfooding it again.

Assignee: nobody → a.beingessner
Crash Signature: [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e] [@ @0x7fcfafc7ed6e] → [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e] [@ @0x7fcfafc7ed6e]
Assignee: a.beingessner → nobody

The leave-open keyword is there and there is no activity for 6 months.
:jbonisteel, maybe it's time to close this bug?

Flags: needinfo?(jbonisteel)

Obsolete by bug 1622360.

Status: NEW → RESOLVED
Closed: 3 months ago
Flags: needinfo?(jbonisteel)
Keywords: leave-open
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.