Bugzilla

Comment 2

•

5 years ago

Yes, this presents itself with document splitting on and the gfx.webrender.debug.texture-cache.disable-shrink pref off. I've created Bug 1538710 to track the work of no longer depending on this pref.

Flags: needinfo?(dothayer)

Updated

•

5 years ago

Depends on: 1538710

Comment 4

•

5 years ago

The TextureResolver::bind signature is currently the #1 GPU process crash for WR-enabled nightly builds. crash-stats search results. There's ~127 crashes in the last 7 days. It seems unlikely that this is only triggered when document-splitting is enabled.

Comment 5

•

5 years ago

The crash signature was very low-volume until March 23, when it suddenly spiked. That's the same time the document-splitting stuff landed. So I'm fairly sure the spike in crashes is a result of those patches. Doug, do you have cycles to prioritize this, since it seems to affect users even in the default configuration of document-splitting disabled?

Flags: needinfo?(dothayer)

Comment 6

•

5 years ago

Yeah. I'm going to prioritize landing the work for bug 1538710 and see how that affects these numbers. Then if they're still present (likely based on comment 5), I'll dig into what else might be causing this.

Flags: needinfo?(dothayer)

Comment 7

•

5 years ago

Looks like this is still happening. Here's a crash from today's nightly: https://crash-stats.mozilla.com/report/index/147b0aab-757b-43b8-acaa-aef9d0190417

Updated

•

5 years ago

Assignee: nobody → dothayer

Status: NEW → ASSIGNED

Whiteboard: [fxperf:p1]

Comment 8

•

5 years ago

•

Edited

I haven't found any smoking gun yet. However, here is a summary of my spelunking so far:

Something is missing from the TextureCache's texture_cache_map
This means one of two possibilities:
- The item has not been added yet, and yet we render with a frame that holds the key. This should not be possible, as in order to be added to the frame the item must have been queued to be added to texture_cache_map via TextureCacheAllocationKind::Alloc
- The item has been removed, and we render with a frame that has not been updated. This is likely what's going on, and it's what was going on in bug 1538710. However, that was specific to multiple documents, where one document's frame build would trigger a clearing of cached resources, and the other document's frame would not be built, leading to a stale frame when it came time to render
Progressing from the assumption that we are removing the relevant texture_cache_map item, triggering a render, and yet not building a frame, we need to examine how we remove items from texture_cache_map.
The item is actually removed here
To hit this, we must go through TextureUpdateList::push_free
- This is called from TextureCache::free for a standalone entry, and from TextureCache::clear_shared
  - TextureCache::free is called from three places:
    - TextureCache::expire_old_entries - this is ruled out as it is only called during a frame build
    - TextureCache::upsert_entry - this is ruled out for the same reason
    - TextureCache::clear_all - this can be called independently outside of a frame build
  - TextureCache::clear_shared is called from two places:
    - TextureCache::maybe_reclaim_shared_memory - this was the cause of bug 1538710, in which it was moved to TextureCache::before_frames, where it forces us to build a frame by setting require_frame_build to true
    - TextureCache::clear_all - (as stated above) this can be called independently outside of a frame build
This leaves us with the conclusion that either we've missed something, or we have to call TextureCache::clear_all without doing a frame build, but still trigger a render.
So, going down that rabbit hole, there's two places where we can trigger TextureCache::clear_all
- Through NotifyWebRenderContextPurge, which occurs immediately after a render call, meaning we would have to render again in order to hit it.
- Through NotifyMemoryPressure, which seems to be triggerable any time(?)
So, assuming NotifyMemoryPressure is the culprit, because it's a possible way of clearing items from texture_cache_map which isn't directly tied with a frame build, the question becomes whether we can then trigger a render while continuing to avoid the frame build.
Before the fixes for bug 1538710, this was relatively easy - if we called TextureCache::clear_all from a memory pressure event, we could have the main document with frame_is_valid equal to true, causing render_frame to be true and build_frame to be false inside RenderBackend::update_document, which would lead us to call wr_notifier_new_frame_ready, without having built an actual frame (?). After bug 1538710, it seems trickier, though there is this strange comment referencing what seems to be a similar problem in the handling of ResultMsg::UpdateResources, though that code seems to be dead. EDIT: ResultMsg::UpdateResources is indeed used, and seems to be used correctly. I just missed it because it only shows up in searchfox for text searches. I think my next step is to do some archaeology on that code and see if anything pops out. However, I still don't have any hunches as to how the document splitting patches would have affected any of this.

Comment 9

•

5 years ago

Attached file Bug 1538540 - Speculatively revert SetDocumentView usage r?kats — Details

This is a bit of a stab in the dark, but I wanted to rule it out
as a potential cause. See comment in the code for some justification.

Comment 10

•

5 years ago

Marking as leave-open while we see if the above fix actually fixes the issue. (Even if it does we'll probably want to land a follow-up patch to clean it up)

Keywords: leave-open

Stefan Hindli [:stefan_hindli]

Comment 11

•

5 years ago

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6cbc4908bc4b
Speculatively revert SetDocumentView usage r=kats

Comment 12

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6cbc4908bc4b

Comment 13

•

5 years ago

Hmm, not resolved, it seems. Back to the search...

Comment 14

•

5 years ago

Kats, I'm trying to make sense of why this has leveled off at 0 with a little blip on the 042209 build. I would chalk it up to my patch if it weren't for that blip. Do you think I should keep scratching my head at this or wait and see if it stays at 0?

Flags: needinfo?(kats)

Comment 15

•

5 years ago

There are likely multiple root causes for this crash; as I mentioned in comment 5 it was pre-existing, but spiked with the document splitting stuff. So if it's gone now except for a minor blip here and there I would say that the spike caused by document splitting has been resolved, and it's back to the baseline pre-existing crashes.

That being said, according to the little crash data table on this bug it looks like the 20190418221600 build was the first one that fixed the problem (if 20190419094745 was another "blip") which points to something before this patch landed. If that's true, the fix window would be https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=02b89c29412b6c1444fe32a4847e5261e2bb3d00&tochange=2ccc6648064315964dd23039ad28ebf7d9f82999 which contains bug 1536487 but I'm not sure how that might have fixed this.

It might be worth backing out the fix you landed to see if the crashes come back, to verify that it was your patch that fixed it. If so, then we can work on a more proper fix. If the crashes don't come back, then we can probably assume it was fixed by something else.

Flags: needinfo?(kats)

Asif Youssuff

Comment 16

•

5 years ago

FYI, the crashes I reported in Bug 1538944 are no longer occurring.

Comment 17

•

5 years ago

Are you able to bisect using mozregression to find out exactly what fixed it?

Flags: needinfo?(yoasif)

Comment 18

•

5 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #17)

Are you able to bisect using mozregression to find out exactly what fixed it?

(Since bug 1538944 is specifically with document splitting enabled I'm nearly certain that's bug 1538710. I'm not sure that's worth bisecting given what we're trying to diagnose is non-document-splitting occurrences of this crash.)

Asif Youssuff

Comment 19

•

5 years ago

I can bisect if you like, but my original bug was duped to this one -- should it have been duped to bug 1538710 instead?

Flags: needinfo?(yoasif)

Reporter

Comment 20

•

5 years ago

The dupe is correct. Comment 0 was fixed by bug 1538710.

Comment 21

•

5 years ago

Yeah I guess we made a bit of a mess in repurposing this bug for the other crashes (which share the same signature).

Asif Youssuff

Comment 22

•

5 years ago

Bisected. Bug 1538944 was fixed by Bug 1538710.

Comment 23

•

5 years ago

Thanks for the confirmation!

Comment 24

•

5 years ago

Attached file Bug 1538540 - Sanity check frames after TextureCache clears r=bholley — Details

In trying to diagnose bug 1538540, I'm hitting my limits as far as
simply staring at the code and trying to work out possible ways to
hit the crash goes. This assertion will split the search space into
clear-related causes and non-clear-related causes to narrow things
down.

failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&selectedJob=243864738&searchStr=linux%2Cx64%2Cquantumrender%2Cdebug%2Cmochitests%2Ctest-linux64-qr%2Fdebug-mochitest-e10s-8%2Cm%288%29

Comment 25

•

5 years ago

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/91859883a54d
Sanity check frames after TextureCache clears r=bholley

Andrei Ciure[:aciure]

Comment 26

•

5 years ago

Backed out changeset 91859883a54d (bug 1538540) for causing tests crashes with [@ GeckoCrash]

push that caused the backout: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&revision=91859883a54dbdd7cc5376f379466ceb5e8b4e9b

backout: https://hg.mozilla.org/integration/autoland/rev/ad490f39a317715e59875f6a630b60f345dd0bf6

Flags: needinfo?(dothayer)

Comment 27

•

5 years ago

Well, this actually looks like what we were hoping to catch, I think, and without having to make it to Nightly. So tentatively, hooray!

Flags: needinfo?(dothayer)

Phabricator Automation

Updated

•

5 years ago

Attachment #9061768 - Attachment description: Bug 1538540 - Sanity check frames after TextureCache clears r?bholley → Bug 1538540 - Sanity check frames after TextureCache clears r=bholley

Phabricator Automation

Updated

•

5 years ago

Attachment #9061768 - Attachment description: Bug 1538540 - Sanity check frames after TextureCache clears r=bholley → Bug 1538540 - Sanity check frames after TextureCache clears r?bholley

Phabricator Automation

Updated

•

5 years ago

Attachment #9061768 - Attachment description: Bug 1538540 - Sanity check frames after TextureCache clears r?bholley → Bug 1538540 - Sanity check frames after TextureCache clears r=bholley

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f8dbda0ea7c74466ef00188e03e8bfa0c208de69

Comment 28

•

5 years ago

Eh, unfortunately it was a red herring - I needed to move a line down a few lines to not split the render_impl call. So this will in fact have to make it to Nightly to give useful information.

Cristian Brindusan [:cbrindusan]

Comment 29

•

5 years ago

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6462a738b3c5
Sanity check frames after TextureCache clears r=bholley

Comment 30

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6462a738b3c5

Dzmitry Malyshau [:kvark]

Comment 31

•

5 years ago

So, we are seeing crashes at the assertion I added. The document splitting pref is off in the crash, so I can see no possible way that the document being checked at the crash site is not the main, default document.

I can only see one way for this to happen, and it sounds really far-fetched, but it's all I've got right now.

Somehow TextureCache::clear_shared is called. Let's say it's through DebugCommand::ClearCaches. I don't know if this is feasible or not, but we could also call it through TextureCache::before_frames, though we'd need some explanation for how we didn't then build a frame.
A ResultMsg::UpdateResources is sent without memory_pressure set to true. It seems this can only happen by the user initiating a capture via Ctrl+Shift+3.
We call Renderer::render_impl without having updated active_documents. This could occur if doc.has_pixels() is false (we would end up calling self.notifier.new_frame_ready, triggering a render), or if there's an entry in active_documents with frame.must_be_drawn() == true.

I think it's unlikely that this is the cause of the crashes we see, but since it's definitely possible, we should at least address it. So I think we need to fix the UpdateResource message in save_capture. Dzmitry, assuming my reasoning checks out to you, do you have any thoughts on how you'd like to see this done? We could set memory_pressure to true, effectively invalidating the active_documents. Or we could send over the frame built earlier in the method. Or something else?

Flags: needinfo?(dmalyshau)

Comment 32

•

5 years ago

I also doubt we can see any reasonable amount of crashes from Ctrl+Shift+3, given that it's only used by us and :darkspirit.

Looking at the fresh documents_seen code, it appears to me that we'd hit the assertion if we had any outstanding texture cache tasks for an old document, and then we cleared the texture cache and sent a new document. The "PublishDocument" match arm would still call render_impl hoping to get those old document updates to complete before replacing the document. Perhaps, we could relax the assertion to not fire if we reach the render_impl only to get those texture cache tasks resolved (as opposed to actually be drawn on screen)?

Flags: needinfo?(dmalyshau)

Reporter

Comment 33

•

5 years ago

Seen on Socorro. Added by comment 30:
bp-b63e0b8f-e4ca-4f9f-bbc5-99bb70190502

Cleared texture cache without sending new document frame.

Crash Signature: [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] → [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ]

Reporter

Comment 34

•

5 years ago

•

Edited

Could it be related to bug 1547911?

"Updating unknown shared surface"
"Managed to allocate after flush"
[GeForce GTX 1050 Ti Mobile] (0x1c8c) only has crashes of this bug and bug 1547911.

Comment 35

•

5 years ago

(In reply to Dzmitry Malyshau [:kvark] from comment #32)

I also doubt we can see any reasonable amount of crashes from Ctrl+Shift+3, given that it's only used by us and :darkspirit.

Looking at the fresh documents_seen code, it appears to me that we'd hit the assertion if we had any outstanding texture cache tasks for an old document, and then we cleared the texture cache and sent a new document. The "PublishDocument" match arm would still call render_impl hoping to get those old document updates to complete before replacing the document. Perhaps, we could relax the assertion to not fire if we reach the render_impl only to get those texture cache tasks resolved (as opposed to actually be drawn on screen)?

Hmm, I'm not sure we're on the same page. Is this what you're referring to:

In Renderer::update():
1. Process a ResultMsg::PublishDocument with some texture cache tasks
- This inserts into documents_seen
1. Process a ResultMsg::PublishDocument with texture cache updates containing the clears_shared_cache flag
- This triggers render_impl for the document from the PublishDocument call above, which calls update_texture_cache with the updates from i., and clears documents_seen
- After the render_impl call, we add ourselves back to documents_seen and set the texture_cache_cleared flag

I don't see any problem with the above sequence of events - the assertion should not fail there or in a subsequent render call. Could you clarify what sequence of events you mean if it's not that?

Flags: needinfo?(dmalyshau)

Reporter

Comment 36

•

5 years ago

•

Edited

Attached video 2019-05-02 22-22-31.mp4 — Details

Win10, GTX1060
I can reproduce, but -so far- only with gfx.webrender.split-render-roots;true:
Play the video on https://bugzilla.mozilla.org/show_bug.cgi?id=1548247 and go fullscreen.
bp-7fa21db6-ba9f-4199-bc8b-247600190502

Edit:
Debian Testing, KDE, X11, Macbook Pro: bp-3fe21d28-8156-4de4-8209-0cc180190502

Comment 37

•

5 years ago

Interesting - I'm on Win10 GTX960 and can't reproduce with that video. Are you able to reproduce 100% of the time?

It's strange to me that this would be specific to a particular graphics card. The code paths related to this don't seem like they would change much on different hardware - the only exception to that is this one, but I know the GTX960 supports ARB_robustness and I imagine the GTX1060 does as well?

Reporter

Updated

•

5 years ago

Comment 38

•

5 years ago

•

Edited

(In reply to Doug Thayer [:dthayer] from comment #37)

Are you able to reproduce 100% of the time?

No, but almost. Could it be related to fade-in/out of video controls? Sometimes I move my mouse to unhide the progress bar.
Further observation: "mozilla.org is now full screen" doesn't show up at the top when going fullscreen with document splitting.

Reporter

Updated

•

5 years ago

Crash Signature: [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] → [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e]

Dzmitry Malyshau [:kvark]

Comment 40

•

5 years ago

Ah - I hit it! Great! I'll debug this tomorrow and see if it could shed any light on the non-doc-splitting case.

Comment 41

•

5 years ago

Sounds like you are in process of solving this. Please reach back if getting stuck.

Flags: needinfo?(dmalyshau)

Comment 42

•

5 years ago

Attached file Bug 1538540 - Respect texture_cache rebuild reqs even if doc has no pixels r?nical — Details

If a document has an empty rect for whatever reason (in the case observed, this was
due to running a full screen video in the content document, which occludes the
chrome document), then we still want to ensure that we build a frame if the
texture cache has performed a clear, as otherwise we may try to access stale
items from the texture resolver in render_impl.

Cristina Coroiu [:ccoroiu]

Comment 43

•

5 years ago

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ee580278f872
Respect texture_cache rebuild reqs even if doc has no pixels r=kvark

Comment 44

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ee580278f872

Reporter

Comment 45

•

5 years ago

One person got a new crash of the original signature:
bp-77e3ec60-5cc8-48c7-bc0e-048bf0190508 [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind ]

no entry found for key

I might have run into the same yesterday (I think by switching tabs?):
bp-5730ecfc-069d-4df7-942d-d0ffd0190507 [@ GeckoCrash ]

no entry found for key

Comment 46

•

5 years ago

I've hit that on two machines on builds since the most recent patch, but in both cases I was running with document splitting enabled. Jan, were you also running with document splitting? It's a bit easier to see how we might hit this with document splitting enabled, as it could just be a resource being attributed to the incorrect document. Still needs to be fixed, but it's certainly less urgent.

Reporter

Comment 47

•

5 years ago

(In reply to Doug Thayer [:dthayer] from comment #46)

Jan, were you also running with document splitting?

Yes. :)

Reporter

Comment 48

•

5 years ago

•

Edited

bp-ca665ce1-566a-4c6e-9c65-4fe4f0190527 (using llvmpipe on Debian Testing)
Just had this picture enlarged: https://twitter.com/einstueckkunst/status/1133074885754986502
Then, suddenly, website content became empty/white. It came back after switching to another tab and back.

[:philipp]

Updated

•

5 years ago

status-firefox68: affected → disabled

Reporter

Updated

•

5 years ago

Comment 49

•

5 years ago

Unassigning from me. I think (correct me if wrong) we talked about this in Whistler; this seems like a bug that will be less efficient for me to tackle at this point. However see comment 31 for some (apparently incomplete) analysis on where this could be coming from.

Assignee: dothayer → nobody

Status: ASSIGNED → NEW

bp-188b3580-1588-4dc9-8bb3-89b0e0190706

Reporter

Comment 50

•

5 years ago

•

Edited

Cleared texture cache without sending new document frame.

(Assert was added by comment 30.)
Document splitting is enabled. 13 app tabs. Opened 7 YouTube tabs by Ctrl+Click. Started to play one video. GPU process crash, window became transparent. (I thought I accidentally minimized Nightly, so I clicked on its icon, but nothing happend (because I minimized a transparent window), and when I clicked again, Nightly reopened, but fortunately with window content.)

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Crash Signature: [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e] → [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e] [@ @0x7fcfafc7ed6e]

Aria Beingessner [:Gankra]

Comment 52

•

5 years ago

Assigning to myself to monitor this, hopefully document splitting should be stable enough to start dogfooding it again.

Assignee: nobody → a.beingessner

Crash Signature: [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e] [@ @0x7fcfafc7ed6e] → [@ core::option::expect_failed | webrender::renderer::TextureResolver::bind] [@ webrender::renderer::SourceTextureResolver::bind ] [@ webrender::renderer::Renderer::render_impl ] [@ @0x7f959d36ba8e] [@ @0x7fcfafc7ed6e]

Aria Beingessner [:Gankra]

Updated

•

5 years ago

Assignee: a.beingessner → nobody

BugBot [:suhaib / :marco/ :calixte]

Comment 53

•

4 years ago

The leave-open keyword is there and there is no activity for 6 months.
:jbonisteel, maybe it's time to close this bug?

Flags: needinfo?(jbonisteel)