Closed Bug 1624881 Opened 5 years ago Closed 5 years ago

Crash in [@ webrender::spatial_tree::SpatialTree::get_relative_transform]

Categories

(Core :: Graphics: WebRender, defect, P1)

Unspecified
Windows 10
defect

Tracking

()

VERIFIED FIXED
mozilla76
Tracking Status
firefox-esr68 --- unaffected
firefox74 --- unaffected
firefox75 --- unaffected
firefox76 + verified

People

(Reporter: RyanVM, Assigned: kvark)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This is the #2 overall top GPU crash on Nightly at the moment. Any thoughts, Miko?

Top 10 frames of crashing thread:

0 xul.dll RustMozCrash mozglue/static/rust/wrappers.cpp:16
1 xul.dll mozglue_static::panic_hook mozglue/static/rust/lib.rs:89
2 xul.dll core::ops::function::Fn::call<fn src/libcore/ops/function.rs:232
3 xul.dll std::panicking::rust_panic_with_hook src/libstd/panicking.rs:475
4 xul.dll std::panicking::begin_panic<str*> src/libstd/panicking.rs:404
5 xul.dll webrender::spatial_tree::SpatialTree::get_relative_transform gfx/wr/webrender/src/spatial_tree.rs:268
6 xul.dll webrender::prim_store::SpaceMapper<webrender_api::units::LayoutPixel, webrender_api::units::PicturePixel>::set_target_spatial_node<webrender_api::units::LayoutPixel, webrender_api::units::PicturePixel> gfx/wr/webrender/src/prim_store/mod.rs:286
7 xul.dll webrender::picture::PicturePrimitive::post_update gfx/wr/webrender/src/picture.rs:5766
8 xul.dll webrender::picture::PictureUpdateState::update gfx/wr/webrender/src/picture.rs:3674
9 xul.dll webrender::picture::PictureUpdateState::update gfx/wr/webrender/src/picture.rs:3662

Flags: needinfo?(mikokm)
Blocks: wr-stability

(In reply to Ryan VanderMeulen [:RyanVM] from comment #0)

This is the #2 overall top GPU crash on Nightly at the moment. Any thoughts, Miko?

This seems to be an older crash that started with build id 20200130095303. It seems that bug 1611948 touched this code in build id 20200129213721 (the previous day). This replaced the older signature of webrender::clip_scroll_tree::ClipScrollTree::get_relative_transform1 (bug 1507323, bug 1509347).

Flags: needinfo?(mikokm)
Blocks: wr-76
Priority: -- → P1

Nical, can you take a look?

Flags: needinfo?(nical.bugzilla)

I know very little of this code.

The crash has been happening on the following urls among others:

Simply visiting these sites and scrolling around didn't let me reproduce the crash it could be that login is required to run into the issue.
It looks like the users having that crash reproduced it several times on the same page.

On crashstats there's only been Windows crashes (238) which is a bit surprising by the look of the signature but worth noting.

The assert we hit is https://hg.mozilla.org/mozilla-central/annotate/c72482e89c472d7d194bd97fec6e57549a79de90/gfx/wr/webrender/src/spatial_tree.rs#l268 introduced in bug 1524797. I guess we have this assert because we expect content to be only mapped to coordinate spaces of a parent stacking context. There were a lot of picture caching changes since the assertion was introduced. Perhaps this assumption doesn't hold true anymore when mapping to picture cache surfaces?

Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(gwatson)
Flags: needinfo?(dmalyshau)

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression

Yeah, something is calling that function on two nodes that aren't related :/ I'll look into it.

Assignee: nobody → dmalyshau
Status: NEW → ASSIGNED
Flags: needinfo?(dmalyshau)

I'm unable to repro this (on Linux with picture caching enabled)

Miko, I wonder if something gets wonky with the DL interning. Did you have to do something about the spatial nodes? If you replace some of the nodes in an incremental DL update, this would ruin the assumption of get_relative_transform. That would also explain to some extent why we can't reproduce it: there would be a sequence of actions needed in order for the updated spatial nodes to be provided by Gecko. Does it ring any bells?

Flags: needinfo?(mikokm)

That does sound plausible - one other regression we've had with the same assert in that function was related to RDL, I believe.

Flags: needinfo?(gwatson)

(In reply to Dzmitry Malyshau [:kvark] from comment #7)

Miko, I wonder if something gets wonky with the DL interning. Did you have to do something about the spatial nodes? If you replace some of the nodes in an incremental DL update, this would ruin the assumption of get_relative_transform. That would also explain to some extent why we can't reproduce it: there would be a sequence of actions needed in order for the updated spatial nodes to be provided by Gecko. Does it ring any bells?

It is definitely possible that some of these crashes are caused by that. However, the crash volume has been steady for months before any DL interning code landed: see comment 1 and comment 3.

Flags: needinfo?(mikokm)

We had this assert firing before for sure, and we fixed a few cases that are linked here. Is there a known crash happening after the fixes and before the DL interning?

FWIW, all the previous firings where a matter of loading a particular page. This one is more elusive, hence the suspicion on the DL interning.

(In reply to Dzmitry Malyshau [:kvark] from comment #10)

We had this assert firing before for sure, and we fixed a few cases that are linked here. Is there a known crash happening after the fixes and before the DL interning?

During the last six months, we have had 2431 reported crashes with these two signatures. Out of these, 347 have happened after WR DL interning was enabled, while the crash rate has remained steady 20-40 per day.

(In reply to Dzmitry Malyshau [:kvark] from comment #11)

FWIW, all the previous firings where a matter of loading a particular page. This one is more elusive, hence the suspicion on the DL interning.

This is most likely a problem related to GPU process crash/recovery, from correlations:

(60.00% in signature vs 01.29% overall) GFX_ERROR "Receive IPC close with reason=AbnormalShutdown" = true [66.67% vs 40.00% if process_type = gpu]

Miko,

During the last six months, we have had 2431 reported crashes with these two signatures. Out of these, 347 have happened after WR DL interning was enabled, while the crash rate has remained steady 20-40 per day.

Thank you for looking into this! Sounds like a non-interning problem them, I'll look more into it.

This is most likely a problem related to GPU process crash/recovery, from correlations:

Isn't this the other way around? If we hit the assert, we end up with a GPU process crash, hence the correlation.

we expect that the children spatial node goes after the parent.
If it's not the case, we can still fall back to a full world transform,
but it's not going to be correct with regards to flattening.
As we don't know the circumstances and are unable to reproduce the issue,
making this fallback could be a reasonable thing to do for now.

Pushed by dmalyshau@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/085e6b728abb Work around the relative transform failed expectations r=gw
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla76
Flags: qe-verify+

I couldn't reproduce this with the info I got from this bug and turning on WR on a Windows 10 machine on old builds previous to the fix. In Mozilla Crash Reports I can only see two crashes in 77.0a1 but none in 76 beta. Is it safe to say that 76 is fixed and close this bug based on the stats?

Flags: needinfo?(dmalyshau)

Yes, it's safe to say the bug is fixed. I do want to follow-up with a proper fix though, because what we submitted is more of a workaround.

Flags: needinfo?(dmalyshau)

(In reply to Dzmitry Malyshau [:kvark] from comment #18)

Yes, it's safe to say the bug is fixed. I do want to follow-up with a proper fix though, because what we submitted is more of a workaround.

Thanks. I'll go ahead and mark this as verified fixed in this case.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: