Closed Bug 1540853 Opened 5 years ago Closed 5 years ago

GPU crash rate is higher with WebRender on 66 experiment

Categories

(Core :: Graphics: WebRender, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jrmuizel, Assigned: kats)

References

Details

(Whiteboard: [wr-april])

We should look into why and what we can do about it.

Blocks: wr-67

It looks like we don’t gather enough crash reports to see where the additional GPU crashes that we see in the release experiment are coming from.

Also here's the data: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/94118/dashboard/94226

I'll take a look at this. Jeff provided me with some pointers on how to get crash stacks out of telemetry and possibly extract signatures from them that we can use to compare between WR and non-WR.

Assignee: nobody → kats
Whiteboard: [wr-april]

I've successfully managed to get symbolicated stack traces from the enabled/disabled experiment cohorts using the telemetry crash pings. Now I just need to figure out how to consolidate the stack traces into "signatures" of sorts so that it's more comparable and actionable.

Et voila: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/100216/command/100247

Note that I only used a 0.1 sampling ratio because I didn't want to DoS the symbolication server. So there might be some sampling bias but even so the numbers are pretty clear. The top few GPU process crash signatures with WR on are below (ordered by descending frequency). The ones with the star are not present in the "WR off" cohort.

  229, Microsoft::WRL::Wrappers::HandleT<T>::Close
  175, PR_JoinThread | nsThread::Shutdown | nsThreadPool::Shutdown
  140, RegistryKeyWatcher::~RegistryKeyWatcher
  131, mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState
* 107, webrender::prim_store::PrimitiveStore::prepare_interned_prim_for_render
   48, CContext::TID3D11DeviceContext_Map_<T>
*  38, mozilla::wr::Moz2DRenderCallback
   37, TppRaiseHandleStatus
*  34, core::option::expect_failed | webrender::prim_store::PrimitiveStore::update_visibility
*  28, OOM | unknown | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook
*  12, core::option::expect_failed | webrender::prim_store::PrimitiveStore::update_picture

webrender::prim_store::PrimitiveStore::prepare_interned_prim_for_render is bug 1519833.

Kats can you confirm that we see similar numbers on 67 beta?

Flags: needinfo?(kats)

Here's an equivalent notebook using 67 beta crashes with WR enabled (submitted after March 1, as with the previous notebook): https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/100338/command/100348

Top 3 WR results look similar to 66 release:

275, EMPTY: no frame data available
223, webrender::prim_store::PrimitiveStore::prepare_interned_prim_for_render
197, OOM | unknown | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook
134, mozilla::wr::Moz2DRenderCallback
86, GeckoCrash
73, wbload64.pdb@0x10bf5
61, nvwgf2um.pdb@0x3a7cd7
46, wbload64.pdb@0x10895
38, mozilla::wr::RenderCompositorANGLE::EndFrame
25, mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::layers::PWebRenderBridgeParent::OnMessageReceived
15, core::option::expect_failed | std::collections::hash::map::{{impl}}::index<T> | webrender::display_list_flattener::DisplayListFlattener::flatten_item
Flags: needinfo?(kats)

Current crash frequency for 67.0b11 + 67.0b12 is below. There's a tail of crashes that occurred only once that I'm omitting.

39, wbload64.pdb@0x10bf5
29, OOM | small
25, EMPTY: no frame data available
20, mozilla::wr::Moz2DRenderCallback
15, nvd3dumx_cfg.pdb@0xb14db4
13, OOM | large | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook
12, wbload64.pdb@0x10895
11, mozilla::wr::RenderCompositorANGLE::EndFrame
10, mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::layers::PWebRenderBridgeParent::OnMessageReceived
9, GeckoCrash
8, nvwgf2um.pdb@0x3a7cd7
6, nvwgf2umx_cfg.pdb@0xf25073
4, nvwgf2um_cfg.pdb@0xdf4c22
3, webrender::picture::TileCache::pre_update
3, nvwgf2um_cfg.pdb@0xdbcf72
3, nvd3dumx_cfg.pdb@0x9a579c
3, core::result::unwrap_failed<T> | webrender_api::display_list::BuiltDisplayListIter::next_raw
3, core::option::expect_failed | webrender::display_list_flattener::NodeIdToIndexMapper::get_spatial_node_index
2, webrender::resource_cache::ResourceCache::update_image_template
2, webrender::prim_store::PrimitiveStore::update_visibility
2, webrender::prim_store::PrimitiveStore::prepare_prim_for_render
2, webrender::batch::AlphaBatchBuilder::add_prim_to_batch
2, vcruntime140.amd64.pdb@0xcd63
2, nvwgf2um.pdb@0x8c5aeb
2, nvd3dumx_cfg.pdb@0x9d5f0c
2, _chkstk | rayon::iter::plumbing::bridge_producer_consumer::helper<T>
2, 

Some caveats to keep in mind when looking at data from the crash pings vs from crash-stats: https://bugzilla.mozilla.org/show_bug.cgi?id=1544246#c7

Blocks: wr-68
No longer blocks: wr-67

I think this is technically fixed now but we need experiment data from 67 to confirm.

Latest experiment data from 67 release shows that GPU crashes are still higher with WR on, although the crash rate is much lower in the main/content processes, such that the overall crash rate is lower with WR.

I looked at the list of GPU crsahes with WR on in 67, both using crash-stats and databricks, and the wbload.dll crash (fixed in bug 1544435 in 68+) was dominating the numbers. Recall that this crash will happen 4 times on every startup for affected users, so it's vastly overreported relative to real crashes. If we discount these crashes I think the crash rates for the GPU process between WR on and off are roughly equivalent, so I'll close this bug.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.