1540853 - GPU crash rate is higher with WebRender on 66 experiment

I'll take a look at this. Jeff provided me with some pointers on how to get crash stacks out of telemetry and possibly extract signatures from them that we can use to compare between WR and non-WR.

Assignee: nobody → kats

Jessie [:jbonisteel] pls NI

Updated

•

6 years ago

Whiteboard: [wr-april]

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 3

•

6 years ago

I've successfully managed to get symbolicated stack traces from the enabled/disabled experiment cohorts using the telemetry crash pings. Now I just need to figure out how to consolidate the stack traces into "signatures" of sorts so that it's more comparable and actionable.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 4

•

6 years ago

•

Edited

Et voila: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/100216/command/100247

Note that I only used a 0.1 sampling ratio because I didn't want to DoS the symbolication server. So there might be some sampling bias but even so the numbers are pretty clear. The top few GPU process crash signatures with WR on are below (ordered by descending frequency). The ones with the star are not present in the "WR off" cohort.

  229, Microsoft::WRL::Wrappers::HandleT<T>::Close
  175, PR_JoinThread | nsThread::Shutdown | nsThreadPool::Shutdown
  140, RegistryKeyWatcher::~RegistryKeyWatcher
  131, mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState
* 107, webrender::prim_store::PrimitiveStore::prepare_interned_prim_for_render
   48, CContext::TID3D11DeviceContext_Map_<T>
*  38, mozilla::wr::Moz2DRenderCallback
   37, TppRaiseHandleStatus
*  34, core::option::expect_failed | webrender::prim_store::PrimitiveStore::update_visibility
*  28, OOM | unknown | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook
*  12, core::option::expect_failed | webrender::prim_store::PrimitiveStore::update_picture

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 5

•

6 years ago

webrender::prim_store::PrimitiveStore::prepare_interned_prim_for_render is bug 1519833.

Kats can you confirm that we see similar numbers on 67 beta?

Flags: needinfo?(kats)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 6

•

6 years ago

Here's an equivalent notebook using 67 beta crashes with WR enabled (submitted after March 1, as with the previous notebook): https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/100338/command/100348

Top 3 WR results look similar to 66 release:

275, EMPTY: no frame data available
223, webrender::prim_store::PrimitiveStore::prepare_interned_prim_for_render
197, OOM | unknown | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook
134, mozilla::wr::Moz2DRenderCallback
86, GeckoCrash
73, wbload64.pdb@0x10bf5
61, nvwgf2um.pdb@0x3a7cd7
46, wbload64.pdb@0x10895
38, mozilla::wr::RenderCompositorANGLE::EndFrame
25, mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::layers::PWebRenderBridgeParent::OnMessageReceived
15, core::option::expect_failed | std::collections::hash::map::{{impl}}::index<T> | webrender::display_list_flattener::DisplayListFlattener::flatten_item

Flags: needinfo?(kats)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 7

•

6 years ago

Current crash frequency for 67.0b11 + 67.0b12 is below. There's a tail of crashes that occurred only once that I'm omitting.

39, wbload64.pdb@0x10bf5
29, OOM | small
25, EMPTY: no frame data available
20, mozilla::wr::Moz2DRenderCallback
15, nvd3dumx_cfg.pdb@0xb14db4
13, OOM | large | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook
12, wbload64.pdb@0x10895
11, mozilla::wr::RenderCompositorANGLE::EndFrame
10, mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::layers::PWebRenderBridgeParent::OnMessageReceived
9, GeckoCrash
8, nvwgf2um.pdb@0x3a7cd7
6, nvwgf2umx_cfg.pdb@0xf25073
4, nvwgf2um_cfg.pdb@0xdf4c22
3, webrender::picture::TileCache::pre_update
3, nvwgf2um_cfg.pdb@0xdbcf72
3, nvd3dumx_cfg.pdb@0x9a579c
3, core::result::unwrap_failed<T> | webrender_api::display_list::BuiltDisplayListIter::next_raw
3, core::option::expect_failed | webrender::display_list_flattener::NodeIdToIndexMapper::get_spatial_node_index
2, webrender::resource_cache::ResourceCache::update_image_template
2, webrender::prim_store::PrimitiveStore::update_visibility
2, webrender::prim_store::PrimitiveStore::prepare_prim_for_render
2, webrender::batch::AlphaBatchBuilder::add_prim_to_batch
2, vcruntime140.amd64.pdb@0xcd63
2, nvwgf2um.pdb@0x8c5aeb
2, nvd3dumx_cfg.pdb@0x9d5f0c
2, _chkstk | rayon::iter::plumbing::bridge_producer_consumer::helper<T>
2,

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 8

•

6 years ago

Some caveats to keep in mind when looking at data from the crash pings vs from crash-stats: https://bugzilla.mozilla.org/show_bug.cgi?id=1544246#c7

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

6 years ago

Blocks: wr-68
No longer blocks: wr-67

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 9

•

6 years ago

I think this is technically fixed now but we need experiment data from 67 to confirm.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 10

•

6 years ago

Latest experiment data from 67 release shows that GPU crashes are still higher with WR on, although the crash rate is much lower in the main/content processes, such that the overall crash rate is lower with WR.

I looked at the list of GPU crsahes with WR on in 67, both using crash-stats and databricks, and the wbload.dll crash (fixed in bug 1544435 in 68+) was dominating the numbers. Recall that this crash will happen 4 times on every startup for affected users, so it's vastly overreported relative to real crashes. If we discount these crashes I think the crash rates for the GPU process between WR on and off are roughly equivalent, so I'll close this bug.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Bugzilla

GPU crash rate is higher with WebRender on 66 experiment

Categories

(Core :: Graphics: WebRender, defect, P2)

Tracking

()

People

(Reporter: jrmuizel, Assigned: kats)

References

Details

(Whiteboard: [wr-april])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10