GPU crash rate is higher with WebRender on 66 experiment
Categories
(Core :: Graphics: WebRender, defect, P2)
Tracking
()
People
(Reporter: jrmuizel, Assigned: kats)
References
Details
(Whiteboard: [wr-april])
We should look into why and what we can do about it.
Reporter | ||
Comment 1•6 years ago
|
||
It looks like we don’t gather enough crash reports to see where the additional GPU crashes that we see in the release experiment are coming from.
Also here's the data: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/94118/dashboard/94226
Assignee | ||
Comment 2•6 years ago
|
||
I'll take a look at this. Jeff provided me with some pointers on how to get crash stacks out of telemetry and possibly extract signatures from them that we can use to compare between WR and non-WR.
Updated•6 years ago
|
Assignee | ||
Comment 3•6 years ago
|
||
I've successfully managed to get symbolicated stack traces from the enabled/disabled experiment cohorts using the telemetry crash pings. Now I just need to figure out how to consolidate the stack traces into "signatures" of sorts so that it's more comparable and actionable.
Assignee | ||
Comment 4•6 years ago
•
|
||
Et voila: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/100216/command/100247
Note that I only used a 0.1 sampling ratio because I didn't want to DoS the symbolication server. So there might be some sampling bias but even so the numbers are pretty clear. The top few GPU process crash signatures with WR on are below (ordered by descending frequency). The ones with the star are not present in the "WR off" cohort.
229, Microsoft::WRL::Wrappers::HandleT<T>::Close
175, PR_JoinThread | nsThread::Shutdown | nsThreadPool::Shutdown
140, RegistryKeyWatcher::~RegistryKeyWatcher
131, mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState
* 107, webrender::prim_store::PrimitiveStore::prepare_interned_prim_for_render
48, CContext::TID3D11DeviceContext_Map_<T>
* 38, mozilla::wr::Moz2DRenderCallback
37, TppRaiseHandleStatus
* 34, core::option::expect_failed | webrender::prim_store::PrimitiveStore::update_visibility
* 28, OOM | unknown | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook
* 12, core::option::expect_failed | webrender::prim_store::PrimitiveStore::update_picture
Reporter | ||
Comment 5•6 years ago
|
||
webrender::prim_store::PrimitiveStore::prepare_interned_prim_for_render is bug 1519833.
Kats can you confirm that we see similar numbers on 67 beta?
Assignee | ||
Comment 6•6 years ago
|
||
Here's an equivalent notebook using 67 beta crashes with WR enabled (submitted after March 1, as with the previous notebook): https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/100338/command/100348
Top 3 WR results look similar to 66 release:
275, EMPTY: no frame data available
223, webrender::prim_store::PrimitiveStore::prepare_interned_prim_for_render
197, OOM | unknown | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook
134, mozilla::wr::Moz2DRenderCallback
86, GeckoCrash
73, wbload64.pdb@0x10bf5
61, nvwgf2um.pdb@0x3a7cd7
46, wbload64.pdb@0x10895
38, mozilla::wr::RenderCompositorANGLE::EndFrame
25, mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::layers::PWebRenderBridgeParent::OnMessageReceived
15, core::option::expect_failed | std::collections::hash::map::{{impl}}::index<T> | webrender::display_list_flattener::DisplayListFlattener::flatten_item
Assignee | ||
Comment 7•6 years ago
|
||
Current crash frequency for 67.0b11 + 67.0b12 is below. There's a tail of crashes that occurred only once that I'm omitting.
39, wbload64.pdb@0x10bf5
29, OOM | small
25, EMPTY: no frame data available
20, mozilla::wr::Moz2DRenderCallback
15, nvd3dumx_cfg.pdb@0xb14db4
13, OOM | large | mozalloc_abort | mozalloc_handle_oom | gkrust_shared::oom_hook::hook
12, wbload64.pdb@0x10895
11, mozilla::wr::RenderCompositorANGLE::EndFrame
10, mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::layers::PWebRenderBridgeParent::OnMessageReceived
9, GeckoCrash
8, nvwgf2um.pdb@0x3a7cd7
6, nvwgf2umx_cfg.pdb@0xf25073
4, nvwgf2um_cfg.pdb@0xdf4c22
3, webrender::picture::TileCache::pre_update
3, nvwgf2um_cfg.pdb@0xdbcf72
3, nvd3dumx_cfg.pdb@0x9a579c
3, core::result::unwrap_failed<T> | webrender_api::display_list::BuiltDisplayListIter::next_raw
3, core::option::expect_failed | webrender::display_list_flattener::NodeIdToIndexMapper::get_spatial_node_index
2, webrender::resource_cache::ResourceCache::update_image_template
2, webrender::prim_store::PrimitiveStore::update_visibility
2, webrender::prim_store::PrimitiveStore::prepare_prim_for_render
2, webrender::batch::AlphaBatchBuilder::add_prim_to_batch
2, vcruntime140.amd64.pdb@0xcd63
2, nvwgf2um.pdb@0x8c5aeb
2, nvd3dumx_cfg.pdb@0x9d5f0c
2, _chkstk | rayon::iter::plumbing::bridge_producer_consumer::helper<T>
2,
Assignee | ||
Comment 8•6 years ago
|
||
Some caveats to keep in mind when looking at data from the crash pings vs from crash-stats: https://bugzilla.mozilla.org/show_bug.cgi?id=1544246#c7
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 9•6 years ago
|
||
I think this is technically fixed now but we need experiment data from 67 to confirm.
Assignee | ||
Comment 10•6 years ago
|
||
Latest experiment data from 67 release shows that GPU crashes are still higher with WR on, although the crash rate is much lower in the main/content processes, such that the overall crash rate is lower with WR.
I looked at the list of GPU crsahes with WR on in 67, both using crash-stats and databricks, and the wbload.dll crash (fixed in bug 1544435 in 68+) was dominating the numbers. Recall that this crash will happen 4 times on every startup for affected users, so it's vastly overreported relative to real crashes. If we discount these crashes I think the crash rates for the GPU process between WR on and off are roughly equivalent, so I'll close this bug.
Description
•