1546671 - Investigate high OOM rate for WebRender

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Description

•

6 years ago

https://metrics.mozilla.com/webrender/dashboard_nvidia.html#nightly shows WR at almost 400% compared to non-WR when it comes to out of memory crashes. That's not good. The beta graph just below shows a more "reasonable" ~140%.

This needs some investigation to figure out what's going on.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 1

•

6 years ago

I used databricks to get the windows GPU process "OOM | small" crashes with WR enabled on beta since 20190414 and collated the stackframes to get a better idea of which code is OOMing. The top few most-frequent stacks (I aggressively pruned stack frames to make this more readable) are below.

13, static void webrender::scene_builder::SceneBuilder::run() | static void std::sys_common::backtrace::__rust_begin_short_backtrace<closure,()>(struct closure) | static void alloc::boxed::{{impl}}::call_box<(),closure>(struct closure *, <NoType>)
5, moz_xmalloc | mozilla::BufferList<InfallibleAllocPolicy>::AllocateSegment(unsigned __int64,unsigned __int64) | mozilla::BufferList<InfallibleAllocPolicy>::WriteBytes(char const *,unsigned __int64)
4, static union core::result::Result<(), alloc::collections::CollectionAllocErr> std::collections::hash::map::HashMap<webrender_api::display_item::ClipId, webrender::display_list_flattener::ClipNode, core::hash::BuildHasherDefault<fxhash::FxHasher>>::try_resize<webrender_api::display_item::ClipId,webrender::display_list_flattener::ClipNode,core::hash::BuildHasherDefault<fxhash::FxHasher>>(unsigned __int64, std::collections::hash::table::Fallibility) | static void webrender::display_list_flattener::NodeIdToIndexMapper::add_clip_chain(union webrender_api::display_item::ClipId, struct webrender::clip::ClipChainId, unsigned __int64) | static union core::option::Option<webrender_api::display_list::BuiltDisplayListIter> webrender::display_list_flattener::DisplayListFlattener::flatten_item(struct webrender_api::display_list::DisplayItemRef, struct webrender_api::api::PipelineId, bool)
3, static union core::result::Result<(), alloc::collections::CollectionAllocErr> std::collections::hash::map::HashMap<(i32, i32), webrender::picture::Tile, core::hash::BuildHasherDefault<fxhash::FxHasher>>::try_resize<(i32, i32),webrender::picture::Tile,core::hash::BuildHasherDefault<fxhash::FxHasher>>(unsigned __int64, std::collections::hash::table::Fallibility) | static void webrender::picture::TileCache::pre_update(struct euclid::rect::TypedRect<f32, webrender_api::units::LayoutPixel>, struct webrender::frame_builder::FrameVisibilityContext *, struct webrender::frame_builder::FrameVisibilityState *, struct webrender::picture::SurfaceIndex)
3, static struct webrender::util::Allocation<webrender::picture::PicturePrimitive> webrender::util::{{impl}}::alloc<webrender::picture::PicturePrimitive>(struct alloc::vec::Vec<webrender::picture::PicturePrimitive> *) | static void webrender::display_list_flattener::DisplayListFlattener::pop_stacking_context() | static union core::option::Option<webrender_api::display_list::BuiltDisplayListIter> webrender::display_list_flattener::DisplayListFlattener::flatten_item(struct webrender_api::display_list::DisplayItemRef, struct webrender_api::api::PipelineId, bool)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 2

•

6 years ago

The bulk of the OOMs seem to be in the content process rather than the GPU process which is somewhat good news, in that it won't take down the browser (at least not right away). But aggregating the crashes by buildid doesn't show any clear regression window where the count went up.

Miko pointed me to bug 1541092 which might be related in that it also about an increase in Windows content-process OOMs, in that case after an arena size was bumped from 8k to 32k. So in general it seems like as the allocation size increases, the higher the OOM rate, which points to some sort of memory fragmentation problem probably in the allocator. Since WR is making larger allocations than non-WR we hit this more often. At least that's my best theory right now.

Comment 3

•

6 years ago

I looked at the dashboard again, and now the nvidia beta summary is showing WR is better than non-WR for OOM crashes. Got worse on Nightly though - now it's around 600%.

But... on AMD and Intel WR is better than non-WR. So I suspect that maybe we just don't have enough crashes/data to meaningfully compare and the numbers are fluctuating a lot as a result. Downgrading to P2 and flagging as something we should look more closely at for 68 but now I'm less concerned than I was a few days ago.

Blocks: wr-68

Priority: P1 → P2

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Updated

•

6 years ago

Updated

•

6 years ago

Assignee: nobody → a.beingessner

Aria Beingessner [:Gankra]

Comment 4

•

6 years ago

Unassigning myself. I agree with kats' conclusion that this data is too noisy to motivate a deeper investigation (e.g. now nvidia nightly has us with less OOMs than non-wr, but beta has more).

Aria Beingessner [:Gankra]

Updated

•

6 years ago

Assignee: a.beingessner → nobody

Jeff Muizelaar [:jrmuizel]

Updated

•

5 years ago

No longer blocks: wr-68

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Glenn Watson [:gw]

Updated

•

8 months ago

Blocks: wr-investigate-crash

Bugzilla

Investigate high OOM rate for WebRender

Categories

(Core :: Graphics: WebRender, defect, P2)

Tracking

()

People

(Reporter: kats, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Updated

Updated

Updated

Updated